What Reinforcement Learning Actually Learns When Routing a PCB
“AI” has become a marketing term. Every EDA vendor now claims some form of artificial intelligence, machine learning, or neural network integration¹. The demos look impressive. The benchmarks sound promising. But when you ask what the model actually does, how it makes decisions, what it optimizes for, or why it chooses one path over another, the answers tend toward hand-waving and black-box mysticism.
This is a problem. You’re being asked to trust your designs to systems you don’t understand. And while you don’t need a PhD in machine learning to use these tools effectively, you do need enough technical clarity to know what you’re trusting, where that trust is warranted, and where human oversight remains essential.
So let’s open the box. Not for AI researchers who already know this material, but for PCB engineers who want to understand the technology before betting production boards on it.
The PCB as an Environment
Reinforcement learning, at its core, is about an agent learning to make sequential decisions in an environment. The agent takes actions, the environment responds, and the agent receives feedback in the form of rewards or penalties. Over time, through trial and error, the agent learns policies: strategies for choosing actions that maximize cumulative reward.
To apply RL to PCB routing, you first need to frame the routing problem in these terms. This framing isn’t trivial, and the choices made here determine everything about what the system can and cannot learn.
The environment is the board itself: the substrate stack-up, the component placements, the netlist defining which pins must connect, the design rules constraining how connections can be made. At any given moment, the state captures everything relevant about the current routing situation: which nets have been routed, where traces currently exist, what resources (layers, vias, routing channels) remain available, and what constraints are active.
The action space defines what moves the agent can make. In PCB routing, this typically means: extend a trace in a particular direction, change layers via a via, complete a net, or abandon the current routing attempt and rip up a problematic trace². The granularity of these actions matters enormously. Too fine-grained (place copper at pixel resolution) and the search space explodes. Too coarse (route entire nets atomically) and the agent can’t learn nuanced strategies.
Most practical systems operate at an intermediate level. Something like “extend this trace segment by one grid unit in direction X” or “place a via here and continue on layer Y.” This gives the agent enough flexibility to discover creative solutions while keeping the action space tractable.
The reward function is where engineering knowledge enters the system. Every time the agent takes an action, the environment returns a scalar reward. Positive rewards encourage behaviors; negative rewards discourage them. The art of applying RL to PCB routing lies almost entirely in designing reward functions that encode what makes a good layout.
What “Learning” Means Here
Before going further, we need to clear up a common misconception. When we say an RL agent “learns” to route PCBs, we don’t mean it memorizes solutions to specific boards and replays them. That approach would be useless. Every new design is different, with different components, different nets, different constraints.
What the agent actually learns are generalizable strategies. It develops an intuition, encoded in neural network weights, for how to navigate routing decisions given the local and global context of a board. It learns patterns like: “when approaching a dense BGA breakout region with limited layer transitions available, favor escape routes that preserve via positions for inner connections.” It doesn’t learn this as an explicit rule. It learns it as a tendency, a bias in its decision-making that emerges from experiencing thousands of similar situations.
This distinction matters because it explains both the power and the limitations of the approach. The power is generalization: a well-trained agent can route boards it has never seen, applying learned strategies to novel situations. The limitation is that these strategies are implicit and approximate. The agent doesn’t “know” your design rules the way a constraint solver knows them. It has learned behaviors that tend to satisfy those rules, but edge cases and unusual configurations can still trip it up.
Training happens in simulation. The agent attempts to route boards, sometimes synthetic boards generated programmatically, sometimes real designs from a training corpus, and receives rewards based on the quality of its routing decisions. Early in training, the agent acts essentially randomly, exploring the action space through trial and error. As training progresses, it learns to favor actions that lead to higher cumulative rewards.
The learning algorithm adjusts the neural network weights to increase the probability of high-reward action sequences. After millions of routing attempts across thousands of board configurations, the agent develops robust strategies that transfer to new, unseen designs³.
Encoding Design Rules in Reward Functions
The reward function is the contract between engineering intent and learned behavior. If you want the agent to minimize trace length, you penalize long traces. If you want proper clearances, you heavily penalize DRC violations. If you want to discourage unnecessary layer transitions, you assign a cost to each via.
This sounds straightforward, but the details are subtle. The objectives are competing. For example, try routing nets but don’t place too many vias. Real reward functions can be considerably more complex. They typically include terms for DRC violations, manufacturing penalties, and signal integrity considerations⁴.
The multi-objective nature of PCB design means these reward terms often conflict. You might want short traces (signal integrity) but also want to avoid congested areas (manufacturing yield). You might want minimal vias (cost, reliability) but also want to avoid long horizontal runs on power layers (EMI). The reward function encodes how these tradeoffs should be resolved, and the agent learns policies that navigate them.
One critical insight: the reward function doesn’t need to specify how to achieve good routing. It only needs to specify what good routing looks like. The agent figures out the how through experience. This is the fundamental leverage of reinforcement learning. You encode the objective, and the optimization process discovers strategies for achieving it. Sometimes those strategies are obvious, things any experienced engineer would do. Sometimes they’re surprising, exploiting patterns or symmetries that humans overlooked.
Via Minimization as a Case Study
To make this concrete, let’s examine a specific capability: via minimization in dense BGA breakout regions. This is a classic PCB routing challenge. You have hundreds of pins on a fine-pitch grid, signals need to escape to outer regions of the board, and each via consumes routing resources on multiple layers while adding cost and potential reliability concerns⁵.
Traditional autorouters handle this with heuristics: escape patterns, layer assignment algorithms, fanout strategies developed over decades of engineering experience. These heuristics work well for common cases but can struggle with unusual pin configurations or competing constraints.
An RL approach starts with no heuristics. The agent begins with random behavior and a reward function that penalizes vias, rewards completed nets, and heavily penalizes DRC violations. Through training, it encounters BGA breakout scenarios repeatedly: different pin counts, different pitch values, different layer stack-ups, and different surrounding congestion.
What emerges, over millions of training episodes, is a learned breakout strategy. The agent develops preferences: which escape directions to favor based on surrounding congestion, when to use dog-bone patterns versus direct via-in-pad escapes, how to sequence net routing to avoid blocking later connections⁶. Crucially, these preferences adapt to context. The same agent might favor different strategies for a 0.8mm pitch BGA versus a 1.0mm pitch, or for a six-layer stack-up versus a ten-layer one.
The interesting question is whether the learned strategies match human expert strategies. Sometimes they do. The agent independently discovers patterns that experienced layout engineers use. This provides validation that the reward function correctly encodes design quality. Sometimes the strategies differ but produce equivalent results, alternative approaches that satisfy the same constraints. And occasionally, the agent discovers genuinely novel strategies that human engineers hadn’t considered, exploiting symmetries or routing-order dependencies that weren’t obvious.
This last case is particularly valuable but also requires careful validation. “Novel” doesn’t automatically mean better. It might mean the reward function has a loophole the agent is exploiting, producing results that score well mathematically but violate some constraint you forgot to encode.
The Role of Simulation in Training
Everything described so far happens in simulation. The agent never routes a physical board during training. It routes virtual boards in a software environment that models the relevant physics and constraints.
This simulation must be accurate enough that strategies learned in simulation transfer to real designs. If the simulation allows routing patterns that would fail DRC in your actual EDA tool, the agent will happily learn those patterns, and you’ll discover the problem only when you try to use the trained model on real work.
Creating high-fidelity training environments is one of the major engineering challenges in applying RL to PCB design. The simulation needs to model your actual design rules, your actual manufacturing constraints, your actual stack-up options. It needs to run fast enough that millions of training episodes complete in reasonable time. And it needs to expose enough variety in training scenarios that the agent learns robust, generalizable strategies rather than overfitting to specific board configurations.
This is where training on diverse, realistic designs matters. An agent trained only on simple two-layer boards won’t develop strategies for complex HDI stack-ups. An agent that never encounters high-speed differential pairs won’t learn length-matching strategies. The training distribution defines the boundaries of competence.
Systems like DeepPCB approach this by training on a large corpora of designs, covering diverse applications, layer counts, and constraint sets⁷. The resulting models have seen enough variety to handle most production scenarios. However, “most” isn’t “all,” and understanding the training distribution helps you judge when the tool is operating within its competence bounds.
What the Model Doesn’t Understand
Honesty about limitations is essential. RL-based routing tools are powerful, but they’re not omniscient, and pretending otherwise helps no one.
First, the model doesn’t understand intent. It doesn’t know that this particular trace is a clock signal requiring careful timing, that this region will be mechanically stressed during assembly, or that this net has special testability requirements. It knows only what’s encoded in the input features and reward function. If you haven’t explicitly specified a constraint, the model won’t respect it.
Second, the model might not get the full physics beyond what’s captured in its training. It learns correlations between routing patterns and reward signals, but it may not have a first-principles model of electromagnetic fields, thermal dissipation, or mechanical stress. If you’re pushing into exotic territory such as unusual materials, extreme frequencies, novel stack-ups, the learned strategies may not transfer.
Third, the model itself doesn’t understand context beyond the board. Is it a medical device requiring extra reliability margins? Does your manufacturing partner struggle with certain via aspect ratios? Did the last revision of this board have field failures traced to a specific routing pattern? Human engineers carry this contextual knowledge; the model sees only the immediate design data.
Fourth, the model’s confidence doesn’t correlate reliably with correctness. Neural networks are notoriously poorly calibrated. They can be confidently wrong. A model might route a net in a way that looks reasonable, scores well against the reward function, but violates some subtle constraint or manufacturing guideline. You cannot trust raw model outputs without validation.
These limitations define the human role in AI-assisted routing. You’re not obsolete. You’re essential. You provide the intent, validate the outputs, catch the edge cases, and bring the contextual knowledge that no training corpus fully captures. The best outcomes come from human-AI collaboration: the model handling the combinatorial complexity of routing optimization, the human ensuring the results actually serve the design’s goals.
From Black Box to Understood Tool
Reinforcement learning for PCB routing is not magic. It’s optimization through trial and error, guided by reward functions that encode engineering objectives, trained in simulation until generalizable strategies emerge. The agent learns behaviors, not rules. Tendencies that work well in aggregate but require human validation in specifics.
Understanding this changes how you use these tools. You know to examine outputs carefully, especially in unusual configurations outside the training distribution. You know that constraints must be explicit. Implicit assumptions don’t get learned. You know that the model’s strategies might differ from yours while still being valid, or might differ because they’re exploiting a reward-function loophole you need to close.
Most importantly, you know where the human remains essential. The model optimizes for what you tell it to optimize for, but you remain responsible for ensuring that objective aligns with what actually matters. That’s not a limitation of AI. It’s the nature of tools. A hammer doesn’t understand whether the nail should go there. Neither does a router, whether automatic or AI-powered.
The difference is that AI tools are complex enough to create an illusion of understanding. Don’t be fooled by that illusion, but don’t be paralyzed by it either. Understood properly, these are powerful tools that handle aspects of PCB design that humans find tedious and error-prone. Understood poorly, they’re black boxes that generate outputs you can’t trust.
The choice between those two outcomes is yours. Now you have enough understanding to make it wisely.