How we gate code before it reaches production

When a human engineer pushes code, there is an implicit review step built into the workflow. They wrote it. They read it again. They thought about whether it was ready before they hit the button. That mental model breaks down when agents are committing to main at any hour, sometimes several times in a row, without the friction of a pull request or a second pair of eyes.

We had to rebuild that review layer deliberately. Not to slow things down, but because “anything goes to production if it compiles” is not a deployment strategy.

The problem with trusting the commit

The first thing we noticed was that agent-generated code passed tests more consistently than we expected. The agents we work with are careful about test coverage. They run the test suite before committing. They fix failures before they push.

But passing tests is not the same as being safe to deploy. An agent might fix a failing test by removing the assertion. It might commit a migration that works in isolation but breaks an existing query pattern. It might add a dependency that conflicts with another service’s pinned version. The tests were green. The deployment was still wrong.

The mistake we made early was treating a passing CI run as a deployment gate. It is necessary, not sufficient. A CI run tells you the code is internally consistent. It does not tell you the code is safe in the context of everything else running in production.

What a real gate looks like

We ended up layering several checks that CI does not cover by default.

The first is a dependency audit on every push. Not just whether the code compiles against the lock file, but whether any new dependency was added and whether it matches what other services expect. Agents sometimes introduce transitive dependency updates when they upgrade a single package, and those updates can silently break things downstream. The audit does not block the build automatically, but it creates a visible artifact that gets reviewed before any migration-related deploy.

The second is a schema drift check. If any database migration file changed in a commit, we run a diff between the schema the migration produces and the schema currently running in the staging environment. A migration that adds a nullable column is usually safe. One that removes a column or renames an index is not, and we want to know before it hits production.

The third is a build size check. This sounds trivial, but it has caught real problems. If the production bundle grows by more than a threshold percentage in a single commit, something unexpected was bundled. Once it was a fixture file that was accidentally imported in production code. Once it was a large vendor library that should have been lazy-loaded. The size check is a proxy for “did something unexpected get included.”

None of these are agent-specific. They would be useful in any codebase. What changed is that without agents, a human would often catch these things naturally during code review. With agents pushing continuously, that human judgment needs to be encoded somewhere.

Rollback is a strategy, not a fallback

The way most teams think about rollbacks is reactive. Something goes wrong, the alarm fires, someone reverts the last deploy and figures out the rest later. That works when deploys are infrequent and the last commit is probably the one to blame.

When you have multiple agents pushing code throughout the day, the “last deploy” might be several commits away from the actual cause. The system has been running new code for an hour and the issue only surfaced when the evening traffic hit a code path that earlier load did not touch.

We changed how we think about this. Rollback is not the last resort after a failure. It is a deployment option we plan for before every release.

Every significant deploy gets a documented rollback procedure before it ships. Not a generic “revert the commit” note, but a specific sequence: what to run, what to check, what the expected state is after the rollback completes. For migrations, we require a corresponding down migration to be written and tested at the same time as the up migration. If the down migration cannot be written, that is a sign the change is not safely reversible, which changes the deployment strategy entirely.

This shifts the cognitive work to before the deploy, when everyone is calm and the system is healthy, rather than after the deploy, when something is on fire.

Monitoring after a deploy is not optional

The last piece is post-deploy verification. After any commit hits production, we run a lightweight check set for about ten minutes: error rate, response time distribution, any new error codes that did not appear in the previous window. This is not the same as ongoing monitoring. It is a targeted signal that the deploy itself did not introduce an immediate regression.

Agents do not get tired and stop watching the dashboards. But they also do not inherently know when something they deployed caused a problem unless the system is designed to surface it quickly and clearly. The post-deploy check is how the system tells the agent: look at this, something changed.

What we found is that most post-deploy issues appear in the first five minutes or not at all. A deploy that is quiet for ten minutes is almost certainly stable. This gave us confidence to move faster, not slower, because we knew we would catch immediate regressions before they compounded.

The goal was never to slow agents down. It was to make the pipeline give the same guarantees it gave when humans were reviewing every change. That required treating the pipeline as a first-class engineering artifact, not a checklist that gets extended only when something breaks.

How we gate code before it reaches production

The problem with trusting the commit

What a real gate looks like

Rollback is a strategy, not a fallback

Monitoring after a deploy is not optional

More from the team

The stateless MCP spec goes final, and the session was the easy part

Slow tools, fast loops: what cutting tool latency did to our agents