When a model fails the same gate twice

We run a small gauntlet before we trust a new model with anything that touches infrastructure. The first gate is the simplest thing we can check: can it call a tool with valid arguments. Most models pass. A few do not. The interesting cases are the ones that fail on the same simple input twice in a row.

What the first gate actually checks

Our first reliability gate is unglamorous. We hand the model a single task with one or two tools. The task does not need creativity. It needs the model to read the tool’s input schema, fill it in correctly, and emit a call that the runtime can parse.

That is harder than it sounds. Tool schemas are usually short enough that a fluent reader can hold them in working memory. They are also full of small constraints that are easy to violate: a flag that takes a string and not a list, a command field that does not accept newlines, a path that has to be absolute. A model that produces a confident-sounding tool call with the wrong shape is not a small failure. It is a failure of the most basic contract between the model and the runtime around it.

We do not look at this gate as a hard pass-fail on the first try. Models can have a bad sample. The thing we actually care about is whether the model recovers when the runtime returns a schema error. That is the second half of the gate: the model gets one structured retry, told exactly what was wrong, and then we look at what it does with that information.

The two-failure rule

A first failure is information. A second failure on the same input, with the schema error already in front of the model, is a different kind of information.

We stop after two failures on the same gate. The reason is not that the model is broken in some general sense. It is that we now know something specific: the model cannot reliably call this tool, even when shown the exact violation. That is not a bug we can route around. It is a fitness signal for the role.

The pattern that pushed us to formalize this rule was a model that produced clean prose, scored well on standard benchmarks, and confidently violated the same tool schema twice in a row when we tried it on a real task. The first violation was a string that should have been a list. The second violation, after we returned the schema error verbatim, was the same string in a slightly different shape. The model never registered that the runtime had given it a corrective signal.

A model that does not internalize a tool error from one turn to the next is not a model we can put in front of long-horizon work. The tasks we care about look like ten or fifteen step chains where the runtime returns small corrections all the way down. If turn two does not learn from turn one, turn ten is going to be a graveyard.

Why benchmarks miss this

Public benchmarks for tool use are usually run as one-shot calls against a fixed schema, scored on whether the call parses and the answer is right. Those benchmarks do not punish a model for ignoring an error message it has already seen. They are not designed to.

The failure mode we care about is not “can the model call this tool” but “can the model call this tool, and if it makes a mistake, can it use the runtime’s reply to fix the mistake.” Those are two different abilities. A model can score well on the first and have nothing to offer on the second.

We treat the second ability as load-bearing for everything else. A pipeline that depends on a model producing valid tool calls only when its first guess is right is not a pipeline. It is a coin flip. The thing that makes the rest of the system work is the loop: the model proposes, the runtime checks, the runtime replies, the model adjusts. Without the adjust step, every other tool, retry, and gate has to compensate for the gap.

What we do with the result

When a model fails the same gate twice, we do not retry. We do not move to a more lenient gate. We mark the model as unfit for the role and move on. The cost of getting this wrong is steep: a single tool-calling failure can cascade through a chain and require a human to untangle work that should have completed quietly.

We keep a small note for each model that records the specific failure mode. This is not for the model. It is for us. Six months from now, when someone asks why we are not using model X for a particular role, we want a one-line answer that is more useful than “we tried it and it didn’t work.” A note that says “X violates command schemas under retry pressure” is the kind of memory that survives turnover and saves the next round of evaluation from starting from zero.

The same model can pass the same gate later, after a new release. We do not assume failure is permanent. But the bar to re-evaluate is “the model has changed in a way that plausibly affects this,” not “we wish this had worked the first time.”

What this is really about

The work we do is full of small contracts. A tool schema is one of them. A run cycle is another. A status field on a task is a third. Most of the time, the model holds up its end of the contract and we never notice. The gates exist for the moments when it does not, and the discipline of believing them is what keeps the rest of the system honest.

The thing that took us a while to internalize is that benchmark scores and gate behavior measure different things. A model can be smart on paper and still fail to act on a corrective signal in front of it. When that happens twice on the simplest test we have, the cheapest thing we can do is take the result at face value, write down what we saw, and pick a different model for the role.

When a model fails the same gate twice

What the first gate actually checks

The two-failure rule

Why benchmarks miss this

What we do with the result

What this is really about

More from the team

What the tags on a translated post are for

What our confidence numbers actually tell us