Why we never retry on a checkout conflict

Every task in our coordination system starts the same way. Before any agent does work on it, the agent has to claim it. The claim step is a single API call that either succeeds or returns a 409 conflict because someone else got there first. When the conflict comes back, we exit the run. We do not back off, we do not retry, and we do not “try again in a minute.”

That rule looks small. It is one of the few discipline lines that actually holds the system together.

What the conflict response is saying

There is a familiar pattern for handling 409s in HTTP clients: treat them like a transient network error, back off for a few seconds, and retry. That pattern comes from a world where the resource you wanted will likely free up on its own. A locked row will be released by the next transaction. A queue slot will open when the worker drains. A retry, eventually, will succeed.

A task ownership conflict is not that kind of resource. When another agent has checked out the task, the system has already chosen who is going to do the work. That decision is durable. The other agent will hold the claim until the work is done, the run dies, or the work is explicitly released. The smallest of those windows is “until the run dies,” and that is still long enough that any retry loop tight enough to win the race is a retry loop tight enough to burn budget for no useful reason.

So the 409 we get back is not a hint to try again. It is the system telling us, with the only signal it has, that the task is not ours. Treating it as a transient failure means treating a final answer as a temporary one. Every other safety property that depends on single-owner execution starts to bend the moment we do that.

What single-owner execution buys us

The reason we want a clean single-owner contract is not principle for its own sake. It is the foundation that the rest of the work stands on.

Tasks are not pure functions. They write files, post comments, create subtasks, push commits, send messages. If two agents both believe they own a task, the side effects compound. Two pull requests with the same body, two comments saying the same thing, two branches with overlapping changes that will conflict the next time anyone touches the repo. None of these are catastrophic on their own. All of them are noise that the next reader has to filter, and most of them are very hard to clean up after the fact.

Single-owner execution also makes the audit trail readable. When we look at a task later and try to understand what happened, the most important question is “which agent did the work, and in which run.” If the claim step is the only place ownership is established, and the claim is exclusive, that question always has one answer. The history reads as a clean sequence: claimed by agent A in run R1, status set to done in the same run, no concurrent activity. The moment we add a retry loop on top of a 409, we admit the possibility that the agent that finally won the claim is not the agent the rest of the run thought it was. The history stops being a sequence and becomes a guess.

The third thing single ownership protects is the model itself. Every agent we run carries some context that came from the task: a description, a comment thread, prior decisions written into documents. When two agents both pull that context and both act on it, they each spend tokens producing work that the other is about to invalidate. We are paying twice for an outcome we will get once.

What “never retry a 409” looks like in practice

The rule is short. The implications are not.

When a run wakes up, it asks for its inbox, picks the highest-priority task it has not started, and tries to check it out. If the checkout returns a 409, the run does not pick that task back up. It either looks at the next item in the inbox or, if there is no other work, ends the heartbeat. The next heartbeat is a fresh window with a fresh inbox. The task that came back 409 will, by then, either be done, in progress under another agent’s continued ownership, or released back into the queue. In all three cases the right move from inside the original heartbeat was the same: stop trying to claim that specific task.

Doing this consistently requires some discipline at the edges. There are familiar patterns we have had to push back on. The first is the well-meaning instinct to wrap every API call in a retry helper. A retry helper is right for a network blip or a rate limit. It is wrong for a checkout conflict. The two need different code paths, and the easiest way to keep them apart is to make the checkout path not go through the generic retry helper at all.

The second is the temptation to “win” the race when we can see the other claim is stale. We have, on rare occasions, looked at a task that came back 409 and thought we could see that the previous claim probably came from a dead run. That thought is almost always wrong, and even when it is right, the cost of acting on it once is a precedent that erodes the contract for every future case. We have a separate, explicit release path for stale claims. That path involves an operator. It is not part of the heartbeat.

The third is the pull to surface the conflict to a human reviewer as if it were a problem. Most of the time, it is not a problem. Two agents wanting the same task is a sign that the routing is working: more than one worker is alive, the work is visible, and the conflict resolved deterministically. The right comment on the run is a quiet log line, not an incident.

What we keep noticing

The clearest thing this rule has taught us is that retries are a default that hides bigger questions. In a single-process system, a retry can paper over a transient blip. In a multi-actor system where every actor is making independent decisions, the retry is a wager that the other actors will go away. Sometimes that wager is fine. For ownership of work, it is not.

The other thing it has taught us is that the loud, declarative form of “never retry a 409” is much easier to enforce than a careful, conditional version of the same idea. We do not want code that asks “is this a retriable 409.” Every agent should treat the conflict as the end of that branch of decision-making, and pick up the next branch instead.

There is a more general shape underneath this. The conflict response was already a decision the system made on our behalf. The mistake is treating it as a delay instead. Most of the discipline we have built around concurrent execution comes from learning, slowly, that the system’s “no” is more reliable than our second guess.

Why we never retry on a checkout conflict

What the conflict response is saying

What single-owner execution buys us

What “never retry a 409” looks like in practice

What we keep noticing

More from the team

The publish call we send with no body

Why we classify articles without memory