The difference between a failed run and a failed task

A worker died in the middle of a task. The task is still marked in progress. Nothing is actively working on it. Somewhere between the moment the worker picked up the job and the moment we checked the queue, the process that was supposed to finish the work stopped existing.

The traditional response is to mark the task failed and alert a human. That is almost always the wrong thing to do. A failed run is not a failed task.

Why the distinction matters

A task is a description of work that needs to happen. A run is a single attempt to execute that work. The two have different lifecycles. A task can outlive many runs. A run can fail in ways that have nothing to do with whether the task is still valid.

When we treat a failed run as a failed task, we lose information. The next process that looks at the work sees a dead ticket and has to reconstruct what was going on. We also pay real costs: retries that should have been cheap become investigations, and humans get paged for things that would have resolved themselves if the next attempt had just been allowed to start.

Put another way: a task fails when the underlying work is infeasible. A run fails when the attempt to do it could not finish. Most of what kills a run in practice is not about the task at all.

A small taxonomy

When we look at what actually kills runs in our system, the failures cluster into four groups, and each group calls for a different response.

Adapter failures. The process that powers the run cannot reach the model it depends on. A credential rotated without a deploy. The provider is returning 429s. A key was revoked. The task is fine. The runtime lost its credentials. Retry with backoff, after a sanity check that the credential is still valid.

Host and network failures. The host OOM’d. The network path dropped a request mid-flight. A container got evicted. Again, the task is fine. Retry, possibly on a different host, with a short circuit breaker so we don’t hammer a sick node.

Context exhaustion. The run got far enough to start real work, but burned through its reasoning budget before finishing. This is not a classical infrastructure failure, but from the outside it looks like one: the run stops, the task stays open. The difference is that retrying with the same inputs produces the same result. The correct move is to split the task, attach better context, or escalate, not to retry.

Logic failures after completion. The run produced valuable output, then crashed during cleanup. The commit landed, the comment was posted, but the process died before it could mark the task done. Retrying here is worse than useless. It duplicates work and, depending on what the task touched, can leave the system in a worse state than if we had done nothing.

Each class has a different response. Conflating them costs us either reliability or money, usually both.

What we actually look at

When a run dies, before deciding what to do, we look at three things.

First, the state of the task. Were any outputs committed? Did a comment get posted with a status update? Is the repository in a stable state? If the answer is yes, the task may already be partially or fully complete. A blind retry would redo the work and potentially undo it in the process.

Second, the tail of the run log. The last thing a dying process says is usually the most important thing it will ever say. A 401 is not the same as a timeout. A truncated response is not the same as an uncaught exception. We write our adapters to emit a final structured reason on every failure path, because that structured reason is the single strongest input to the retry decision. A free-text error message tells you something went wrong. A typed reason tells you what kind of wrong it was.

Third, the failure rate across recent runs. One failure is a run problem. Ten failures in a row on the same worker is an infrastructure problem. A hundred failures across workers is a provider problem. The aggregate shape of failures usually decides the correct response more than any single run does.

Retries are cheap only when they are right

The appeal of automatic retry is that it hides flakiness. Most of the time, the next attempt works, and nobody notices the first one died. That is a good outcome when the failure was an adapter blip. It is a bad outcome when the failure was context exhaustion, because the retry consumes another run’s budget and produces the same result. And it is a genuinely costly outcome when the failure was a crash after completion, because the same work gets committed twice.

We think of automatic retry as a choice that has to be justified, not a default. Our pipeline retries automatically only for a short list of well-understood failure modes, and surfaces everything else to a human or to the task owner for a decision. The list is shorter than it feels like it should be, and that has been the right tradeoff.

The sibling rule is that retries should be visible. When a run is retried, that fact belongs in the task’s history, not buried in a log. A task that has been retried four times is telling us something different from a task that succeeded on the first attempt, even if the final state looks identical.

What this is really about

The failure modes of a traditional job queue are mostly about the environment the job runs in: the network, the host, the filesystem, the database. The job itself is usually a deterministic function of its inputs. If it ran today, it will run tomorrow.

Our workers are not deterministic. They reason their way through tasks. They can exhaust their reasoning budget. They can misread a requirement and produce output that looks correct but is not. They can also, sometimes, succeed at a task and then die during the bookkeeping.

That last case is the one that changed how we think about operations. A run is not a unit of work we own. It is a unit of attempt. The work belongs to the task, and the task survives its runs.

The difference between a failed run and a failed task

Why the distinction matters

A small taxonomy

What we actually look at

Retries are cheap only when they are right

What this is really about

More from the team

The crontab we deleted

Why our runbooks became scripts