Why we design APIs for replay, not retry

Most systems assume retry. When an operation fails, the same caller tries again with the same intent. That assumption breaks the moment the caller can forget.

When the caller is a human at a terminal, retry is free. The intent lives in their head. They read the error, wait, and run the command again. When the caller is a long-running process with durable state, retry still works. The state carries the intent across the gap.

When the caller is an agent whose context resets between the failure and the next attempt, retry fails quietly. The next attempt is not the same caller. It is a different caller, with the same task description and a blurrier picture of what has already happened. “Retry” becomes a promise the system can’t keep.

So we stopped designing for retry. We design for replay.

The distinction

Retry means: the same logical caller, with the same intent, tries the same operation again. The server does not need to know about the first attempt, because the caller is remembering it.

Replay means: any caller, possibly a different one, can observe what has already happened and decide what to do next. The server has to be the source of truth about progress, not the caller.

The difference shows up in every endpoint we touch. A POST /orders that returns a fresh order id and assumes the caller will hold onto it is retry-shaped. If the caller forgets, we get a duplicate order. A POST /orders that accepts a client-supplied key and returns the same response on every subsequent call with that key is replay-shaped. A new caller with the same task can ask, “has this been done?” and get a definitive answer.

What this costs

Replay costs more to build. Every mutating endpoint needs:

A stable key that identifies the intent, not the attempt
A way to look up prior results by that key
A way to distinguish “in progress” from “done” from “not started”

That means keys in the request body, state stored server-side, and explicit terminal states. It rules out endpoints that only return on success and leave the caller to guess what happened on timeout.

The tradeoff is real. Our internal endpoints carry more state than they would otherwise. Our clients carry less. We have moved work from the edge to the center on purpose.

The hardest part is the key

Most of the difficulty in replay-shaped design is not in the storage. It is in picking the key.

A good key identifies the intent without being so specific that it accidentally fingerprints the attempt. “Send payment of $50 to supplier 7 on April 24th” is an intent. A UUID generated fresh on every call is not. Using the UUID means each attempt is a new intent, which collapses replay back into retry.

The keys we trust most are ones the upstream planner assigns before any network call happens: “this is subtask 3 of goal G”, “this is the eleventh run of routine R”. The key is chosen where the intent was chosen. If the caller dies, a replacement caller can derive the same key from the same context and pick up where the first one stopped.

The keys we trust least are ones generated inside a function, because a caller that forgets will not regenerate the same value. Client-side random UUIDs are effectively retry-only.

What we gained

The payoff is that we stopped writing recovery code. When an agent times out mid-transaction, the next heartbeat does not need to reconstruct the state. It asks the server. If the work is done, the server says so and the agent moves on. If the work is partial, the server describes what is left. If the work never started, the server accepts the request as new.

None of that requires the agent to remember anything across the gap.

A second benefit, one we did not expect: replay-shaped APIs let us hand work between agents without ceremony. When a task changes ownership, the new agent reads the current state and picks up where the old one left off. There is no special handoff logic. The API is already built for it.

Where we draw the line

We do not make every endpoint replayable. Read endpoints do not need to be. Endpoints whose side effects are cheap to repeat, like writing a log line, stay retry-shaped and we accept the duplicates.

The rule is: if repeating the operation would cause a measurable problem for someone, it has to be replayable. A duplicate order is a problem. A duplicate log line is not.

That is less rigorous than “idempotency everywhere”, but it is what we can afford. Replay has a cost, and we pay it where it matters.

What the habit produced

The thing we did not anticipate is how the design principle reshaped the way we talk about work itself. We stopped asking “what does the caller need to do on failure?”, because that frames the problem wrong. The question became: “what does any observer need to be able to ask the system to understand where we are?”

That second question is more useful. It pushes us toward durable state, clear transitions, and keys that outlive individual attempts. It also makes the system more debuggable for humans, which we did not set out to do.

It turns out an API that is safe for a forgetful agent to call is also an API that is easy for a tired engineer to reason about.