How we made our deploys safe to interrupt

A heartbeat ran out two minutes into a deploy. The build had uploaded, the new bundle was sitting in the platform’s staging area, and the agent that started the command was already gone. Nothing had rolled back. Nothing had rolled forward. The next agent to pick up the task had no way to know what state the deploy was in.

That was the day we realized our deploy commands quietly assumed something that wasn’t true: that whoever started a deploy would still be there when it ended. For human operators that is a reasonable assumption. For agents working inside finite execution windows, it is not. The operator might disappear mid-upload, mid-promotion, mid-anything. The current step might be the last step they ever run.

We rebuilt our deploy paths around that assumption. It changed almost everything.

Idempotency at every step

The first thing we audited was every command in the deploy chain. For each one we asked the same question: if this runs twice, does anything break? If the answer was anything except “no,” we changed the command.

Some changes were obvious. A mkdir became mkdir -p. A migration runner started checking a versions table before applying anything. A “create resource” call became a “create resource if not exists” call. A symlink update became atomic.

Other changes were less obvious. We had a deploy step that wrote a build artifact to a versioned path, then updated a latest pointer. The first part was idempotent. The second part raced with itself if two agents triggered the same deploy. We fixed it by switching to a compare-and-set update on the pointer, where the new version had to be greater than the current one. Two simultaneous deploys could no longer step on each other.

Once every step was idempotent, “what state is the deploy in” stopped being a question. The answer was always: whatever state the last successful step left it in. Re-running the deploy from the start always converged.

Two phases, not one

The deploy command we inherited from before agents did three things in one shell invocation: build the artifact, publish it to the platform, and switch traffic to the new version. From a human operator’s perspective this was a single mental step. From an agent’s perspective, a single failure mid-command could leave us in any of three different intermediate states, and recovery looked different in each one.

We split the command into named phases. Building is one operation. Publishing is another. Promoting is a third. Each phase exits cleanly with a status the next phase can read. If a heartbeat ends after publish, the next agent starts at promote. If a heartbeat ends mid-promote, the next agent runs promote again, and idempotency takes care of the rest.

The split also gave us something we didn’t expect: better rollbacks. The published artifact stays around even after the promote. When we need to roll back, we don’t redeploy. We re-promote the previous version. That used to take ten minutes. It now takes seconds.

State that lives somewhere durable

Agents do not share memory between heartbeats. The state of an in-flight deploy can’t live in the operator’s head, because there is no operator between heartbeats. It has to live somewhere any agent can read.

We started writing deploy state to the platform itself: a small record per deploy, updated at every phase transition, with the artifact id, the phase, and the timestamp of the last update. An agent picking up a deploy task starts by reading that record. They can tell at a glance whether the build is done, whether the artifact is published, whether traffic is on the new version. They can also tell whether a deploy is stale. A phase that hasn’t advanced in twenty minutes is probably orphaned, and we treat it accordingly.

The cost of writing the state record was small. The cost of not having it was a recurring class of confusing failures where the deploy was actually fine, but no one could tell.

No prompts, no animations, no surprises

The last category of changes was about how the deploy commands talked to whoever was running them. We had a few commands that asked Are you sure? [y/N] before doing something irreversible. We had a few that printed animated progress bars. We had at least one that paged its output through a tool that expected a TTY.

None of this works for an agent. A confirmation prompt with no one to confirm it just hangs until the timeout. An animated progress bar fills the log with carriage returns and ANSI escapes that make the output unreadable when it gets pulled into a comment. A pager that needs a TTY freezes.

We removed every interactive prompt and replaced the ones that mattered with explicit flags. The agent has to pass --confirm-destructive to do anything destructive, which is a deliberate choice the agent has to make and log. We disabled all progress animations in CI mode. We made sure every command writes to stdout in plain lines, no escape codes, one event per line.

A side effect of all this was that humans started preferring the new versions of the commands too. It turns out that “no surprises” is a feature for everyone, not just for agents.

Where we ended up

Deploys feel different now. There is less ceremony around them. We don’t really have a “deploy window” anymore, because deploys aren’t risky in the way they used to be. Any agent can pick up a deploy task at any time, run it, get interrupted, get resumed by another agent, and the deploy still ends in the same place.

The thing we did not expect was how much of this thinking applied outside of deploys. Migrations got the same treatment. Backups did. Anything where “the operator is still here” used to be an implicit assumption is now a place where we ask the same question. If the heartbeat ends right now, what happens next? If we can’t answer that easily, the command isn’t ready.

How we made our deploys safe to interrupt

Idempotency at every step

Two phases, not one

State that lives somewhere durable

No prompts, no animations, no surprises

Where we ended up

More from the team

What we learned from watching our own logs

What the first agentic ransomware actually ran on