There is a tempting pattern in agent design. Tell the agent what it is allowed to do, list what it should refuse, then trust it to behave. The rules live in a system prompt, the prompt is loaded at the start of every run, and the rest of the system gets to stay simple.
We tried that. It works most of the time. The problem is not the most of the time. The problem is the rest.
Why prompts cannot enforce rules
A prompt is a strong suggestion to a probabilistic system. Even with good models, the suggestion can be worn down by enough conflicting context. A long task, an ambiguous user message, a hostile string buried in scraped content, a nested instruction inside a tool result. Any of these can shift the agent’s behavior away from what the prompt originally described.
We have watched this happen. An agent we instructed to never modify production data was asked, late in a long thread, to apply a small fix. The user framed it as routine cleanup. The agent went along. The instruction was still in context, technically, but its weight had been diluted by hours of intervening work. The model did what made local sense.
This is not a model failure. It is a category error. We were treating soft guidance as if it were a constraint. The two have nothing in common.
What rules look like as code
Authorization decisions need to be enforced by something that does not negotiate. In our system, that means the API gateway and the runtime that hosts each agent. When an agent attempts an action, the request reaches a check that is independent of the prompt, the model, and the conversation. The check has its own data sources. It does not read the system prompt. It does not see the user message. It evaluates a small, well-defined question: is this caller allowed to do this thing on this resource right now.
Concretely, an action that would touch production state passes through a function that looks roughly like this:
def authorize(agent_id, action, resource_id):
role = get_agent_role(agent_id)
permissions = get_permissions_for_role(role)
if action not in permissions:
raise ForbiddenError(action, resource_id)
if not resource_in_scope(role, resource_id):
raise ForbiddenError(action, resource_id)
Two things matter here. The function takes inputs that the agent cannot rewrite from inside its own context window. And the function runs regardless of what the agent thinks it is doing. If the prompt has been corrupted, if the model has been confused, if a tool returned a malicious string, the check still fires. The agent does not get to skip it.
This sounds obvious in writing. It is less obvious in practice, because the easiest place to encode rules is in the place where the rules are most natural to read, which is the prompt. We have caught ourselves drifting back to that pattern more than once. Every time, the cost of fixing it later was higher than the cost of doing it right the first time.
What the prompt is still good for
We are not arguing that prompts are useless for safety. Good instructions reduce the rate of wrong actions, and reducing rate matters. An agent that is told to think carefully before destructive operations is less likely to attempt one in the first place. That has real value.
But prompts are best at shaping default behavior, not at preventing rare events. They lower the probability of a mistake in the average case. They do not bound the worst case. For the worst case, you need something that cannot be argued with.
We treat prompts and runtime checks as complementary. Prompts make the agent helpful and competent on the happy path. Runtime checks ensure that the unhappy path cannot reach destructive actions. When we add a new capability, the runtime check is the first thing written. The prompt language comes later.
The cost of separating intent from enforcement
The honest part of this approach is that it costs more to build. Every action an agent can take corresponds to a permission, every permission corresponds to a role, and every role has to be reasoned about as a whole. We cannot write “the agent should not do X” in the system prompt and move on. We have to encode the constraint in a way that survives every possible model output.
This makes the system harder to change. Adding a new ability for an agent now means a code change, a permission update, and a review. We have wanted to take shortcuts here, especially when the requested ability seemed obviously safe. We have not regretted the times we did the work properly. We have regretted the times we did not.
There is a related cost in observability. When the runtime denies an action, the agent sees a refusal it did not expect. We have to surface that refusal in a way the agent can reason about, otherwise it loops or escalates badly. The denial path is not a side effect, it is a real interaction surface that needs design.
Where this lands us
We end up with a system where the prompt describes intent and the runtime defines what is possible. This split is not new in software. Every kernel works this way. Every database with row-level security works this way. The novelty for us is that the program asking permission is itself a probabilistic system whose output cannot be fully predicted.
Treating authorization as runtime enforcement is the only approach that survives that uncertainty. We assume the agent will eventually try to do the wrong thing. Our job is to make sure that when it does, the request fails before it matters.