What GPT-5.5 actually changes for people building agents

OpenAI released GPT-5.5 on April 23, six weeks after GPT-5.4. The gap between frontier releases used to be counted in years. Now it is counted in weeks. That cadence is the first thing worth registering about this launch, ahead of any single benchmark number. For anyone building agents, the model you chose last month is already a month out of date, and the shape of the decision is starting to change.

The marketing frame around the release is that GPT-5.5 is OpenAI’s most capable frontier model and the first fully retrained base model since GPT-4.5. Greg Brockman described it as “a faster, sharper thinker for fewer tokens” and “a new class of intelligence for real work.” Those are reasonable sentences for a press release. They are not why we care. We care about the parts that change what an agent can do in production, and the parts that change the math on what an agent costs to run.

What is actually new

GPT-5.5 is natively omnimodal. Text, images, audio, and video run through a single unified architecture instead of bolted-on modality heads. The context window reaches 400,000 tokens when accessed through Codex. OpenAI is shipping three variants, Standard, Thinking, and Pro, with the Pro tier restricted to Pro, Business, and Enterprise plans. API access is coming “very soon,” which means the article you are reading will age before anyone can write production code against the raw endpoint.

The capability list that matters for agents is shorter than the full feature sheet. GPT-5.5 is meaningfully better at multi-step autonomous execution. It plans, calls tools, checks its own output, and loops until a task is done, with less handholding than 5.4 required. Computer use and OS navigation have moved forward. Long-context retrieval has moved forward. The interpretation of ambiguous instructions, which is really a measure of how often an agent asks you to clarify versus just making a reasonable choice, has moved forward.

Benchmarks, with the honest context

The benchmark story is mostly good for GPT-5.5 and mostly uncomfortable for Claude Opus 4.7, with one loud exception.

Benchmark	GPT-5.5	Claude Opus 4.7
Terminal-Bench 2.0 (agentic CLI)	82.7%	69.4%
FrontierMath Tier 4 (Pro variant)	39.6%	22.9%
GDPval (economically valuable tasks)	84.9%	—
OSWorld-Verified (computer use)	78.7%	78.0%
CyberGym	81.8%	73.1%
MRCR v2 (long-context retrieval)	74%	—
SWE-bench Pro (real GitHub issues)	58.6%	64.3%

The Terminal-Bench gap is the number that changes the most in practice. A thirteen-point spread on agentic CLI work is not a rounding error. Neither is the near-doubling on FrontierMath Tier 4, which covers postdoc-level mathematics and is genuinely hard to game.

SWE-bench Pro is the asterisk. On real-world GitHub issue resolution, Claude Opus 4.7 still wins by almost six points. That benchmark is the closest public proxy for “fix a real bug in a real codebase.” The honest read is that GPT-5.5 is ahead on synthetic agentic tasks and several specialized domains, and behind on the messy, context-heavy work of navigating an existing software project. Anyone building a coding agent should take both numbers seriously and pick the model that matches the workload, not the one on top of the press release.

What this means for agent architecture

Three things in GPT-5.5 change how we would build an agent, not just how much it would cost.

The first is token efficiency. OpenAI and several of the early-access partners report that GPT-5.5 completes the same task in fewer tokens than 5.4. That is a structural change, not a marginal one. Fewer tokens per step means reflection loops, self-check passes, and tool-use retries become cheaper in absolute terms. Agent architectures that were too expensive to run at 5.4 token counts move back into range.

The second is computer use at 78.7% on OSWorld-Verified. That is close to Claude Opus 4.7, which sits at 78.0%, and both are close enough to “usable but not reliable” that we still would not trust either one for unattended click-through work on a production account. The right read is that computer use is now good enough to be a fallback path when an API does not exist, not a replacement for APIs that do. The remaining failure rate is still large enough to need supervision.

The third is the 400K context window in Codex. It does not eliminate the need for retrieval. It does reduce the amount of aggressive summarization a long-running agent has to do mid-task, which is where a lot of quality loss has historically crept in. The context window is not a substitute for good context management. It is a bigger margin before bad context management starts hurting.

The pricing reality

GPT-5.5 is priced at $5 per million input tokens and $30 per million output tokens. GPT-5.4 sat at $2.50 and $15. The list price is exactly 2x. The effective cost is not, because of the token-efficiency gains, but it is not the same either. Expect real workloads to land somewhere in the range of 1.3x to 1.8x more expensive than 5.4, depending on how much of the task is generation versus reasoning. Codex Fast mode offers roughly 1.5x speed for roughly 2.5x cost, which is the right option for interactive developer workflows and the wrong option for bulk agent work running in the background.

Pricing has now caught up with the fact that frontier quality is a scarce resource and the people selling it have stopped apologizing for the number. The question for anyone running an agent stack is no longer “which model is best.” It is “which model is best for which step.” A dispatcher routing easy steps to smaller models and hard steps to GPT-5.5 is now a meaningful cost lever.

What the real-world reports suggest

The early anecdotes are useful as a sanity check on the benchmark numbers. A math professor built an algebraic geometry app from a single prompt in eleven minutes. The model contributed to a new mathematical proof related to Ramsey numbers. A GPU batching optimization credited to GPT-5.5 increased token generation speeds by more than 20%. Bank of New York reports a “step change” in accuracy and hallucination resistance. Over ten thousand NVIDIA employees across engineering, legal, marketing, finance, and HR are now running GPT-5.5 through Codex, and Jensen Huang sent a “jump to lightspeed” email to the whole company about it.

None of those stories prove anything by themselves. They do fit the pattern the benchmarks suggest, which is that GPT-5.5 is genuinely better at complex, multi-step, domain-intensive work. That is a different claim than “it writes better code in general.” Both can be true.

The thing under the release

The six-week gap between 5.4 and 5.5 is easy to miss in a release that has a 400K context window and a new benchmark table. We do not think it should be missed. OpenAI is compressing its release cycle in response to competitive pressure from Anthropic, and the shape of the race now is that frontier models are going to keep shipping on a faster cadence than any agent stack can realistically retune itself for. The model-picking decision is becoming a weekly one for anyone who cares about being on the efficient frontier.

For now, the honest bottom line is that GPT-5.5 is the default we would pick for agentic CLI work, long-context reasoning, and specialized scientific or mathematical tasks, and Claude Opus 4.7 is still the default we would pick for navigating and editing real software projects. That split will probably not last long. Something will break it in the next six weeks. That is the part of the release that is actually new.

Sources: TechCrunch, SiliconAngle, 9to5Mac, Fortune, Interesting Engineering, Ghacks, OpenAI.