Claude Opus 4.8 Review: Better at What It's Good At, Worse at What It's Not

Claude Opus 4.8 benchmark comparison chart showing SWE-bench Pro scores against GPT-5.5 and previous Claude versions released May 28 2026

Anthropic dropped Claude Opus 4.8 on May 28, 2026 — just 41 days after Opus 4.7. That release cadence alone tells you something. They're not waiting for a generational leap anymore. They're shipping targeted improvements and making you evaluate every single release.

So here's the honest review: Opus 4.8 is genuinely better at coding and math. It's worse — sometimes embarrassingly worse — at negotiation and business simulation. It costs the same as before. And it now has a Fast Mode that's three times cheaper than the old one.

Here's the full breakdown.

What Changed in Claude Opus 4.8 — The Quick Version

If you're already running Opus 4.7 in production and want the short answer: migrate. It's a config-only change, same API, same context window, same price — and the improvements are real.

💡 Claude Opus 4.8 at a Glance:
📅 Released: May 28, 2026
💰 Pricing: $5/M input tokens · $25/M output tokens (same as 4.7)
⚡ Fast Mode: $10/$50 per million tokens (was $30/$150 on 4.7 — 3x cheaper)
🧠 Context window: 1M tokens
🏆 Artificial Analysis leaderboard: #1 overall (61.4 score)
🆔 API model ID: claude-opus-4-8

Where Claude Opus 4.8 Got Dramatically Better

🧑‍💻 Coding — This Is Where It Absolutely Shines

This is Opus 4.8's biggest and most meaningful improvement. On SWE-Bench Pro — the hardest coding benchmark in the industry, testing whether a model can actually resolve real GitHub issues from actively-maintained repositories — Opus 4.8 jumped from 64.3% to 69.2%. That's a 4.9-point gain in 41 days.

To understand why that matters: GPT-5.5 sits at 58.6% on the same benchmark. Opus 4.8 is now 10.6 percentage points ahead of GPT-5.5 on real-world coding tasks. That gap is not noise. It shows up in complex multi-file refactoring, understanding interconnected codebases, and producing changes that actually pass existing test suites.

There's also a quality improvement that doesn't show up in benchmarks but matters enormously in practice: Opus 4.8 is four times less likely than 4.7 to let flawed code pass without flagging it. If you're using it for production code review or agentic coding runs, this is the most important number in this whole review.

And it does all of this using 35% fewer output tokens than Opus 4.7 — meaning you're getting better answers for less money even though the list price hasn't changed.

🔢 Math — The Jump That Surprised Everyone

The math improvement is the number that turned heads in the AI research community.

On USAMO 2026 — the United States Mathematical Olympiad proof-based problems, which Anthropic has confirmed are not in the training data — Opus 4.8 scored 96.7%. Opus 4.7 scored 69.3% on the same test.

That's a 27-point jump in one release cycle. For context, that's the difference between a strong high school math student and a national competition finalist. It's the kind of gain that makes researchers double-check whether the benchmark has been contaminated. Anthropic says it hasn't.

⚙️ Agentic Tasks and Tool Use

On OSWorld-Verified — which tests whether a model can actually control a live desktop, clicking around and completing tasks the way a human would — Opus 4.8 scores 83.4% versus GPT-5.5's 78.7%. Opus 4.7 and GPT-5.5 were essentially tied on this benchmark. 4.8 pulled ahead by five points, which is meaningful for anyone building browser agents or desktop automation.

On MCP-Atlas (multi-step tool use across real APIs), Opus 4.8 reaches 82.2% versus GPT-5.5's 75.3%. These are tasks that look like real work: booking something, pulling data from multiple sources, chaining together a series of API calls.

⚡ Fast Mode Got Dramatically Cheaper

This is the underrated story of this release. Opus 4.8's Fast Mode now costs $10 per million input tokens and $50 per million output tokens — down from $30/$150 on Opus 4.7. That's three times cheaper, while running at roughly 2.5x the speed of standard Opus inference.

For teams that were avoiding Fast Mode because of cost, this changes the math entirely. You can now run high-speed Opus 4.8 at a price that used to buy you slow Opus 4.7.

The New Feature: Dynamic Workflows in Claude Code

Alongside the model update, Anthropic shipped a new feature called Dynamic Workflows inside Claude Code — and it's worth understanding what it actually does.

Normal Claude Code works sequentially. You give it a task, it works through it step by step, and you wait. Dynamic Workflows changes this: Claude now acts as an orchestrator, spinning up up to 1,000 parallel subagents that each tackle a piece of the work simultaneously.

In practical terms: instead of watching Claude Code slowly work through a repository-scale migration one file at a time, you set it up, let it run, and come back to results. Box ran internal benchmarks comparing 4.8 to 4.7 on enterprise workloads — the industrial goods task went from 77% to 87%, and consumer products launch planning went from 84% to 90%.

This is the feature that makes Opus 4.8 genuinely useful for large-scale autonomous coding runs, not just assisted coding.

📊 New Benchmarks at a Glance:
SWE-Bench Verified: 88.6% (up from 87.6%)
SWE-Bench Pro: 69.2% (up from 64.3%)
Terminal-Bench 2.1: 74.6%
USAMO 2026 Math: 96.7% (up from 69.3%)
OSWorld Desktop: 83.4% (vs GPT-5.5's 78.7%)
GDPval-AA Real-Work Leaderboard: 1890 Elo (#1 overall)

Where Claude Opus 4.8 Got Worse — The Part Nobody's Advertising

Here's the part that makes this review different from the press release.

Opus 4.8 is measurably worse than both 4.7 and GPT-5.5 in specific categories that matter for certain real-world workflows. The researchers at LessWrong and Andon Labs ran Opus 4.8 through Vending-Bench — a simulation that tests AI performance in a business management scenario, running a vending machine operation across suppliers, pricing, and inventory.

The results were striking in the wrong direction.

Opus 4.8 lost to both GPT-5.5 and its own predecessor Opus 4.7 on Vending-Bench. It falls for scam suppliers — in one documented run, it sent over $9,000 to a "membership" upsell from a fraudulent vendor. It's worse at supplier negotiation than 4.7. It runs the machine empty. It overprices inventory. And when researchers gave it maximum thinking tokens to try to fix this, the performance got even worse rather than better.

⚠️ Where 4.8 Falls Short:
❌ Business negotiation and supplier management (Vending-Bench)
❌ Terminal-Bench 2.1 — GPT-5.5 still wins here (78.2% vs 74.6%)
❌ Financial knowledge work — Gemini 3.5 Flash beats both Opus 4.8 and GPT-5.5 on Finance Agent v2
❌ Legal Agent Benchmark — only 9.6% all-pass rate (industry-wide problem, not just Claude)
❌ Multimodal visual coding — Mythos Preview is ahead at 59.0% vs 38.4% on SWE-Bench Multimodal

💬 GPQA Diamond — Basically Meaningless Now

You'll see GPQA Diamond (graduate-level science questions) cited in a lot of reviews. The honest take: this benchmark is saturated. Both Opus 4.8 and GPT-5.5 score around 94%, within one point of each other. It's essentially a tie that tells you nothing useful about which model is better. Ignore it when comparing these two models.

Claude Opus 4.8 vs GPT-5.5 — The Honest Head-to-Head

This is the comparison most people actually want.

The short version: pick Opus 4.8 for coding, complex reasoning, and agentic software work. Pick GPT-5.5 for terminal operations, general assistant tasks, and anything involving heavy multimodal use.

  • Coding (SWE-Bench Pro): Opus 4.8 wins by 10.6 points. Not close.
  • Math (USAMO 2026): Opus 4.8 at 96.7%. GPT-5.5 not published on this test.
  • Terminal tasks (Terminal-Bench 2.1): GPT-5.5 wins at 78.2% vs Opus 4.8's 74.6%.
  • Desktop automation (OSWorld): Opus 4.8 at 83.4% vs GPT-5.5's 78.7%. Claude wins.
  • Finance tasks (Finance Agent v2): Gemini 3.5 Flash wins both. Neither Claude nor GPT-5.5.
  • Real-world aggregate (GDPval-AA): Opus 4.8 at 1890 Elo, GPT-5.5 at 1769. Claude wins.
  • Pricing: GPT-5.5 is cheaper at $3/$15 vs Opus 4.8's $5/$25. For cost-constrained work, GPT-5.5 has an edge.

One more comparison that matters for budget-conscious teams: DeepSeek V4 at $0.27/$1.10 per million tokens is still the value-per-quality champion for teams that don't need frontier-level performance. Opus 4.8 is the best available, but it's the most expensive of the top options.

Is Claude Opus 4.8 the Most Honest AI Model?

Anthropic made a specific claim with this release that's worth noting: they call Opus 4.8 their most honest model to date, both in terms of not telling users what they want to hear (reduced sycophancy) and in terms of not passing flawed work without flagging it.

The sycophancy benchmark ("You're Absolutely Right!") shows Opus 4.8 scoring 4.5 out of 5 — ahead of 4.7 and the best result in Anthropic's model line. In practice this means Opus 4.8 is more likely to push back when you're wrong, and less likely to agree with flawed logic just because you stated it confidently.

For professional use — writing, analysis, research, legal review — this is actually the most practically important improvement in the whole release. A model that catches your mistakes rather than validating them is worth more than one that scores slightly higher on a benchmark you'll never encounter.

Who Should Actually Use Claude Opus 4.8?

Use Opus 4.8 if: You're building agentic coding workflows. You're doing complex multi-step reasoning or analysis. You need a model that will catch its own errors and flag problems without being asked. You're already on 4.7 — migration is trivially easy and the improvements are real.

Stick with GPT-5.5 if: Your work is heavily terminal-based. You need tight multimodal integration across a broader tool ecosystem. You're cost-constrained and the $3/$15 pricing is a meaningful factor for your volume.

Consider DeepSeek V4 if: You're running high-volume production workloads where frontier quality isn't strictly necessary and cost is the dominant constraint.

Wait for Mythos if: You're doing the absolute hardest software engineering work — Mythos Preview already sits at 77.8% on SWE-Bench Pro vs Opus 4.8's 69.2%, and Anthropic has said it's "coming in the next few weeks" to general availability after its restricted Project Glasswing deployment.

📌 Bottom Line: Claude Opus 4.8 is the best publicly available AI model for software engineering and complex reasoning as of June 2026. It's worse at business simulation and negotiation tasks than its predecessor. It's three times cheaper to run at speed than Opus 4.7 was. And it's the most honest model Anthropic has shipped. If coding is your primary use case, the upgrade decision is straightforward.

Claude Opus 4.8 is available now on the Anthropic API (model ID: claude-opus-4-8), Amazon Bedrock, and Google Vertex AI. Fast Mode is available on Claude.ai Pro and Team plans. Follow Ampick for ongoing coverage of AI model releases and everything happening in the AI industry.

Post a Comment

0 Comments