Anthropic Releases Claude Opus 4.8: #1 on Benchmarks, Parallel Subagents, and It Actually Tells You When Your Code Is Wrong

Anthropic releases Claude Opus 4.8 — now #1 on benchmarks, and it’ll actually tell you when your code is broken

Anthropic dropped Opus 4.8 yesterday (May 27–28) and it’s legitimately impressive, at least on paper. Same price as 4.7 ($5/$25 per million tokens), but they’ve managed to push SWE-Bench Verified to 88.6% and land the top spot on the Artificial Analysis Intelligence Index at 61.4 — beating GPT-5.5 by 1.2 points. Whether those translate to real-world gains is the usual question, but early enterprise reports look solid.

The benchmark numbers

SWE-Bench Verified: 88.6% — significant jump from 4.7
SWE-Bench Pro: 69.2%
Humanity’s Last Exam: #1 overall (beats GPT-5.5 and Gemini 3.1 Pro on scientific reasoning)
Terminal-Bench Hard: +6.8 pts vs 4.7
Online-Mind2Web (browser agent): 84%, ahead of GPT-5.5
GDPval-AA (agentic work): 1,890 Elo, +137 vs 4.7, ~67% win rate against GPT-5.5 — while using 15% fewer turns and 35% fewer output tokens

That last point matters if you’re paying per token in long agentic loops.

Dynamic Workflows — the actually interesting part

This is in research preview and it’s what makes 4.8 more than a benchmark bump. Inside Claude Code, you can now have the model spin up hundreds of parallel subagents in a single session. The subagents adapt their priorities based on real-time findings rather than following a fixed plan upfront, and Claude verifies outputs before reporting back rather than blindly running to completion.

Practically: Anthropic is pitching this for codebase-scale migrations — the kind of “we need to touch 200k lines” project that currently requires a full sprint of back-and-forth. It’s gated to Enterprise, Team, and Max plans for now.

Cursor’s evaluation noted it was “more efficient in tool calling, using fewer steps for the same intelligence.” One tester said it’s “the only model to complete every case end-to-end on the Super-Agent benchmark.”

The honesty angle — actually useful for devs

The headline marketing is “honesty as a killer feature,” which sounds soft but is concrete here: Opus 4.8 is ~4× less likely to let flawed code pass unremarked compared to 4.7. It’ll proactively flag issues with your inputs and outputs rather than just generating what you asked for.

Bridgewater put it well: “its tendency to proactively flag issues with inputs and outputs of an analysis — something other models routinely missed and left to users to catch.”

For code review workflows this is the right direction. Models that just complete tasks confidently without flagging obvious problems are actively harmful in production codebases.

Effort levels and pricing

New: you can now control how hard the model thinks, as a toggle on all claude.ai plans:

Level	Behavior
Low	Faster, fewer tokens consumed
High (default)	Quality/UX balance
`xhigh`	More tokens, harder problems
Max	Peak performance

Fast mode is 2.5× speed at $10/$50 per million input/output tokens — they’re claiming 3× cheaper than previous models at that speed tier.

Technical specs worth knowing

Context: 1M tokens on Claude API, Bedrock, and Vertex AI — 200k on Microsoft Foundry
Max output: 128k tokens per response
Model size: ~59.1B parameters
New API feature: accepts role: "system" messages mid-conversation so agents can update instructions without breaking prompt caching. This is useful for long-running agentic tasks where you want to inject new context mid-session without nuking your cache hits.

Enterprise callouts

Legal: First model past 10% on Harvey’s Legal Agent Benchmark all-pass standard; 91.1% on BigLaw Bench
Finance: Better citation precision on dense financial filings (Hebbia)
Databricks: 61% cheaper token cost vs 4.7 for their Genie product

What’s next

Anthropic is teasing a model above Opus in intelligence — Claude Mythos — under Project Glasswing. Currently restricted to a small group doing cybersecurity work while they work out safety guardrails. No public timeline beyond “coming weeks.”

TL;DR: Opus 4.8 is a meaningful upgrade over 4.7 at the same price. The agentic efficiency gains (fewer turns, fewer tokens) and the proactive code-flaw detection are the two things actually worth testing in your workflow. Dynamic Workflows is the feature to watch, but it’s enterprise-gated for now.

On This Page

Anthropic releases Claude Opus 4.8 — now #1 on benchmarks, and it’ll actually tell you when your code is broken

The benchmark numbers

Dynamic Workflows — the actually interesting part

The honesty angle — actually useful for devs

Effort levels and pricing

Technical specs worth knowing

Enterprise callouts

What’s next

Continue reading

Related Content

Anthropic Quantifies Expertise Multiplier; Practitioners Build Agent-Side Control Plane

Your AI is only as responsible as you are

Lightweight AI Workflows Outperform OpenSpec in UI Redesign Experiments