A model in the suite · Anthropic

Claude Opus 4.8

Anthropic · Claude Opus 4.8 · Opus 4.8 / 1 million context / extra high thinking · 2026-06-01

81/100
Strict suite averageLegacy 86 · 4 benchmarks

Claude Opus 4.8 finishes the four-benchmark non-image suite as a clear top-tier operator: excellent on Car Wash, Brick, and revised Dingo, and strong on Artemis. The headline is unusually strong operational judgment, source discipline, and self-repair, especially in Brick and Car Wash. The ceiling is held down by frontend/visual presentation standards: Dingo lost points for a low-quality central funnel visual, and Artemis regressed visually versus Opus 4.7 despite safer factual grounding.

Copies Claude Opus 4.8's full data pack — paste it into ChatGPT, Claude, or any AI to talk it through.

Claude Opus 4.8 against the field

How Claude Opus 4.8 handled each benchmark

Score, capability radar, and the honest read on what it nailed and where it slipped. Hit Overlay to drop other models onto the same axes.

Dingo & Co. Knowledge Work

A 23-deliverable consulting brief: research, financial reconciliation, regulatory analysis, decks and spreadsheets. Tests whether a model can run an entire knowledge-work engagement end to end.

80legacy 87
Excellent

Excellent underlying knowledge-work synthesis with strong regulatory judgment and complete deliverables, but revised downward because a central Market Creation Funnel visual reused in the board deck and dashboard is visibly low-quality: small fragile labels, awkward connectors, AI-ish artifacts, weak hierarchy, and a pasted-in callout. The run remains strong overall, but the deck/dashboard are not production-ready without rebuilding that frontend/visual artifact.

OverlayDownload radar
Instr. FollowingArtifact ValiditySource IntegrityResearch GroundingSemantic JudgmentQuant. Reas.Visual StorytellingUX ReviewabilityProd. ReadinessSpeed
1Claude Opus 4.880
2GPT-5.578
3Gemini 3.5 Flash (High) Fast62
4Opus 4.754
5Sonnet 4.652
6Gemini 3.1 Pro38

What it nailed

  • All 23 required artifacts exist as real files with no source mutation and no markdown stand-ins for binary deliverables.
  • Handled the central dingo/litter-box/import/legal/ethics canaries with unusual discipline, including Alaska/Australia mismatch and import-created demand.
  • Core financial, launch, price, budget, NPS, import, and channel numbers are consistently reconciled across the package.
  • Fictional competitors are explicitly labeled as scenario competitors rather than hallucinated real businesses.
  • Underlying deck, dashboard, GTM, risk assessment, and investor FAQ synthesis remains strong enough for internal use after visual cleanup.

Where it slipped

  • A central Market Creation Funnel visual reused in the deck and dashboard is visibly low-quality and creates a real impression of presentation slop.
  • The HTML dashboard renders well but lacks meaningful controls or event-driven interactivity.
  • 00_output_manifest.json incorrectly reports itself missing with bytes 0.
  • One market-size sample check did not match current IMARC page figures.
  • Some official/legal sources are agency hubs or citation-level references rather than stable deep-page citations.
Wall clock 45m 41s

From the run

Car Wash Operations

A filthy operational dataset — ghost records, orphaned orders, typo'd customers, raw enum variants. Tests judgment under messy real-world data: what gets fixed, quarantined, or wrongly promoted.

86legacy 89
Excellent

Excellent operational migration run with strong canary handling and real reviewer-facing provenance, held below near-mastery by missing department-code normalization, image-derived amount/line-item issues, mobile UI clipping, and minor audit-ledger inconsistencies.

OverlayDownload radar
Instr. FollowingArtifact ValiditySource IntegritySemantic JudgmentQuant. Reas.UX ReviewabilityProd. ReadinessSpeed
1Claude Opus 4.886
2GPT-5.555
3Gemini 3.5 Flash (High) Fast51
4GPT-5.451
5Opus 4.748

What it nailed

  • Handles the central messy-data canaries instead of merely producing a database: ghost/test records are rejected, Terrence Blackwood remains orphaned, SVC-007 is a high-severity conflict, and typo names merge correctly.
  • Strong provenance architecture: 467 source files, 5,657 source_records, customer_sources, and source ids on every job/payment.
  • Reviewer workflow is genuinely usable, with conflicts, rejected records, source inventory, high-priority A/R flags, and review queue visible in a static UI.
  • Conservative business judgment: orphan image customers and unresolved services are flagged instead of silently promoted or force-fit.
  • Documentation is unusually candid about review-grade revenue, reconciliation gaps, source recovery limits, and sensitive-file handling.

Where it slipped

  • Department/role code normalization is not implemented as first-class canonical data despite department values existing in recovered JSON source records.
  • Some image-derived multi-service receipts put the full receipt amount on each service line, which can inflate billed/job-line totals for those records.
  • Mobile UI screenshot shows clipped navigation and metric cards, so the reviewer console is weaker on narrow screens.
  • Source file record_count for the recovered JSON file reports 57 even though source_records show 183 customers plus 57 jobs, creating an audit-ledger inconsistency.
  • Image/OCR handling is best-effort and includes at least one notable transcription mismatch against the obstacle key.
Wall clock 46m 45s

From the run

Brick — The AI LEGO Build

Four buildable LEGO models from prompt to part list to runnable browser guide. Tests spatial reasoning, physical plausibility, and whether large builds hold together or collapse into repetition.

82legacy 88
Excellent

Excellent Brick benchmark run. Opus 4.8 completed all four prompts as separate runnable, data-driven animated assembly guides, passed hard structural validators, produced complete screenshots, and delivered the strongest Brick result observed in this suite so far. It remains below near-mastery because literal physical buildability is not fully proven: the larger builds still use stylized overlaps, suspended airship assumptions, decorative rods/panels, and scene-graph geometry that would need human review before becoming real interlocking-brick instructions.

OverlayDownload radar
Instr. FollowingArtifact ValiditySource IntegritySemantic JudgmentQuant. Reas.Spatial Reas.Visual StorytellingUX ReviewabilityProd. ReadinessSpeed
1Claude Opus 4.882
2Gemini 3.5 Flash (High) Fast56

What it nailed

  • All four builds completed with in-range piece counts and full hard validation pass.
  • Strong single-source-of-truth architecture: kitSpec drives manifest, instructions, final model, and animation.
  • The 1000-piece airship keeps the exact nine requested chapters and produces a showable flagship render.
  • Viewer usability is robust: play/pause, step navigation, scrubber, speed, complete, reset, orbit controls, highlights, studs, spin/shadows, and per-build toggles.
  • Runner-reported self-correction behavior was unusually strong: repeated piece-count failures were detected, generator logic revised, and final outputs brought back into range.

Where it slipped

  • Physical buildability is not fully proven; support/collision heuristics and screenshot review show many places that would need real brick-design cleanup.
  • The large airship is a strong visual kit but relies on mooring/support assumptions and overlapping balloon panels rather than a fully literal buildable envelope.
  • Some steps, especially in large builds, are repetitive or add relatively large batches of parts.
  • Rover and helicopter visuals are serviceable but blockier and less polished than the food stall and flagship airship.
  • Raw model evidence is partial; the final scrollback does not preserve intermediate correction details.
Wall clock 51m 41s

From the run

Artemis II Mission Visualization

A fact sheet plus an interactive 3D visualization of the Artemis II mission. Tests factual grounding, source integrity, and the ability to dramatize the hard beats — launch, staging, re-entry, recovery.

76legacy 78
Strong

Strong source-aware package and runnable visualization, limited by schematic/toy-like visuals, weak splashdown/recovery storytelling, mobile framing issues, and missed primary-source labeling opportunities.

OverlayDownload radar
Instr. FollowingArtifact ValiditySource IntegrityResearch GroundingSemantic JudgmentSpatial Reas.Visual StorytellingUX ReviewabilityProd. ReadinessQuant. Reas.Speed
1GPT-5.579
2Claude Opus 4.876
3Opus 4.760
4Gemini 3.5 Flash (High) Fast54

What it nailed

  • Complete artifact package with fact sheet, source list, offline visualization, documentation, screenshots, and validation evidence.
  • Current Artemis II mission facts mostly align with NASA/CSA/ESA sources.
  • Useful confidence-tagging discipline separates official, secondary, and approximate data.
  • Interactive visualization has 14 phases, playback, scrubber, event stepping, orbit/follow camera controls, HUD, and local Three.js assets.

Where it slipped

  • Visual storytelling regresses versus Opus 4.7 despite much safer factual grounding.
  • Splashdown/recovery do not read as distinct scenes.
  • Later mission beats are visually repetitive.
  • Missed available NASA primary support for some central flyby/max-distance claims.
  • Mobile composition partly pushes scene content off-frame.
Wall clock 24m 23s

From the run