A model in the suite · Anthropic

Opus 4.7

Anthropic · Claude · historical runs from 2026-04-19 through 2026-06-01 staging

54/100

Strict suite averageLegacy 65 · 3 benchmarks

Visually capable but comparatively fragile on source integrity, operational judgment, and factual discipline. Artemis preserves the visual-strength evidence while clearly retaining a dedicated factual reevaluation requirement.

Copies Opus 4.7's full data pack — paste it into ChatGPT, Claude, or any AI to talk it through.

Where it ranks

Opus 4.7 against the field

ANTClaude Opus 4.881 OAIGPT-5.571 GOOGemini 3.5 Flash (High) Fast56 ANTOpus 4.754 ANTSonnet 4.652 OAIGPT-5.451 GOOGemini 3.1 Pro38

Strict suite average Legacy averageScored 0–100 · higher is better

Download 16:9

The runs

How Opus 4.7 handled each benchmark

Score, capability radar, and the honest read on what it nailed and where it slipped. Hit Overlay to drop other models onto the same axes.

Dingo & Co. Knowledge Work

A 23-deliverable consulting brief: research, financial reconciliation, regulatory analysis, decks and spreadsheets. Tests whether a model can run an entire knowledge-work engagement end to end.

54legacy 67

Interesting but Unreliable

The run is complete, polished, and strategically useful, but the strict score is pulled below competent because it misses central Dingo legal/regulatory canaries: unsupported Alaska/NCI permit-path framing, several unverified case-by-case jurisdiction claims, stale or unsupported market/pricing research, and material cross-document drift in import counts and budget release amounts.

OverlayDownload radar

1Claude Opus 4.880

2GPT-5.578

3Gemini 3.5 Flash (High) Fast62

4Opus 4.754

5Sonnet 4.652

6Gemini 3.1 Pro38

What it nailed

Completed all 23 required deliverables in real artifact formats with no validator errors.
Used provided source imagery in the board deck, sales one-pager, and dashboard.
Produced practical GTM, investor FAQ, persona, and email work rather than generic template filler.
Reconciled several core finance figures including revenue, unit count, CAC, LTV, burn, and cash.

Where it slipped

Claims an Alaska NCI permit path and several NCI case-by-case postures without official support; Alaska is directly contradicted by sampled ADFG evidence.
Uses stale or unsupported external market and pricing claims, including Grand View market sizes and Halo/SpotOn pricing anchors.
NCI completed imports drift between 7 and 16 across artifacts.
Executive summary asks for a $240K gate-1 release while deck and GTM ask for a $185K Phase-1 release.
Both spreadsheet workbooks contain formulas and structure but no charts.
Raw model output/transcript evidence is absent from the evaluation package.

Illegal Or Unverified ClaimsMaterial Number InconsistencyFabricated Or Broken SourcesPublication Fact ErrorsEmpty Raw Model Output Evidence Confidence Cap

Car Wash Operations

A filthy operational dataset — ghost records, orphaned orders, typo'd customers, raw enum variants. Tests judgment under messy real-world data: what gets fixed, quarantined, or wrongly promoted.

48legacy 62

Failed Core Purpose

Opus shipped a complete, fast, good-looking audit package with useful provenance, and it handled Terrence Blackwood better than the GPT-5.4 run. The strict rescore is harsh because Mickey Mouse, Test Customer, and Asdf Asdf survived as canonical customers, the SVC-007 file was not parsed despite report claims, the seven planted contact-conflict duplicates stayed split, and the report overstated what the code actually did. That combination fails the central trust test for an operational migration.

OverlayDownload radar

1Claude Opus 4.886

2GPT-5.555

3Gemini 3.5 Flash (High) Fast51

4GPT-5.451

5Opus 4.748

What it nailed

Completed the expected artifact set quickly.
Produced the stronger reviewer UI in the April 19 cross-review set.
Flagged Terrence Blackwood as unmatched rather than silently treating the order as clean canonical data.
Used source provenance tables that external review considered genuinely useful.

Where it slipped

Promoted Mickey Mouse, Test Customer, and Asdf Asdf to canonical customers.
Missed the SVC-007 conflict because deshawn_services.tsv was not actually parsed.
Kept all seven planted duplicate customer pairs separate.
Report/docs claimed coverage the implementation did not deliver.

Misses Three Or More Primary CanariesPromotes Ghost RecordsReport Honesty Failure

Artemis II Mission Visualization

A fact sheet plus an interactive 3D visualization of the Artemis II mission. Tests factual grounding, source integrity, and the ability to dramatize the hard beats — launch, staging, re-entry, recovery.

60legacy 60

Competent Scaffold

The visualization is the strongest video-facing artifact in the prior Artemis set: cinematic full-screen Three.js, component inspection, camera modes, staged screenshots, launch/ascent, staging, TLI, lunar flyby, max-distance, re-entry, and trajectory views. Strict scoring is capped because the fact sheet and visualization include no traceable source URLs, directly contradict NASA's Artemis II CubeSat evidence by claiming no CubeSats flew, place major lunar-flyby timings far from NASA's official EDT/UTC sequence, and include unsupported anomaly/color/patch/crater/social-media details. The result is a visually impressive but source-unsafe scaffold.

OverlayDownload radar

1GPT-5.579

2Claude Opus 4.876

3Opus 4.760

4Gemini 3.5 Flash (High) Fast54

What it nailed

Best visual storytelling of the two prior Artemis artifacts.
Cinematic full-screen Three.js presentation with camera modes, component inspection, trajectory toggle, labels, mission event list, and staged screenshot set.
Detailed procedural SLS/Orion model and stronger video/post b-roll value than GPT-5.5.
Broadly covers launch, ascent, staging, TLI, lunar flyby, max distance, re-entry, and full trajectory.

Where it slipped

No source URLs or source list were found in the fact sheet or visualization.
Material public-facing CubeSat claim contradicts NASA primary sources.
Major lunar-flyby timing values differ from NASA's official updates.
Several colorful anomaly, patch, livery, crater, and social-media claims are unsupported in this pass.
No formal historical scorecard or raw model output was found.

From the run