Anthropic · Claude · historical runs from 2026-04-19 through 2026-06-01 staging
54/100
Strict suite averageLegacy 65 · 3 benchmarks
Visually capable but comparatively fragile on source integrity, operational judgment, and factual discipline. Artemis preserves the visual-strength evidence while clearly retaining a dedicated factual reevaluation requirement.
Copies Opus 4.7's full data pack — paste it into ChatGPT, Claude, or any AI to talk it through.
Score, capability radar, and the honest read on what it nailed and where it slipped. Hit Overlay to drop other models onto the same axes.
Dingo & Co. Knowledge Work
A 23-deliverable consulting brief: research, financial reconciliation, regulatory analysis, decks and spreadsheets. Tests whether a model can run an entire knowledge-work engagement end to end.
54legacy 67
Interesting but Unreliable
The run is complete, polished, and strategically useful, but the strict score is pulled below competent because it misses central Dingo legal/regulatory canaries: unsupported Alaska/NCI permit-path framing, several unverified case-by-case jurisdiction claims, stale or unsupported market/pricing research, and material cross-document drift in import counts and budget release amounts.
Completed all 23 required deliverables in real artifact formats with no validator errors.
Used provided source imagery in the board deck, sales one-pager, and dashboard.
Produced practical GTM, investor FAQ, persona, and email work rather than generic template filler.
Reconciled several core finance figures including revenue, unit count, CAC, LTV, burn, and cash.
Where it slipped
Claims an Alaska NCI permit path and several NCI case-by-case postures without official support; Alaska is directly contradicted by sampled ADFG evidence.
Uses stale or unsupported external market and pricing claims, including Grand View market sizes and Halo/SpotOn pricing anchors.
NCI completed imports drift between 7 and 16 across artifacts.
Executive summary asks for a $240K gate-1 release while deck and GTM ask for a $185K Phase-1 release.
Both spreadsheet workbooks contain formulas and structure but no charts.
Raw model output/transcript evidence is absent from the evaluation package.
Illegal Or Unverified ClaimsMaterial Number InconsistencyFabricated Or Broken SourcesPublication Fact ErrorsEmpty Raw Model Output Evidence Confidence Cap
Car Wash Operations
A filthy operational dataset — ghost records, orphaned orders, typo'd customers, raw enum variants. Tests judgment under messy real-world data: what gets fixed, quarantined, or wrongly promoted.
48legacy 62
Failed Core Purpose
Opus shipped a complete, fast, good-looking audit package with useful provenance, and it handled Terrence Blackwood better than the GPT-5.4 run. The strict rescore is harsh because Mickey Mouse, Test Customer, and Asdf Asdf survived as canonical customers, the SVC-007 file was not parsed despite report claims, the seven planted contact-conflict duplicates stayed split, and the report overstated what the code actually did. That combination fails the central trust test for an operational migration.
Produced the stronger reviewer UI in the April 19 cross-review set.
Flagged Terrence Blackwood as unmatched rather than silently treating the order as clean canonical data.
Used source provenance tables that external review considered genuinely useful.
Where it slipped
Promoted Mickey Mouse, Test Customer, and Asdf Asdf to canonical customers.
Missed the SVC-007 conflict because deshawn_services.tsv was not actually parsed.
Kept all seven planted duplicate customer pairs separate.
Report/docs claimed coverage the implementation did not deliver.
Misses Three Or More Primary CanariesPromotes Ghost RecordsReport Honesty Failure
Artemis II Mission Visualization
A fact sheet plus an interactive 3D visualization of the Artemis II mission. Tests factual grounding, source integrity, and the ability to dramatize the hard beats — launch, staging, re-entry, recovery.
60legacy 60
Competent Scaffold
The visualization is the strongest video-facing artifact in the prior Artemis set: cinematic full-screen Three.js, component inspection, camera modes, staged screenshots, launch/ascent, staging, TLI, lunar flyby, max-distance, re-entry, and trajectory views. Strict scoring is capped because the fact sheet and visualization include no traceable source URLs, directly contradict NASA's Artemis II CubeSat evidence by claiming no CubeSats flew, place major lunar-flyby timings far from NASA's official EDT/UTC sequence, and include unsupported anomaly/color/patch/crater/social-media details. The result is a visually impressive but source-unsafe scaffold.