OpenAI · GPT · historical runs from 2026-04-23 through 2026-06-01 staging
71/100
Strict suite averageLegacy 81 · 3 benchmarks
Strongest historical non-image packet in this backfill set, led by Dingo and Artemis. The Car Wash result keeps the strict average grounded because operational canary misses remain substantial; the Brick rover is preserved only as a single-prompt reference.
Copies GPT-5.5's full data pack — paste it into ChatGPT, Claude, or any AI to talk it through.
Score, capability radar, and the honest read on what it nailed and where it slipped. Hit Overlay to drop other models onto the same axes.
Dingo & Co. Knowledge Work
A 23-deliverable consulting brief: research, financial reconciliation, regulatory analysis, decks and spreadsheets. Tests whether a model can run an entire knowledge-work engagement end to end.
78legacy 87
Strong
This is the cleanest historical Dingo run: all 23 deliverables exist as real files, source integrity passed, regulatory/import ambiguity was handled unusually well, and the strategy work is coherent. It stays below excellent strict territory because the board deck has a real PPTX XML/rendering defect, one visible NPS inconsistency crosses artifacts, pricing research includes stale or imprecise claims, and there is no raw transcript evidence.
Completed all 23 required deliverables as real files with valid types and preserved source integrity.
Handled dingo ownership, import-created demand, legal uncertainty, ethics, and Alaska/Australia mismatch as central operating constraints.
Used provided source imagery heavily in the deck, sales one-pager, and dashboard.
Delivered coherent GTM, board, pricing, risk, and investor-facing strategy with staged decisions and guardrails.
Where it slipped
Board deck contains invalid PPTX metadata XML because the Company value uses an unescaped ampersand, blocking Quick Look rendering.
Board deck slide 5 reports average NPS as 6.6 while source math and other artifacts use about 6.2.
Some pricing research was stale or imprecise, especially Halo membership pricing and PetSafe blended pricing.
Raw model output/transcript evidence is absent from the evaluation package.
Pptx Metadata Xml Rendering DefectCross Document Number DriftStale Or Imprecise Pricing ClaimsEmpty Raw Model Output Evidence Confidence Cap
Car Wash Operations
A filthy operational dataset — ghost records, orphaned orders, typo'd customers, raw enum variants. Tests judgment under messy real-world data: what gets fixed, quarantined, or wrongly promoted.
55legacy 74
Interesting but Unreliable
This is the strongest inspected audit scaffold: complete artifacts, full source discovery, a working frontend, strong provenance, fake/test rejection, and an empirically verified idempotent rebuild. It still fails too many operational canaries for a high strict score: Terrence Blackwood became a canonical customer, SVC-007 was missed, department codes were dropped, status/payment enums stayed raw, canonical jobs were overcounted, and several name variants stayed split. The run reaches the cap for multiple primary canary misses but not above it.
Produced every expected artifact, including screenshots.
Discovered 465 of 465 source files and processed or partially processed almost all business-relevant files.
Rejected planted ghost/test records and preserved a large source-record provenance layer.
Passed an isolated idempotency rerun with identical counts.
Where it slipped
Created Terrence Blackwood as a canonical customer instead of an orphan review case.
Missed the DeShawn SVC-007 conflict and lacked a service-code column.
Dropped department/role code normalization and left raw status/payment values.
Overcounted jobs and retained several duplicate/nickname customer splits.
Misses Three Or More Primary CanariesPromotes Orphan Order
Artemis II Mission Visualization
A fact sheet plus an interactive 3D visualization of the Artemis II mission. Tests factual grounding, source integrity, and the ability to dramatize the hard beats — launch, staging, re-entry, recovery.
79legacy 79
Strong
The run produced a complete, runnable React/Vite/Three package with a separately maintained missionData.js source of truth, a detailed fact sheet, NASA-heavy citations, screenshots, desktop/mobile verification images, and mission-specific visual beats for launch, ascent, staging, TLI, lunar flyby, max distance, re-entry, splashdown, and recovery. It stays below excellent because several values need strict current-source cleanup, including closest lunar approach finalization, actual ascent milestone timings, official total miles after NASA's May 7 update, and a non-primary Orion helium-leak claim. The re-entry and recovery visuals are informative but still more app-like than cinematic.