A model in the suite · OpenAI

GPT-5.5

OpenAI · GPT · historical runs from 2026-04-23 through 2026-06-01 staging

71/100

Strict suite averageLegacy 81 · 3 benchmarks

Strongest historical non-image packet in this backfill set, led by Dingo and Artemis. The Car Wash result keeps the strict average grounded because operational canary misses remain substantial; the Brick rover is preserved only as a single-prompt reference.

Copies GPT-5.5's full data pack — paste it into ChatGPT, Claude, or any AI to talk it through.

Where it ranks

GPT-5.5 against the field

ANTClaude Opus 4.881 OAIGPT-5.571 GOOGemini 3.5 Flash (High) Fast56 ANTOpus 4.754 ANTSonnet 4.652 OAIGPT-5.451 GOOGemini 3.1 Pro38

Strict suite average Legacy averageScored 0–100 · higher is better

Download 16:9

The runs

How GPT-5.5 handled each benchmark

Score, capability radar, and the honest read on what it nailed and where it slipped. Hit Overlay to drop other models onto the same axes.

Dingo & Co. Knowledge Work

A 23-deliverable consulting brief: research, financial reconciliation, regulatory analysis, decks and spreadsheets. Tests whether a model can run an entire knowledge-work engagement end to end.

78legacy 87

Strong

This is the cleanest historical Dingo run: all 23 deliverables exist as real files, source integrity passed, regulatory/import ambiguity was handled unusually well, and the strategy work is coherent. It stays below excellent strict territory because the board deck has a real PPTX XML/rendering defect, one visible NPS inconsistency crosses artifacts, pricing research includes stale or imprecise claims, and there is no raw transcript evidence.

OverlayDownload radar

1Claude Opus 4.880

2GPT-5.578

3Gemini 3.5 Flash (High) Fast62

4Opus 4.754

5Sonnet 4.652

6Gemini 3.1 Pro38

What it nailed

Completed all 23 required deliverables as real files with valid types and preserved source integrity.
Handled dingo ownership, import-created demand, legal uncertainty, ethics, and Alaska/Australia mismatch as central operating constraints.
Used provided source imagery heavily in the deck, sales one-pager, and dashboard.
Delivered coherent GTM, board, pricing, risk, and investor-facing strategy with staged decisions and guardrails.

Where it slipped

Board deck contains invalid PPTX metadata XML because the Company value uses an unescaped ampersand, blocking Quick Look rendering.
Board deck slide 5 reports average NPS as 6.6 while source math and other artifacts use about 6.2.
Some pricing research was stale or imprecise, especially Halo membership pricing and PetSafe blended pricing.
Raw model output/transcript evidence is absent from the evaluation package.

Pptx Metadata Xml Rendering DefectCross Document Number DriftStale Or Imprecise Pricing ClaimsEmpty Raw Model Output Evidence Confidence Cap

Car Wash Operations

A filthy operational dataset — ghost records, orphaned orders, typo'd customers, raw enum variants. Tests judgment under messy real-world data: what gets fixed, quarantined, or wrongly promoted.

55legacy 74

Interesting but Unreliable

This is the strongest inspected audit scaffold: complete artifacts, full source discovery, a working frontend, strong provenance, fake/test rejection, and an empirically verified idempotent rebuild. It still fails too many operational canaries for a high strict score: Terrence Blackwood became a canonical customer, SVC-007 was missed, department codes were dropped, status/payment enums stayed raw, canonical jobs were overcounted, and several name variants stayed split. The run reaches the cap for multiple primary canary misses but not above it.

OverlayDownload radar

1Claude Opus 4.886

2GPT-5.555

3Gemini 3.5 Flash (High) Fast51

4GPT-5.451

5Opus 4.748

What it nailed

Produced every expected artifact, including screenshots.
Discovered 465 of 465 source files and processed or partially processed almost all business-relevant files.
Rejected planted ghost/test records and preserved a large source-record provenance layer.
Passed an isolated idempotency rerun with identical counts.

Where it slipped

Created Terrence Blackwood as a canonical customer instead of an orphan review case.
Missed the DeShawn SVC-007 conflict and lacked a service-code column.
Dropped department/role code normalization and left raw status/payment values.
Overcounted jobs and retained several duplicate/nickname customer splits.

Misses Three Or More Primary CanariesPromotes Orphan Order

Artemis II Mission Visualization

A fact sheet plus an interactive 3D visualization of the Artemis II mission. Tests factual grounding, source integrity, and the ability to dramatize the hard beats — launch, staging, re-entry, recovery.

79legacy 79

Strong

The run produced a complete, runnable React/Vite/Three package with a separately maintained missionData.js source of truth, a detailed fact sheet, NASA-heavy citations, screenshots, desktop/mobile verification images, and mission-specific visual beats for launch, ascent, staging, TLI, lunar flyby, max distance, re-entry, splashdown, and recovery. It stays below excellent because several values need strict current-source cleanup, including closest lunar approach finalization, actual ascent milestone timings, official total miles after NASA's May 7 update, and a non-primary Orion helium-leak claim. The re-entry and recovery visuals are informative but still more app-like than cinematic.

OverlayDownload radar

1GPT-5.579

2Claude Opus 4.876

3Opus 4.760

4Gemini 3.5 Flash (High) Fast54

What it nailed

Complete fact sheet plus runnable React/Vite/Three visualization.
Separates mission facts, crew, vehicle facts, component details, events, and telemetry into missionData.js.
Uses many NASA source links directly in both fact sheet and visualization.
Covers the hard mission sequence rather than staying in generic orbit-only mode.
Includes 10 staged screenshots and desktop/mobile verification images.

Where it slipped

No formal historical scorecard or raw model output was found.
Some public-facing numbers need current-source reconciliation before publication.
Re-entry, splashdown, and recovery scenes are visually abstract and less video-useful than the stronger Opus visual treatment.
At least one anomaly/issue claim relies on non-primary reporting.

From the run