Field notes · the benchmark suite

Nate’s AI Benchmark Suite

One fixed battery of real-world tasks — a full knowledge-work engagement, a filthy data-operations cleanup, a four-model spatial build, and an interactive mission visualization — run through the frontier models as the labs ship them. Every model gets the same tasks and the same strict 0–100 rubric, so the only variable is the model.

7 models across 4 benchmarks so far. Claude Opus 4.8 leads the suite at 81. Pick a model to see its full card, or read the standings below.

Copies a data pack for this page — paste it into ChatGPT, Claude, or any AI to talk through it.

I · The standings

Strict suite average, every model

Averaged across each model’s primary benchmark runs. The tick shows where the legacy rubric would have placed it.

ANTClaude Opus 4.881 OAIGPT-5.571 GOOGemini 3.5 Flash (High) Fast56 ANTOpus 4.754 ANTSonnet 4.652 OAIGPT-5.451 GOOGemini 3.1 Pro38

Strict suite average Legacy averageScored 0–100 · higher is better

Download 16:9

II · The matrix

Who ran what, and how it landed

Every model against every benchmark. Brighter is stronger; a hatched cell means the model hasn’t run that one yet. Tap a cell to jump straight to that result.

Model · benchmark

Knowledge work

Data operations

Spatial build

Interactive viz

Claude Opus 4.881 avg 80 86 82 76

GPT-5.571 avg 78 55

—

79

Gemini 3.5 Flash (High) Fast56 avg 62 51 56 54

—

—

—

—

Gemini 3.1 Pro38 avg 38

—

Download 16:9

III · The tests

The four benchmarks

Each one is a complete, messy, real-world job — not a trivia quiz.

01 · Dingo & Co. Knowledge Work

A 23-deliverable consulting brief: research, financial reconciliation, regulatory analysis, decks and spreadsheets. Tests whether a model can run an entire knowledge-work engagement end to end.

6 models runBest so far Claude Opus 4.8 at 80

02 · Car Wash Operations

A filthy operational dataset — ghost records, orphaned orders, typo'd customers, raw enum variants. Tests judgment under messy real-world data: what gets fixed, quarantined, or wrongly promoted.

5 models runBest so far Claude Opus 4.8 at 86

03 · Brick — The AI LEGO Build

Four buildable LEGO models from prompt to part list to runnable browser guide. Tests spatial reasoning, physical plausibility, and whether large builds hold together or collapse into repetition.

2 models runBest so far Claude Opus 4.8 at 82

04 · Artemis II Mission Visualization

A fact sheet plus an interactive 3D visualization of the Artemis II mission. Tests factual grounding, source integrity, and the ability to dramatize the hard beats — launch, staging, re-entry, recovery.

4 models runBest so far GPT-5.5 at 79