Field notes · the benchmark suite

Nate’s AI Benchmark Suite

One fixed battery of real-world tasks — a full knowledge-work engagement, a filthy data-operations cleanup, a four-model spatial build, and an interactive mission visualization — run through the frontier models as the labs ship them. Every model gets the same tasks and the same strict 0–100 rubric, so the only variable is the model.

7 models across 4 benchmarks so far. Claude Opus 4.8 leads the suite at 81. Pick a model to see its full card, or read the standings below.

Copies a data pack for this page — paste it into ChatGPT, Claude, or any AI to talk through it.
How to read it

The strict score is the headline — it docks for the things that actually break a deliverable: missed canaries, broken sources, physical implausibility, faked beats. The lighter legacy mark behind each bar is the old, more generous read. Open any model and hit Overlay on a benchmark to put other models on the same axes.

Strict suite average, every model

Averaged across each model’s primary benchmark runs. The tick shows where the legacy rubric would have placed it.

Who ran what, and how it landed

Every model against every benchmark. Brighter is stronger; a hatched cell means the model hasn’t run that one yet. Tap a cell to jump straight to that result.

The four benchmarks

Each one is a complete, messy, real-world job — not a trivia quiz.

01 · Dingo & Co. Knowledge Work

A 23-deliverable consulting brief: research, financial reconciliation, regulatory analysis, decks and spreadsheets. Tests whether a model can run an entire knowledge-work engagement end to end.

6 models runBest so far Claude Opus 4.8 at 80

02 · Car Wash Operations

A filthy operational dataset — ghost records, orphaned orders, typo'd customers, raw enum variants. Tests judgment under messy real-world data: what gets fixed, quarantined, or wrongly promoted.

5 models runBest so far Claude Opus 4.8 at 86

03 · Brick — The AI LEGO Build

Four buildable LEGO models from prompt to part list to runnable browser guide. Tests spatial reasoning, physical plausibility, and whether large builds hold together or collapse into repetition.

2 models runBest so far Claude Opus 4.8 at 82

04 · Artemis II Mission Visualization

A fact sheet plus an interactive 3D visualization of the Artemis II mission. Tests factual grounding, source integrity, and the ability to dramatize the hard beats — launch, staging, re-entry, recovery.

4 models runBest so far GPT-5.5 at 79