A model in the suite · Google

Gemini 3.1 Pro

Google · Gemini · 2026-04-25

38/100

Strict suite averageLegacy 50 · 1 benchmark

Single-benchmark historical packet. The Dingo run failed core artifact validity and should be treated as a low-confidence comparator until additional current-suite evidence exists.

Copies Gemini 3.1 Pro's full data pack — paste it into ChatGPT, Claude, or any AI to talk it through.

Where it ranks

Gemini 3.1 Pro against the field

ANTClaude Opus 4.881 OAIGPT-5.571 GOOGemini 3.5 Flash (High) Fast56 ANTOpus 4.754 ANTSonnet 4.652 OAIGPT-5.451 GOOGemini 3.1 Pro38

Strict suite average Legacy averageScored 0–100 · higher is better

Download 16:9

The runs

How Gemini 3.1 Pro handled each benchmark

Score, capability radar, and the honest read on what it nailed and where it slipped. Hit Overlay to drop other models onto the same axes.

Dingo & Co. Knowledge Work

A 23-deliverable consulting brief: research, financial reconciliation, regulatory analysis, decks and spreadsheets. Tests whether a model can run an entire knowledge-work engagement end to end.

38legacy 50

Failed Core Purpose

The run recognizes some of the benchmark's weird business/legal premise, but it fails the central artifact-production job: eight required DOCX/PPTX/XLSX/PDF deliverables are invalid HTML/text stand-ins, image references are broken, workbooks have no real sheets/formulas/charts, research is shallow, and multiple planning numbers drift. Under strict normalization this is not a comparable production-grade knowledge-work package.

OverlayDownload radar

1Claude Opus 4.880

2GPT-5.578

3Gemini 3.5 Flash (High) Fast62

4Opus 4.754

5Sonnet 4.652

6Gemini 3.1 Pro38

What it nailed

Recognized several core absurdities: dingoes are not normal pets, Alaska does not solve legality, and the import program creates ethics and TAM risk.
Used consistent main planning assumptions in many files, including $380K recognized revenue, 200 units, $899 MSRP, $799 early-bird, and $740K launch budget.
Persona set separates curiosity traffic from real buyers.

Where it slipped

Eight required DOCX/PPTX/XLSX/PDF artifacts are invalid HTML/text stand-ins.
Provided image assets are referenced through broken paths, so deck, sales one-pager, dashboard, blog, and emails do not actually render those images.
Regulatory research is shallow and often not tied to official jurisdiction-specific sources.
No real Excel sheets, formulas, or charts exist in the required workbooks.
Dashboard channel math conflicts with source totals and the broader 200-unit story.
Raw model output/transcript evidence is absent from the evaluation package.

Markdown StandinsUnreadable Core ArtifactsNon Runnable Core ArtifactIgnored Source ImagesWeak Regulatory CoverageMaterial Number InconsistencyEmpty Raw Model Output Evidence Confidence Cap