A model in the suite · Google

Gemini 3.1 Pro

Google · Gemini · 2026-04-25

38/100
Strict suite averageLegacy 50 · 1 benchmark

Single-benchmark historical packet. The Dingo run failed core artifact validity and should be treated as a low-confidence comparator until additional current-suite evidence exists.

Copies Gemini 3.1 Pro's full data pack — paste it into ChatGPT, Claude, or any AI to talk it through.

Gemini 3.1 Pro against the field

How Gemini 3.1 Pro handled each benchmark

Score, capability radar, and the honest read on what it nailed and where it slipped. Hit Overlay to drop other models onto the same axes.

Dingo & Co. Knowledge Work

A 23-deliverable consulting brief: research, financial reconciliation, regulatory analysis, decks and spreadsheets. Tests whether a model can run an entire knowledge-work engagement end to end.

38legacy 50
Failed Core Purpose

The run recognizes some of the benchmark's weird business/legal premise, but it fails the central artifact-production job: eight required DOCX/PPTX/XLSX/PDF deliverables are invalid HTML/text stand-ins, image references are broken, workbooks have no real sheets/formulas/charts, research is shallow, and multiple planning numbers drift. Under strict normalization this is not a comparable production-grade knowledge-work package.

OverlayDownload radar
Instr. FollowingArtifact ValiditySource IntegrityResearch GroundingSemantic JudgmentQuant. Reas.Visual StorytellingUX ReviewabilityProd. ReadinessSpeed
1Claude Opus 4.880
2GPT-5.578
3Gemini 3.5 Flash (High) Fast62
4Opus 4.754
5Sonnet 4.652
6Gemini 3.1 Pro38

What it nailed

  • Recognized several core absurdities: dingoes are not normal pets, Alaska does not solve legality, and the import program creates ethics and TAM risk.
  • Used consistent main planning assumptions in many files, including $380K recognized revenue, 200 units, $899 MSRP, $799 early-bird, and $740K launch budget.
  • Persona set separates curiosity traffic from real buyers.

Where it slipped

  • Eight required DOCX/PPTX/XLSX/PDF artifacts are invalid HTML/text stand-ins.
  • Provided image assets are referenced through broken paths, so deck, sales one-pager, dashboard, blog, and emails do not actually render those images.
  • Regulatory research is shallow and often not tied to official jurisdiction-specific sources.
  • No real Excel sheets, formulas, or charts exist in the required workbooks.
  • Dashboard channel math conflicts with source totals and the broader 200-unit story.
  • Raw model output/transcript evidence is absent from the evaluation package.
Markdown StandinsUnreadable Core ArtifactsNon Runnable Core ArtifactIgnored Source ImagesWeak Regulatory CoverageMaterial Number InconsistencyEmpty Raw Model Output Evidence Confidence Cap