A model in the suite · Google

Gemini 3.5 Flash (High) Fast

Google · Gemini · High / Fast · 2026-05-27

56/100
Strict suite averageLegacy 68 · 4 benchmarks

Very fast scaffold generator with broad completion, but uneven judgment. The strict score is much lower than the legacy average because the run failed core semantic, factual, visual-storytelling, and physical-plausibility checks.

Copies Gemini 3.5 Flash (High) Fast's full data pack — paste it into ChatGPT, Claude, or any AI to talk it through.

Gemini 3.5 Flash (High) Fast against the field

How Gemini 3.5 Flash (High) Fast handled each benchmark

Score, capability radar, and the honest read on what it nailed and where it slipped. Hit Overlay to drop other models onto the same axes.

Dingo & Co. Knowledge Work

A 23-deliverable consulting brief: research, financial reconciliation, regulatory analysis, decks and spreadsheets. Tests whether a model can run an entire knowledge-work engagement end to end.

62legacy 72
Competent Scaffold

Full deliverable completion and decent strategic synthesis, but weak visual polish, thin source coverage, shallow legal/regulatory distinction, spreadsheet limitations, manual recovery cycles, and missing raw model evidence keep this well below strong long-term comparison territory.

OverlayDownload radar
Instr. FollowingArtifact ValiditySource IntegrityResearch GroundingSemantic JudgmentQuant. Reas.Visual StorytellingUX ReviewabilityProd. ReadinessSpeed
1Claude Opus 4.880
2GPT-5.578
3Gemini 3.5 Flash (High) Fast62
4Opus 4.754
5Sonnet 4.652
6Gemini 3.1 Pro38

What it nailed

  • Completed 23/23 required deliverables as real files.
  • Preserved source input copy by checksum.
  • Reconciled important financial contradictions.
  • Treated Northern Canid Imports as central rather than incidental.

Where it slipped

  • Only 15 URLs found against the benchmark's 20-URL expectation.
  • Regulatory handling compressed ownership, import, transport, quarantine, and local-law distinctions.
  • Deck and one-pager were not visually strong.
  • Spreadsheets lacked charts and one workbook was thin.
  • Raw model output evidence is absent.
Weak Regulatory CoverageFabricated Or Broken Sources Soft CapEmpty Raw Model Output Evidence Confidence Cap
Wall clock 15m 11sPartnered lane scored 67

Car Wash Operations

A filthy operational dataset — ghost records, orphaned orders, typo'd customers, raw enum variants. Tests judgment under messy real-world data: what gets fixed, quarantined, or wrongly promoted.

51legacy 64
Interesting but Unreliable

The run was fast and complete, but it failed the benchmark's most important operational canaries: ghost/test records survived, Terrence Blackwood was promoted instead of treated as orphaned, typo-order merges failed, department codes and enum normalization were weak, and provenance was incomplete.

OverlayDownload radar
Instr. FollowingArtifact ValiditySource IntegritySemantic JudgmentQuant. Reas.UX ReviewabilityProd. ReadinessSpeed
1Claude Opus 4.886
2GPT-5.555
3Gemini 3.5 Flash (High) Fast51
4GPT-5.451
5Opus 4.748

What it nailed

  • Created all required artifacts and an openable database.
  • Accounted for 463 business files.
  • Preserved source checksums and avoided strict sensitive-term leaks.
  • Produced a usable static audit UI.

Where it slipped

  • Ghost/test records were promoted instead of quarantined.
  • Terrence Blackwood was created as a customer instead of flagged as orphaned.
  • All 13 planted typo orders stayed attached to typo-name customers.
  • Nickname variants remained split.
  • Status and payment methods remained raw variants.
Misses Three Or More Primary CanariesPromotes Ghost RecordsPromotes Orphan OrderEmpty Raw Model Output Evidence Confidence Cap
Wall clock 6m 09s

Brick — The AI LEGO Build

Four buildable LEGO models from prompt to part list to runnable browser guide. Tests spatial reasoning, physical plausibility, and whether large builds hold together or collapse into repetition.

56legacy 67
Interesting but Unreliable

The run completed all four prompts and maintained clean part accounting, but physical plausibility was weak, large prompts degraded into repetitive abstraction, the airship station missed core structure, and visual quality was only partial-pass.

OverlayDownload radar
Instr. FollowingArtifact ValiditySource IntegritySemantic JudgmentQuant. Reas.Spatial Reas.Visual StorytellingUX ReviewabilityProd. ReadinessSpeed
1Claude Opus 4.882
2Gemini 3.5 Flash (High) Fast56

What it nailed

  • Completed all four benchmark prompts.
  • Hit exact target piece counts.
  • Maintained unique IDs and one-step coverage for parts.
  • Produced runnable browser guides with controls.

Where it slipped

  • Many overlaps and unsupported-looking placements.
  • Large models used repetitive generated patterns.
  • Airship station collapsed requested nine-chapter structure to five.
  • Some HUD/text overlap and camera framing problems.
Large Scale CollapseSevere Physical ImplausibilityEmpty Raw Model Output Evidence Confidence Cap
Wall clock 2m 35s

From the run

Artemis II Mission Visualization

A fact sheet plus an interactive 3D visualization of the Artemis II mission. Tests factual grounding, source integrity, and the ability to dramatize the hard beats — launch, staging, re-entry, recovery.

54legacy 68
Interesting but Unreliable

The artifact was fast, complete, and runnable, but it generated broken source URLs, mixed supported facts with unsupported details, overstated post-flight/anomaly claims, and avoided the hardest visual mission beats like launch, staging, re-entry, and recovery.

OverlayDownload radar
Instr. FollowingArtifact ValiditySource IntegrityResearch GroundingSemantic JudgmentSpatial Reas.Visual StorytellingUX ReviewabilityProd. ReadinessQuant. Reas.Speed
1GPT-5.579
2Claude Opus 4.876
3Opus 4.760
4Gemini 3.5 Flash (High) Fast54

What it nailed

  • Completed a fact sheet and interactive 3D visualization in under three minutes.
  • Produced a usable dashboard shell with timeline, HUD, narrative panel, and camera modes.
  • Correctly treated Artemis II as a completed April 2026 mission.

Where it slipped

  • Six generated bibliography URLs returned 404.
  • Some timeline and closest-approach values were off.
  • Unsupported anomaly/post-flight details were overclaimed.
  • Visualization did not convincingly show launch, staging, re-entry, or recovery.
  • Visual result trailed prior Opus 4.7 and GPT-5.5 artifacts.
Fabricated Or Broken SourcesPublication Fact ErrorsGeneric Orbit SceneEmpty Raw Model Output Evidence Confidence Cap
Wall clock 2m 53s

From the run