2025-11-18 · Google

A new era of intelligence with Gemini 3

models

A new era of intelligence with Gemini 3

Source: DeepMind Date: 2025-11-18 URL: https://deepmind.google/blog/a-new-era-of-intelligence-with-gemini-3/

Summary

Google launched Gemini 3, its flagship model, topping LMArena at Elo 1501, with benchmark suite: 37.5% Humanity’s Last Exam, 91.9% GPQA Diamond, 76.2% SWE-bench Verified, 81% MMMU-Pro. Deep Think variant lifts those to 41.0% HLE and 45.1% ARC-AGI-2. Available across Gemini app, Search AI Mode, AI Studio, Vertex AI, and Google Antigravity. Vending-Bench 2 tops for long-horizon planning.

Implications

The launch benchmark stack redefines the frontier. LMArena 1501 + HLE 37.5% + SWE-bench 76.2% together position Gemini 3 as the credible new frontier model at November 2025. That’s the window before Anthropic’s Opus 4.7 / Claude next cycle and OpenAI’s post-GPT-5.5 release. Google timed this to own the benchmark narrative heading into the end of 2025.

ARC-AGI-2 at 45.1% with Deep Think is the headline number. ARC-AGI-2 was explicitly designed to resist current AI systems. 45.1% is not solved — but it’s the highest published score on the hardest generalization benchmark in the field. Combined with HLE 41.0%, Deep Think is staking a claim as the strongest publicly available reasoning mode.

Vending-Bench 2 for long-horizon planning is a new evaluation surface. Topping a planning-specific benchmark alongside coding and reasoning signals that Gemini 3 isn’t just a QA model — it’s designed for extended task completion. That’s the agentic capability foundation that Mariner and Workspace integrations need.

Watch:

Claude Opus 4.7 and GPT-5.5 equivalent benchmark comparisons — this is now the yardstick
Deep Think API availability and pricing — the most important capability is gated to a mode, not the base model
Whether LMArena Elo 1501 holds as community use shifts to Gemini 3 as the primary comparison point

← all signals