2025-03-25 · Google

Gemini 2.5: Our most intelligent AI model

modelscapital

Gemini 2.5: Our most intelligent AI model

Source: DeepMind Date: 2025-03-25 URL: https://deepmind.google/blog/gemini-2-5-our-most-intelligent-ai-model/

Summary

Google launched Gemini 2.5 as a thinking model — reasoning before responding — claiming #1 on LMArena by a “significant margin,” 18.8% on Humanity’s Last Exam, 63.8% on SWE-bench Verified, and leading GPQA and AIME 2025 math/science benchmarks. Context window of 1M tokens (2M forthcoming). Available experimentally in AI Studio and the Gemini app for Advanced users.

Implications

18.8% on HLE at launch is the frontier stake. Humanity’s Last Exam was designed to be unsolvable by current models. At March 2025, 18.8% without tools is where the frontier starts — this is the Gemini 2.5 baseline number that GPT-4.5 and Claude 3.7 had to respond to. It defined the reasoning model tier for Q1–Q2 2025.

63.8% SWE-bench is the coding integrator signal. SWE-bench Verified measures ability to resolve real GitHub issues — it’s what Cursor, Aider, and Devin-style tools actually care about. 63.8% at launch of 2.5 is competitive with contemporaneous Claude models. Watch whether 2.5 Pro’s updates (May 2025) pushed this number significantly.

1M → 2M context is a stated roadmap commitment. Announcing the upcoming context expansion creates developer expectation and influences integrators who plan architecture around context windows. It also puts pressure on Anthropic’s 200K Claude context to stay competitive.

Watch:

LMArena #1 holding time before competitor updates — the leaderboard moves fast in this period
Whether the 2M context window arrived on schedule and how it compares to Claude’s long-context performance
Enterprise AI Studio adoption trajectory — this is where the B2B Gemini story starts

← all signals