Gemini 2.5: Our most intelligent AI model
read at source ↗ deepmind.google
Gemini 2.5: Our most intelligent AI model
Source: DeepMind Date: 2025-03-25 URL: https://deepmind.google/blog/gemini-2-5-our-most-intelligent-ai-model/
Summary
Google launched Gemini 2.5 as a thinking model — reasoning before responding — claiming #1 on LMArena by a “significant margin,” 18.8% on Humanity’s Last Exam, 63.8% on SWE-bench Verified, and leading GPQA and AIME 2025 math/science benchmarks. Context window of 1M tokens (2M forthcoming). Available experimentally in AI Studio and the Gemini app for Advanced users.
Implications
18.8% on HLE at launch is the frontier stake. Humanity’s Last Exam was designed to be unsolvable by current models. At March 2025, 18.8% without tools is where the frontier starts — this is the Gemini 2.5 baseline number that GPT-4.5 and Claude 3.7 had to respond to. It defined the reasoning model tier for Q1–Q2 2025.
63.8% SWE-bench is the coding integrator signal. SWE-bench Verified measures ability to resolve real GitHub issues — it’s what Cursor, Aider, and Devin-style tools actually care about. 63.8% at launch of 2.5 is competitive with contemporaneous Claude models. Watch whether 2.5 Pro’s updates (May 2025) pushed this number significantly.
1M → 2M context is a stated roadmap commitment. Announcing the upcoming context expansion creates developer expectation and influences integrators who plan architecture around context windows. It also puts pressure on Anthropic’s 200K Claude context to stay competitive.
Watch:
- LMArena #1 holding time before competitor updates — the leaderboard moves fast in this period
- Whether the 2M context window arrived on schedule and how it compares to Claude’s long-context performance
- Enterprise AI Studio adoption trajectory — this is where the B2B Gemini story starts