2025-05-20 · Google

Gemini 2.5: Our most intelligent models are getting even better

securitymodels

Gemini 2.5: Our most intelligent models are getting even better

Source: DeepMind Date: 2025-05-20 URL: https://deepmind.google/blog/gemini-25-our-world-leading-model-is-getting-even-better/

Summary

Google I/O 2025 Gemini 2.5 update: Deep Think experimental reasoning mode for Pro uses “multiple hypotheses before responding,” leading LiveCodeBench (competition coding), USAMO math, and scoring 84.0% MMMU. Gemini 2.5 Pro tops WebDev Arena ELO 1415 and all LMArena leaderboards. Flash gains 20–30% token efficiency. New capabilities: native audio output, computer use (Project Mariner), prompt injection hardening, and developer-facing thinking budgets with thought summaries.

Implications

Deep Think as a staged capability unlock. Rolling out Deep Think as experimental at I/O — weeks after the consumer toggle launched — signals Google is managing capability release deliberately: research teams get it first, then consumer, then the production API. That sequencing is the same pattern Anthropic uses with extended thinking beta access.

WebDev Arena ELO 1415 is the integrators-facing claim. Topping WebDev Arena is a direct message to Cursor, Bolt, v0, and Replit: use Gemini 2.5 Pro for web app generation. This is Google’s strongest claim in the coding integrators market where Claude was previously dominant. Watch whether integrators actually shift defaults.

20–30% Flash token efficiency is the cost story. Efficiency gains on Flash matter more than benchmark wins for high-volume production use. 20–30% fewer tokens for equivalent outputs is a direct cost reduction for any API consumer already using Flash — no migration required, just a model version bump.

Watch:

Whether Deep Think’s USAMO performance (competition math) translates to the same IMO gold claim from the October announcement — timeline discrepancy needs reconciliation
Project Mariner’s computer use API: the developer access timeline is the real enterprise unlock
Token efficiency claims validated by independent integrators running production workloads

← all signals