2025-05-06 · Google

Gemini 2.5 Pro Preview: even better coding performance

agentsmodelsresearch

read at source ↗ deepmind.google

Gemini 2.5 Pro Preview: even better coding performance

Source: DeepMind Date: 2025-05-06 URL: https://deepmind.google/blog/gemini-25-pro-preview-even-better-coding-performance/

Summary

Google updated Gemini 2.5 Pro Preview with significantly improved coding performance, ranking #1 on the WebDev Arena leaderboard for UI and web app quality and scoring 84.8% on VideoMME. Key improvements: front-end and UI development, code transformation and editing accuracy, reduced function-calling errors, and stronger agentic workflow support. Highlighted use cases include video-to-code generation, design-file-to-CSS feature development, and rapid web app prototyping.

Implications

WebDev Arena #1 is the signal that matters more than SWE-bench for product teams. SWE-bench measures bug fixing in existing codebases — a narrow slice of developer work. WebDev Arena measures human preference for aesthetically pleasing, functional web apps. Winning on human preference in UI generation is the claim that matters to frontend teams and indie developers, who evaluate models by “does this look right” not “does this pass tests.”

84.8% VideoMME opens video-to-code as a legitimate workflow. Converting YouTube tutorials or screen recordings into functional code is a qualitatively new use case — not incremental coding improvement. If the quality holds, it shortcuts the loop from “I want to replicate this” to working prototype without manually transcribing steps. That’s not a benchmark number; it’s a workflow change.

Reduced function-calling errors is the agentic reliability signal. Coding agents fail when they call tools incorrectly or at the wrong moment — the failure mode is rarely raw code quality. Improved function-calling trigger rates means longer autonomous chains before human correction. That’s the measurement that predicts whether agents complete multi-step coding tasks or derail.

Watch:

  • WebDev Arena position stability as other labs release coding-focused updates — does 2.5 Pro hold #1 or does it signal a one-week lead?
  • Video-to-code quality on complex tutorials with heavy CSS/animation — the benchmark video inputs may not reflect real-world tutorial messiness
  • Whether the agentic workflow improvements translate to external coding agent products (Cursor, Windsurf) that use Gemini via API

← all signals