2026-05-05 · Google

Accelerating Gemma 4: faster inference with multi-token prediction drafters

protocolsmodelsinfrastructure

read at source ↗ blog.google

Accelerating Gemma 4: faster inference with multi-token prediction drafters

Source: Google Date: 2026-05-05 URL: https://blog.google/innovation-and-ai/technology/developers-tools/multi-token-prediction-gemma-4/

Summary

Google released multi-token prediction (MTP) drafter models for Gemma 4, achieving up to 3x inference speedup with no degradation in output quality. The technique pairs a lightweight drafter with the Gemma 4 31B target model using speculative decoding: the drafter predicts multiple tokens simultaneously while the target model verifies them in a single forward pass, sidestepping the memory-bandwidth bottleneck that limits standard autoregressive generation. Measured gains include ~2x on NVIDIA RTX PRO 6000 for the 26B model and ~2.2x on Apple Silicon at batch sizes of 4–8.

Implications

  • Local inference thread: The Apple Silicon numbers are directly relevant for on-device agent deployments — a 2.2x speedup on M-series hardware at batch 4–8 meaningfully changes what’s feasible without cloud round-trips.
  • Speculative decoding maturing: MTP drafters require no fine-tuning of the target model and share its KV cache, making this an increasingly practical drop-in acceleration layer rather than a research technique.
  • Open-weight competitive pressure: Shipping verified speedup numbers with open-weight Gemma 4 raises the bar for comparable Meta Llama and Mistral deployments, and pressures closed-API vendors to match latency SLAs.

← all signals