Accelerating Gemma 4: faster inference with multi-token prediction drafters
protocolsmodelsinfrastructure
read at source ↗ blog.google
Accelerating Gemma 4: faster inference with multi-token prediction drafters
Source: Google Date: 2026-05-05 URL: https://blog.google/innovation-and-ai/technology/developers-tools/multi-token-prediction-gemma-4/
Summary
Google released multi-token prediction (MTP) drafter models for Gemma 4, achieving up to 3x inference speedup with no degradation in output quality. The technique pairs a lightweight drafter with the Gemma 4 31B target model using speculative decoding: the drafter predicts multiple tokens simultaneously while the target model verifies them in a single forward pass, sidestepping the memory-bandwidth bottleneck that limits standard autoregressive generation. Measured gains include ~2x on NVIDIA RTX PRO 6000 for the 26B model and ~2.2x on Apple Silicon at batch sizes of 4–8.
Implications
- Local inference thread: The Apple Silicon numbers are directly relevant for on-device agent deployments — a 2.2x speedup on M-series hardware at batch 4–8 meaningfully changes what’s feasible without cloud round-trips.
- Speculative decoding maturing: MTP drafters require no fine-tuning of the target model and share its KV cache, making this an increasingly practical drop-in acceleration layer rather than a research technique.
- Open-weight competitive pressure: Shipping verified speedup numbers with open-weight Gemma 4 raises the bar for comparable Meta Llama and Mistral deployments, and pressures closed-API vendors to match latency SLAs.