2026-06-09 · Google

Introducing Gemma 4 12B: a unified, encoder-free multimodal model

protocolsmodels

Introducing Gemma 4 12B: a unified, encoder-free multimodal model

Source: DeepMind Date: 2026-06-09 URL: https://deepmind.google/blog/introducing-gemma-4-12b-a-unified-encoder-free-multimodal-model/

Summary

Google’s Gemma 4 12B (Apache 2.0, released June 3) is an encoder-free multimodal model: the vision encoder is replaced by a lightweight embedding module (a single matrix multiplication plus positional embedding and normalizations), and the audio encoder is removed entirely — raw audio is projected directly into the same dimensional space as text tokens, making it “the first mid-sized model to feature native audio inputs.” It runs locally on consumer laptops with 16GB of RAM/VRAM/unified memory, delivers performance “nearing our larger 26B MoE model on standard benchmarks” at “less than half the total memory footprint,” and ships Multi-Token Prediction (MTP) drafters for lower latency. Separate Gemma 4 QAT (quantization-aware-training) variants are referenced.

Implications

Feeds the local-model / open-weight ecosystem thread, directly in a tracked family (Gemma is active-use).

Hardware fit changes for the two Apple-Silicon profiles. A 16GB footprint sits comfortably inside the M3 Max 36GB (~21–24GB budget) and M2 Max 32GB (~19–22GB budget) — GPU-resident, leaving room for a small fleet. On the 12GB-VRAM consumer-NVIDIA profile the 16GB floor forces CPU/GPU offload, though encoder-free design + MTP soften the latency cost. New strong multimodal default for the mid-tier local boxes.
Native audio in a 12B is the architectural beat. Dropping the audio encoder and projecting raw signal into the token space removes a whole subsystem and its memory — the same “collapse the pipeline into the model” move seen in the frontier (orchestration into the weights), here applied to multimodal encoders.
Counterweight to the open-frontier-narrowing read. Lands the same window as a closed capability-gated frontier launch and a new open coding model — the open-weight layer is widening, not contracting.

← all signals