2025-12-16 · Google

Gemma Scope 2: helping the AI safety community deepen understanding of complex language model behavior

models

Gemma Scope 2: helping the AI safety community deepen understanding of complex language model behavior

Source: DeepMind Date: 2025-12-16 URL: https://deepmind.google/blog/gemma-scope-2-helping-the-ai-safety-community-deepen-understanding-of-complex-language-model-behavior/

Summary

Google DeepMind released Gemma Scope 2, the largest open-source interpretability toolkit published by any AI lab — covering the full Gemma 3 family (270M to 27B) with sparse autoencoders and transcoders trained across every layer. The release stored ~110 petabytes and trained 1 trillion+ total parameters of interpretability tooling. Includes skip-transcoders and cross-layer transcoders for multi-step computation analysis, plus Matryoshka training for detecting internal concepts. Interactive via Neuronpedia.

Implications

Competing with Anthropic’s interpretability program on open-source terms. Anthropic has published significant mechanistic interpretability research (superposition, circuits) but mostly keeps tools internal. Gemma Scope 2 being the “largest open-source release of interpretability tools by an AI lab” is a direct competitive positioning — Google is trying to become the interpretability infrastructure layer for the research community.

SAEs covering every layer of 27B model is genuinely new. Prior open-source interpretability work covered specific layers or smaller models. Full-layer coverage of a 27B model with 1 trillion trained parameters of interpretability tooling is an unprecedented release in terms of scale. If the research community adopts it, Google influences what interpretability even looks like as a field.

Jailbreaks and hallucinations as investigation targets. The explicit mention of jailbreak and refusal mechanism research enabled by Gemma Scope 2 is strategically important — it’s safety research infrastructure that could reveal why models misbehave. But it also potentially reveals how to circumvent safeties. That’s the dual-use edge of interpretability tooling.

Watch:

Research output from the safety community using Gemma Scope 2 — do we learn something genuinely new about model behavior?
Anthropic’s response: does their interpretability team open more tooling given the Gemma Scope 2 release?
The cancer therapy pathway result (from the 27B model) getting verified — that’s the most provocative application claim

← all signals