2025-02-20 · Anthropic

Insights on Crosscoder Model Diffing

research

Insights on Crosscoder Model Diffing

Source: Anthropic Research Date: 2025-02-20 URL: https://www.anthropic.com/research/crosscoder-model-diffing

Summary

Preliminary work from Anthropic’s interpretability team on “crosscoder model diffing” — a technique for comparing internal representations across different model versions or configurations using cross-model sparse autoencoders. Full technical content at transformer-circuits.pub (access restricted). Characterized as developing work / lab meeting notes rather than a mature paper.

Implications

This is the interpretability tooling thread addressing a key practical problem: when you update a model, which features changed and how? Crosscoder model diffing would give you a diff of the representation space — letting you trace how a safety fine-tune, capability update, or RLHF pass actually altered what the model represents. This is the interpretability equivalent of git diff for model internals. If it works, it becomes essential infrastructure for understanding what fine-tuning and RLHF actually do to a model’s internal world model. Watch for a full paper.

← all signals