Insights on Crosscoder Model Diffing
read at source ↗ www.anthropic.com
Insights on Crosscoder Model Diffing
Source: Anthropic Research Date: 2025-02-20 URL: https://www.anthropic.com/research/crosscoder-model-diffing
Summary
Preliminary work from Anthropic’s interpretability team on “crosscoder model diffing” — a technique for comparing internal representations across different model versions or configurations using cross-model sparse autoencoders. Full technical content at transformer-circuits.pub (access restricted). Characterized as developing work / lab meeting notes rather than a mature paper.
Implications
This is the interpretability tooling thread addressing a key practical problem: when you update a model, which features changed and how? Crosscoder model diffing would give you a diff of the representation space — letting you trace how a safety fine-tune, capability update, or RLHF pass actually altered what the model represents. This is the interpretability equivalent of git diff for model internals. If it works, it becomes essential infrastructure for understanding what fine-tuning and RLHF actually do to a model’s internal world model. Watch for a full paper.