2026-03-13 · Anthropic

A “diff” tool for AI: Finding behavioral differences in new models

models

A “diff” tool for AI: Finding behavioral differences in new models

Source: Anthropic Research Date: 2026-03-13 URL: https://www.anthropic.com/research/diff-tool

Summary

Dedicated Feature Crosscoder (DFC) for cross-architecture model diffing: identifies features exclusive to each model via shared + model-specific dictionary sections, then validates causality via steering. Applied to four open-source models. Found: “CCP Alignment” feature in Qwen and DeepSeek controlling censorship (rediscovered 5/5 times), “American Exceptionalism” feature in Llama controlling pro-US rhetoric, “Copyright Refusal Mechanism” unique to GPT-OSS-20B.

Implications

The interpretability → security auditing thread made into a concrete tool. The CCP alignment finding is the most politically significant — a reproducible feature that causally controls censorship behavior in Chinese models, independently verified five times. This is the kind of result that will appear in regulatory discussions about AI model provenance and transparency. The American Exceptionalism finding shows this isn’t just an “adversarial models” story — value-loaded features are present in Western models too. Watch for DFC being adopted by independent auditors and government bodies evaluating models for deployment in sensitive contexts. This is interpretability research with immediate geopolitical applications.

← all signals