2024-10-16 · Anthropic

Using dictionary learning features as classifiers

research

Using dictionary learning features as classifiers

Source: Anthropic Research Date: 2024-10-16 URL: https://www.anthropic.com/research/features-as-classifiers

Summary

Preliminary work (shared as lab meeting notes, not a mature paper) exploring whether dictionary learning features discovered through sparse autoencoders can function directly as classifiers for model behavior. Full technical details at transformer-circuits.pub. Investigates the practical utility of mechanistic interpretability features beyond just labeling internal representations.

Implications

Short but important step in the interpretability → tooling thread: if SAE features work as classifiers, they become a lightweight probe layer for behavioral monitoring without needing separate classifier training. This is the “features as infrastructure” direction — the same sparse decomposition that explains what a model is doing also tells you what category of task it’s on. Early stage, but watch for this becoming part of Claude’s real-time safety monitoring stack if it holds up at scale.

← all signals