Interpretability Research
read at source ↗ www.anthropic.com
Interpretability Research
Source: Anthropic Research Date: 2025-03-27 URL: https://www.anthropic.com/research/team/interpretability
Summary
Team overview page for Anthropic’s Interpretability research group. Mission: discover how large language models work internally as a foundation for AI safety. Current methods: circuit tracing, sparse autoencoders, persona vector extraction, introspection analysis. Multidisciplinary team spanning ML, physics, mathematics, astronomy, biology, and data visualization. Not a specific research post — a team landing page.
Implications
This team is the source of the entire mechanistic interpretability program at Anthropic. The multidisciplinary composition is a deliberate strategy — interpretability requires both deep ML knowledge and the ability to build conceptual frameworks for understanding complex systems, which draws from physics and biology as much as computer science. The “watch Claude think” framing from circuit tracing is Anthropic’s headline claim for public communication. The listing of introspection analysis alongside circuit tracing signals the team is actively exploring whether the tools they use to understand models from the outside can be turned inward. Track this team’s output as the primary signal for where mechanistic interpretability is heading.