2025-03-27 · Anthropic

Interpretability Research

research

Interpretability Research

Source: Anthropic Research Date: 2025-03-27 URL: https://www.anthropic.com/research/team/interpretability

Summary

Team overview page for Anthropic’s Interpretability research group. Mission: discover how large language models work internally as a foundation for AI safety. Current methods: circuit tracing, sparse autoencoders, persona vector extraction, introspection analysis. Multidisciplinary team spanning ML, physics, mathematics, astronomy, biology, and data visualization. Not a specific research post — a team landing page.

Implications

This team is the source of the entire mechanistic interpretability program at Anthropic. The multidisciplinary composition is a deliberate strategy — interpretability requires both deep ML knowledge and the ability to build conceptual frameworks for understanding complex systems, which draws from physics and biology as much as computer science. The “watch Claude think” framing from circuit tracing is Anthropic’s headline claim for public communication. The listing of introspection analysis alongside circuit tracing signals the team is actively exploring whether the tools they use to understand models from the outside can be turned inward. Track this team’s output as the primary signal for where mechanistic interpretability is heading.

← all signals