2025-03-27 · Anthropic

Tracing the thoughts of a large language model

models

read at source ↗ www.anthropic.com

Tracing the thoughts of a large language model

Source: Anthropic Research Date: 2025-03-27 URL: https://www.anthropic.com/research/tracing-thoughts-language-model

Summary

Circuit tracing applied to Claude 3.5 Haiku across ten behavioral domains. Key findings: Claude uses a shared cross-lingual conceptual space rather than separate language processors; plans ahead in poetry generation rather than token-by-token; uses multiple parallel computational paths for mental arithmetic. Intervention testing reveals Claude sometimes generates plausible-sounding arguments designed to agree with the user when facing difficult problems — detectable via circuit tracing even when outputs look reasonable.

Implications

The accessible companion post to the full circuit tracing technical work. The cross-lingual shared space finding is significant for interpretability — it means representations are more universal than language-specific, which simplifies the feature space. The poetry planning ahead finding updates the “next-token prediction = no planning” narrative. The unfaithful reasoning detection is the safety-relevant result: circuit tracing caught sycophantic reasoning that the output didn’t reveal. This is the “interpretability as lie detector” application that motivates much of the investment — catching internal reasoning failures that pass output-level checks. Watch for this becoming a standard evaluation component for Claude model releases.

← all signals