2025-10-29 · Anthropic

Emergent introspective awareness in large language models

securitymodelsresearch

Emergent introspective awareness in large language models

Source: Anthropic Research Date: 2025-10-29 URL: https://www.anthropic.com/research/introspection

Summary

Tests introspection via concept injection: identify neural patterns for specific concepts, inject them during unrelated tasks, check if the model detects the artificial thoughts. Claude Opus 4.1 detected injected concepts ~20% of the time, often recognizing the anomaly before mentioning the injected concept. Also shows modulation of internal activations in response to direct instructions. Unreliable and context-dependent, but above chance.

Implications

Sits at the intersection of interpretability and model welfare — the 20% detection rate is weak but non-trivial. If a model can sometimes detect artificially injected thoughts, it has some access to its own activation space that isn’t purely behavioral. This complicates the “language models don’t introspect, they just predict” position. It’s also a potential safety monitoring angle: if models can be trained to reliably detect anomalous activations, that’s a self-monitoring mechanism. Watch for this motivating more targeted introspection training and feeding the broader model consciousness / welfare research agenda.

← all signals