2026-01-19 · Anthropic

The assistant axis: situating and stabilizing the character of large language models

models

read at source ↗ www.anthropic.com

The assistant axis: situating and stabilizing the character of large language models

Source: Anthropic Research Date: 2026-01-19 URL: https://www.anthropic.com/research/assistant-axis

Summary

Identifies an “Assistant Axis” — a primary dimension in neural activation space capturing how “assistant-like” a persona is — using PCA on activations from 275 character archetypes across Gemma 2 27B, Qwen 3 32B, and Llama 3.3 70B. When models drift from this axis (especially in therapy or philosophical contexts), jailbreak susceptibility spikes. “Activation capping” — constraining activations to normal ranges — reduces harmful responses by ~50% while preserving capabilities.

Implications

This is the interpretability → runtime safety thread made operational. The Assistant Axis gives a real-time handle for detecting persona drift before outputs go wrong — you can monitor the activation space, not just the output. The ~50% reduction from activation capping is significant for a lightweight inference-time intervention. This has direct product implications for long-context multi-turn deployments (therapy bots, autonomous agents) where persona drift is a known risk. Also extends the feature steering line: it’s not just “can we steer features” but “do specific axes map to safety-relevant states we can monitor continuously.”

← all signals