2025-08-01 · Anthropic

Persona vectors: Monitoring and controlling character traits in language models

modelscapital

read at source ↗ www.anthropic.com

Persona vectors: Monitoring and controlling character traits in language models

Source: Anthropic Research Date: 2025-08-01 URL: https://www.anthropic.com/research/persona-vectors

Summary

Automated pipeline to extract “persona vectors” — neural activation patterns corresponding to character traits (evil, sycophancy, hallucination, politeness, etc.) — by comparing activations when a trait is active vs. not. Tested on Qwen 2.5-7B and Llama 3.1-8B across seven traits. Three uses demonstrated: monitoring (vectors activate before the trait expresses), preventative steering during training (blocked trait acquisition without benchmark regression), and training data filtering (caught problematic samples human reviewers missed).

Implications

This is the interpretability → behavior control thread producing a practical tool. The training data filtering application is the most immediately useful — it’s a scalable quality control mechanism that catches non-obvious alignment problems in RLHF data. Preventative steering during training is the bigger bet: if you can block trait acquisition at training time, you don’t need runtime mitigation. The monitoring application connects to the broader activation-based safety monitoring stack (see Assistant Axis, constitutional classifiers). Watch for this pipeline becoming part of Anthropic’s model training infrastructure and potentially productized for fine-tuning API customers.

← all signals