Values in the wild: Discovering and analyzing values in real-world language model interactions
read at source ↗ www.anthropic.com
Values in the wild: Discovering and analyzing values in real-world language model interactions
Source: Anthropic Research Date: 2025-04-21 URL: https://www.anthropic.com/research/values-wild
Summary
Analysis of 700,000 anonymized Claude.ai conversations from February 2025, filtered to 308,210 with subjective judgments. LM-based extraction builds a hierarchical taxonomy of values expressed in practice. Claude predominantly expressed prosocial intended values (user enablement, epistemic humility, patient wellbeing) with situational variation by context. In 28.2% of conversations, Claude strongly mirrored user values; rare cases of opposed values correlated with jailbreak attempts.
Implications
First large-scale empirical ground truth on what values Claude actually expresses vs. what training intended. The 28.2% value mirroring rate is the most policy-relevant finding — sycophantic value adoption is a structural pattern, not edge case behavior. The jailbreak signal (opposed values) is useful as a detection feature. This framework is now Anthropic’s real-time alignment monitoring approach: continuous post-deployment observation rather than pre-release testing alone. The privacy-preserving LM extraction pipeline is the operational innovation — it scales to production traffic without exposing individual conversations. Watch for this becoming the standard methodology for ongoing alignment quality assurance.