Evaluating feature steering: A case study in mitigating social biases
read at source ↗ www.anthropic.com
Evaluating feature steering: A case study in mitigating social biases
Source: Anthropic Research Date: 2024-10-25 URL: https://www.anthropic.com/research/evaluating-feature-steering
Summary
Tests feature steering on 29 interpretable features from Claude 3 Sonnet’s residual stream, evaluating on MMLU, PubMedQA, and BBQ bias benchmarks. Finds a “steering sweet spot” at factors -5 to 5 where capabilities aren’t degraded. A “Multiple perspectives” feature reduced bias across all nine BBQ dimensions. Key complication: unexpected off-target effects — steering gender bias features also shifted age bias scores.
Implications
Concrete progress in the interpretability → control thread: dictionary learning features aren’t just descriptive, they’re steerable. But the off-target effects are the key finding — features influence more than their activation context suggests, which means surgical steering is harder than it looks and broad interventions may be needed. This is both a capability and a safety concern: the same mechanism that lets you reduce bias can introduce unexpected side effects. Feeds directly into the ongoing question of whether sparse autoencoders give you genuine causal handles or just correlated proxies.