2024-10-25 · Anthropic

Evaluating feature steering: A case study in mitigating social biases

models

Source: Anthropic Research Date: 2024-10-25 URL: https://www.anthropic.com/research/evaluating-feature-steering

Summary

Tests feature steering on 29 interpretable features from Claude 3 Sonnet’s residual stream, evaluating on MMLU, PubMedQA, and BBQ bias benchmarks. Finds a “steering sweet spot” at factors -5 to 5 where capabilities aren’t degraded. A “Multiple perspectives” feature reduced bias across all nine BBQ dimensions. Key complication: unexpected off-target effects — steering gender bias features also shifted age bias scores.

Implications

Concrete progress in the interpretability → control thread: dictionary learning features aren’t just descriptive, they’re steerable. But the off-target effects are the key finding — features influence more than their activation context suggests, which means surgical steering is harder than it looks and broad interventions may be needed. This is both a capability and a safety concern: the same mechanism that lets you reduce bias can introduce unexpected side effects. Feeds directly into the ongoing question of whether sparse autoencoders give you genuine causal handles or just correlated proxies.

← all signals