Forecasting rare language model behaviors
read at source ↗ www.anthropic.com
Forecasting rare language model behaviors
Source: Anthropic Research Date: 2025-02-25 URL: https://www.anthropic.com/research/forecasting-rare-behaviors
Summary
Harmful response rates for language models follow power law distributions across prompt-query counts. Extrapolating from ~900 to ~90,000 queries: predictions within one order of magnitude in 86% of forecasts for dangerous information risk. For misaligned behavior detection: 2.5x lower error than baseline. For automated red-teaming: optimal resource allocation identified 79% of the time.
Implications
This is the evaluation methodology thread solving the rarity problem: how do you assess risk from behaviors that appear once in a million queries during pre-deployment testing? Power law extrapolation from small evaluations is the answer. The 86% within-one-OOM result is reasonable accuracy for pre-deployment risk forecasting — not precise enough for hard safety claims but useful for resource allocation decisions. This feeds directly into Anthropic’s Responsible Scaling Policy: capability thresholds require quantitative risk estimates, and this provides the methodology for getting those estimates from tractable evaluation sets. Watch for this becoming a standard pre-deployment evaluation tool.