2025-02-03 · Anthropic

Constitutional Classifiers: Defending against universal jailbreaks

infrastructure

Constitutional Classifiers: Defending against universal jailbreaks

Source: Anthropic Research Date: 2025-02-03 URL: https://www.anthropic.com/research/constitutional-classifiers

Summary

Input/output classifiers trained on synthetic data derived from a content constitution. Jailbreak success dropped from 86% (unguarded) to 4.4% with classifiers; false refusal increase of 0.38% on production traffic; 23.7% compute overhead. Live public red-team demo with 339 participants and 300k+ interactions over 7 days found one universal jailbreak. Synthetic training data with jailbreak-style augmentation is the core method.

Implications

The original Constitutional Classifiers paper (the next-generation version in Jan 2026 cuts the 23.7% overhead to ~1%). This is the jailbreak defense thread made into a deployable system. The public red-team demo is a notable methodological choice — crowdsourcing adversarial testing at scale is harder to game than internal red teams. The 86% → 4.4% reduction is the headline number that motivated Anthropic’s scaling policy: capable models can be conditionally deployed once this mitigation exists. Watch whether the compute overhead reduction in the v2 system makes this standard across all Claude API tiers.

← all signals