2026-01-09 · Anthropic

Next-generation Constitutional Classifiers: More efficient protection against universal jailbreaks

ecosystem

read at source ↗ www.anthropic.com

Next-generation Constitutional Classifiers: More efficient protection against universal jailbreaks

Source: Anthropic Research Date: 2026-01-09 URL: https://www.anthropic.com/research/next-generation-constitutional-classifiers

Summary

Two-stage cascade classifier: a lightweight probe reads internal neural activations to pre-screen conversations, escalating only suspicious ones to a full ensemble classifier. Result: 87% drop in false refusal rate (down to 0.05% on harmless queries) with only ~1% computational overhead vs. 23.7% previously. 1,700+ red-team hours found one high-risk vulnerability, no universal jailbreak.

Implications

This is the Constitutional AI / jailbreak defense thread operationalized at production scale. The key move is using interpretability (internal activations) to make safety cheaper — same compute budget, much lower false positive rate. The 1% overhead number matters for deployment economics. This likely becomes the default safety layer for Claude API and operator deployments. The activation-based probe approach also represents an architectural bet: safety via internals, not just output filtering, which is harder to jailbreak but requires the interpretability research to be solid.

← all signals