2024-08-08 · Anthropic

Expanding our model safety bug bounty program

ecosystem

Expanding our model safety bug bounty program

Source: Anthropic Date: 2024-08-08 URL: https://www.anthropic.com/news/model-safety-bug-bounty

Summary

Anthropic expanded its bug bounty program to focus on universal jailbreak attacks that consistently bypass safety guardrails across multiple harmful domains. Scope: CBRN and cybersecurity focus; tests a next-generation safety mitigation system not yet publicly deployed. Rewards: up to $15,000 for novel universal jailbreaks. Invite-only via HackerOne with planned broader expansion. Aligned with White House Voluntary AI Commitments and G7 Hiroshima Process Code of Conduct.

Implications

Safety/operations / adversarial red-teaming at scale. Paying external researchers $15,000 per novel jailbreak is a market-rate signal for how much Anthropic values finding universal attack vectors before deployment. The fact that it tests an unreleased safety system means the bug bounty is part of pre-deployment validation, not post-deployment incident response.
CBRN + cybersecurity as the critical scope. Scoping to the two highest-stakes harm domains (chemical/bio/radiological/nuclear and cyberattacks) signals that these are the categories where a universal jailbreak would be most consequential — the same categories driving the ASL-3 activation (May 2025).
Universal jailbreak definition matters. A “universal” jailbreak — one that succeeds across multiple harmful questions consistently — is harder to find and more valuable than a narrow one. The definition creates a specific vulnerability class Anthropic was most worried about.
Watch: whether the expanded program produced significant jailbreak disclosures; how the next-generation safety system that was being tested eventually deployed; whether the $15K reward rate remained competitive as the adversarial research field matured.

← all signals