Testing our safety defenses with a new bug bounty program
protocolsmodelsenterprise
read at source ↗ www.anthropic.com
Testing our safety defenses with a new bug bounty program
Source: Anthropic Date: 2025-05-14 URL: https://www.anthropic.com/news/testing-our-safety-defenses-with-a-new-bug-bounty-program
Summary
Anthropic launched a bug bounty program via HackerOne, offering up to $25,000 for discovering universal jailbreaks on Claude 3.7 Sonnet that bypass Constitutional Classifiers — specifically targeting CBRN-related misuse protections. The program ran through May 18, 2025, with a subsequent update indicating expansion to Claude Opus 4. Anthropic framed the program as supporting its ASL-3 Deployment Standard under the Responsible Scaling Policy.
Implications
- Safety posture / ASL-3 thread. Bug bounty on Constitutional Classifiers is a public pressure test of the ASL-3 safety claim — Anthropic is inviting external validation that its CBRN protections hold. If jailbreaks are found and disclosed, it’s a signal the protections need work; if none are found, it’s a credibility data point.
- CBRN as the hard boundary. The specific focus on CBRN misuse bypass (vs. general jailbreaks) shows Anthropic treating bio/chem/radiological/nuclear uplift as the existential red line that matters most for the RSP commitments.
- HackerOne as infrastructure. Using an established bug bounty platform (vs. an internal program) is both a credibility move and a practical one — it surfaces specialized red teamers who don’t follow Anthropic’s own research channels.
- Transition to Opus 4. The bounty expanding to Opus 4 mid-run suggests Anthropic is testing the new model’s defenses in parallel with deployment — useful data point for whether Opus 4 actually meets ASL-3 standards the RSP requires.
- Watch: any public disclosure of what jailbreaks were found (if any); whether the bounty program becomes permanent vs. one-off; how the program compares to OpenAI’s equivalent safety programs.