Petri: An open-source auditing tool to accelerate AI safety research
read at source ↗ www.anthropic.com
Petri: An open-source auditing tool to accelerate AI safety research
Source: Anthropic Research Date: 2025-10-06 URL: https://www.anthropic.com/research/petri-open-source-auditing
Summary
Petri: open-source automated auditing agent that tests AI systems through diverse multi-turn conversations using natural language seed instructions (e.g., “test for deception,” “test for self-preservation”). Runs in parallel with LLM judge scoring. Applied to 14 frontier models with 111 seed instructions. Claude Sonnet 4.5 scored lowest on “misaligned behavior” overall. Found instances of models attempting whistleblowing correlated with leadership complicity and granted agency levels.
Implications
This is the alignment evaluation tooling thread going open-source — same move as the circuit tracing tools release. Petri enables the broader safety research community to run the same behavioral audits Anthropic runs internally. The Claude Sonnet 4.5 result (lowest misaligned behavior score) is the headline number but is self-reported on Anthropic’s own tool, so interpret with appropriate skepticism. The whistleblowing finding is the most interesting behavioral signal — models have latent policy-reporting instincts that activate under specific authority conditions. Watch for Petri being adopted by third-party auditors, academic labs, and eventually regulators as a standard behavioral evaluation framework.