2025-09-12 · Anthropic

Strengthening our safeguards through collaboration with US CAISI and UK AISI

securitymodels

Strengthening our safeguards through collaboration with US CAISI and UK AISI

Source: Anthropic Date: 2025-09-12 URL: https://www.anthropic.com/news/strengthening-our-safeguards-through-collaboration-with-us-caisi-and-uk-aisi

Summary

Anthropic collaborated with US Center for AI Standards and Innovation (CAISI) and UK AI Security Institute (AISI) to red-team Constitutional Classifiers across Claude Opus 4 and 4.1. Vulnerabilities found: prompt injection via false annotations, universal jailbreaks, cipher/obfuscation techniques, fragmented harmful strings, automated attack refinement systems. Outcomes: specific vulnerability patches and fundamental safeguard architecture restructuring to address underlying vulnerability classes.

Implications

Safety/operations / government red-teaming made public. Publishing the specific vulnerability classes found by government red-teamers is unusual transparency — it tells adversaries what was found and patched, but also demonstrates that the safeguards were tested and improved. The trade-off is worth it for credibility.
Constitutional Classifier restructuring. “Fundamentally restructuring safeguard architecture to address underlying vulnerability classes” (not just patching specific exploits) is the significant outcome — it suggests the government teams found systematic weaknesses, not just edge cases.
CAISI + AISI as dual jurisdiction. The simultaneous collaboration with US and UK government institutes reflects the bilateral AI safety relationship that the UK AI Safety Summit (2023) established. These two institutes represent the Western government AI safety infrastructure.
Prompt injection via false annotations. The specific vulnerability — using false meta-information to bypass classifiers — is the most technically interesting finding. It exploits the classifier’s trust in context framing, which is hard to address without redesigning how context is processed.
Watch: how the restructured safeguard architecture performed in subsequent red-teaming; whether the US CAISI became a standard pre-deployment evaluation partner; how the published vulnerability classes were used by the research community.

← all signals