2025-09-17 · Anthropic

A postmortem of three recent issues

modelsinfrastructure

read at source ↗ www.anthropic.com

A postmortem of three recent issues

Source: Anthropic Engineering Date: 2025-09-17 URL: https://www.anthropic.com/engineering/a-postmortem-of-three-recent-issues

Summary

Anthropic’s postmortem on three overlapping infrastructure bugs (August–September 2025) that degraded Claude’s response quality: context window routing errors, output corruption, and an XLA:TPU compiler miscompilation caused by bf16/fp32 precision disagreement in distributed token selection. The overlapping symptoms made diagnosis slow, and existing evals missed the degradation because “Claude often recovers well from isolated mistakes.” Anthropic’s fix accepted a performance trade-off by switching to exact top-k operations.

Implications

The reliability and postmortem thread. This is Anthropic’s first public infrastructure postmortem — the fact that it exists signals a maturity shift toward operator-level transparency. The precision-mismatch root cause (bf16 model computation vs. fp32 TPU optimization) is an unusual failure mode that won’t generalize to most shops, but the detection gap is universally relevant: evals that test isolated correctness will miss degradation in recovery-capable models.

Continuous quality monitoring. The stated fix — continuous quality evals on production systems, not just pre-deploy benchmarks — is the same recommendation Anthropic gives to customers in the eval-design posts. They’re now applying it to themselves publicly.

User feedback as a signal. Thumbs-down buttons and bug reports surfaced degradation before internal monitoring did. For any shop running Claude in production, direct user feedback channels are a faster detection layer than automated evals alone.

← all signals