2026-01-21 · Anthropic

Designing AI resistant technical evaluations

models

read at source ↗ www.anthropic.com

Designing AI resistant technical evaluations

Source: Anthropic Engineering Date: 2026-01-21 URL: https://www.anthropic.com/engineering/AI-resistant-technical-evaluations

Summary

An Anthropic engineer documents how successive Claude releases (Opus 4, then Opus 4.5) defeated their performance engineering take-home test across three design iterations, ultimately requiring a pivot to a novel domain (Zachtronics-inspired constrained instruction set puzzles) that sacrifices realism for AI-resistance because “the replacement works because it simulates novel work.” Claude Opus 4.5’s best score on the original test: 1,487 cycles within the 2-hour limit.

Implications

The eval-reliability thread — hiring edition. The same adversarial dynamic affecting benchmark evals is now documented in hiring: model capability improvements invalidate assessment instruments faster than they can be redesigned. This is a structural problem, not a one-time calibration issue.

The “passive observer” trap. The colleague’s framing — “raising bars to substantially outperform Claude Code risks making humans passive observers” — is the clearest articulation of the ceiling problem for AI-assisted technical assessments. It’s a signal that the hiring market for performance engineering (and similar specialized technical roles) is being disrupted faster than conventional hiring practices can adapt.

Novel tasks as the remaining defense. The pivot to Zachtronics-style puzzles is a temporary solution: tasks drawn from genuinely novel domains resist AI because they’re outside training distribution. But novelty is a depletable resource — as models improve on reasoning and constraint satisfaction, even exotic puzzle formats will erode.

← all signals