2026-03-24 · Anthropic

Harness design for long-running application development

pricingprotocolsagentsmodelscommentary

read at source ↗ www.anthropic.com

Harness design for long-running application development

Source: Anthropic Engineering Date: 2026-03-24 URL: https://www.anthropic.com/engineering/harness-design-long-running-apps

Summary

Anthropic describes a GAN-inspired three-agent harness (Planner, Generator, Evaluator) for long-running application development, where the Evaluator uses Playwright MCP to test functionality and provide structured critique. Context resets between sessions proved essential — Claude exhibited “context anxiety” under compaction that hurt output quality. A retro game maker built with this harness cost $200 over 6 hours versus $9/20 minutes for a single agent, with meaningfully better output quality.

Implications

The agent harness design thread. Separating the work-doing agent from the judging agent is now an explicit Anthropic recommendation, not just a community pattern. The sprint-contract mechanism (negotiate testable “done” criteria before implementation) is the most concrete design primitive here — it bridges LLM ambiguity and automated test verification.

Context management. The finding that context resets outperform compaction for long runs is directly relevant to any multi-session Claude Code workflow. This is Anthropic acknowledging a real failure mode in their own models (Sonnet 4.5 specifically), and partially crediting Opus 4.6 for reducing it — model version matters for harness design choices.

Cost calibration. The 20x cost multiplier for multi-agent vs. single-agent ($200 vs $9) with a qualitative improvement claim is a useful planning data point. Whether that delta is worth it depends entirely on whether quality matters more than throughput for a given task.

← all signals