Opus 4.8 scored 81 in my benchmark. I still wouldn't default to it. (The full breakdown + Nate's Community Slack)
pricingmodels
read at source ↗ natesnewsletter.substack.com
Opus 4.8 scored 81 in my benchmark. I still wouldn’t default to it. (The full breakdown + Nate’s Community Slack)
Source: Nate’s Newsletter Date: 2026-06-03 URL: https://natesnewsletter.substack.com/p/opus-48-benchmark-model-selection
Summary
Nate’s Newsletter runs Opus 4.8 through a personal benchmark suite (strict average across several task types) where it scores 81, beating GPT-5.5 (71) and prior Claude releases. Despite the top score, the author argues against defaulting to it: GPT-5.5 beats it on visualization tasks, and maxing out Opus 4.8’s reasoning effort actually degrades performance on extended business-logic workflows. The piece proposes a multi-factor selection framework — task type, duration, tool access, state preservation, failure cost — over the “pick the smartest model” heuristic.
Implications
- Model landscape: Concrete evidence that the reasoning-dial tradeoff is real and task-dependent; raw benchmark rank is a weak signal for operational model selection.
- Agentic coding: Long-horizon agent tasks are specifically called out as a regime where higher-reasoning modes can regress. Relevant for any multi-step coding loop calibration.
- Dev tooling: The selection-framework angle is practical: practitioners routing tasks to different models based on structured criteria is becoming standard practice, not a niche optimization.