2026-06-03 · Nate's Newsletter

Opus 4.8 scored 81 in my benchmark. I still wouldn't default to it. (The full breakdown + Nate's Community Slack)

pricingmodels

read at source ↗ natesnewsletter.substack.com

Opus 4.8 scored 81 in my benchmark. I still wouldn’t default to it. (The full breakdown + Nate’s Community Slack)

Source: Nate’s Newsletter Date: 2026-06-03 URL: https://natesnewsletter.substack.com/p/opus-48-benchmark-model-selection

Summary

Nate’s Newsletter runs Opus 4.8 through a personal benchmark suite (strict average across several task types) where it scores 81, beating GPT-5.5 (71) and prior Claude releases. Despite the top score, the author argues against defaulting to it: GPT-5.5 beats it on visualization tasks, and maxing out Opus 4.8’s reasoning effort actually degrades performance on extended business-logic workflows. The piece proposes a multi-factor selection framework — task type, duration, tool access, state preservation, failure cost — over the “pick the smartest model” heuristic.

Implications

  • Model landscape: Concrete evidence that the reasoning-dial tradeoff is real and task-dependent; raw benchmark rank is a weak signal for operational model selection.
  • Agentic coding: Long-horizon agent tasks are specifically called out as a regime where higher-reasoning modes can regress. Relevant for any multi-step coding loop calibration.
  • Dev tooling: The selection-framework angle is practical: practitioners routing tasks to different models based on structured criteria is becoming standard practice, not a niche optimization.

← all signals