2025-08-12 · HuggingFace

TextQuests: How Good are LLMs at Text-Based Video Games?

agentsmodelsresearchinfrastructure

TextQuests: How Good are LLMs at Text-Based Video Games?

Source: HuggingFace Date: 2025-08-12 URL: https://huggingface.co/blog/textquests

Summary

Research benchmark release: TextQuests evaluates LLMs as autonomous agents across 25 classic Infocom interactive fiction games. Games require 30+ hours of human play and hundreds of precise actions; contexts exceed 100k tokens during play. Key failure modes: spatial reasoning (reversing navigation sequences), hallucination about prior actions as context grows, repetitive action patterns, and total failure on maze navigation (all frontier models failed Zork I’s maze). Efficiency finding: more test-time compute improves performance with diminishing returns; dynamic reasoning budget allocation is identified as a key open problem.

Implications

Open-weights ecosystem health. The Zork I maze failure across all frontier models is a memorable finding — a 1977 game feature that modern LLMs cannot handle, despite passing harder reasoning benchmarks. TextQuests is useful precisely because it exposes these qualitative failure modes in a way that MMLU and GPQA don’t. Open-weights models competing with closed models on this benchmark would be a meaningful signal.

Model release cadence — agent thread. TextQuests’ 100k+ token contexts and 500-step limits test long-horizon agent behavior that is poorly covered by existing benchmarks. As agentic use cases become more prominent, benchmarks like TextQuests that stress-test exploration and long-context consistency will become more cited in model release evaluations.

← all signals