2024-08-14 · HuggingFace

A failed experiment: Infini-Attention, and why we should keep trying?

modelstoolingresearch

A failed experiment: Infini-Attention, and why we should keep trying?

Source: HuggingFace Date: 2024-08-14 URL: https://huggingface.co/blog/infini-attention

Summary

Research post-mortem: HF’s attempted reproduction of Google’s Infini-Attention (compressive memory for extending LLM context) on Llama 3 8B to 1M tokens. Result: the approach failed — gating factors converged near 0.5 (neutral) rather than learning to distinguish memory from local context, causing 100% passkey retrieval success within the current segment but 0% across earlier segments. Root causes: gating requires a separate high learning rate (0.01 vs. 3e-4 base), and weight decay pushes gating toward neutral. Conclusion: ring attention, YaRN, and RoPE scaling remain more reliable for long-context extension.

Implications

Open-weights ecosystem health. This post is valuable precisely because it documents failure with reproducible evidence. The HF team spent significant compute time trying to make Infini-Attention work and publishing the negative result saves others from the same investment. The finding — that gating instability is the core problem, not the general approach — is actionable for researchers who want to fix rather than abandon it.

Transformers library trajectory. The lesson hierarchy at the end (train baselines, start small, track qualitative eval not just loss) is the kind of institutional knowledge that prevents the ecosystem from chasing paper results that don’t reproduce. HF publishing these failure modes openly is part of what makes it a credible technical voice.

← all signals