Mastering Long Contexts in LLMs with KVPress
read at source ↗ huggingface.co
Mastering Long Contexts in LLMs with KVPress
Source: HuggingFace Date: 2025-01-23 URL: https://huggingface.co/blog/nvidia/kvpress
Summary
Library release: NVIDIA’s KVPress, an open-source toolkit for KV cache compression in LLMs. Addresses the memory wall for long-context inference — Llama 3-70B at 1M tokens requires 327.6GB for KV cache alone. KVPress applies compression algorithms (“presses”) that prune low-importance key-value pairs dynamically during generation. At 50% compression on Llama 3.1 8B (128k context, A100): memory drops from 45GB to 37GB, throughput rises from 11 to 17 tokens/sec with accuracy measured on RULER.
Implications
Transformers library trajectory. KVPress integrates with the HF pipeline API, making KV cache compression a drop-in for any Transformers-based long-context workflow. The AdaKVPress + ExpectedAttentionPress combination performs best on RULER benchmarks — watch for these methods to become standard practice as 128k+ context models move into production.
Open-weights ecosystem health. The memory requirements for long-context open-weights inference have been a practical blocker for teams without H100 clusters. A 50% compression ratio that keeps accuracy near baseline changes the feasibility calculus for running 70B+ models at extended context lengths on available hardware.