2025-09-11 · HuggingFace

Tricks from OpenAI gpt-oss YOU 🫵 can use with transformers

modelsresearchinfrastructure

Tricks from OpenAI gpt-oss YOU 🫵 can use with transformers

Source: HuggingFace Date: 2025-09-11 URL: https://huggingface.co/blog/faster-transformers

Summary

Library update: Seven Transformers enhancements added for GPT-OSS model support, most broadly applicable. (1) Zero-build kernels from Hub (use_kernels=True) — 2-10x speedup for RMSNorm/MoE/FlashAttention3; (2) MXFP4 quantization — 4x memory reduction (120B→~80GB VRAM); (3) Tensor parallelism via tp_plan="auto"; (4) Expert parallelism for MoE (DistributedConfig(enable_expert_parallel=True)); (5) Dynamic sliding window cache — ~50% KV cache reduction for GPT-OSS hybrid attention; (6) Continuous batching via generate_batch; (7) faster model loading via pre-allocated GPU memory blocks.

Implications

Transformers library trajectory. All seven features are broadly applicable beyond GPT-OSS — MXFP4 quantization, tensor parallelism, and continuous batching are general improvements to Transformers’ inference stack. The pre-compiled kernel distribution from Hub (use_kernels=True) is a workflow improvement that eliminates JIT compilation overhead at startup, a pain point for any production deployment.

Open-weights ecosystem health. MXFP4 reducing GPT-OSS 120B to ~80GB VRAM makes the largest available open-weights model runnable on a single 8xH100 node (640GB total) with room for batch inference. Dynamic sliding window cache reducing KV memory by ~50% for hybrid attention models is particularly significant as long-context inference scales — this is the class of optimization that turns 128k context from theoretically possible to practically deployable.

← all signals