Custom Kernels for All from Codex and Claude
read at source ↗ huggingface.co
Custom Kernels for All from Codex and Claude
Source: HuggingFace Date: 2026-02-13 URL: https://huggingface.co/blog/custom-cuda-kernels-agent-skills
Summary
Agent skill release: HF ships a ~550-token agent skill teaching Claude and Codex to write production CUDA kernels with correct PyTorch/C++ bindings and benchmarking. Validated on real models: RMSNorm kernel for Qwen3-8B yields 1.94x isolated speedup (2.47x at 8192 tokens), LTX-Video RMSNorm yields 1.88x isolated / 1.43x end-to-end with torch.compile. Generated kernels publish to the Kernel Hub; users load via get_kernel("org/kernel") without compilation.
Implications
Thread: HF as open-source ML hub / agentic patterns. The agent skill abstraction is the key idea: a 550-token system prompt gives Claude or Codex the context to produce correct, benchmarkable CUDA code for a specific model target. The Kernel Hub as distribution layer is the strategic extension — once a kernel is published, any user can load it without a CUDA toolchain. This is infrastructure democratization: custom kernel performance has historically required deep CUDA expertise; now it requires the ability to prompt an agent. Watch whether the Kernel Hub accumulates a meaningful catalog of community-contributed model-specific optimizations.