2025-08-13 · HuggingFace

Arm & ExecuTorch 0.7: Bringing Generative AI to the masses

infrastructure

Arm & ExecuTorch 0.7: Bringing Generative AI to the masses

Source: HuggingFace Date: 2025-08-13 URL: https://huggingface.co/blog/Arm/executorch-0-dot-7

Summary

Library update: ExecuTorch 0.7 enables KleidiAI kernel optimizations by default on Arm CPUs, targeting the SDOT instruction available on Armv8.2+ (present in 72% of all devices, ~3 billion units). Benchmarks on Galaxy S24+: 20% higher prefill vs. non-KleidiAI, 350+ tokens/second prefill, 40+ tokens/second decode — faster than average human reading speed. Use cases demonstrated: fully offline speech-to-text + LLM + TTS assistant, context-aware local text completion.

Implications

Open-weights ecosystem health. 40+ tokens/second decode on a mid-range phone with a ~3-5 year old Arm CPU is a practical threshold for consumer on-device LLM use. ExecuTorch + KleidiAI getting there on Llama 3.2 1B means the on-device inference story is no longer limited to flagship hardware — it covers ~3 billion devices. This expands the addressable deployment surface for small open-weights models significantly.

Model release cadence. As on-device inference becomes more capable, there will be pressure on model labs to release sub-3B models that are specifically optimized for ExecuTorch/Arm rather than just quantized versions of larger models. Watch for purpose-built on-device model releases citing SDOT and KleidiAI compatibility as a deployment target.

← all signals