2025-10-15 · HuggingFace

Get your VLM running in 3 simple steps on Intel CPUs

infrastructure

Get your VLM running in 3 simple steps on Intel CPUs

Source: HuggingFace Date: 2025-10-15 URL: https://huggingface.co/blog/openvino-vlm

Summary

Integration tutorial: Running VLMs on Intel CPUs via Optimum Intel and OpenVINO — convert to IR format, apply weight-only quantization, run inference. Tested on SmolVLM2-256M with Intel Core Ultra 7 265K. PyTorch baseline: 5.15s time-to-first-token, 0.72 tokens/sec. OpenVINO (unquantized): 0.42s TTFT (12x), 47.2 tokens/sec (65x). OpenVINO 8-bit WOQ: 0.247s TTFT (21x), 63.9 tokens/sec (88x). No GPU required.

Implications

Open-weights ecosystem health. 88x throughput improvement on CPU hardware via OpenVINO quantization changes the deployment calculus for edge and embedded use cases. A 256M VLM running at 64 tokens/sec on a consumer CPU without a GPU is practically usable for local inference — this is the path to open-weights VLMs on hardware that has no GPU budget.

Model release cadence (hardware-specific). Intel’s Optimum Intel library and OpenVINO as an inference backend are consistently underrepresented in discussion vs CUDA-native paths. The gap between PyTorch-on-CPU baseline and OpenVINO quantized performance is large enough that teams evaluating edge deployment should treat OpenVINO as the baseline, not an optimization step.

← all signals