2024-09-20 · HuggingFace

Optimize and deploy with Optimum-Intel and OpenVINO GenAI

modelsenterpriseinfrastructure

read at source ↗ huggingface.co

Optimize and deploy with Optimum-Intel and OpenVINO GenAI

Source: HuggingFace Date: 2024-09-20 URL: https://huggingface.co/blog/deploy-with-openvino

Summary

Integration tutorial: End-to-end workflow for deploying HF Transformers models on Intel hardware via Optimum-Intel and OpenVINO GenAI — export to OpenVINO IR format, apply INT4/INT8 weight-only quantization (AWQ + scale estimation), deploy via Python or C++ GenAI API with KV cache optimization. Perplexity on Llama-3.1-8B (Wikitext): FP32 7.34, OpenVINO INT8 7.35 (near-identical), OpenVINO INT4 7.83 (minor degradation). C++ deployment path addresses Python limitations in production edge environments.

Implications

Open-weights ecosystem health. INT8 at near-identical perplexity to FP32 is the key finding — teams deploying Llama-3.1-8B on Intel CPUs or integrated GPUs can use INT8 quantization without meaningful accuracy loss. The C++ GenAI API path is the production-relevant one for edge deployments where Python interpreter overhead and startup time are constraints.

Model release cadence (hardware-specific). OpenVINO GenAI being positioned as an alternative to llama.cpp for Intel hardware reflects Intel’s strategy of maintaining a separate optimized inference path. The AWQ + scale estimation combination (vs simple round-to-nearest INT4) is the accuracy-preserving quantization path worth watching as INT4 adoption grows for edge inference.

← all signals