WWDC 24: Running Mistral 7B with Core ML
read at source ↗ huggingface.co
WWDC 24: Running Mistral 7B with Core ML
Source: HuggingFace Date: 2024-07-22 URL: https://huggingface.co/blog/mistral-coreml
Summary
Integration tutorial demonstrating Mistral 7B running on Apple silicon via Core ML with 4-bit block-wise quantization: 3.8GB model size vs 14GB float16, under 4GB RAM usage. Uses four new Core ML features from WWDC 2024 — MLTensor (Swift tensor abstraction), stateful KV-cache buffers (GPU-resident, no transfer overhead), block-wise quantization, and multifunction support for LoRA adapters. No explicit latency benchmarks. swift-transformers preview branch.
Implications
Thread: open-weights ecosystem health. This is a significant on-device deployment signal: Mistral 7B in 4GB RAM on Mac is practically usable for local inference without cloud dependency. The Core ML stateful buffer feature for KV-cache is the critical hardware unlock — without GPU-resident KV-cache, inference latency would be dominated by memory transfers. The swift-transformers integration path means HF’s model ecosystem is becoming directly accessible from native Apple development, which matters for macOS/iOS AI apps. Watch for Core ML exporters support expanding to more architectures.