2024-07-18 · HuggingFace

TGI Multi-LoRA: Deploy Once, Serve 30 Models

pricingmodelsenterpriseinfrastructure

read at source ↗ huggingface.co

TGI Multi-LoRA: Deploy Once, Serve 30 Models

Source: HuggingFace Date: 2024-07-18 URL: https://huggingface.co/blog/multi-lora-serving

Summary

Library update + tutorial: TGI adds multi-LoRA serving — deploy a single base model (e.g., Mistral-7B) with up to 30 LoRA adapters selected per request. Adapters are ~13.6MB vs. 14.48GB base model; loading 30 adapters adds ~3% VRAM overhead. Cost: $0.80/hr on L4 GPU with 75 req/sec throughput. Cost per token stays constant regardless of adapter count (linear per-adapter cost with separate model instances is eliminated). LoRA retraining: ~$8/adapter per Predibase. Supports Docker deployment, HF Inference Endpoints GUI, and Python API.

Implications

HF as open-source ML hub. Multi-LoRA serving at $0.80/hr for 30 adapters on a single deployment is a practical unlock for teams that want task-specific fine-tuned behavior without running separate endpoints per model variant. The cost math is compelling: the alternative (30 separate deployments) is ~$24/hr minimum for the same adapter count.

Open-weights ecosystem health. Task-specific LoRA fine-tuning (customer support, code, domain-specific) being economical at ~$8/adapter to train and ~$0.027/hr to serve at scale ($0.80/30) makes the economics of specialized fine-tuning comparable to or better than general-purpose API calls. This is a concrete path for teams migrating from GPT-4 to domain-specialized open-weights models.

← all signals