2024-08-19 · HuggingFace

Deploy Meta Llama 3.1 405B on Google Cloud Vertex AI

modelsinfrastructure

Deploy Meta Llama 3.1 405B on Google Cloud Vertex AI

Source: HuggingFace Date: 2024-08-19 URL: https://huggingface.co/blog/llama31-on-vertex-ai

Summary

Integration tutorial: deploying Meta Llama 3.1 405B-Instruct-FP8 on Google Cloud Vertex AI using TGI v2.2 in a HF Deep Learning Container. Hardware: A3 instance with 8xH100 80GB (~640GB VRAM for FP8). Total setup ~25–30 minutes. Covers API enablement, model registration, endpoint creation, and cleanup.

Implications

Thread: HF as open-source ML hub. The HF + GCP partnership that produced this tutorial is the practical payoff of cloud infrastructure deals: Llama 3.1 405B goes from Hub weights to a Vertex AI endpoint via a standardized workflow. The FP8 quantization requirement (~405GB VRAM) means this is still A3-class infrastructure — not accessible without quota increases and significant cost. The tutorial establishes the deployment pattern that lower-cost cloud tiers will inherit as hardware improves. TGI-in-DLC as the deployment substrate means HF controls the inference layer even when running on Google infrastructure.

← all signals