2024-11-20 · HuggingFace

From Files to Chunks: Improving HF Storage Efficiency

modelsenterprise

From Files to Chunks: Improving HF Storage Efficiency

Source: HuggingFace Date: 2024-11-20 URL: https://huggingface.co/blog/from-files-to-chunks

Summary

Infrastructure post: HF’s Xet team describes their Content-Defined Chunking (CDC) approach for Hub storage. Instead of Git LFS file-level storage requiring full re-uploads on any change, CDC breaks files into variable chunks via rolling hash, deduplicating identical chunks. Benchmark on CORD-19 (50 incremental updates): download time 51min→19min, upload time 47min→24min, storage 8.9GB→3.52GB. GPT-2 model across two versions: total storage 1.2GB→645MB (53% savings). Projected hub-wide impact: up to 100TB immediate savings, 7-8TB/month. Rollout planned for early 2025.

Implications

HF as open-source ML hub. At 30PB of stored data, the projected savings (100TB immediately, 7-8TB/month ongoing) are operationally significant — Xet’s CDC turns what would be a linear storage cost problem into one that scales with meaningful delta size rather than raw file size. This is the infrastructure investment that makes Hub financially sustainable as model file sizes continue to grow.

Open-weights ecosystem health. The 53% storage savings on GPT-2 across versions demonstrates the practical deduplication for model checkpoint sequences — the pattern that matters most for Hub users uploading incremental fine-tunes or quantized variants. When Xet-backed repos replace Git LFS, teams hosting multiple versions of large models will see substantially reduced upload times and storage costs without changing their workflows.

← all signals