2024-10-05 · HuggingFace

Improving Parquet Dedupe on Hugging Face Hub

researchcommentary

read at source ↗ huggingface.co

Improving Parquet Dedupe on Hugging Face Hub

Source: HuggingFace Date: 2024-10-05 URL: https://huggingface.co/blog/improve_parquet_dedupe

Summary

Technical research post: HF Xet team investigates Content-Defined Chunking (CDC) deduplication efficiency on Parquet files for dataset updates. Findings: appending 10k rows achieves 99.1% dedupe (only 20MB additional storage); modifying a single row achieves 89% dedupe (230MB); deletions are poor (near 50%) due to Parquet storing absolute file offsets and column header rewrites. Context: HF hosts 11PB of datasets, 2.2PB in Parquet. Proposed fix: content-defined row groups (split on hash of key column) for efficient dedupe across insertions/deletions. Collaboration with Apache Arrow invited.

Implications

HF as open-source ML hub. At 11PB dataset scale with 2.2PB in Parquet, Parquet dedupe efficiency is a real infrastructure cost. The single-row modification costing 230MB is the kind of issue that matters at HF’s scale — and the proposed content-defined row groups solution is a meaningful contribution back to the Apache Arrow ecosystem if adopted.

Open-weights ecosystem health. Better dataset update efficiency directly benefits the data pipeline for fine-tuning open-weights models — faster, cheaper dataset iteration means more experimental runs are economically viable. The dedupe estimator tool made available for community testing lowers the barrier to understanding storage costs for large dataset management.

← all signals