2025-07-25 · HuggingFace

Parquet Content-Defined Chunking

researchcommentary

read at source ↗ huggingface.co

Parquet Content-Defined Chunking

Source: HuggingFace Date: 2025-07-25 URL: https://huggingface.co/blog/parquet-cdc

Summary

Feature release: Parquet Content-Defined Chunking (CDC) integrated into PyArrow ≥21.0.0 and Pandas, using HF’s Xet storage backend. CDC aligns Parquet’s byte-level layout with content-addressable storage so minor dataset changes transfer only the delta. Appending 10K rows to a 100K table: 6MB transferred vs 89.8MB without CDC. Exact re-upload: 0MB transferred. Column add/remove: 575KB / 37.7KB. HF hosts ~4 PB of Parquet data; cross-repo deduplication enabled.

Implications

Thread: HF as open-source ML hub. Parquet CDC is a meaningful infrastructure improvement for iterative dataset development — the most common Hub workflow (expand a dataset, re-upload) goes from full-file transfer to delta transfer. At HF’s 4 PB Parquet scale, the storage efficiency gains are material. The cross-repository deduplication is the sleeper feature: datasets that share base data across forks or versions no longer pay for redundant storage. This is the Xet acquisition payoff becoming visible at the application layer — CDC in PyArrow with an hf:// URI is as frictionless as it can be.

← all signals