2025-10-27 · HuggingFace

Streaming datasets: 100x More Efficient

infrastructure

read at source ↗ huggingface.co

Streaming datasets: 100x More Efficient

Source: HuggingFace Date: 2025-10-27 URL: https://huggingface.co/blog/streaming-datasets

Summary

Library update: major streaming optimizations in datasets — 100x fewer startup requests via persistent data files cache (only one worker resolves the file list), 10x faster data file resolution, 2x faster streaming throughput, and 2x more efficient in-flight requests. Parquet prefetching added via ParquetFragmentScanOptions. On 64xH100 with 256 workers, streaming now matches local SSD performance.

Implications

Thread: transformers library trajectory / HF as open-source ML hub. Streaming performance that matches local SSD at 64xH100 scale removes a real bottleneck: teams that couldn’t afford petabyte-scale local data caches now have a viable alternative. The 100x reduction in startup requests also matters for Hub infrastructure — less hammering on file resolution endpoints at scale. The persistent cache design is the key architectural move; it shifts the cost model from per-worker resolution to once-per-job. Watch whether this triggers teams to abandon local dataset mirrors entirely in favor of Hub-streaming training pipelines.

← all signals