Streaming datasets: 100x More Efficient
read at source ↗ huggingface.co
Streaming datasets: 100x More Efficient
Source: HuggingFace Date: 2025-10-27 URL: https://huggingface.co/blog/streaming-datasets
Summary
Library update: major streaming optimizations in datasets — 100x fewer startup requests via persistent data files cache (only one worker resolves the file list), 10x faster data file resolution, 2x faster streaming throughput, and 2x more efficient in-flight requests. Parquet prefetching added via ParquetFragmentScanOptions. On 64xH100 with 256 workers, streaming now matches local SSD performance.
Implications
Thread: transformers library trajectory / HF as open-source ML hub. Streaming performance that matches local SSD at 64xH100 scale removes a real bottleneck: teams that couldn’t afford petabyte-scale local data caches now have a viable alternative. The 100x reduction in startup requests also matters for Hub infrastructure — less hammering on file resolution endpoints at scale. The persistent cache design is the key architectural move; it shifts the cost model from per-worker resolution to once-per-job. Watch whether this triggers teams to abandon local dataset mirrors entirely in favor of Hub-streaming training pipelines.