2024-08-21 · HuggingFace

Improving Hugging Face Training Efficiency Through Packing with Flash Attention 2

researchinfrastructure

Improving Hugging Face Training Efficiency Through Packing with Flash Attention 2

Source: HuggingFace Date: 2024-08-21 URL: https://huggingface.co/blog/packing-with-FA2

Summary

Library update: DataCollatorWithFlattening added to Transformers and TRL, enabling correct sequence packing with Flash Attention 2. Prior packing implementations allowed cross-example attention, degrading quality; this fix tracks cu_seqlens boundaries. Benchmarks on FLAN dataset: up to 2x throughput, 20% memory reduction. On OrcaMath: 1.4x throughput, 6% memory reduction. No convergence degradation. Supports 14 model architectures.

Implications

Thread: transformers library trajectory. Sequence packing with correct attention boundaries is a high-value training efficiency improvement — the 2x throughput gain on variable-length datasets like FLAN is substantial. The fact that earlier packing implementations silently produced incorrect attention (cross-example contamination) is a quality issue that was likely degrading fine-tuning results for users who enabled packing without FA2 boundary tracking. The cu_seqlens solution is standard in efficient transformer implementations; it’s good hygiene to have it in the official collator rather than requiring users to implement it correctly themselves.

← all signals