Improving Hugging Face Training Efficiency Through Packing with Flash Attention 2
read at source ↗ huggingface.co
Improving Hugging Face Training Efficiency Through Packing with Flash Attention 2
Source: HuggingFace Date: 2024-08-21 URL: https://huggingface.co/blog/packing-with-FA2
Summary
Library update: DataCollatorWithFlattening added to Transformers and TRL, enabling correct sequence packing with Flash Attention 2. Prior packing implementations allowed cross-example attention, degrading quality; this fix tracks cu_seqlens boundaries. Benchmarks on FLAN dataset: up to 2x throughput, 20% memory reduction. On OrcaMath: 1.4x throughput, 6% memory reduction. No convergence degradation. Supports 14 model architectures.
Implications
Thread: transformers library trajectory. Sequence packing with correct attention boundaries is a high-value training efficiency improvement — the 2x throughput gain on variable-length datasets like FLAN is substantial. The fact that earlier packing implementations silently produced incorrect attention (cross-example contamination) is a quality issue that was likely degrading fine-tuning results for users who enabled packing without FA2 boundary tracking. The cu_seqlens solution is standard in efficient transformer implementations; it’s good hygiene to have it in the official collator rather than requiring users to implement it correctly themselves.