2024-10-16 · HuggingFace

Fixing Gradient Accumulation

capital

Fixing Gradient Accumulation

Source: HuggingFace Date: 2024-10-16 URL: https://huggingface.co/blog/gradient_accumulation

Summary

Library bug fix: Gradient accumulation in Transformers was mathematically incorrect for token-level tasks (causal LM training). Root cause: loss was averaged per-batch instead of summed over all non-padding tokens and divided by total token count — incorrect when batches have different numbers of valid tokens. Fix: reduction="sum" in cross_entropy followed by division by total non-padding tokens. Shipped to main within 24 hours; new loss_function property on PreTrainedModel for custom loss API in progress.

Implications

Transformers library trajectory. This is a correctness bug, not a performance bug — any model fine-tuned with gradient accumulation enabled in Transformers before this fix has different effective gradients than a full-batch run. The fix is a breaking change to training behavior (loss values will change). Teams comparing training runs before and after this fix need to account for the changed loss semantics.

Open-weights ecosystem health. The breadth of fine-tunes performed with gradient accumulation in Transformers before this fix (virtually all QLoRA fine-tunes on consumer hardware) means a non-trivial fraction of community fine-tuned models were trained with mathematically incorrect gradients. The practical impact on final model quality is unclear, but reproducibility claims for any fine-tune that used gradient accumulation before this fix are suspect without re-running.

← all signals