2025-08-08 · HuggingFace

Accelerate ND-Parallel: A guide to Efficient Multi-GPU Training

modelsresearchinfrastructure

Accelerate ND-Parallel: A guide to Efficient Multi-GPU Training

Source: HuggingFace Date: 2025-08-08 URL: https://huggingface.co/blog/accelerate-nd-parallel

Summary

Library feature guide: Accelerate ND-Parallel introduces N-dimensional parallelism composition via a unified ParallelismConfig class. Supports combinations of: FSDP (weight/gradient sharding), DP (replicated models, sharded data), Tensor Parallel (layer splitting within nodes), and Context Parallel (sequence length sharding via Ring Attention). Key compositions: Hybrid FSDP (FSDP within nodes + DP across), FSDP+TP, FSDP+CP, HSDP+TP. Integrates with Axolotl. No benchmark numbers — this is a methodology guide.

Implications

Transformers library trajectory. A single ParallelismConfig composing four parallelism strategies is the right abstraction level for the multi-node training landscape — teams currently configure FSDP, TP, and CP through separate mechanisms with incompatible APIs. ND-Parallel unifying this in Accelerate means existing Accelerate training pipelines can adopt multi-dimensional parallelism with minimal code changes.

Open-weights ecosystem health. Context Parallel (Ring Attention for sequence sharding) is the critical addition for teams training on long-context data — the memory constraint on long sequences is often the binding factor before weight sharding becomes necessary. FSDP+CP enabling both weight and sequence sharding simultaneously makes 128k+ context training at 70B+ scale feasible on available multi-node hardware.

← all signals