2026-03-09 · HuggingFace

Ulysses Sequence Parallelism: Training with Million-Token Contexts

protocolscapitalinfrastructure

Ulysses Sequence Parallelism: Training with Million-Token Contexts

Source: HuggingFace Date: 2026-03-09 URL: https://huggingface.co/blog/ulysses-sp

Summary

Integration tutorial announcing production-ready Ulysses Sequence Parallelism (SP) support across Accelerate, Transformers Trainer, and TRL SFTTrainer. Ulysses partitions sequences across GPUs via all-to-all communication, enabling 12x longer training sequences on the same hardware. Benchmarks on 4x H100 80GB (Qwen3-4B): SP=4 reaches 96K token sequences vs 8K baseline, with 3.7x throughput improvement at 64K tokens. Loss equivalence with standard data parallelism confirmed. Requires deepspeed≥0.18.1 + accelerate≥1.12.

Implications

Thread: transformers library trajectory. This closes a significant training gap in the HF ecosystem: million-token context training was previously only accessible via custom distributed setups. With SP now wired into SFTTrainer and TrainingArguments via a single config object, long-context LLM finetuning becomes a mainstream workflow rather than an infrastructure project. The 2D parallelism support (SP + ZeRO-3) suggests this is designed for serious scale. Watch whether this becomes the standard path for fine-tuning the long-context models (e.g., Gemma 3, Qwen 2.5-1M) that are proliferating.

← all signals