No GPU left behind: Unlocking Efficiency with Co-located vLLM in TRL
read at source ↗ huggingface.co
No GPU left behind: Unlocking Efficiency with Co-located vLLM in TRL
Source: HuggingFace Date: 2025-06-03 URL: https://huggingface.co/blog/vllm-colocate
Summary
Library update: TRL adds co-located vLLM support (vllm_mode="colocate"), allowing training and inference to share the same GPUs rather than requiring dedicated inference hardware. This eliminates GPU idle time during online RL methods like GRPO. Benchmarks: 1.43x speedup for 1.5B models, 1.73x for 7B with TP=8. For Qwen2.5-Math-72B on 8xH100, co-location achieves 1.26x throughput improvement even with 4 fewer GPUs, with no model quality regression on Math500.
Implications
Transformers library trajectory. Co-located vLLM in TRL is a step toward making online RL (GRPO, PPO) practical on realistic hardware budgets — previously a 72B online RL run needed a separate GPU pool for inference. This removes a meaningful infrastructure barrier for teams trying to do post-training at scale without datacenter-level GPU counts.
Open-weights ecosystem health. Faster and cheaper online RL training means more teams can afford to run post-training experiments on open-weights models. If GRPO continues to be the dominant signal for math/reasoning improvements (as with DeepSeek-R1’s technique), co-located vLLM becomes infrastructure for the next wave of open reasoning models.