2026-05-06 · HuggingFace

vLLM V0 to V1: Correctness Before Corrections in RL

ecosystem

vLLM V0 to V1: Correctness Before Corrections in RL

Source: HuggingFace Date: 2026-05-06 URL: https://huggingface.co/blog/ServiceNow-AI/correctness-before-corrections

Summary

ServiceNow AI published a technical post-mortem on migrating PipelineRL from vLLM 0.8.5 to 1.18.1, documenting four categories of train-inference mismatch that silently degraded RL training dynamics: logprob semantics (V1 returns raw rather than processed logprobs by default), runtime defaults (prefix caching and async scheduling create different inference paths), inflight weight-update cache behavior, and fp32 vs fp16 precision in the final projection layer. The core finding is that V1’s architectural improvements introduce new configuration footguns that break policy ratio calculations unless explicitly matched to V0 behavior.

Implications

Feeds the open inference infrastructure and RL training stack threads: vLLM V1 is a breaking upgrade for any team running RLHF or GRPO pipelines on top of it — the default configuration changes are not documented as training-safety regressions, making them easy to miss.
The “correctness before corrections” title is a methodological point worth internalizing: adding reward shaping or KL terms to fix training instability can mask underlying inference bugs, making the root cause harder to find later.
Relevant to anyone evaluating vLLM for production RL workloads — the migration checklist from this post is the practical artifact.

← all signals