Unlocking Agentic RL Training for GPT-OSS: A Practical Retrospective
read at source ↗ huggingface.co
Unlocking Agentic RL Training for GPT-OSS: A Practical Retrospective
Source: HuggingFace Date: 2026-01-27 URL: https://huggingface.co/blog/LinkedIn/gpt-oss-agentic-rl
Summary
Engineering retrospective from LinkedIn on stabilizing PPO-based agentic RL training for GPT-OSS (120B MoE). Three engineering contributions: (1) fixing MoE log-probability mismatch in on-policy PPO (force old_log_prob = log_prob.detach() when truly on-policy); (2) implementing FlashAttention v3 backward pass for attention sinks, eliminating training/inference mismatch; (3) memory-efficient MoE materialization + sequence parallelism. Tested on 16 H200 nodes, 8k prompt + 16k response lengths. Without these fixes, training collapsed on VerifyIf; with them, stable convergence on GSM8K, VerifyIf, ReTool.
Implications
Thread: open-weights ecosystem health / agentic patterns. LinkedIn publishing specific fixes for GPT-OSS agentic RL training is valuable engineering knowledge for anyone attempting similar work — MoE on-policy PPO mismatch and attention sink support gaps are non-obvious failure modes. The code will be released pending internal review, so watch for the repo. The FlashAttention v3 attention sink backward pass is a contribution that should propagate back upstream. This confirms GPT-OSS 120B as a viable foundation for production agentic RL, which was an open question before this post.