Prime Intellect Releases prime-rl 0.6.0 to Train Trillion-Parameter MoE Models on Agentic RL Workloads
AI
Prime Intellect released prime-rl 0.6.0, an open framework for asynchronous reinforcement learning on trillion-parameter MoE models, enabling efficient training on long-horizon agentic tasks like software engineering.
Intelligence Insights
Context + impact, normalized for TechCulture.
The Big Picture
Prime Intellect announced prime-rl version 0.6.0, an open-source framework designed for asynchronous reinforcement learning (RL) on trillion-parameter Mixture-of-Experts (MoE) models. The release focuses on heavy agentic workloads, such as long-horizon software engineering tasks, and was demonstrated by training GLM-5 on SWE tasks with 131k sequence length, sub-5-minute step times, and 28 H200 nodes. Key optimizations include asynchronous RL disaggregation of trainer and inference, FP8 inference with wide expert parallelism and prefill/decode disaggregation, and training using 3-D parallelism (FSDP, EP, CP) plus block-scaled FP8. The framework also introduces router replay to reduce KL mismatch between trainer and inference, and supports KV cache offloading to CPU and disk for high concurrency. This release enables stable, scalable post-training of large open-source models on agentic tasks using fewer nodes.
Why It Matters
This release makes it practical to train trillion-parameter MoE models on complex agentic tasks using far fewer GPUs than previously possible. By disaggregating training and inference, optimizing KV cache management, and reducing precision mismatch, prime-rl 0.6.0 lowers the barrier for post-training large models on long-horizon software engineering and other agentic workloads. This could accelerate the development of more capable coding assistants and autonomous agents, shifting the focus from pre-training scale to efficient post-training alignment.
Deepen your understanding
Use our AI to break down complex signals.
Select an AI action to generate more depth.
Prime Intellect has released prime-rl version 0.6.0. The framework targets reinforcement learning on trillion-parameter Mixture-of-Experts (MoE) models. It focuses on heavy agentic workloads, like long-horizon software-engineering tasks.
The research team trained GLM-5 on SWE tasks at up to 131k sequence length. Step times stayed under five minutes. The batch size was 256 rollouts. The run used only 28 H200 nodes.
TL;DR
prime-rl 0.6.0 trains trillion-parameter MoE models on agentic RL workloads.
GLM-5 trained on SWE at 131k sequence length, sub-5-minute steps, 28 H200 nodes.
Asynchronous RL disaggregates trainer and inference for independent optimization.
Training uses 3-D parallelism (FSDP, EP, CP) plus block-scaled FP8.
What is prime-rl 0.6.0?
prime-rl is an open framework for asynchronous reinforcement learning. It post-trains large open-source models on agentic tasks. Version 0.6.0 extends this to trillion-parameter MoE scale.
The example model in the announcement is zai-org/GLM-5.1. The optimizations also apply to other large MoE models. Examples include moonshotai/Kimi-K2.7-Code and nvidia/NVIDIA-Nemotron-3-Ultra-550B-A55B-BF16.
A full GLM-5.1 run starts with one command on a Slurm cluster.
uv run rl @ examples/glm5_llmd/rl.toml --output-dir /shared/outputs/glm5-llmd
Role of asynchronous RL
Agentic tasks have long-tail outliers. Some coding rollouts run for hours. Waiting for them before each policy update would idle GPUs.
Asynchronous RL avoids this. The trainer and inference systems are disaggregated. They run and scale independently. The inference policy updates as soon as the optimizer step finishes.
There is one synchronization point: the policy update. prime-rl pushes new weights as soon as they exist. Already-dispatched rollouts keep their active prefix cache. So a single rollout may mix tokens from several policy versions.
New rollouts behave differently. They repopulate their own KV cache, even when prefixes match. A KV-cache salt forces this. Requests from too old a policy are dropped. The max_off_policy_steps value controls that threshold.
Inference optimizations
Inference is usually the throughput bottleneck in an RL system. prime-rl optimizes for throughput, while keeping latency bounded.
FP8 inference: Lower precision speeds up prefill and decode. prime-rl uses FP8 with DeepEP and DeepGEMM kernels.
Wide Expert Parallelism: Wide EP spreads experts across ≥32 GPUs. It pairs with a large data-parallel rank, for example 32. Each GPU holds separate experts and serves as an endpoint. Synchronization happens per-layer, through dispatch and combine operations.
Prefill and Decode Disaggregation: Some model↔env pairs hit a 4:1 prefill:decode token ratio. Shared workers would inflate end-to-end latency. That reduces the benefits of PipelineRL. P/D disaggregation separates prefill and decode workers. Long tool outputs then stop throttling decode workers.
KV cache management: High concurrency needs large KV cache space. prime-rl supports tiered offloading to CPU and disk. vLLM native offloading creates one pool per worker. Mooncake Store instead pools RAM and disk across all nodes centrally.
Request routing: prime-rl ships a fork of vllm-router by default. It also supports the NVIDIA Dynamo router as a drop-in. Routers score workers using KV cache reuse, queue depth, and live load.
Router replay (R3): Trainer↔inference mismatch silently kills training. Router replay captures inference routing decisions. It replays them directly on the trainer. This cuts KL mismatch by roughly an order of magnitude. Routed experts have shape [num_layers, top_k, seq_len]. This payload can grow to hundreds of GB. At scale, the data rate reaches tens of Gbps. So prime-rl treats it as an opaque payload. Optimized PyTorch operations handle the processing.
Training optimizations
The trainer builds on torchtitan, a PyTorch-native training codebase. It relies on 3-D parallelism: FSDP, CP, and EP. The GLM-5 case study uses all three.
StrategyWhat it shardsPrimary useKey detailFSDP (FSDP2)Parameters, gradients, optimizer statesBaseline memory amortizationGathers weights on demand per layer via fully_shardExpert Parallelism (EP)Experts within a layerShrinks active layer memoryall2all dispatch/combine; torch-native or DeepEPContext Parallelism (CP)The sequence dimensionLong-context activation memoryUlysses (default) or Ring Attention
EP exists because layers stay huge after FSDP. With 78 layers and 800B params in float32, one layer’s all-gather needs roughly 40GB. Overlapping one layer pushes that near 80GB. Setting EP=8 dispatches tokens instead of gathering full experts. torch-native all2all is slightly faster within one node. DeepEP wins when EP spans multiple nodes.
CP matters at 131k+ sequence length. There, activations dominate memory, not parameters. GLM-5 uses DSA, which neither Ulysses nor Ring Attention parallelizes directly. So prime-rl ships a custom context-parallel implementation for it.
FP8 training. prime-rl uses DeepGEMM block-scaled FP8, as proposed by DeepSeek V3. This rarely raises throughput, due to quantization overhead. Its real value is matching trainer and inference precision. That reduces KL mismatch and stabilizes training.
Interactive Explainer
Use cases with examples
Long-horizon SWE agents: Train a model on real repository issues. Rollouts can span 100s of turns and tool calls. P/D disaggregation keeps decode latency predictable here.
1T-scale post-training on fewer nodes: The GLM-5 run fit on 28 H200 nodes. Wide EP and KV offloading raise concurrency and throughput.
Stable agentic RL at scale: Router replay and FP8 training both reduce trainer↔inference KL mismatch. Lower mismatch means steadier training.