Orbit - Ultra-efficient RL Pipeline

Orbit is an open-source RL framework for memory-efficient post-training of trillion-parameter LLMs. By minimizing the train-rollout gap in RL, Orbit is able to effortlessly train Kimi-K2.6 and DeepSeek V4-pro at a single-node setting.

Today’s leading large language models contain over a trillion parameters, and the prevailing recipe for RL post-training of such models calls for high-precision, multi-node, full-parameter updates. This is extremely challenging for RL infrastructures. Orbit takes a different route: treat the base model as fixed at deployment precision, and put the gradient signal where it is actually needed — on a tiny BF16 OFT adapterControlling Text-to-Image Diffusion by Orthogonal Finetuning (Qiu, Liu, et al, NeurIPS 2023)Parameter-Efficient Orthogonal Finetuning via Butterfly Factorization (Liu, Qiu, et al, ICLR 2024)Orthogonal Finetuning Made Scalable (Qiu, Liu, et al, EMNLP 2025).

The result is that RL post-training of 1T-class models fits on a single node with 8×B200 GPUsFor the public record, Macaron AI used 8 nodes × 8× NVIDIA H800 (80GB) GPUs (64 GPUs total) to RL post-train 1T LLMs (Kimi-K2 1.04T)., the precision mismatch between the training base and the serving base disappears by construction, and most of the orchestration tax that comes with multi-node full-finetuning vanishes along the way. We have used Orbit to successfully run stable, end-to-end RL on Kimi-K2.6 (~1T), DeepSeek V4-Flash, DeepSeek V4-Pro (~1.6T), and Qwen3 MoE, on a single node setting.

Abstract#

Orbit moves RL post-training of frontier-scale models away from the conventional recipe (high-precision, multi-node, full-parameter updates) toward a deployment-aligned, low-precision PEFT-centric pipeline. The base model is held at its deployment precision (INT4 / FP4) during both training and rollout, and only a small BF16 adapterWe primarily use OFT in Orbit due to its better stability under low-bit base models, but LoRA is also supported. receives gradient updates. RL post-training of 1T-class models then fits on a single 8×B200 node, and the precision mismatch between the training base and the serving base, which is typically idenitifed as a chronic source of instability in low-precision RL, disappears by construction.

What Orbit offers#

Orbit is an adapter-first system for frontier-scale RL. Four properties set it apart from the conventional full-finetuning pipeline:

Training stability

RL is sensitive to policy log-prob drift and base precision. Orbit trains a BF16 adapter on top of a low-precision frozen base; reward and eval rise steadily, and train-rollout log-prob difference stays bounded. Moreover, we observe no performance drop compared to full-parameter RL.

Single-node trillion-scale RL

Trillion-scale RL can be reduced from complex multi-node orchestration to a single 8×B200 node, eliminating cross-node communication, simplifying scheduling, and greatly reducing the failure-recovery surface area. For the first time, 1T LLMs can be RL post-trained with a single node.

Deployment-aligned RL

The training base and the rollout / serving base are the same low-precision weights. No silent precision gap between what is optimized and what is shipped.

Budget-constrained efficiency

Under the same hardware budget, Orbit supports larger models, longer responses, and larger update sizes than full-finetuning baselines.

Weight-state memory: why a single node is enough#

The starting point is a simple memory counting. For any RL recipe, the weight and gradient footprint of training is a function of the base model size and the precision strategy. Modern RL design space consists of four common recipes:

Method	Training	Rollout
Traditional RL full-FT	BF16 forward + BF16 backward	FP8
Low-precision RL full-FT	FP8 forward + BF16 backward	FP8
Modern low-precision RL full-FT (most commonly used)	FP8 forward + FP8 backward	INT4 / FP4
Orbit's PEFT-centric RL	INT4 / FP4 (frozen) + BF16 adapter	same INT4 / FP4 + BF16 adapter

The slope ratios make the gap explicit. For training (weight + grad), the four recipes scale as (16+16) : (8+16) : (8+8) : (4 + 2·adapter); for rollout, as 8 : 8 : 4 : (4 + adapter), where adapter ≈ 0.04 bits/parameter. Orbit's curve is essentially the 4-bit base alone. For fair comparison, for all the compared RL recipes, we use exactly the same training settings on each base model.

Weight-state memory scaling across full-ft and Orbit PEFT recipes — Theoretical lower bound on weight + gradient footprint (4-bit base; baselines scaled by precision ratio; Orbit adds a BF16 adapter at ≈ 0.04 bit/param). At Kimi-K2.6 (~1.03T), full-ft requires ≈ 4,120 GB — more than 2.6× a single 8×B200 node (1,536 GB HBM). Orbit's footprint at the same scale is ≈ 528 GB. Activation and optimizer state are not included; the gap widens further once they are.

Kimi-K2.6 1T: stable RL on a trillion-parameter model#

Our first end-to-end result is a 200-step RL run on Kimi-K2.6 (~1.03T)Kimi K2.6: Advancing Open-Source Coding (Moonshot AI, 2026) using an INT4 frozen base with a BF16 adapter, run entirely on a single 8×B200 node. Reward, eval accuracy, and training-set pass@k all rise; the train–rollout log-prob diff stays bounded throughout the run.

Model	Kimi-K2.6 ~1.03T
Hardware	single 8×B200
Training precision	INT4 base + BF16 adapter
Rollout precision	same INT4 base + BF16 adapter

1T Kimi-K2.6 RL training curves — Kimi-K2.6 single-node stable RL (8×B200, INT4 base + BF16 adapter). *(a)* Training reward climbs from 0.33 to 0.88. *(b)* Eval accuracy on AIME24, AMC23, MATH500 rises consistently. *(c)* pass@1/2/4/8 on the training set increases over ≈220 RL steps. *(d)* Train–rollout |Δ log-prob| stays in `0.0066–0.0104` with no drift.

1T Kimi-K2.6 GPU memory stability over 27 hours — Single-node GPU memory stability over 27 hours / ≈220 steps. Mean ≈ 160 GB, peak ≈ 191 GB out of the 192 GB B200 HBM. Bounded, no upward drift — the 1T model genuinely fits on one node.

DeepSeek V4-Flash: not Kimi-only#

To show that deployment-aligned low-precision RL is not specific to one model family, we ran the same recipe on DeepSeek V4-FlashDeepSeek-V4: Towards Highly Efficient Million-Token Context Intelligence (DeepSeek AI, 2026) with an FP4 base. Reward, eval, and pass@k all improve; log-prob diff is again bounded.

Model	DeepSeek V4-Flash
Hardware	single 8×B200
Training precision	FP4 base + BF16 adapter
Rollout precision	same FP4 base + BF16 adapter

DeepSeek V4-Flash RL training curves — DeepSeek V4-Flash single-node stable RL (8×B200, FP4 + FP8 base + BF16 adapter). Training reward rises from 0.62 to 0.84; pass@k climbs steadily; log-prob diff stays inside `0.0161–0.0247` with no drift over 160 steps.

DeepSeek V4-Flash GPU memory stability — V4-Flash memory: mean ≈ 119 GB, peak ≈ 190 GB. Headroom remains under the 192 GB cap, with no growth over the 27-hour run.

System validation on DeepSeek V4-Pro-1.6T#

DeepSeek V4-Pro (~1.59T)DeepSeek-V4: Towards Highly Efficient Million-Token Context Intelligence (DeepSeek AI, 2026) pushes Orbit into MoE territory beyond 1.6T parameters. We treat this section as a system capability proof, not a quality benchmark — the base model is already very strong and the present RL data cannot effectively improve its accuracy further. However, what really matters here is that a 1.6T LLM can well fit in a single GPU node (8×B200), RL post-training stays very stable in the whole time, and the train-rollout log-prob difference is also quite small.

DeepSeek V4-Pro 1.6T single-node systems validation — DeepSeek V4-Pro (1.6T) single-node systems validation (8×B200, FP4 + FP8 base + BF16 adapter). *(a)* Train–rollout log-prob diff stays in `0.030–0.040` with no drift at 1.6T scale. *(b)* pass@k during the validation run. *(c)* GPU memory bounded; mean ≈ 156 GB, peak ≈ 188 GB / 192 GB — 1.6T fits and stays bounded on a single 8×B200 node.

Why Orbit uses OFT#

Our previous PEFT-Arena benchmarkPEFT-Arena: Understanding Parameter-Efficient Finetuning from a Stability-Plasticity Perspective (Huang, et al, preprint 2026) has performed a conprehensive evaluation on existing PEFT methods. We can observe from the following figure that OFT achieves better performance than full-finetuning and LoRA (on both downstream adaptation and pretraining knowledge preservation). Most importanly, we find that OFT can consistently outperform full-parameter finetuning in the RLVR settings, while LoRA cannot. We also note that OFT, as a supervised finetuning method, also works well. These results motivate us to use OFT as Orbit's default PEFT method.

PEFF-Arena Results — Performance on PEFT-Arena to evaluate PEFT's statbility-plasticity trade-off. The y-axis shows the performance on target downstream tasks (for evaluating the adaptation effectiveness) and the x-axis shows the performance on general tasks (for evaluating the general capability preservation).

To compare OFT and LoRA in a larger-scale setting, we use Qwen3-30B-A3B as our base model, and perform RL with OFT or LoRA (under the same parameter budget). The results are given below. We study adapter choices in the standard BF16 regime. On Qwen3-30B-A3B (MoE) we compare LoRA r16/a32 against OFT b64 under the PEFT-Arena train–eval setting. OFT outperforms LoRA on both training reward and eval accuracy while producing log-prob difference at the same scale.

Qwen3-30B-A3B LoRA vs OFT — Qwen3-30B-A3B (MoE, BF16) — LoRA vs OFT over 500 RL steps. OFT shows a consistent advantage on training reward, eval avg accuracy, and pass@8, with comparable log-prob diff.

Beyond accuracy, OFT also has two system-level properties that make it a better fit than LoRA for adapter-first RL on frontier-scale models:

Zero additional network communication overhead when serving OFT. While optimized approaches like S-LoRA attempt to mitigate LoRA's inherent communication costs (which scale as 5(N−1)Br / N) by cleverly fusing the adapter's all-reduce steps with those of the base model, OFT's localized matrix design inherently bypasses the need for these extra cross-device reductions altogether, entirely avoiding this “communication wall” and making it fundamentally faster than LoRA architectures during distributed inference.
Custom kernels for split-projection adapters on fused layers. Modern high-throughput transformers fuse logically-independent projections — Q/K/V in attention, gate/up in SwiGLU MLPs — into single GEMMs to saturate tensor cores and memory bandwidth. Naively attaching an adapter to such a fused tensor shares one set of adapter parameters across distinct subspaces, degrading quality; the typical LoRA workaround couples the two halves through a shared low-rank A, effectively wasting half of the adapter capacity. We engineered custom split-projection kernels that natively apply separate orthogonal rotations to each subspace. Taking Kimi as an example, during inference, fused INT4 weights are dynamically sliced and repacked at load time to run as precise half-size GEMMs, avoiding discarded compute and cutting kernel time by ≈35%. On the training side, a split-aware Triton dequantization kernel paired with custom autograd hooks minimizes memory overhead.

Technical details of Orbit#

PEFT adapters as first-class citizens in RL infrastructure#

The architectural shift from full-parameter to adapter-only updates is what unlocks single-node trillion-scale RL. Five properties of the adapter-first design matter most:

Trivial weight sync. Only megabyte-scale adapter deltas move between the trainer and the rollout engines — no gigabyte-scale weight exports, format conversions, or serving rebuilds on the hot path.
Persistent rollout servers. Because the base model never changes, inference engines stay constantly active across the training lifecycle. The cyclic engine restart between generation and optimization disappears.
Viable trillion-parameter scaling. The memory-heavy optimizer state is bounded to the tiny adapter while the frozen base rests in low-precision memory; massive MoE models become single-node-feasible.
Free reference policy. The KL-divergence reference simply evaluates the frozen base without the adapter applied — instantly recovering the memory normally wasted on a separate snapshot of the actor.
Adapter-native async with double-buffered rollout. Orbit's versioned adapter runtime makes async RL cheap. Each adapter delta is stamped with a monotonic version; SGLang streams the next version into an inactive slot while the active slot keeps serving in-flight requests, and activation flips atomically — zero rollout bubble. On Qwen3-4B + OFT (8×B200, TP=2), this yields a 1.42× step-time speedup and 44% higher rollout throughput, with eval accuracy unchanged within run-to-run noise.

Low-precision MoE expert modules#

Active-expert-chunked dequantization dynamically groups only the router-selected experts into fixed-size batches, temporarily dequantizing them for grouped GEMMs and immediately releasing the high-precision weights. The hybrid reaches the throughput of fully grouped matrix multiplications while strictly bounding peak transient memory to a single small chunk — preventing OOM on massive low-precision MoE models during training.

DeepSeek-V4 specific optimizations#

Full CUDA-graph decoding support, DeepGEMM + DeepEP V2 integration, and hand-written tilelang / triton / CUDA kernels for attention backward and fusion. An adapter-aware bypass-base-weight-grad GEMM backward kernel exploits the fact that the base is frozen. A 6-layer debug model with debug ops enabled lands at 1e-5-level log-prob diff; step-0 log-prob diff in real V4-Flash training is 0.02.

Authors#

Zeju Qiu^*+, Le Chen^*+, Lixin Liu^*+, Tim Z. Xiao⁺, Yao Feng, Yangyi Huang, Zhen Liu, Han Shi, Yandong Wen, Zhouliang Yu, Bernhard Schölkopf, Weiyang Liu⁺.

The Chinese University of Hong Kong · Max Planck Institute for Intelligent Systems

Citation#

Please cite this work as:

@article{spherelab2026orbit,
  author = {Qiu, Zeju and Chen, Le and Liu, Lixin and Xiao, Tim Z.
            and Feng, Yao and Huang, Yangyi and Liu, Zhen and Shi, Han
            and Wen, Yandong and Yu, Zhouliang and Sch{\"o}lkopf, Bernhard
            and Liu, Weiyang},
  title  = {Orbit: Stable and Efficient Reinforcement Learning for Trillion-Parameter LLMs},
  journal = {SphereLab Blog},
  year   = {2026},
  note   = {https://spherelab.ai/orbit}
}