Orbit is an open-source RL framework for memory-efficient post-training of trillion-parameter LLMs. By minimizing the train-rollout gap in RL, Orbit is able to effortlessly train Kimi-K2.6 and DeepSeek V4-pro at a single-node setting.
Today’s leading large language models contain over a trillion parameters, and the prevailing recipe for RL post-training of such models calls for high-precision, multi-node, full-parameter updates. This is extremely challenging for RL infrastructures. Orbit takes a different route: treat the base model as fixed at deployment precision, and put the gradient signal where it is actually needed — on a tiny BF16 OFT adapterControlling Text-to-Image Diffusion by Orthogonal Finetuning (Qiu, Liu, et al, NeurIPS 2023)Parameter-Efficient Orthogonal Finetuning via Butterfly Factorization (Liu, Qiu, et al, ICLR 2024)Orthogonal Finetuning Made Scalable (Qiu, Liu, et al, EMNLP 2025).
The result is that RL post-training of 1T-class models fits on a single node with 8×B200 GPUsFor the public record, Macaron AI used 8 nodes × 8× NVIDIA H800 (80GB) GPUs (64 GPUs total) to RL post-train 1T LLMs (Kimi-K2 1.04T)., the precision mismatch between the training base and the serving base disappears by construction, and most of the orchestration tax that comes with multi-node full-finetuning vanishes along the way. We have used Orbit to successfully run stable, end-to-end RL on Kimi-K2.6 (~1T), DeepSeek V4-Flash, DeepSeek V4-Pro (~1.6T), and Qwen3 MoE, on a single node setting.
Abstract#
Orbit moves RL post-training of frontier-scale models away from the conventional recipe (high-precision, multi-node, full-parameter updates) toward a deployment-aligned, low-precision PEFT-centric pipeline. The base model is held at its deployment precision (INT4 / FP4) during both training and rollout, and only a small BF16 adapterWe primarily use OFT in Orbit due to its better stability under low-bit base models, but LoRA is also supported. receives gradient updates. RL post-training of 1T-class models then fits on a single 8×B200 node, and the precision mismatch between the training base and the serving base, which is typically idenitifed as a chronic source of instability in low-precision RL, disappears by construction.
What Orbit offers#
Orbit is an adapter-first system for frontier-scale RL. Four properties set it apart from the conventional full-finetuning pipeline:
RL is sensitive to policy log-prob drift and base precision. Orbit trains a BF16 adapter on top of a low-precision frozen base; reward and eval rise steadily, and train-rollout log-prob difference stays bounded. Moreover, we observe no performance drop compared to full-parameter RL.
Trillion-scale RL can be reduced from complex multi-node orchestration to a single 8×B200 node, eliminating cross-node communication, simplifying scheduling, and greatly reducing the failure-recovery surface area. For the first time, 1T LLMs can be RL post-trained with a single node.
The training base and the rollout / serving base are the same low-precision weights. No silent precision gap between what is optimized and what is shipped.
Under the same hardware budget, Orbit supports larger models, longer responses, and larger update sizes than full-finetuning baselines.
Weight-state memory: why a single node is enough#
The starting point is a simple memory counting. For any RL recipe, the weight and gradient footprint of training is a function of the base model size and the precision strategy. Modern RL design space consists of four common recipes:
| Method | Training | Rollout |
|---|---|---|
| Traditional RL full-FT | BF16 forward + BF16 backward | FP8 |
| Low-precision RL full-FT | FP8 forward + BF16 backward | FP8 |
| Modern low-precision RL full-FT (most commonly used) | FP8 forward + FP8 backward | INT4 / FP4 |
| Orbit's PEFT-centric RL | INT4 / FP4 (frozen) + BF16 adapter | same INT4 / FP4 + BF16 adapter |
The slope ratios make the gap explicit. For training (weight + grad), the
four recipes scale as (16+16) : (8+16) : (8+8) : (4 + 2·adapter);
for rollout, as 8 : 8 : 4 : (4 + adapter), where
adapter ≈ 0.04 bits/parameter.
Orbit's curve is essentially the 4-bit base alone. For fair comparison, for all the compared
RL recipes, we use exactly the same training settings on each base model.
Kimi-K2.6 1T: stable RL on a trillion-parameter model#
Our first end-to-end result is a 200-step RL run on Kimi-K2.6 (~1.03T)Kimi K2.6: Advancing Open-Source Coding (Moonshot AI, 2026) using an INT4 frozen base with a BF16 adapter, run entirely on a single 8×B200 node. Reward, eval accuracy, and training-set pass@k all rise; the train–rollout log-prob diff stays bounded throughout the run.
| Model | Kimi-K2.6 ~1.03T |
| Hardware | single 8×B200 |
| Training precision | INT4 base + BF16 adapter |
| Rollout precision | same INT4 base + BF16 adapter |
0.0066–0.0104 with no drift.
DeepSeek V4-Flash: not Kimi-only#
To show that deployment-aligned low-precision RL is not specific to one model family, we ran the same recipe on DeepSeek V4-FlashDeepSeek-V4: Towards Highly Efficient Million-Token Context Intelligence (DeepSeek AI, 2026) with an FP4 base. Reward, eval, and pass@k all improve; log-prob diff is again bounded.
| Model | DeepSeek V4-Flash |
| Hardware | single 8×B200 |
| Training precision | FP4 base + BF16 adapter |
| Rollout precision | same FP4 base + BF16 adapter |
0.0161–0.0247
with no drift over 160 steps.
System validation on DeepSeek V4-Pro-1.6T#
DeepSeek V4-Pro (~1.59T)DeepSeek-V4: Towards Highly Efficient Million-Token Context Intelligence (DeepSeek AI, 2026) pushes Orbit into MoE territory beyond 1.6T parameters. We treat this section as a system capability proof, not a quality benchmark — the base model is already very strong and the present RL data cannot effectively improve its accuracy further. However, what really matters here is that a 1.6T LLM can well fit in a single GPU node (8×B200), RL post-training stays very stable in the whole time, and the train-rollout log-prob difference is also quite small.
0.030–0.040 with no drift at 1.6T scale.
(b) pass@k during the validation run.
(c) GPU memory bounded; mean ≈ 156 GB, peak ≈ 188 GB
/ 192 GB — 1.6T fits and stays bounded on a single 8×B200 node.
Why Orbit uses OFT#
Our previous PEFT-Arena benchmarkPEFT-Arena: Understanding Parameter-Efficient Finetuning from a Stability-Plasticity Perspective (Huang, et al, preprint 2026) has performed a conprehensive evaluation on existing PEFT methods. We can observe from the following figure that OFT achieves better performance than full-finetuning and LoRA (on both downstream adaptation and pretraining knowledge preservation). Most importanly, we find that OFT can consistently outperform full-parameter finetuning in the RLVR settings, while LoRA cannot. We also note that OFT, as a supervised finetuning method, also works well. These results motivate us to use OFT as Orbit's default PEFT method.
To compare OFT and LoRA in a larger-scale setting, we use Qwen3-30B-A3B as our base model, and perform RL with OFT or LoRA (under the same parameter budget). The results are given below. We study adapter choices in the standard BF16 regime. On Qwen3-30B-A3B (MoE) we compare LoRA r16/a32 against OFT b64 under the PEFT-Arena train–eval setting. OFT outperforms LoRA on both training reward and eval accuracy while producing log-prob difference at the same scale.
Beyond accuracy, OFT also has two system-level properties that make it a better fit than LoRA for adapter-first RL on frontier-scale models:
-
Zero additional network communication overhead when serving OFT. While optimized approaches like S-LoRA attempt to mitigate LoRA's inherent communication costs (which scale as
5(N−1)Br / N) by cleverly fusing the adapter's all-reduce steps with those of the base model, OFT's localized matrix design inherently bypasses the need for these extra cross-device reductions altogether, entirely avoiding this “communication wall” and making it fundamentally faster than LoRA architectures during distributed inference. -
Custom kernels for split-projection adapters on fused layers. Modern high-throughput transformers fuse logically-independent projections — Q/K/V in attention, gate/up in SwiGLU MLPs — into single GEMMs to saturate tensor cores and memory bandwidth. Naively attaching an adapter to such a fused tensor shares one set of adapter parameters across distinct subspaces, degrading quality; the typical LoRA workaround couples the two halves through a shared low-rank A, effectively wasting half of the adapter capacity. We engineered custom split-projection kernels that natively apply separate orthogonal rotations to each subspace. Taking Kimi as an example, during inference, fused INT4 weights are dynamically sliced and repacked at load time to run as precise half-size GEMMs, avoiding discarded compute and cutting kernel time by ≈35%. On the training side, a split-aware Triton dequantization kernel paired with custom autograd hooks minimizes memory overhead.
Technical details of Orbit#
PEFT adapters as first-class citizens in RL infrastructure#
The architectural shift from full-parameter to adapter-only updates is what unlocks single-node trillion-scale RL. Five properties of the adapter-first design matter most:
-
Trivial weight sync. Only megabyte-scale adapter deltas move between the trainer and the rollout engines — no gigabyte-scale weight exports, format conversions, or serving rebuilds on the hot path.
-
Persistent rollout servers. Because the base model never changes, inference engines stay constantly active across the training lifecycle. The cyclic engine restart between generation and optimization disappears.
-
Viable trillion-parameter scaling. The memory-heavy optimizer state is bounded to the tiny adapter while the frozen base rests in low-precision memory; massive MoE models become single-node-feasible.
-
Free reference policy. The KL-divergence reference simply evaluates the frozen base without the adapter applied — instantly recovering the memory normally wasted on a separate snapshot of the actor.
-
Adapter-native async with double-buffered rollout. Orbit's versioned adapter runtime makes async RL cheap. Each adapter delta is stamped with a monotonic version; SGLang streams the next version into an inactive slot while the active slot keeps serving in-flight requests, and activation flips atomically — zero rollout bubble. On Qwen3-4B + OFT (8×B200, TP=2), this yields a 1.42× step-time speedup and 44% higher rollout throughput, with eval accuracy unchanged within run-to-run noise.
Low-precision MoE expert modules#
Active-expert-chunked dequantization dynamically groups only the router-selected experts into fixed-size batches, temporarily dequantizing them for grouped GEMMs and immediately releasing the high-precision weights. The hybrid reaches the throughput of fully grouped matrix multiplications while strictly bounding peak transient memory to a single small chunk — preventing OOM on massive low-precision MoE models during training.
DeepSeek-V4 specific optimizations#
Full CUDA-graph decoding support, DeepGEMM + DeepEP V2 integration, and
hand-written tilelang / triton / CUDA kernels for
attention backward and fusion. An adapter-aware
bypass-base-weight-grad GEMM backward kernel exploits the fact
that the base is frozen. A 6-layer debug model with debug ops enabled
lands at 1e-5-level log-prob diff; step-0 log-prob diff in
real V4-Flash training is 0.02.
Authors#
Zeju Qiu*+, Le Chen*+, Lixin Liu*+, Tim Z. Xiao+, Yao Feng, Yangyi Huang, Zhen Liu, Han Shi, Yandong Wen, Zhouliang Yu, Bernhard Schölkopf, Weiyang Liu+.
The Chinese University of Hong Kong · Max Planck Institute for Intelligent Systems
Citation#
Please cite this work as:
@article{spherelab2026orbit,
author = {Qiu, Zeju and Chen, Le and Liu, Lixin and Xiao, Tim Z.
and Feng, Yao and Huang, Yangyi and Liu, Zhen and Shi, Han
and Wen, Yandong and Yu, Zhouliang and Sch{\"o}lkopf, Bernhard
and Liu, Weiyang},
title = {Orbit: Stable and Efficient Reinforcement Learning for Trillion-Parameter LLMs},
journal = {SphereLab Blog},
year = {2026},
note = {https://spherelab.ai/orbit}
}