Orbit rollout architectures

One axis, five panels on one shared clock. Start fully serial. A detour panel tests "is the heavy push the bottleneck?" by swapping it for an adapter push while keeping the loop serial — turns out, not really. Then add overlap (async), adapter-native push, and finally a non-blocking double-buffered swap. Press play.

Speed 1.0× Scrub t = 0.0

Baseline

Sync RL — serial, full-weight

The naive loop: generate a full batch, train, push the full model, repeat — with no overlap. The two GPU pools take turns, so each sits idle while the other works. And a batch can't close until its slowest sample finishes, so fast slots stall at the barrier (hatched). That straggler waste, plus the serial structure, is the motivation for everything that follows. Rollouts are on-policy (staleness 0).

gen @ v0 gen @ v1 gen @ v2 gradient step queue depth · ▼ in · ▲ out full-model weight push waiting on straggler (wasted)

Rollout utilization: — Trainer utilization: — Weight push payload: full model Staleness: 0 (on-policy)

No direct Qwen3-4B measurement — full-weight push not benched in this batch. Implied lower bound: step time ≥ adapter sync's 8.651 s, since this path pays the same serial loop cost plus a much larger push.

detour · swap push payload

Adapter-native sync — serial, adapter-only

Hypothesis: maybe the heavy full-model push is what makes Sync RL slow — so swap it for a tiny OFT adapter push and keep everything else the same. The push does get cheap (warm update_weights ≈ 0.105 s), but the loop is still serial: the trainer waits for rollout generation, offload, onload, and the adapter sync before it can step. Warm train wait stays at ≈ 7.142 s, and warm step time is ≈ 8.651 s — not meaningfully better than full-weight sync, because the push wasn't the bottleneck; the serial loop was. This is the panel that motivates overlap (the next step).

gen @ v0 gen @ v1 gen @ v2 gradient step queue depth · ▼ in · ▲ out adapter push (tiny) waiting on straggler (wasted)

Rollout utilization: — Trainer utilization: — Weight push payload: adapter only Staleness: 0 (on-policy)

Measured (Qwen3-4B OFT, 4-GPU colocated sync, warm rollouts 1–4): update_weights 0.105 s (small ✓) Train wait 7.142 s (still huge) Step time 8.651 s

+ overlap (async)

Async RL — still full-weight

A background producer keeps generation in flight while the trainer takes gradient steps — they now overlap, so the trainer stays busy and slots never wait at a barrier. But the push is still the full model, so each update_weights() forces a long pause where the engine can't accept work. Pushes are too expensive to do often, so the engine lags the trainer by several versions (v0 then jumps to v2). That gap between the gen colors and the current version is staleness (≈ 2 here).

gen @ v0 gen @ v1 gen @ v2 gradient step queue depth · ▼ in · ▲ out full-model weight push engine paused

Rollout utilization: — Trainer utilization: — Weight push payload: full model Staleness: ≈ 2 (engine lags)

No direct Qwen3-4B measurement — full-weight push not benched in this batch. Implied: better than sync thanks to overlap, but each push still pauses generation for seconds (vs the 0.169 s adapter push), so the engine lags the trainer by several versions.

+ adapter-native

Adapter-native async — single slot

Only LoRA/OFT deltas cross the wire, so the push is tiny. With a single adapter slot, UpdateWeightFromTensor.update_weights still has to pause_generation → flush_cache → overwrite → continue_generation, but the stall is brief instead of a full-model pause. Cheap syncs can run every step, so the engine keeps up with the trainer and rollouts stay fresh (staleness ≈ 1).

gen @ v0 gen @ v1 gen @ v2 gen @ v3 gen @ v4 gradient step queue depth · ▼ in · ▲ out adapter overwrite brief engine pause

Rollout utilization: — Trainer utilization: — Weight push payload: adapter only Adapter swap: pause → overwrite → resume

Measured (Qwen3-4B OFT, 2+2 GPU async single-slot, warm rollouts 1–4): update_weights 0.169 s Train wait 2.048 s (overlap helps) Rollout throughput 1402.9 tok/GPU/s Step time 3.165 s (vs 8.651 s sync ↑)

+ double-buffer

Adapter-native async + double-buffered rollout

All three stacked. The new adapter is broadcast into the inactive slot while the active slot keeps serving in-flight requests, so the long sync no longer blocks generation. Orbit still wraps the swap with the standard updater lifecycle — pause_generation → flush_cache → activate → continue_generation — so there is a brief pause at the flip point, but the activation itself is essentially instant (no tensors crossing the wire). Measured: warm update_weights drops from 0.169 s (single-slot) to 0.118 s (double-buffer), a ~30% reduction, and the knock-on effect is what matters — warm rollout throughput climbs +50.2% and warm train wait drops -81.2%. Gens that span a flip keep their start-version (older color); that residual is the staleness the fully_async metrics track (≈ 1).

gen @ v0 gen @ v1 gen @ v2 gen @ v3 gen @ v4 gradient step queue depth · ▼ in · ▲ out broadcast to inactive slot atomic activate (flip) brief pause (flush + activate)

Rollout utilization: — Trainer utilization: — Weight push payload: adapter only Adapter swap: broadcast staged + brief pause for flush/activate

Measured (Qwen3-4B OFT, 2+2 GPU async double-buffer, warm rollouts 1–4): update_weights 0.118 s (−30.4%) Train wait 0.384 s (−81.2%) Rollout throughput 2106.9 tok/GPU/s (+50.2%) Step time 2.531 s (−20.0%)