One axis, five panels on one shared clock. Start fully serial. A detour panel tests "is the heavy push the bottleneck?" by swapping it for an adapter push while keeping the loop serial — turns out, not really. Then add overlap (async), adapter-native push, and finally a non-blocking double-buffered swap. Press play.
The naive loop: generate a full batch, train, push the full model, repeat — with no overlap. The two GPU pools take turns, so each sits idle while the other works. And a batch can't close until its slowest sample finishes, so fast slots stall at the barrier (hatched). That straggler waste, plus the serial structure, is the motivation for everything that follows. Rollouts are on-policy (staleness 0).
Hypothesis: maybe the heavy full-model push is what makes Sync RL slow — so swap it for a
tiny OFT adapter push and keep everything else the same. The push does get cheap (warm
update_weights ≈ 0.105 s), but the loop is still serial: the trainer waits for
rollout generation, offload, onload, and the adapter sync before it can step. Warm train wait
stays at ≈ 7.142 s, and warm step time is ≈ 8.651 s — not meaningfully better than
full-weight sync, because the push wasn't the bottleneck; the serial loop was. This is the
panel that motivates overlap (the next step).
update_weights 0.105 s (small ✓)
Train wait 7.142 s (still huge)
Step time 8.651 s
A background producer keeps generation in flight while the trainer takes gradient steps — they
now overlap, so the trainer stays busy and slots never wait at a barrier. But the push is
still the full model, so each update_weights() forces a long pause where the
engine can't accept work. Pushes are too expensive to do often, so the engine lags the
trainer by several versions (v0 then jumps to v2). That gap between the gen colors and the
current version is staleness (≈ 2 here).
Only LoRA/OFT deltas cross the wire, so the push is tiny. With a single adapter slot,
UpdateWeightFromTensor.update_weights still has to pause_generation →
flush_cache → overwrite → continue_generation, but the stall is brief instead of a
full-model pause. Cheap syncs can run every step, so the engine keeps up with the trainer and
rollouts stay fresh (staleness ≈ 1).
update_weights 0.169 s
Train wait 2.048 s (overlap helps)
Rollout throughput 1402.9 tok/GPU/s
Step time 3.165 s (vs 8.651 s sync ↑)
All three stacked. The new adapter is broadcast into the inactive slot while the active
slot keeps serving in-flight requests, so the long sync no longer blocks generation. Orbit
still wraps the swap with the standard updater lifecycle —
pause_generation → flush_cache → activate → continue_generation — so there is a
brief pause at the flip point, but the activation itself is essentially instant (no
tensors crossing the wire). Measured: warm update_weights drops from
0.169 s (single-slot) to 0.118 s (double-buffer), a ~30% reduction, and the
knock-on effect is what matters — warm rollout throughput climbs +50.2% and warm train
wait drops -81.2%. Gens that span a flip keep their start-version (older color); that
residual is the staleness the fully_async metrics track (≈ 1).
update_weights 0.118 s (−30.4%)
Train wait 0.384 s (−81.2%)
Rollout throughput 2106.9 tok/GPU/s (+50.2%)
Step time 2.531 s (−20.0%)