SphereLab · Technical Report

Pion: A Spectrum-Preserving Optimizer
via Orthogonal Equivalence Transformation

An LLM optimizer that updates each weight matrix through coupled left and right orthogonal transformations — preserving its singular spectrum throughout training.

Kexuan Shi1,†, Hanxuan Li1,†, Zeju Qiu1,2, Yandong Wen3, Simon Buchholz2, Weiyang Liu1,*
1The Chinese University of Hong Kong · 2Max Planck Institute for Intelligent Systems · 3Westlake University
† Equal contribution · * Corresponding author
POET vs Pion comparison
Pion vs. POET. POET reparameterizes each weight as W = RW0P and trains the orthogonal factors instead of the weights. Pion removes the reparameterization and instead applies left and right orthogonal transformations directly to Wt, preserving its singular-value spectrum without introducing auxiliary parameters.

Abstract

A new geometric route to stable LLM training.

TL;DR Pion treats each weight matrix as living on an iso‑spectral manifold and updates it via coupled left and right orthogonal transformations. The spectrum is preserved by construction, μP-compatibility follows naturally, and the optimizer remains competitive with AdamW and Muon across pretraining, supervised finetuning, and RLVR.

We introduce Pion, a spectrum-preserving optimizer for large language model training based on orthogonal equivalence transformations. Unlike additive optimizers such as AdamW and Muon, Pion updates each weight matrix through left and right orthogonal transformations, preserving its singular values throughout training. This yields an optimization mechanism that modulates the geometry of weight matrices while keeping their spectral norm fixed. We derive Pion's update rule, systematically examine its design choices, and analyze its convergence behavior along with several key properties. Empirical results show that Pion offers a stable and competitive alternative to standard optimizers for both LLM pretraining and finetuning.

Why Pion

Spectrum control as a first-class optimizer property.

Algorithmic spectrum control

The update rule lives on the iso-spectral manifold of W0. Singular values are preserved exactly — no explicit normalization or projection needed.

Minimum hyperspherical energy

Orthogonal equivalence transformations preserve the minimum-energy configuration of zero-mean Gaussian initialization, keeping neurons uniformly distributed on the hypersphere throughout training.

Naturally μP-compatible

Pion satisfies the forward spectral condition by construction. A simple bilateral normalization on the spectral norms of Gin and Gout delivers learning-rate transfer across model widths.

Stable under stress

Trains successfully on 200-layer networks and on models with no normalization layers — both regimes where AdamW and Muon diverge.

Convergent and well-behaved

We prove an 𝒪(1/√T ) stationarity bound on the iso-spectral manifold under standard L-smoothness and bounded-noise assumptions.

Competitive everywhere

Best average benchmark score on LLaMA-1.3B pretraining, best stability-plasticity tradeoff on SFT, and the strongest RLVR results on Qwen3-1.7B and DeepSeek-R1-Distill.

Method

Two rotations, one identity, zero reparameterization.

The core update rule

For any weight matrix Wt ∈ ℝdout×din, we can trivially write Wt = IdoutWtIdin. Geometrically, each identity is the neutral element of an orthogonal group. Pion evolves these identity factors directly on the orthogonal group, which induces left and right orthogonal transformations on Wt — preserving its singular values.

Pion update rule
\[ \begin{gathered} \begin{aligned} \mathbf{G}_t^{\mathrm{in}} &= \mathbf{W}_t^\top \mathbf{G}_t - (\mathbf{W}_t^\top \mathbf{G}_t)^\top, \qquad \mathbf{G}_t^{\mathrm{out}} = \mathbf{G}_t \mathbf{W}_t^\top - (\mathbf{G}_t \mathbf{W}_t^\top)^\top, \end{aligned}\\[0.65em] \mathbf{W}_{t+1} = \exp(-\eta \mathbf{G}_t^{\mathrm{out}})\, \mathbf{W}_t\, \exp(-\eta \mathbf{G}_t^{\mathrm{in}}). \end{gathered} \]

The skew-symmetrization projects the gradients onto the Lie algebra; the matrix exponential maps them back to the Lie group, producing valid orthogonal transformations. Because Rt = exp(−η Gtout) and Pt = exp(−η Gtin) are orthogonal, the row and column 2 norms of Wt are preserved — the update is pure angular motion rather than rescaling.

Full algorithm

Algorithm: The Pion Optimizer
Input Learning rate \(\eta\), momentum \(\beta_1,\beta_2\), RMS constant \(c\), stability \(\epsilon\), alternating flag, initial \(\mathbf{W}_0 \in \mathbb{R}^{d_{\mathrm{out}}\times d_{\mathrm{in}}}\).
Output Optimized parameter \(\mathbf{W}_t\).
Initialize \(\mathbf{m}^{\mathrm{in}}_0,\mathbf{v}^{\mathrm{in}}_0 \leftarrow \mathbf{0}\in\mathbb{R}^{d_{\mathrm{in}}\times d_{\mathrm{in}}}\), \(\mathbf{m}^{\mathrm{out}}_0,\mathbf{v}^{\mathrm{out}}_0 \leftarrow \mathbf{0}\in\mathbb{R}^{d_{\mathrm{out}}\times d_{\mathrm{out}}}\).
Define \(\mathcal{E}_2(\mathbf{A},\alpha) \leftarrow \mathbf{I}+\eta\alpha \mathbf{A}+\tfrac{1}{2}(\eta\alpha \mathbf{A})^2\).
for \(t=1,2,\ldots\) do
\(\mathbf{G}_t \leftarrow \nabla_{\mathbf{W}} f(\mathbf{W}_{t-1})\).
\(\mathbf{G}^{\mathrm{in}}_t \leftarrow \mathbf{W}_{t-1}^{\top}\mathbf{G}_t - \mathbf{G}_t^{\top}\mathbf{W}_{t-1}\),   \(\mathbf{G}^{\mathrm{out}}_t \leftarrow \mathbf{G}_t\mathbf{W}_{t-1}^{\top} - \mathbf{W}_{t-1}\mathbf{G}_t^{\top}\).
\(\mathbf{m}^{\mathrm{in}}_t \leftarrow \beta_1 \mathbf{m}^{\mathrm{in}}_{t-1} + (1-\beta_1)\mathbf{G}^{\mathrm{in}}_t\),   \(\mathbf{m}^{\mathrm{out}}_t \leftarrow \beta_1 \mathbf{m}^{\mathrm{out}}_{t-1} + (1-\beta_1)\mathbf{G}^{\mathrm{out}}_t\).
\(\mathbf{v}^{\mathrm{in}}_t \leftarrow \beta_2 \mathbf{v}^{\mathrm{in}}_{t-1} + (1-\beta_2)(\mathbf{G}^{\mathrm{in}}_t \odot \mathbf{G}^{\mathrm{in}}_t)\),   \(\mathbf{v}^{\mathrm{out}}_t \leftarrow \beta_2 \mathbf{v}^{\mathrm{out}}_{t-1} + (1-\beta_2)(\mathbf{G}^{\mathrm{out}}_t \odot \mathbf{G}^{\mathrm{out}}_t)\).
\(\mathbf{A}^{\mathrm{in}}_t \leftarrow -\,\mathbf{m}^{\mathrm{in}}_t / (\sqrt{\mathbf{v}^{\mathrm{in}}_t} + \epsilon)\),   \(\mathbf{A}^{\mathrm{out}}_t \leftarrow -\,\mathbf{m}^{\mathrm{out}}_t / (\sqrt{\mathbf{v}^{\mathrm{out}}_t} + \epsilon)\) (element-wise).
if alternating update then
if \(t\) is even then
\(\alpha_t \leftarrow \dfrac{c\sqrt{d_{\mathrm{out}}d_{\mathrm{in}}}}{ \|\mathbf{A}^{\mathrm{out}}_t \mathbf{W}_{t-1}\|_F+\epsilon}\).
\(\mathbf{W}_t \leftarrow \mathcal{E}_2(\mathbf{A}^{\mathrm{out}}_t,\alpha_t)\,\mathbf{W}_{t-1}\).
else
\(\alpha_t \leftarrow \dfrac{c\sqrt{d_{\mathrm{out}}d_{\mathrm{in}}}}{ \|\mathbf{W}_{t-1}\mathbf{A}^{\mathrm{in}}_t\|_F+\epsilon}\).
\(\mathbf{W}_t \leftarrow \mathbf{W}_{t-1}\,\mathcal{E}_2(\mathbf{A}^{\mathrm{in}}_t,\alpha_t)\).
end if
else
\(\alpha_t \leftarrow \dfrac{c\sqrt{d_{\mathrm{out}}d_{\mathrm{in}}}}{ \|\mathbf{A}^{\mathrm{out}}_t\mathbf{W}_{t-1}+\mathbf{W}_{t-1}\mathbf{A}^{\mathrm{in}}_t\|_F+\epsilon}\).
\(\mathbf{W}_t \leftarrow \mathcal{E}_2(\mathbf{A}^{\mathrm{out}}_t,\alpha_t)\,\mathbf{W}_{t-1}\,\mathcal{E}_2(\mathbf{A}^{\mathrm{in}}_t,\alpha_t)\).
end if
end for
return \(\mathbf{W}_t\).

Design principles for practical training

We turn the bare rule above into a usable optimizer by studying four design axes empirically:

Momentum design ablation losses
Figure Training-loss curves for different momentum designs. Lie-algebra momentum (used in the final Pion algorithm) consistently outperforms ambient-space momentum, and the Lie‑algebra first‑order + Lie‑algebra second‑order combination is the strongest.

Convergence on the iso-spectral manifold

Under standard \(L\)-smoothness, lower-boundedness of \(f\), and bounded stochastic-gradient noise \(\mathbb{E}\| \xi_t \|_F^2 \le \sigma^2\), Pion with step size \(\eta = C / \sqrt{T + 1}\) satisfies:

\[ \min_{0 \le t \le T} \mathbb{E}\Bigl[ \| \mathbf{G}_t^{\mathrm{in}} \|_F^2 + \| \mathbf{G}_t^{\mathrm{out}} \|_F^2 \Bigr] \le \frac{C_1 + C_2}{\sqrt{T + 1}} = \mathcal{O}\!\left(\frac{1}{\sqrt{T}}\right). \]

The geometric structure of the update also gives a clean interpretation: the Frobenius norm of ΔW measures the total rotational strength applied to Wt, while the scaled norms  ‖GtinF / √din and  ‖GtoutF / √dout capture the average planar-rotation angles on the two sides.

Results

Stable pretraining, strong post-training, faithful μP transfer.

Stable LLM pretraining (LLaMA-1.3B, 54B tokens)

We pretrain a 1.3B-parameter LLaMA-based model on 54B C4 tokens — the Chinchilla-optimal budget. Pion attains the best average benchmark score and validation loss comparable to Muon, while substantially outperforming AdamW.

Method ARC-C ARC-E BoolQ Hella. PIQA SciQ TriviaQA Wino. Avg Val Loss
AdamW 25.9445.9646.3045.10 71.2770.801.0651.46 44.742.7700
Muon 25.3447.9451.5646.70 72.2071.601.6453.75 46.342.7225
Pion 26.7949.4157.5847.34 71.2773.402.1753.59 47.692.7350
Stability diagnostics during pretraining
Figure Four pretraining stability diagnostics on LLaMA-1.3B. Pion keeps the maximum attention logit, SwiGLU activation norm, down-projection weight norm, and output norm essentially flat throughout 54B tokens of training. AdamW's attention logits and activations grow unboundedly; Muon controls logits but its activations and weight norms continue to drift.
Singular spectrum of W_O at convergence
Figure Singular-value spectrum of an attention output projection at the end of training. Pion exactly preserves the initial spectrum; AdamW and Muon drift.
Normalization-free training curves
Figure Normalization-free pretraining. After removing every normalization layer, AdamW and Muon diverge to NaN; only Pion completes 9.6B tokens and converges cleanly.
Training loss for 200-layer LLaMA
Figure 200-layer LLaMA on 50B tokens of C4. Pion has the smallest local-loss standard deviation (σ = 0.0892 vs. 0.0931 for AdamW and 0.0927 for Muon) and the most efficient intermediate-stage descent.
Layer-wise Jacobian norm
Figure Layer-wise Jacobian norm J − IF. Pion maintains a uniform expressivity profile across depth, avoiding the interior degradation seen with AdamW and Muon.

μP-compatible learning-rate transfer

Pion satisfies μP's forward spectral condition by construction. With a simple bilateral spectral-norm normalization on Gtin and Gtout, the optimal learning rate is invariant to model width across both LLaMA and Qwen architectures.

mu-P learning-rate transfer across scales
Figure μP learning-rate transfer for Pion across LLaMA and Qwen models of varying widths. The minima of the loss curves align at the same learning rate, confirming that small-scale tuning transfers reliably to larger models.

Supervised finetuning

Full-parameter SFT on MetaMathQA and Magicoder-Evol-Instruct using Qwen2.5-1.5B and Llama-3.2-3B. Pion delivers the best stability–plasticity tradeoff: comparable in-domain math performance with markedly better OOD retention, and the strongest results on code generation across both base models.

Method Qwen2.5-1.5B Llama3.2-3B
Math Code Math Code
ID (%) OOD (%) ID (%) OOD (%) ID (%) OOD (%) ID (%) OOD (%)
Base 59.8164.83 35.9863.99 25.4767.59 26.2253.08
AdamW 65.8862.13 51.8362.64 59.8760.86 46.9558.64
Muon 65.2761.22 50.0062.41 57.7761.20 46.3458.88
Pion 65.7662.16 53.0563.21 58.8360.44 47.1959.74

All numbers in % (accuracy). ID = in-domain, OOD = out-of-domain; bold = best per column among optimizers (excluding Base). Shaded row = Pion (Table 2 in the paper).

Reinforcement learning with verifiable reward

With GRPO on DeepMath, Pion is the strongest optimizer on every mathematical-reasoning benchmark we tested for both Qwen3-1.7B and DeepSeek-R1-Distill-Qwen-1.5B — consistent with the observation that RLVR updates largely preserve pretrained spectral structure, an inductive bias that Pion enforces by design.

Method Qwen3-1.7B DeepSeek-R1-Distill-Qwen-1.5B
AIME24
avg@32
AIME25
avg@32
AMC23
avg@8
Minerva
avg@4
Olymp.
avg@8
Avg AIME24
avg@32
AIME25
avg@32
AMC23
avg@8
Minerva
avg@4
Olymp.
avg@8
Avg
Base 4.0610.1030.2716.2723.6716.87 20.5220.8354.0619.3936.2030.20
AdamW 22.7120.9458.4325.9146.0934.82 25.4223.9462.6523.1644.6935.97
Muon 20.4219.2754.2224.0842.4132.08 29.0623.3366.7222.8944.6137.32
Pion 25.4221.9859.9426.8446.4336.12 30.0024.3866.8723.9046.4338.32

Each benchmark is avg@K accuracy (%): AIME24 avg@32, AIME25 avg@32, AMC23 avg@8, Minerva Math avg@4, OlympiadBench avg@8. Avg = mean of the five. Table 3 in the paper.

RLVR training dynamics
Figure Evaluation-accuracy training dynamics during RLVR with GRPO. Pion exhibits the fastest convergence on both base models across all five benchmarks.

Citation

BibTeX

bibtex
@techreport{pion2026,
  title={Pion: A Spectrum-Preserving Optimizer via Orthogonal Equivalence Transformation},
  author={Shi, Kexuan and Li, Hanxuan and Qiu, Zeju and Wen, Yandong and Buchholz, Simon and Liu, Weiyang},
  institution={SphereLab Technical Report},
  year={2026},
  note={Available at https://spherelab.ai/pion}
}