Pion: A Spectrum-Preserving Optimizer via Orthogonal Equivalence Transformation

Abstract

A new geometric route to stable LLM training.

TL;DR Pion treats each weight matrix as living on an iso‑spectral manifold and updates it via coupled left and right orthogonal transformations. The spectrum is preserved by construction, μP-compatibility follows naturally, and the optimizer remains competitive with AdamW and Muon across pretraining, supervised finetuning, and RLVR.

We introduce Pion, a spectrum-preserving optimizer for large language model training based on orthogonal equivalence transformations. Unlike additive optimizers such as AdamW and Muon, Pion updates each weight matrix through left and right orthogonal transformations, preserving its singular values throughout training. This yields an optimization mechanism that modulates the geometry of weight matrices while keeping their spectral norm fixed. We derive Pion's update rule, systematically examine its design choices, and analyze its convergence behavior along with several key properties. Empirical results show that Pion offers a stable and competitive alternative to standard optimizers for both LLM pretraining and finetuning.

Why Pion

Spectrum control as a first-class optimizer property.

Algorithmic spectrum control

The update rule lives on the iso-spectral manifold of W₀. Singular values are preserved exactly — no explicit normalization or projection needed.

Minimum hyperspherical energy

Orthogonal equivalence transformations preserve the minimum-energy configuration of zero-mean Gaussian initialization, keeping neurons uniformly distributed on the hypersphere throughout training.

Naturally μP-compatible

Pion satisfies the forward spectral condition by construction. A simple bilateral normalization on the spectral norms of Gⁱⁿ and G^out delivers learning-rate transfer across model widths.

Stable under stress

Trains successfully on 200-layer networks and on models with no normalization layers — both regimes where AdamW and Muon diverge.

Convergent and well-behaved

We prove an 𝒪(1/√T ) stationarity bound on the iso-spectral manifold under standard L-smoothness and bounded-noise assumptions.

Competitive everywhere

Best average benchmark score on LLaMA-1.3B pretraining, best stability-plasticity tradeoff on SFT, and the strongest RLVR results on Qwen3-1.7B and DeepSeek-R1-Distill.

Method

Two rotations, one identity, zero reparameterization.

The core update rule

For any weight matrix W_t ∈ ℝ^d_out×d_in, we can trivially write W_t = I_{d_out} W_t I_{d_in}. Geometrically, each identity is the neutral element of an orthogonal group. Pion evolves these identity factors directly on the orthogonal group, which induces left and right orthogonal transformations on W_t — preserving its singular values.

Pion update rule

\begin{gathered} \begin{aligned} \mathbf{G}_t^{\mathrm{in}} &= \mathbf{W}_t^\top \mathbf{G}_t - (\mathbf{W}_t^\top \mathbf{G}_t)^\top, \qquad \mathbf{G}_t^{\mathrm{out}} = \mathbf{G}_t \mathbf{W}_t^\top - (\mathbf{G}_t \mathbf{W}_t^\top)^\top, \end{aligned}\\[0.65em] \mathbf{W}_{t+1} = \exp(-\eta \mathbf{G}_t^{\mathrm{out}})\, \mathbf{W}_t\, \exp(-\eta \mathbf{G}_t^{\mathrm{in}}). \end{gathered}

The skew-symmetrization projects the gradients onto the Lie algebra; the matrix exponential maps them back to the Lie group, producing valid orthogonal transformations. Because R_t = exp(−η G_t^out) and P_t = exp(−η G_tⁱⁿ) are orthogonal, the row and column ℓ₂ norms of W_t are preserved — the update is pure angular motion rather than rescaling.

Full algorithm

Algorithm: The Pion Optimizer

Input Learning rate \(\eta\), momentum \(\beta_1,\beta_2\), RMS constant \(c\), stability \(\epsilon\), alternating flag, initial \(\mathbf{W}_0 \in \mathbb{R}^{d_{\mathrm{out}}\times d_{\mathrm{in}}}\).

Output Optimized parameter \(\mathbf{W}_t\).

Initialize \(\mathbf{m}^{\mathrm{in}}_0,\mathbf{v}^{\mathrm{in}}_0 \leftarrow \mathbf{0}\in\mathbb{R}^{d_{\mathrm{in}}\times d_{\mathrm{in}}}\), \(\mathbf{m}^{\mathrm{out}}_0,\mathbf{v}^{\mathrm{out}}_0 \leftarrow \mathbf{0}\in\mathbb{R}^{d_{\mathrm{out}}\times d_{\mathrm{out}}}\).

Define \(\mathcal{E}_2(\mathbf{A},\alpha) \leftarrow \mathbf{I}+\eta\alpha \mathbf{A}+\tfrac{1}{2}(\eta\alpha \mathbf{A})^2\).

for \(t=1,2,\ldots\) do

\(\mathbf{G}_t \leftarrow \nabla_{\mathbf{W}} f(\mathbf{W}_{t-1})\).

\(\mathbf{G}^{\mathrm{in}}_t \leftarrow \mathbf{W}_{t-1}^{\top}\mathbf{G}_t - \mathbf{G}_t^{\top}\mathbf{W}_{t-1}\), \(\mathbf{G}^{\mathrm{out}}_t \leftarrow \mathbf{G}_t\mathbf{W}_{t-1}^{\top} - \mathbf{W}_{t-1}\mathbf{G}_t^{\top}\).

\(\mathbf{m}^{\mathrm{in}}_t \leftarrow \beta_1 \mathbf{m}^{\mathrm{in}}_{t-1} + (1-\beta_1)\mathbf{G}^{\mathrm{in}}_t\), \(\mathbf{m}^{\mathrm{out}}_t \leftarrow \beta_1 \mathbf{m}^{\mathrm{out}}_{t-1} + (1-\beta_1)\mathbf{G}^{\mathrm{out}}_t\).

\(\mathbf{v}^{\mathrm{in}}_t \leftarrow \beta_2 \mathbf{v}^{\mathrm{in}}_{t-1} + (1-\beta_2)(\mathbf{G}^{\mathrm{in}}_t \odot \mathbf{G}^{\mathrm{in}}_t)\), \(\mathbf{v}^{\mathrm{out}}_t \leftarrow \beta_2 \mathbf{v}^{\mathrm{out}}_{t-1} + (1-\beta_2)(\mathbf{G}^{\mathrm{out}}_t \odot \mathbf{G}^{\mathrm{out}}_t)\).

\(\mathbf{A}^{\mathrm{in}}_t \leftarrow -\,\mathbf{m}^{\mathrm{in}}_t / (\sqrt{\mathbf{v}^{\mathrm{in}}_t} + \epsilon)\), \(\mathbf{A}^{\mathrm{out}}_t \leftarrow -\,\mathbf{m}^{\mathrm{out}}_t / (\sqrt{\mathbf{v}^{\mathrm{out}}_t} + \epsilon)\) (element-wise).

if alternating update then

if \(t\) is even then

\(\alpha_t \leftarrow \dfrac{c\sqrt{d_{\mathrm{out}}d_{\mathrm{in}}}}{ \|\mathbf{A}^{\mathrm{out}}_t \mathbf{W}_{t-1}\|_F+\epsilon}\).

\(\mathbf{W}_t \leftarrow \mathcal{E}_2(\mathbf{A}^{\mathrm{out}}_t,\alpha_t)\,\mathbf{W}_{t-1}\).

else

\(\alpha_t \leftarrow \dfrac{c\sqrt{d_{\mathrm{out}}d_{\mathrm{in}}}}{ \|\mathbf{W}_{t-1}\mathbf{A}^{\mathrm{in}}_t\|_F+\epsilon}\).

\(\mathbf{W}_t \leftarrow \mathbf{W}_{t-1}\,\mathcal{E}_2(\mathbf{A}^{\mathrm{in}}_t,\alpha_t)\).

end if

else

\(\alpha_t \leftarrow \dfrac{c\sqrt{d_{\mathrm{out}}d_{\mathrm{in}}}}{ \|\mathbf{A}^{\mathrm{out}}_t\mathbf{W}_{t-1}+\mathbf{W}_{t-1}\mathbf{A}^{\mathrm{in}}_t\|_F+\epsilon}\).

\(\mathbf{W}_t \leftarrow \mathcal{E}_2(\mathbf{A}^{\mathrm{out}}_t,\alpha_t)\,\mathbf{W}_{t-1}\,\mathcal{E}_2(\mathbf{A}^{\mathrm{in}}_t,\alpha_t)\).

end if

end for

return \(\mathbf{W}_t\).

Design principles for practical training

We turn the bare rule above into a usable optimizer by studying four design axes empirically:

Consistent updates. RMS-controlled per-weight scaling keeps rotational magnitudes scale-proportional across matrices and unlocks large learning rates.
Momentum, the right way. The Lie-algebra variants accumulate the in- and out-side skew-symmetric gradients directly inside a single tangent space, sidestepping the parallel-transport problem and matching the geometry of the update.
Alternate updates. Alternating between in-side and out-side rotations across steps stays within ≈ 0.23% of the bilateral training loss while roughly halving the optimizer-side compute.
Second-order exponential. Each update starts from the identity, so errors do not compound. A two-term truncation exp(A) ≈ I + A + ½ A² is both sufficient and fast.

Figure Training-loss curves for different momentum designs. Lie-algebra momentum (used in the final Pion algorithm) consistently outperforms ambient-space momentum, and the Lie‑algebra first‑order + Lie‑algebra second‑order combination is the strongest.

Convergence on the iso-spectral manifold

Under standard \(L\)-smoothness, lower-boundedness of \(f\), and bounded stochastic-gradient noise \(\mathbb{E}\| \xi_t \|_F^2 \le \sigma^2\), Pion with step size \(\eta = C / \sqrt{T + 1}\) satisfies:

\min_{0 \le t \le T} \mathbb{E}\Bigl[ \| \mathbf{G}_t^{\mathrm{in}} \|_F^2 + \| \mathbf{G}_t^{\mathrm{out}} \|_F^2 \Bigr] \le \frac{C_1 + C_2}{\sqrt{T + 1}} = \mathcal{O}\!\left(\frac{1}{\sqrt{T}}\right).

The geometric structure of the update also gives a clean interpretation: the Frobenius norm of ΔW measures the total rotational strength applied to W_t, while the scaled norms ‖G_tⁱⁿ‖_F / √d_in and ‖G_t^out‖_F / √d_out capture the average planar-rotation angles on the two sides.

Results

Stable pretraining, strong post-training, faithful μP transfer.

Stable LLM pretraining (LLaMA-1.3B, 54B tokens)

We pretrain a 1.3B-parameter LLaMA-based model on 54B C4 tokens — 2× the Chinchilla-optimal budget. Pion attains the best average benchmark score and validation loss comparable to Muon, while substantially outperforming AdamW.

Method	ARC-C	ARC-E	BoolQ	Hella.	PIQA	SciQ	TriviaQA	Wino.	Avg	Val Loss
AdamW	25.94	45.96	46.30	45.10	71.27	70.80	1.06	51.46	44.74	2.7700
Muon	25.34	47.94	51.56	46.70	72.20	71.60	1.64	53.75	46.34	2.7225
Pion	26.79	49.41	57.58	47.34	71.27	73.40	2.17	53.59	47.69	2.7350

Stability diagnostics during pretraining

Figure Four pretraining stability diagnostics on LLaMA-1.3B. Pion keeps the maximum attention logit, SwiGLU activation norm, down-projection weight norm, and output norm essentially flat throughout 54B tokens of training. AdamW's attention logits and activations grow unboundedly; Muon controls logits but its activations and weight norms continue to drift.

Figure Singular-value spectrum of an attention output projection at the end of training. Pion exactly preserves the initial spectrum; AdamW and Muon drift.

Figure Normalization-free pretraining. After removing every normalization layer, AdamW and Muon diverge to NaN; only Pion completes 9.6B tokens and converges cleanly.

Figure 200-layer LLaMA on 50B tokens of C4. Pion has the smallest local-loss standard deviation (σ = 0.0892 vs. 0.0931 for AdamW and 0.0927 for Muon) and the most efficient intermediate-stage descent.

Figure Layer-wise Jacobian norm ‖J_ℓ − I‖_F. Pion maintains a uniform expressivity profile across depth, avoiding the interior degradation seen with AdamW and Muon.

μP-compatible learning-rate transfer

Pion satisfies μP's forward spectral condition by construction. With a simple bilateral spectral-norm normalization on G_tⁱⁿ and G_t^out, the optimal learning rate is invariant to model width across both LLaMA and Qwen architectures.

mu-P learning-rate transfer across scales

Figure μP learning-rate transfer for Pion across LLaMA and Qwen models of varying widths. The minima of the loss curves align at the same learning rate, confirming that small-scale tuning transfers reliably to larger models.

Supervised finetuning

Full-parameter SFT on MetaMathQA and Magicoder-Evol-Instruct using Qwen2.5-1.5B and Llama-3.2-3B. Pion delivers the best stability–plasticity tradeoff: comparable in-domain math performance with markedly better OOD retention, and the strongest results on code generation across both base models.

Method	Qwen2.5-1.5B				Llama3.2-3B
	Math		Code		Math		Code
	ID (%)	OOD (%)	ID (%)	OOD (%)	ID (%)	OOD (%)	ID (%)	OOD (%)
Base	59.81	64.83	35.98	63.99	25.47	67.59	26.22	53.08
AdamW	65.88	62.13	51.83	62.64	59.87	60.86	46.95	58.64
Muon	65.27	61.22	50.00	62.41	57.77	61.20	46.34	58.88
Pion	65.76	62.16	53.05	63.21	58.83	60.44	47.19	59.74

All numbers in % (accuracy). ID = in-domain, OOD = out-of-domain; bold = best per column among optimizers (excluding Base). Shaded row = Pion (Table 2 in the paper).

Reinforcement learning with verifiable reward

With GRPO on DeepMath, Pion is the strongest optimizer on every mathematical-reasoning benchmark we tested for both Qwen3-1.7B and DeepSeek-R1-Distill-Qwen-1.5B — consistent with the observation that RLVR updates largely preserve pretrained spectral structure, an inductive bias that Pion enforces by design.

Method	Qwen3-1.7B						DeepSeek-R1-Distill-Qwen-1.5B
Method	AIME24 avg@32	AIME25 avg@32	AMC23 avg@8	Minerva avg@4	Olymp. avg@8	Avg	AIME24 avg@32	AIME25 avg@32	AMC23 avg@8	Minerva avg@4	Olymp. avg@8	Avg
Base	4.06	10.10	30.27	16.27	23.67	16.87	20.52	20.83	54.06	19.39	36.20	30.20
AdamW	22.71	20.94	58.43	25.91	46.09	34.82	25.42	23.94	62.65	23.16	44.69	35.97
Muon	20.42	19.27	54.22	24.08	42.41	32.08	29.06	23.33	66.72	22.89	44.61	37.32
Pion	25.42	21.98	59.94	26.84	46.43	36.12	30.00	24.38	66.87	23.90	46.43	38.32

Each benchmark is avg@K accuracy (%): AIME24 avg@32, AIME25 avg@32, AMC23 avg@8, Minerva Math avg@4, OlympiadBench avg@8. Avg = mean of the five. Table 3 in the paper.

Figure Evaluation-accuracy training dynamics during RLVR with GRPO. Pion exhibits the fastest convergence on both base models across all five benchmarks.

Pion: A Spectrum-Preserving Optimizer
via Orthogonal Equivalence Transformation