Logo POET

Reparameterized LLM Training via
Orthogonal Equivalence Transformation


1Max Planck Institute for Intelligent Systems, Tübingen, 2The Chinese University of Hong Kong

geometric reasoning

POET's three learning phases. Left: illustration; Middle: angle; Right: loss and validation.

POET: A New Training Paradigm for Large Language Models

While Large language models (LLMs) are driving the rapid advancement of artificial intelligence, effectively and reliably training these large models remains one of the field's most significant challenges. To address this challenge, we propose POET, a novel reParameterized training algorithm that uses Orthogonal Equivalence Transformation to optimize neurons. Specifically, POET reparameterizes each neuron with two learnable orthogonal matrices and a fixed random weight matrix. Because of its provable preservation of spectral properties of weight matrices, POET can stably optimize the objective function with improved generalization. We further develop efficient approximations that make POET flexible and scalable for training large-scale neural networks. Extensive experiments validate the effectiveness and scalability of POET in training LLMs.



Spectral Property of Weight Matrices

Spectral Properties and Generalization

The de facto way for training LLMs is to directly optimize weight matrices with the Adam optimizer. While conceptually simple, this direct optimization can be computationally intensive (due to the poor scaling with model size) and requires careful hyperparameter tuning to ensure stable convergence. More importantly, its generalization can remain suboptimal even if the training loss is perfectly minimized. To stabilize training and enhance generalization, various weight regularization methods and weight normalization techniques have been proposed. Most of these methods boil down to improving spectral properties of weight matrices (i.e., singular values) either explicitly or implicitly. Intuitively, the spectral norm of a weight matrix (i.e., the largest singular value) provides an upper bound on how much a matrix can amplify the input vectors, which connects to the generalization properties. In general, smaller spectral norms (i.e., better smoothness) are considered to be associated with stronger generalization, which inspires explicit spectrum control. Theoretical results also suggest that weight matrices with bounded spectrum can provably guarantee generalization.

Spectrum-Preserving Weight Update

To achieve effective weight spectrum control without the limitations above, we propose POET, a reParameterized training algorithm that uses Orthogonal Equivalence Transformation to indirectly learn weight matrices. Specifically, POET reparameterizes a weight matrix \( \mathbf{W} \in \mathbb{R}^{m \times n} \) with \( \mathbf{R} \mathbf{W}_0 \mathbf{P} \) where \( \mathbf{W}_0 \in \mathbb{R}^{m \times n} \) is a randomly initialized weight matrix, \( \mathbf{R} \in \mathbb{R}^{m \times m} \) and \( \mathbf{P} \in \mathbb{R}^{n \times n} \) are two orthogonal matrices. Instead of optimizing weight matrices directly, POET keeps the randomly initialized weight matrix \( \mathbf{W}_0 \) unchanged during training and learns two orthogonal matrices \( \mathbf{R}, \mathbf{P} \) to transform \( \mathbf{W}_0 \). This reparameterization preserves the singular values of weights while allowing flexible optimization of the singular vectors.

algebraic reasoning

Training dynamics of singular values of the same weight matrix in a LLaMA model using standard training (AdamW) and POET.

Dynamics of Singular Spectrum

Inspired by Muon [4], we conduct a spectral analysis by comparing the singular spectrum among AdamW, Muon and POET. We compute SVD entropy of the trained Llama-60M model at different iterations. The SVD entropy, defined as \( H(\boldsymbol{\sigma}) = \frac{1}{\log n} \sum_i \frac{\sigma_i^2}{\sum_j \sigma_j^2} \log \frac{\sigma_i^2}{\sum_j \sigma_j^2} \), measures the diversity of singular values; higher entropy indicates a more uniform and diverse spectrum. As shown in Figure 4, POET consistently maintains high spectral diversity throughout training, owing to its orthogonal equivalence transformation. Consistent with [4], Muon also yields more diverse spectral updates than AdamW.

algebraic reasoning

Dynamics comparison of average singular value entropy (singular value diversity) between direct training (AdamW, Muon) and POET.

Our proposed POET has two key properties:

  • Strong spectrum control: Because orthogonal transformations do not change the singular values of weight matrices, POET keeps the weight spectrum the same as the randomly initialized weight matrices (empirically validated even with approximations). Through the initialization scheme, POET thus directly controls the singular value distribution of its weight matrices. As a result, and in contrast to standard LLM training, POET matrices avoid undesirable large singular values after training. To further facilitate the POET algorithm, we introduce two new initialization schemes: normalized Gaussian initialization and uniform spectrum initialization, which can ensure the resulting weight matrices have bounded singular values.
  • Efficient approximation: While a naive implementation of POET can be computationally expensive, its inherent flexibility opens up opportunities for efficient and scalable training. To address the key challenge of optimizing large orthogonal matrices, we introduce two levels of approximations: stochastic primitive optimization, which factorizes a large orthogonal matrix \( \mathbf{R} \in \mathbb{R}^{m \times m} \) into a product of primitive orthogonal matrices with fewer trainable parameters and uses a merge-then-reinitialize trick for efficiency, and approximate orthogonality via Cayley-Neumann parameterization, which maintains orthogonality with low computational overhead by approximating the Cayley orthogonal parameterization with Neumann series and also benefits from the merge-then-reinitialize trick to prevent error accumulation.

Beyond AdamW: Fewer Parameters, Lower Perplexity

LLM Pretraining using LLaMA Transformer Architecture

We perform the pretraining experiments on the Llama transformers of varying sizes (60M, 130M, 350M, 1.3B) for POET. We use the C4 dataset, a cleaned web crawl corpus from Common Crawl, widely used for LLM pretraining. The training results are summarized below, the validation perplexity and the trainable parameters are reported.

Model (# tokens) 60M (30B) 130M (40B) 350M (40B) 1.3B (50B)
AdamW 26.68 (25.30M) 20.82 (84.93M) 16.78 (302.38M) 14.73 (1.21B)
Galore 29.81 (25.30M) 22.35 (84.93M) 17.99 (302.38M) 18.33 (1.21B)
LoRAr=64 39.70 (4.85M) 32.07 (11.21M) 25.19 (30.28M) 20.55 (59.38M)
POETBS, b=64 29.52 (2.39M) 24.52 (5.52M) 20.29 (14.90M) 18.28 (29.22M)
POETBS, b=128 26.90 (4.81M) 21.86 (11.12M) 18.05 (30.04M) 16.24 (58.91M)
POETBS, b=256 25.29 (9.66M) 19.88 (22.33M) 16.27 (60.32M) 14.56 (118.26M)
POETFS, b=1/8 34.06 (0.53M) 29.67 (1.78M) 24.61 (6.34M) 18.46 (25.39M)
POETFS, b=1/4 28.69 (2.13M) 23.55 (7.13M) 19.42 (25.44M) 17.60 (101.66M)
POETFS, b=1/2 25.37 (8.54M) 19.94 (28.56M) 15.95 (101.86M) 13.70 (406.88M)
algebraic reasoning

Training Speedup

To highlight POET's non-trivial performance improvement, we increase the training steps (i.e., effectively tokens seen) for AdamW, and find that POET-FS (\( b = 1/2 \)) still outperforms AdamW even if AdamW is trained with almost triple the number of tokens.

algebraic reasoning

Parameter and Memory Complexity

By introducing a hyperparameter \( b \) as the sampling budget, fully stochastic SPO decouples parameter complexity from the size of the weight matrices. With a small \( b \), POET becomes highly parameter-efficient, though at the cost of slower convergence. This offers users a flexible trade-off between efficiency and speed. In contrast, block-stochastic SPO has parameter complexity dependent on the matrix size (i.e., \( m + n \)), making it more scalable than AdamW, which requires \( mn \) trainable parameters. In terms of memory complexity, both POET variants can be much more efficient than AdamW with a suitable sampling budget \( b \). A comparison of parameter and memory complexity is given below:

Method # trainable params Memory cost
AdamW \( mn \) \( 3mn \)
GaLore \( mn \) \( mn + mr + 2nr \)
POET (FS) \( b(b-1) \) \( mn + 3b(b-1) \)
POET (BS) \( \frac{1}{2}(m+n)(b-1) \) \( mn + \frac{3}{2}(m+n)(b-1) \)

Training with POET

Training Algorithm

Step 1: Initialization. We initialize the weight matrices using normalized Gaussian: \( \mathbf{W} \leftarrow \mathbf{W}_0 \).

Step 2: Orthogonal matrix initialization. For fully stochastic SPO, we randomly sample an index set \( \mathbf{S} \), and parameterize \( \tilde{\mathbf{G}}_{R} \in \mathbb{R}^{b \times b} \) and \( \tilde{\mathbf{G}}_{P} \in \mathbb{R}^{b \times b} \) using CNP. Both matrices are initialized as identity, so \( \mathbf{R} \) and \( \mathbf{P} \) also start as identity matrices. For block-stochastic SPO, we sample a random permutation matrix \( \mathbf{\Psi}_R, \mathbf{\Psi}_P \), and parameterize \( \{ \tilde{\mathbf{G}}_R^{1}, \cdots, \tilde{\mathbf{G}}_R^{\lceil m/b \rceil} \} \) and \( \{ \tilde{\mathbf{G}}_P^{1}, \cdots, \tilde{\mathbf{G}}_P^{\lceil m/b \rceil} \} \) using CNP. Then we initialize them as the identity, so \( \mathbf{R} \) and \( \mathbf{P} \) again start as identity matrices.

Step 3: Efficient orthogonal parameterization. For fully stochastic SPO, we have \( \mathbf{R} = \mathbf{I}_m + \mathbf{D}(\mathbf{S})(\tilde{\mathbf{G}}_R - \mathbf{I}_b)\mathbf{D}(\mathbf{S})^\top \) and \( \mathbf{P} = \mathbf{I}_m + \mathbf{D}(\mathbf{S})(\tilde{\mathbf{G}}_P - \mathbf{I}_b)\mathbf{D}(\mathbf{S})^\top \). For block-stochastic SPO, we have \( \mathbf{R} = \mathbf{\Psi}_R^\top \mathrm{Diag}(\tilde{\mathbf{G}}^1_R, \cdots, \tilde{\mathbf{G}}^{\lceil m/b \rceil}_R)\mathbf{\Psi}_R \) and \( \mathbf{P} = \mathbf{\Psi}_P^\top \mathrm{Diag}(\tilde{\mathbf{G}}^1_P, \cdots, \tilde{\mathbf{G}}^{\lceil m/b \rceil}_P)\mathbf{\Psi}_P \).

Step 4: Inner training loop for updating orthogonal matrices. The equivalent weight matrix in the forward pass is \( \mathbf{R}\mathbf{W}\mathbf{P} \). Gradients are backpropagated through \( \mathbf{R} \) and \( \mathbf{P} \) to update \( \tilde{\mathbf{G}}_{R}, \tilde{\mathbf{G}}_{P} \) (fully stochastic) or \( \tilde{\mathbf{G}}_{R}^i, \tilde{\mathbf{G}}_{P}^i, \forall i \) (block-stochastic). This inner loop runs for a fixed number of iterations.

Step 5: Merge-then-reinitialize. The learned orthogonal matrices \( \mathbf{R} \) and \( \mathbf{P} \) are merged into the weight matrix by \( \mathbf{W} \leftarrow \mathbf{R}\mathbf{W}\mathbf{P} \). If not terminated, return to Step 2 for reinitialization.

Efficient Approximation to Orthogonality

POET is conceptually simple, requiring only the optimization of two orthogonal matrices. However, these matrices are typically large, and naively optimizing them leads to significant computational challenges. We introduce the Stochastic Primitive Optimization (SPO). The core idea of SPO is inspired by how QR factorization is performed using Givens rotations and Householder transformations. Both methods construct a large orthogonal matrix \( \mathbf{R} \) by sequentially applying primitive orthogonal transformations (e.g., Givens rotations or Householder reflections), i.e., \( \mathbf{R} = \prod_{i=1}^c \mathbf{G}_i \), where \( \mathbf{G}_i \) denotes the \( i \)-th primitive orthogonal matrix. While each \( \mathbf{G}_i \) is of the same size as \( \mathbf{R} \), it is parameterized by significantly fewer degrees of freedom. Both Givens rotation and Householder reflection use relatively low-capacity parameterizations—for example, each Givens rotation \( \mathbf{G}_i \) involves only a single effective parameter—which limits their efficiency in representing the full orthogonal matrix. SPO follows a similar idea of factorizing the original orthogonal matrix into multiple primitive orthogonal matrices. However, unlike Givens and Householder methods, SPO treats the number of effective parameters in each primitive matrix as a tunable hyperparameter and adopts a stochastic sparsity pattern.

algebraic reasoning
  • Fully stochastic SPO. The basic idea of fully stochastic SPO is to randomly sample a small submatrix and enforce its orthogonality, allowing it to be easily extended to a full orthogonal matrix by embedding it within an identity matrix—a process similar to Givens or Householder transformations. To represent a large orthogonal matrix \( \mathbf{R} \in \mathbb{R}^{m \times m} \), we start by defining \( c \) index sets \( \mathbf{S}^j = \{ s^j_1, \cdots, s^j_b \} \subseteq \{ 1, \cdots, m \} \) (\( j \in [1, c] \)), where each set has cardinality \( |\mathbf{S}^j| = b \), a hyperparameter controlling the number of effective parameters of a primitive orthogonal matrix. \( \mathbf{S}^j, \forall j \) are randomly sampled from the full indices \( \{ 1, \cdots, m \} \). Let \( \tilde{\mathbf{G}}_j \in \mathbb{R}^{b \times b} \) be a small orthogonal matrix, and \( \mathbf{D}(\mathbf{S}^j) = \{ \mathbf{e}(s^j_1), \cdots, \mathbf{e}(s^j_b) \} \in \mathbb{R}^{m \times b} \) be a selection matrix, where \( \mathbf{e}(k) \) is the standard basis vector with a 1 in the \( k \)-th position and 0 elsewhere. The factorization is given by:

    \( \mathbf{R} = \prod_{i=1}^c \left( \underbrace{ \mathbf{I}_m + \mathbf{D}(\mathbf{S}^i) \cdot (\tilde{\mathbf{G}}_i - \mathbf{I}_b) \cdot \mathbf{D}(\mathbf{S}^i)^\top }_{ \mathbf{G}_i\text{: The } i\text{-th primitive orthogonal matrix} } \right), \quad \text{s.t.}~ \tilde{\mathbf{G}}_i^\top \tilde{\mathbf{G}}_i = \tilde{\mathbf{G}}_i \tilde{\mathbf{G}}_i^\top = \mathbf{I}_b,~\forall i \)
    where \( \mathbf{D}(\mathbf{S}^i) \cdot (\mathbf{A}) \cdot \mathbf{D}(\mathbf{S}^i)^\top \) is a projector that replaces the \( b \times b \) sub-block with \( \mathbf{A} \). \( \mathbf{I}_m \) and \( \mathbf{I}_b \) are identity matrices of size \( m \times m \) and \( b \times b \), respectively. To efficiently parameterize small orthogonal matrices \( \tilde{\mathbf{G}}_i \), we can use the CNP introduced in the next section.
  • Block-stochastic SPO. While fully stochastic SPO is simple, it may fail to transform all neuron dimensions because the identity matrix leaves part of the space unchanged. To address this, we propose block-stochastic SPO, which first constructs a block-diagonal orthogonal matrix with small blocks for parameter efficiency, and then applies a random permutation to enhance expressiveness by randomizing the sparsity pattern. Block-stochastic SPO transforms all neuron dimensions simultaneously. Formally, we have:

    \( \mathbf{R} = \prod_{i=1}^c \left( \underbrace{ \mathbf{\Psi}_i^\top \cdot \mathrm{Diag}(\tilde{\mathbf{G}}^1_i, \tilde{\mathbf{G}}^2_i, \cdots, \tilde{\mathbf{G}}^{\lceil m/b \rceil}_i) \cdot \mathbf{\Psi}_i }_{ \mathbf{G}_i\text{: The } i\text{-th primitive orthogonal matrix} } \right), \quad \text{s.t.}~ (\tilde{\mathbf{G}}_i^j)^\top \tilde{\mathbf{G}}_i^j = \tilde{\mathbf{G}}_i^j (\tilde{\mathbf{G}}_i^j)^\top = \mathbf{I}_b,~\forall i, j \)
    where \( \tilde{\mathbf{G}}^j_i \in \mathbb{R}^{b \times b} \) is the \( j \)-th block of the block diagonal matrix, and \( \mathbf{\Psi}_i, \forall i \) are all random permutation matrices. As long as each diagonal block \( \tilde{\mathbf{G}}^j_i \) is an orthogonal matrix, both \( \mathbf{G}_i \) and \( \mathbf{R} \) are also orthogonal matrices. We also use CNP to efficiently parameterize each orthogonal block \( \tilde{\mathbf{G}}^j_i \).
  • Cayley-Neumann Parameterization (CNP)

    The classic Cayley parameterization generates an orthogonal matrix \(\mathbf{R}\) in the form of \(\mathbf{R}=(\mathbf{I}+\mathbf{Q})(\mathbf{I}-\mathbf{Q})^{-1}\) where \(\mathbf{Q}\) is a skew-symmetric matrix satisfying \(\mathbf{Q}=-\mathbf{Q}^\top\). A minor caveat of this parameterization is that it only produces orthogonal matrices with determinant \(1\) (i.e., elements of the special orthogonal group), but empirical results in Orthogonal Finetuning (OFT) indicate that this constraint does not hurt performance. However, the matrix inverse in the original Cayley parameterization introduces numerical instability and computational overhead, limiting its scalability to large orthogonal matrices. To address this, we approximate the matrix inverse using a truncated Neumann series:

    \begin{equation}\label{eq:cnp} \mathbf{R}=(\mathbf{I}+\mathbf{Q})(\mathbf{I}-\mathbf{Q})^{-1}=(\mathbf{I}+\mathbf{Q})\cdot\big(\sum_{i=0}^\infty \mathbf{Q}^i \big) \approx (\mathbf{I}+\mathbf{Q})\cdot\big(\mathbf{I}+\sum_{i=1}^k \mathbf{Q}^i \big), \end{equation}

    where a larger number of approximation terms \(k\) leads to a smaller approximation error. By avoiding matrix inversion, the training stability of POET is improved; however, this comes with a price--the approximation is valid only when the Neumann series converges in the operator norm. To initialize orthogonal matrices as identity, we set \(\mathbf{Q}\) to a zero matrix in CNP, satisfying the convergence condition initially. As the training progresses, however, updates to \(\mathbf{Q}\) may cause its operator norm to exceed \(1\), violating this condition. Fortunately, our merge-then-reinitialize trick mitigates this issue by periodically resetting \(\mathbf{Q}\) to a zero matrix, ensuring its operator norm remains small.

POET’s superior Performance Stems from Energy and Spectrum Preservation.

Neuron Initialization

Since POET preserves the spectral properties of the initial weight matrix \( \mathbf{W}_0 \), the choice of initialization plays a critical role. We propose to use the normalized Gaussian initialization, which normalizes neurons drawn from a zero-mean Gaussian with fixed variance. We empirically compare different random initialization schemes for POET in the following table. Results show that the normalized Gaussian initialization leads to the best final performance. We hypothesize that the reason for this favorable performance is that POET with normalized Gaussian initialization preserves both hyperspherical energy and the spectrum during training.

Scheme Perplexity
Standard 26.22
Xavier 25.79
Uniform Spectrum 27.29
Normalized Gaussian 25.37

Hyperspherical Energy in Training

Hyperspherical energy \( \mathrm{HE}(\cdot) \) characterizes the uniformity of neurons on a unit hypersphere and can be used to characterize the neural representation of each layer. Orthogonal training [2,3] with \( \hat{\mathbf{w}}_i = \mathbf{w}_i / \|\mathbf{w}_i\| \) ensures the preservation of hyperspherical energy during training:


\[ \text{HE}(\mathbf{W}_0):=\sum_{i\neq j}\left\| \hat{\mathbf{w}}^0_i-\hat{\mathbf{w}}^0_j \right\|^{-1} =\sum_{i\neq j}\left\| \mathbf{R}\hat{\mathbf{w}}_i-\mathbf{R}\hat{\mathbf{w}}_j \right\|^{-1} =:\text{HE}(\mathbf{W}^R), \]

Prior work [2] has shown that energy-preserving training can effectively improve generalization.

POET with Normalized Gaussian Initialization Preserves both Energy and Spectrum

Under zero-mean isotropic Gaussian initialization, spectrum-preserving training and energy-preserving training can be achieved simultaneously by POET. It also partially explains why the proposed normalized Gaussian initialization achieves the best performance (Proof in the Appendix B).

algebraic reasoning

The relationship between POET, spectrum-preserving training and energy-preserving training.

Discussions and Intriging Insights

Understanding POET Training

To better understand how POET functions, we employ vector probing to analyze the learning dynamics of the orthogonal matrices. Vector probing evaluates an orthogonal matrix \( \mathbf{R} \) using a fixed, randomly generated unit vector \( \mathbf{v} \) by computing \( \mathbf{v}^\top \mathbf{R} \mathbf{v} \), which corresponds to the cosine similarity between \( \mathbf{R}\mathbf{v} \) and \( \mathbf{v} \). By inspecting the cosine similarities of seven orthogonal matrices throughout training, we observe that the learning process can be divided into three distinct phases (see Figure 1):

  1. Conical shell searching: The cosine starts at 1 (i.e., \( \mathbf{R} \) is the identity) and gradually converges to a stable range of [0.6, 0.65], which we observe consistently across all learnable orthogonal matrices. This suggests that \( \mathbf{R} \) transforms \( \mathbf{v} \) into a thin conical shell around its original direction.
  2. Stable learning on the conical shell: The cosine remains within this range while the model begins to learn stably. Despite the cosine plateauing, validation perplexity continues to improve almost linearly.
  3. Final adjusting: Learning slows and eventually halts as the learning rate approaches zero.
We provide an in-depth discussion and empirical results in the Appendix.

Connection to Generalization Theory

Several previous works on generalization based on bounding the spectral norm of weight matrices. In particular, the spectrally-normalized margin analysis from Bartlett et al. bounds the misclassification error in terms of a margin-based training loss and a complexity term. The complexity term is proportional to \( Q/(\gamma n) \) where \( \gamma \) and \( n \) are margin and sample size and \( Q \) bounds the spectral complexity. For an \( L \)-layer ReLU MLP and maximal width \( d \), \( Q \) is bounded by

\( Q = \left( \prod_{i=1}^L \lVert \mathbf{W}_i \rVert \right) \left( \sum_{i=1}^L \frac{(\sqrt{d} \lVert \mathbf{W}_i \rVert_F)^{2/3} }{ \lVert \mathbf{W}_i \rVert^{2/3} } \right)^{3/2} \)

where \( \lVert \cdot \rVert \) and \( \lVert \cdot \rVert_F \) denote spectral and Frobenius norm respectively. Those norms remain invariant when training the network with POET and at initialization they can be bounded with high probability using standard results from random matrix theory. The scale at initialization is typically chosen such that \( \mathbf{W} \in \mathbb{R}^{d \times d} \) satisfies \( \lVert \mathbf{W} \rVert = O(1) \) and \( \lVert \mathbf{W} \rVert_F = O(\sqrt{d}) \) so that \( Q = O_L(d) \). For detailed analysis, please refer to the Appendix of the paper.

Performance under a constant parameter budget

Since POET optimizes two orthogonal matrices \( \mathbf{R}, \mathbf{P} \) simultaneously, a natural question arises: which matrix should receive more parameter budget under a fixed total constraint? To investigate this, we conduct a controlled experiment where different ratios of trainable parameters are allocated to \( \mathbf{R} \) and \( \mathbf{P} \) under a fixed total budget. All other settings (e.g., architecture, data) remain unchanged, with full details provided in the Appendix. We use validation perplexity as the evaluation metric. The total parameter budget matches that of fully stochastic POET with \( b = \frac{1}{h}m \) for \( \mathbf{R} \) and \( b = \frac{1}{h}n \) for \( \mathbf{P} \), where \( h = 8 \), \( 4 \), and \( 3 \) correspond to small, medium, and large budgets, respectively. The results show that POET with a balanced allocation between \( \mathbf{R} \) and \( \mathbf{P} \) yields the best performance.

algebraic reasoning

Performance of POET under a constant total parameter budget on \( \mathbf{R} \) and \( \mathbf{P} \).

Weight Update Evenness of Different POET Variants

To understand the higher parameter efficiency of POET-BS compared to POET-FS, we employ a toy example to visualize their different weight update mechanisms by counting the total number of updates for each element of the weight matrix. The visualization results are given below. Specifically, in this experiment, a 64 x 64 matrix was randomly initialized and trained for 100 steps under various POET-BS and POET-FS configurations. The merge-then-reinitialize trick is performed at each iteration, and the same set of weight elements was effectively updated between two successive merge-then-reinitialize operations. For each weight element, we compute its total number of update in these 100 steps.

algebraic reasoning

Visualization of the weight update mechanism of POET-BS and POET-FS after 100 steps of update and $\` T_m = 1 \`$.

Neumann Series Approximation of POET.

CNP approximates the matrix inverse using a Neumann series. As the number of Neumann terms directly influences the approximation quality, understanding its impact on model performance is essential. To this end, we evaluate how varying the number of Neumann terms affects performance, using POET-FS with \( b = 1/2 \) to train LLaMA-130M. Results in the following table show that increasing the number of Neumann terms generally improves validation perplexity. However, this also leads to slower training. Moreover, Using only 1 Neumann term (\( k=1 \)) leads to training divergence, highlighting the critical role of maintaining orthogonality. To balance overhead and performance, we find that using 5 Neumann terms is a good trade-off.

algebraic reasoning

Approximation error of orthogonal matrices \( \mathbf{R} \) and \( \mathbf{P} \) of a weight matrix.

Scheme Perplexity
\( k=1 \) Not converged
\( k=2 \) 22.56
\( k=3 \) 21.54
\( k=4 \) 20.22
\( k=5 \) 20.19

BibTeX


      @article{qiu2025poet,
        title={Reparameterized LLM Training via Orthogonal Equivalence Transformation},
        author={Qiu, Zeju and Buchholz, Simon and Xiao, Tim Z. and Dax, Maximilian and Sch\"olkopf, Bernhard and Liu, Weiyang},
        journal={arXiv preprint arXiv:2506.08001},
        year={2025}
      }
    

References

  1. Liu, W., Lin, R., Liu, Z., Liu, L., Yu, Z., Dai, B., & Song, L. (2018). Learning towards minimum hyperspherical energy. Advances in neural information processing systems, 31.
  2. Liu, W., Lin, R., Liu, Z., Rehg, J. M., Paull, L., Xiong, L., ... & Weller, A. (2021). Orthogonal over-parameterized training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 7251-7260).
  3. Qiu, Z., Liu, W., Feng, H., Xue, Y., Feng, Y., Liu, Z., ... & Schölkopf, B. (2023). Controlling text-to-image diffusion by orthogonal finetuning. Advances in Neural Information Processing Systems, 36, 79320-79362.
  4. Liu, J., Su, J., Yao, X., Jiang, Z., Lai, G., Du, Y., ... & Yang, Z. (2025). Muon is Scalable for LLM Training. arXiv e-prints, arXiv-2502.