POET-X: Memory-efficient LLM Training

POET-X: A Scalable and Memory-efficient Method for Pretraining Billion-parameter LLMs

Efficient and stable training of large language models (LLMs) remains a core challenge in modern machine learning systems. To address this challenge, Reparameterized Orthogonal Equivalence Training (POET), a spectrum-preserving framework that optimizes each weight matrix through orthogonal equivalence transformation, has been proposed. Although POET provides strong training stability, its original implementation incurs high memory consumption and computational overhead due to intensive matrix multiplications. To overcome these limitations, we introduce POET-X, a scalable and memory-efficient variant that performs orthogonal equivalence transformations with significantly reduced computational cost. POET-X maintains the generalization and stability benefits of POET while achieving substantial improvements in throughput and memory efficiency. In our experiments, POET-X enables the pretraining of billion-parameter LLMs on a single Nvidia H100 GPU, and in contrast, standard optimizers such as AdamW run out of memory under the same settings.

Latency breakdown of POET, POET-X, and PyTorch Linear Layers with sequence length 2048 and block size $ b = 256 $.

Memory breakdown for training Llama-8B on a single GPU across different methods with batch size 1, sequence lengths 1024 and block size $b = 256$. Since POET runs OOM under this setting, we estimate its memory footprint by profiling memory usage across different numbers of decoder layers (i.e., parameter sizes) and applying scaling.

Single-layer Benchmarking against POET

The original POET struggles to scale beyond 3B parameters due to prohibitive memory and compute requirements. We quantitatively evaluate POET-X against POET and standard linear layers using Llama-8B settings to demonstrate its scalability.

1. Latency and Compute Efficiency

Compared to POET, the combined forward and backward pass latency drops from 10.59ms to 1.38ms for POET-X_fast and 1.89ms for POET-X_mem. Relative to a highly optimized PyTorch linear layer (cuBLAS), POET-X incurs only modest overhead. Notably, POET-X_fast achieves backward-pass latency comparable to standard linear layers due to its high parameter efficiency.

2. Memory Footprint and Scalability

We profiled memory consumption for training a Llama-8B model on a single GPU. While POET was intended to exhibit PEFT-like memory reduction, its original formulation is actually more memory-intensive than AdamW because it requires storing large transformed weight matrices ($\mathbf{W}_{RP}$) for backpropagation. In contrast, $\text{POET-X}_{\text{fast}}$ and $\text{POET-X}_{\text{mem}}$ both exhibit the reduced memory footprint characteristic of PEFT methods. By drastically lowering memory requirements for activations, gradients, and optimizer states, POET-X significantly enhances the scalability of orthogonal reparameterization for large-scale transformer pretraining.

Multi-node Pretraining using LLaMA Transformer Architecture

We evaluated POET-X by pretraining a Llama-3B transformer on 60B tokens from the C4 dataset, following the Chinchilla scaling law of ~20 tokens per parameter. We benchmarked POET-X against AdamW, Muon (gradient orthogonalization), and memory-efficient baselines GaLore and APOLLO.

Key Experimental Findings

Superior Performance: POET-X achieved better validation perplexity (PPL) than AdamW and all other memory-efficient methods.
Efficiency Trade-off: While slightly underperforming Muon in PPL, POET-X required significantly lower GPU memory.
Block Size Impact: $\text{POET-X}_{b=512}$ yielded a highly competitive validation perplexity of 12.05, the second-best result across all tested methods.

These results demonstrate that POET-X provides an optimal balance, delivering state-of-the-art training efficiency with a much smaller memory footprint than traditional second-order or orthogonal optimizers.

A primary advantage of POET-X is its seamless application to quantized base models. We evaluated POET-XQ by pretraining Llama-3B on 10B tokens, comparing it against GaLore and APOLLO.

Key Results

Superior Performance: POET-XQ_b=512 achieved the best validation perplexity of 14.78, outperforming both GaLore and APOLLO baselines.
Minimal Memory: POET-XQ_b=256 required the lowest overall training memory across all tested configurations.
High Throughput: POET-XQ demonstrated superior computational efficiency by avoiding direct optimization of low-precision weights, leading to higher training throughput.

By decoupling reparameterization from the quantized base weights, POET-XQ maintains compatibility with standard optimizers and can be effortlessly integrated into any quantized model architecture.

Training Speedup

Beyond iteration-wise convergence, we evaluated wall-clock efficiency in a distributed environment (32× Nvidia H100 GPUs across 4 nodes via InfiniBand).

Distributed Scaling Advantages

Superior Throughput: POET-X's extreme memory efficiency allows the use of Distributed Data Parallel (DDP). Since the model, gradients, and optimizer states fit entirely on a single GPU, only data is sharded, resulting in higher throughput and stronger scalability.
Communication Overhead: In contrast, training AdamW in the same setting triggers Out-of-Memory (OOM) errors with DDP. It necessitates Fully Sharded Data Parallel (FSDP), which introduces significant collective communication overhead due to sharding parameters and optimizer states across GPUs.

By enabling DDP where standard optimizers require FSDP, POET-X achieves significantly better practical wall-clock efficiency and robustness in large-scale distributed pretraining.

Memory Efficiency

We systematically benchmarked peak GPU memory on a single Nvidia H100, varying model size (3B to 13B), sequence length (512 to 2048), and block size $b$. We compared POET-X against AdamW, Muon, GaLore, APOLLO, and a parameter-matched LoRA baseline across both BF16 and INT8 quantized settings.

Key Scalability Findings

Prohibitive Baseline: The original POET formulation is highly memory-intensive, failing to fit 8B and 13B models even at the shortest sequence lengths.
State-of-the-Art Efficiency: POET-X_fast matches the memory footprint of LoRA, while POET-X_mem outperforms all baselines (including LoRA) across every scale and configuration.
Scaling Advantage: POET-X's memory benefits become most pronounced at scale. For the 13B model with a 2048 sequence length, POET-X maintains a stable footprint where other methods encounter significant overhead.

These results confirm that POET-X is the most memory-efficient framework for large-scale orthogonal reparameterization, effectively enabling the pretraining of 13B+ parameter models on single-node hardware.

The following table benchmarks the peak GPU memory usage on a single Nvidia H100 (Batch Size 1, no gradient accumulation). This comparison highlights the memory stability of POET-X as model size and sequence length increase.

Throughput Efficiency

We evaluated throughput scalability by varying model size, sequence length, and node count (scaling from 1 to 64 GPUs). For distributed training, POET-X and LoRA utilize Distributed Data Parallel (DDP), while AdamW employs a hybrid FSDP strategy.

1. Single-GPU Throughput

While AdamW shows competitive performance on smaller configurations (Llama-8B at 512/1024 lengths), it encounters Out-of-Memory (OOM) errors as sequence length or model size increases. POET-X maintains stable execution across all tested scales.

2. Distributed Node Scaling

As shown in our 64-GPU benchmarks, AdamW's performance deviates from ideal linear scaling as node counts increase. This bottleneck is driven by:

Network Congestion: AdamW requires a full-gradient all-reduce across all nodes at every step.
Communication Overhead: FSDP's intra-node all-gather and reduce-scatter operations further degrade throughput.

In contrast, POET-X closely follows the ideal scaling curve. By drastically reducing the number of trainable parameters, it minimizes communication overhead, requiring only minimal collective operations to synchronize updates across the cluster.

Throughput (k tokens/s) across different numbers of GPUs. The solid line denotes the actual throughput, and the dashed line denotes the ideal linear scaling throughput. The ideal throughput of $k$ GPUs is defined as $T_{k, ideal}=T_{8,real}\times k /8$.

The table compares the throughput (k tokens/s) and scaling efficiency. The Scaling Ratio represents the throughput improvement when moving from a single H100 (1×1) to a 64-GPU cluster (8×8 H100).

Preliminaries of POET

POET reparameterizes each neuron as $\mathbf{W}_{RP} = \mathbf{R}\mathbf{W}_0\mathbf{P}$, where $\mathbf{W}_0 \in \mathbb{R}^{m \times n}$ is a fixed random weight matrix, and $\mathbf{R} \in \mathbb{R}^{m \times m}$, $\mathbf{P} \in \mathbb{R}^{n \times n}$ are trainable orthogonal matrices.

This formulation performs an orthogonal equivalence transformation (OET) on $\mathbf{W}_0$, defined as $\text{OET}(\mathbf{W}; \mathbf{R}, \mathbf{P}) = \mathbf{R}\mathbf{W}\mathbf{P}$, which multiplies $\mathbf{W}$ by orthogonal matrices from both sides. The forward pass of POET is thus:

\begin{aligned} &\mathbf{y} = \mathbf{W}_{RP}^\top \mathbf{x} = (\mathbf{R}\mathbf{W}_0\mathbf{P})^\top \mathbf{x},\\ &\text{s.t. } \big\{ \mathbf{R}^\top\mathbf{R} = \mathbf{R}\mathbf{R}^\top = \mathbf{I}, \quad \mathbf{P}^\top\mathbf{P} = \mathbf{P}\mathbf{P}^\top = \mathbf{I} \big\}. \end{aligned}

After training, $\mathbf{R}$ and $\mathbf{P}$ can be merged into $\mathbf{W}_{RP}$, ensuring that POET-trained networks have no inference overhead.

POET-X: Formulation

In block-stochastic POET, the orthogonal matrix $\mathbf{R}_i$ is parameterized as a block-diagonal structure with random permutations:

\mathbf{R}_i = \underbrace{\mathbf{\Psi}_i^\top}_{\text{Column-permute}} \cdot \underbrace{\text{Diag}(\tilde{\mathbf{G}}^1_i, \dots, \tilde{\mathbf{G}}^{\lceil\frac{m}{b}\rceil}_i)}_{\text{Orthogonal matrix } \mathbf{G}_i} \cdot \underbrace{\mathbf{\Psi}_i}_{\text{Row-permute}}

The weight-centric implementation incurs $\mathcal{O}(nm^2)$ complexity. To solve this, we use an input-centric formulation:

\underbrace{\overbrace{\underbrace{\mathbf{P}_i^\top\mathbf{W}^\top}_{\text{① mm}}\mathbf{R}_i^\top}^{\text{② mm}}\mathbf{x}}_{\text{③ mv}} \quad \Leftrightarrow \quad \underbrace{\mathbf{P}_i^\top\overbrace{\mathbf{W}^\top\underbrace{\mathbf{R}_i^\top\mathbf{x}}_{\text{① mv}}}^{\text{② mv}}}_{\text{③ mv}}

The complete inference formula for one weight matrix is defined as:

\mathbf{z} = \mathbf{\Phi}_{n}\mathbf{G}_P^\top\mathbf{\Phi}_{n}^\top\mathbf{W}\mathbf{\Phi}_{m}\mathbf{G}_R^\top\mathbf{\Phi}_{m}^\top\mathbf{x}

where $\mathbf{R}=\mathbf{\Phi}_m^\top\mathbf{G}_R\mathbf{\Phi}_m$ and $\mathbf{P}=\mathbf{\Phi}_n^\top\mathbf{G}_P\mathbf{\Phi}_n$. To minimize memory overhead, we implement two core optimizations:

Permutation Acceleration

Instead of explicit matrix construction, we use a custom CUDA operator to implement index mapping. For permutations $\pi_p$ and $\pi_q$, we apply the following bijections:

\begin{aligned} \mathbf{\Psi}_m \mathbf{W} \equiv \mathbf{W}' \Leftrightarrow (\mathbf{W}')_{i, :} = \mathbf{W}_{\pi_p(i), :} & \quad & \mathbf{W} \mathbf{\Psi}_n \equiv \mathbf{W}' \Leftrightarrow (\mathbf{W})_{:, j} = \mathbf{W}_{:, \pi^{-1}_q(j)} \\ \mathbf{\Psi}_m^\top \mathbf{W} \equiv \mathbf{W}' \Leftrightarrow (\mathbf{W}')_{i, :} = \mathbf{W}_{\pi^{-1}_p(i), :} & \quad & \mathbf{W} \mathbf{\Psi}_n^\top \equiv \mathbf{W}' \Leftrightarrow (\mathbf{W}')_{:, j} = \mathbf{W}_{:, \pi_q(j)} \end{aligned}

By accessing weights in a prescribed order, this approach achieves up to 20× speedup in both forward and backward passes.

Permutation Reduction

In the input-centric formulation, we merge two permutations directly into $\mathbf{W}$ at the start of the inner loop:

\mathbf{z} = \mathbf{\Phi}_{n}\mathbf{G}_P^\top \underbrace{\mathbf{\Phi}_{n}^\top \mathbf{W} \mathbf{\Phi}_{m}}_{\text{Pre-computed}} \mathbf{G}_R^\top \mathbf{\Phi}_{m}^\top \mathbf{x}

Since $\mathbf{W}$ remains fixed during the optimization of $\mathbf{G}_P$ and $\mathbf{G}_R$, pre-computing the permuted weights eliminates redundant calculations and significantly reduces total runtime.

Batch-Parallel Strategy

In block-stochastic POET, orthogonal matrices utilize a sparse block-diagonal structure:

\mathbf{G}_P = \text{Diag}(\tilde{\mathbf{G}}^1_P, \dots, \tilde{\mathbf{G}}^{\lceil n/b \rceil}_P), \quad \mathbf{G}_R = \text{Diag}(\tilde{\mathbf{G}}^1_R, \dots, \tilde{\mathbf{G}}^{\lceil m/b \rceil}_R)

To avoid the overhead of constructing large sparse matrices, we observe that computations occur strictly within individual blocks. We propose a batch-parallel strategy that skips explicit matrix construction. Instead, each block is treated as an independent matrix in a batch, and we perform batch-wise matrix multiplications. This optimization significantly reduces GPU memory usage and improves runtime performance.

Efficient Cayley-Neumann Parameterization

POET-X stores only the upper-triangular part of skew-symmetric matrices $\mathbf{Q}$, halving the memory footprint. We use Triton for kernel fusion in the Cayley-Neumann expansion:

\mathbf{G} \approx 2 (\mathbf{Q} + \mathbf{Q}^2 + \mathbf{Q}^2 \cdot \mathbf{Q}) + \mathbf{Q}^2 \cdot \mathbf{Q}^2 + \mathbf{I}

Compared to a naive PyTorch implementation that repeatedly reads $\mathbf{Q}$ and $\mathbf{Q}^2$ from slow global GPU memory for each term in $\mathbf{G}$, our approach drastically reduces data transfer overhead. By leveraging kernel fusion, we load these tensors into low-latency shared memory only once.

Furthermore, fusing multiple tensor operations into a single custom kernel minimizes the number of PyTorch operator calls. This reduces CPU overhead by improving kernel launch times, leading to a more efficient execution pipeline for both forward and backward passes.

The backward pass is also fused using the following gradient derivation:

\begin{aligned} &\color{#191970}{\nabla_{1} = \frac{\partial f}{\partial \mathbf{G}}}, \quad \color{#008080}{\nabla_2 =} \color{#191970}{\nabla_1} \color{#008080}{\mathbf{Q}^\top + \mathbf{Q}^\top} \color{#191970}{\nabla_1} \\ &\color{#C27E7E}{\nabla_3 =} \color{#191970}{\nabla_1} \color{#C27E7E}{(\mathbf{Q}^2)^\top + \mathbf{Q}^\top}\color{#008080}{\nabla_2}, \quad \color{#C2B280}{\nabla_4 =} \color{#008080}{\nabla_2} \color{#C2B280}{(\mathbf{Q}^2)^\top + (\mathbf{Q}^2)^\top}\color{#008080}{\nabla_2} \\ &\frac{\partial f}{\partial \mathbf{Q}} = 2\color{#191970}{\nabla_{1}} + 2\color{#008080}{\nabla_{2}} + 2\color{#C27E7E}{\nabla_{3}} + \color{#C2B280}{\nabla_{4}} \\ &~~~~~~~= 2 (\color{#191970}{\nabla_{1}} + \color{#008080}{\nabla_2}) + (2\mathbf{Q}^\top + (\mathbf{Q}^2)^\top)\color{#008080}{\nabla_2} + (2\color{#191970}{\nabla_{1}} + \color{#008080}{\nabla_2}) (\mathbf{Q}^2)^\top \end{aligned}

Efficient Cayley-Neumann parameterization illustration

Illustration of efficient Cayley-Neumann parameterization (batch-wise implementation).

Boosting Memory-Efficiency with Checkpointing

To simplify the memory analysis, we omit permutation matrices and express the forward pass as: $\mathbf{z} = \mathbf{G}_P^\top \mathbf{W} \mathbf{G}_R^\top \mathbf{x}$. The process is executed via three sequential multiplications:

\mathbf{mm1}: \mathbf{a} = \mathbf{G}_R^\top \mathbf{x}, \quad \mathbf{mm2}: \mathbf{b} = \mathbf{W} \mathbf{a}, \quad \mathbf{mm3}: \mathbf{z} = \mathbf{G}_P^\top \mathbf{b}

To enable the backward pass, the PyTorch Autograd Engine must save specific activations, which impacts peak memory:

mm3 backward: Computes $\nabla_{\mathbf{G}_P} = \mathbf{b} \nabla_{\mathbf{z}}^\top$. This requires saving activation $\mathbf{b}$ (shape $\mathbb{R}^{N \times m}$).
mm2 backward: Computes $\nabla_{\mathbf{a}} = \mathbf{W}^\top \nabla_{\mathbf{b}}$. Since $\mathbf{W}$ is fixed and requires no gradient, no additional activation is saved.
mm1 backward: Computes $\nabla_{\mathbf{G}_R} = \mathbf{x} \nabla_{\mathbf{a}}^\top$. The input $\mathbf{x}$ is already available in memory, requiring no extra storage.

POET-X Variants

We introduce two variants to balance the compute-memory trade-off:

$\text{POET-X}_{\text{fast}}$: Follows standard Autograd logic by saving $\mathbf{b}$. It is faster but higher in memory usage.
$\text{POET-X}_{\text{mem}}$: Uses gradient checkpointing to recompute $\mathbf{b}$ on-the-fly during the backward pass, serving as our most memory-efficient implementation.

POET-XQ: Quantized POET-X Training

Leveraging custom CUDA kernels for both forward and backward passes, POET-X readily supports quantized training. The core mechanism involves storing only the base model's low-bit quantized weight matrices and dequantizing them on the fly. This ensures that high-precision weights are never stored in memory during the activation phase.

Consequently, POET-XQ is specifically implemented on POET-X_mem. In this configuration, intermediate activations are recomputed during the backward pass to save space. In contrast, POET-X_fast is less suitable for quantized training because it requires storing extra activation tensors, which would necessitate keeping high-precision weight matrices in memory.

References

Liu, W., Lin, R., Liu, Z., Liu, L., Yu, Z., Dai, B., & Song, L. (2018). Learning towards minimum hyperspherical energy. Advances in Neural Information Processing Systems, 31.
Liu, W., Lin, R., Liu, Z., Rehg, J. M., Paull, L., Xiong, L., ... & Weller, A. (2021). Orthogonal over-parameterized training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 7251–7260).
Qiu, Z., Liu, W., Feng, H., Xue, Y., Feng, Y., Liu, Z., ... & Schölkopf, B. (2023). Controlling text-to-image diffusion by orthogonal finetuning. Advances in Neural Information Processing Systems, 36, 79320–79362.
Liu, J., Su, J., Yao, X., Jiang, Z., Lai, G., Du, Y., ... & Yang, Z. (2025). Muon is Scalable for LLM Training. arXiv preprint, arXiv-2502.

POET-X

Memory-efficient LLM Training by Scaling Orthogonal Transformation

POET-X: A Scalable and Memory-efficient Method for Pretraining Billion-parameter LLMs

Single-layer Benchmarking against POET

1. Latency and Compute Efficiency

2. Memory Footprint and Scalability

Performance Comparison

Multi-node Pretraining using LLaMA Transformer Architecture

Key Experimental Findings

Key Results

Training Speedup

Distributed Scaling Advantages

In-depth Efficiency Study

Memory Efficiency

Key Scalability Findings

Throughput Efficiency

1. Single-GPU Throughput

2. Distributed Node Scaling

POET-X: Fast, Memory-efficient Training
by Scaling Orthogonal Transformation

Preliminaries of POET

POET-X: Formulation

Permutation Acceleration

Permutation Reduction

Batch-Parallel Strategy

Efficient Cayley-Neumann Parameterization

Boosting Memory-Efficiency with Checkpointing

POET-X Variants

POET-XQ: Quantized POET-X Training

BibTeX

References

POET-X

Memory-efficient LLM Training by Scaling Orthogonal Transformation

POET-X: A Scalable and Memory-efficient Method for Pretraining Billion-parameter LLMs

Single-layer Benchmarking against POET

1. Latency and Compute Efficiency

2. Memory Footprint and Scalability

Performance Comparison

Multi-node Pretraining using LLaMA Transformer Architecture

Key Experimental Findings

Key Results

Training Speedup

Distributed Scaling Advantages

In-depth Efficiency Study

Memory Efficiency

Key Scalability Findings

Throughput Efficiency

1. Single-GPU Throughput

2. Distributed Node Scaling

POET-X: Fast, Memory-efficient Trainingby Scaling Orthogonal Transformation

Preliminaries of POET

POET-X: Formulation

Permutation Acceleration

Permutation Reduction

Batch-Parallel Strategy

Efficient Cayley-Neumann Parameterization

Boosting Memory-Efficiency with Checkpointing

POET-X Variants

POET-XQ: Quantized POET-X Training

BibTeX

References

POET-X: Fast, Memory-efficient Training
by Scaling Orthogonal Transformation