Logo OFTv2

Orthgonal Finetuning Made Scalable


1Max Planck Institute for Intelligent Systems, Tübingen, 2The Chinese University of Hong Kong, 3University of Cambridge Equal contribution *Corresponding author

geometric reasoning

OFTv2 significantly reduces training time and GPU memory usage without sacrificing performance.

Orthogonal Finetuning with LoRA-Competitive Scalability

Orthogonal finetuning (OFT) offers highly parameter-efficient adaptation while preventing catastrophic forgetting, but its high runtime and memory demands limit practical deployment. We identify the core computational bottleneck in OFT as its weight-centric implementation, which relies on costly matrix-matrix multiplications with cubic complexity. To overcome this, we propose OFTv2, an input-centric reformulation that instead uses matrix-vector multiplications (i.e., matrix-free computation), reducing the computational cost to quadratic. We further introduce the Cayley–Neumann parameterization, an efficient orthogonal parameterization that approximates the matrix inversion in Cayley transform via a truncated Neumann series. These modifications allow OFTv2 to achieve up to \(10\times\) faster training and \(3\times\) lower GPU memory usage without compromising performance. In addition, we extend OFTv2 to support finetuning quantized foundation models and show that it outperforms the popular QLoRA in training stability, efficiency, and memory usage.

algebraic reasoning

Results of GPU memory usage for the same finetuning task.



Orthogonal Finetuning Motivation

Orthogonal Transformation Well Preserves the Pre-trained Knowledge

We draw inspiration from the empirical observation in that angular feature difference well characterizes the semantic gap. SphereNet shows that training a neural network with all neurons normalized onto a unit hypersphere yields comparable capacity and even better generalizability, implying that the direction of neurons can fully capture the most important information from data. To better demonstrate the importance of neuron angles, we conduct a toy experiment where we train a standard convolutional autoencoder on some flower images. In the training stage, we use the standard inner product to produce the feature map (\(z\) denotes the element output of the convolution kernel \(\mathbf{w}\) and \(\mathbf{x}\) is the input in the sliding window). In the testing stage, we compare three ways to generate the feature map: (a) the inner product used in training, (b) the magnitude information, and (c) the angular information. The results in the following Figure show that the angular information of neurons can almost perfectly recover the input images, while the magnitude of neurons contains no useful information. We emphasize that we do not apply the cosine activation (c) during training, and the training is done only based on inner product. The results imply that the angles (directions) of neurons play the major role in storing the semantic information of the input images. In order to modify the semantic information of images, finetuning the neuron directions will likely be more effective.

algebraic reasoning

Why Does Orthogonal Transformation Make Sense?

We perform an experiment to demonstrate the effective regularization induced by the orthogonality constraint. We perform the controllable generation experiment using the setting of ControlNet, and the results are given in the following Figure. We can observe that our standard OFT performs quite stably and achieves accurate control after the training is finished (epoch 20). In comparison, OFT without the orthogonality constraint fails to generate any realistic image and achieve no control effect. The experiment validates the importance of the orthogonality constraint in OFT.

algebraic reasoning

OFT vs LoRA: Two Distinct Roads to Efficient Fine-tuning

algebraic reasoning

Comparison between low-rank adaptation (e.g., LoRA) and orthogonal finetuning (e.g., OFT): low-rank vs. sparsity to reduce trainable parameters.

algebraic reasoning

Comparison between sequential adaptation (e.g., LoRA) and parallel adaptation (e.g., OFT).

Training with OFTv2

Training with Huggingface PEFT

OFT can be easily used as a drop-in replacement for LoRA, simply replace the LoraConfig with OFTConfig:

from peft import get_peft_model, OFTConfig

# Configure OFT
peft_config = OFTConfig(
    oft_block_size=32,
    use_cayley_neumann=True,
    target_modules="all-linear",
    bias="none",
    task_type="CAUSAL_LM"
)

model = get_peft_model(model, peft_config)

model.print_trainable_parameters()

Important: r x oft_block_size should be equal to in_features of the target module, e.g., in_features = 4096 and oft_block_size = 32 leads to r = 128. For simplicity, we let the user speficy either r or oft_block_size and infer the other one. We advice the user to specify oft_block_size for better clarity.

Key configuration parameters explained:

  • r: specifies the number of OFT blocks. Smaller values uses more parameters (i.e., bigger block sizes). Typically, we set r = 0 set the oft_block_size.
  • oft_block_size: controls the size of the OFT blocks. Smaller values use fewer parameters but may be less expressive, while larger values provide more flexibility at the cost of increased memory usage.
  • use_cayley_neumann: Specifies whether to use the Cayley-Neumann parameterization (efficient but approximate) or the vanilla Cayley parameterization (exact but computationally expensive because of matrix inverse). We recommend to set it to True for better efficiency, but performance may be slightly worse. Please test both settings (True and False) depending on your needs. Default is False.

Training with Huggingface TRL

OFT works as a drop-in replacement for LoRA in TRL—simply replace LoraConfig with OFTConfig to use it for SFT, PPO, or DPO fine-tuning:

from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
from trl import SFTTrainer
from peft import OFTConfig

if use_quantization:
  bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_storage=torch.bfloat16,
  )

model = AutoModelForCausalLM.from_pretrained(
  "model_name", 
  quantization_config=bnb_config
)
tokenizer = AutoTokenizer.from_pretrained("model_name")

# Configure OFT
peft_config = OFTConfig(
    oft_block_size=32,
    use_cayley_neumann=True,
    target_modules="all-linear",
    bias="none",
    task_type="CAUSAL_LM"
)

trainer = SFTTrainer(
    model=model,
    train_dataset=ds['train'],
    peft_config=peft_config,
    tokenizer=tokenizer,
    args=training_arguments,
    data_collator=collator,
)

trainer.train()

OFTv2 Methods

From Weight-centric to Input-centric Implementation

OFT performs finetuning by learning an orthogonal matrix to directly transform the weight matrix, which naturally leads to a weight-centric implementation of the forward pass:

\[ \mathbf{z} = \underbrace{\overbrace{\mathbf{W}_0^\top\mathbf{R}^\top}^{\text{(1) \textbf{Weight transform}: matrix-matrix mult.}}\mathbf{x}}_{\text{(2) \textbf{Linear map}: matrix-vector mult.}} \]

The original OFT first performs a weight transform by computing \(\mathbf{W}_{\text{OFT}}^\top=\mathbf{W}_0^\top\mathbf{R}^\top\) (i.e., a matrix-matrix multiplication) and then computes the results of a linear layer with the equivalent weight matrix \(\mathbf{W}_{\text{OFT}}^\top\) (i.e., a matrix-vector multiplication). This incurs \(\mathcal{O}(nd^2)\) complexity due to the matrix-matrix multiplication. Inspired by matrix-free methods for solving linear systems, we observe that OFT's forward pass can be interpreted as two linear maps applied to the input. This leads to an input-centric implementation:

\[ \mathbf{z} = \underbrace{\mathbf{W}_0^\top\overbrace{\mathbf{R}^\top\mathbf{x}}^{\text{(1) \textbf{Linear map}: matrix-vector mult.}}}_{\text{(2) \textbf{Linear map}: matrix-vector mult.}} \]

where only two matrix-vector multiplications are required, reducing the complexity from cubic to quadratic: \(\mathcal{O}(nd + d^2)\). This simple conceptual shift in implementation entails substantial speed-up in training time and reduction in GPU memory.

Approximate Orthogonality via Cayley-Neumann Parameterization

The Cayley parameterization constructs an orthogonal matrix \(\mathbf{R}\) as \(\mathbf{R} = (\mathbf{I} + \mathbf{Q})(\mathbf{I} - \mathbf{Q})^{-1}\), where \(\mathbf{Q}\) is a skew-symmetric matrix. One limitation of this formulation is that it only generates rotation matrices, though empirical studies suggest that this restriction does not negatively affect performance. More critically, computing a matrix inverse introduces numerical instability and additional computational overhead, making it challenging to scale to large orthogonal matrices. To avoid numerical instability, we replace the matrix inverse with a truncated Neumann series:

\[ \begin{aligned} \mathbf{R}&=(\mathbf{I}+\mathbf{Q})(\mathbf{I}-\mathbf{Q})^{-1}=(\mathbf{I}+\mathbf{Q})\left(\sum_{i=0}^\infty \mathbf{Q}^i \right) \\ &\approx (\mathbf{I}+\mathbf{Q})\left(\mathbf{I}+\sum_{i=1}^k \mathbf{Q}^i \right), \end{aligned} \]

where larger \(k\) leads to better approximation. Removing the matrix inversion improves training stability. The Neumann series approximation converges in the operator norm if \(\|\mathbf{Q}\|<1\). This condition is naturally satisfied in practice: to start from the pretrained model, OFT initializes the orthogonal matrix \(\mathbf{R}\) as the identity, which requires \(\mathbf{Q}\) to start as a zero matrix. Since finetuning begins with a small learning rate and typically involves relatively few steps, \(\mathbf{Q}\) tends not to drift far from zero. Empirically, even if \(\|\mathbf{Q}\|\) slightly exceeds \(1\), it does not harm OFT's training stability, as we use only a finite number of Neumann terms.

CUDA kernel for skew-symmetric matrices. To maximize GPU memory efficiency, we leverage the skew-symmetric structure of \(\mathbf{Q}\in\mathbb{R}^{n\times n}\), where \(Q_{ii} = 0\), \(Q_{ij} = -Q_{ji}\). By storing only the upper triangular part as a vector, we reduce the storage requirement from \(n^2\) to \(\frac{n (n - 1)}{2}\). During the forward pass, \(\mathbf{Q}\) is reconstructed on-the-fly using a highly optimized custom CUDA kernel that significantly accelerates this process.

QOFT: Adapting OFT to Finetuning Quantized Foundation Models

While PEFT methods primarily aim to reduce optimizer memory by minimizing trainable parameters, the growing scale of foundation models has shifted the memory bottleneck to the pretrained weights themselves. As model dimensions grow, these frozen parameters increasingly dominate memory consumption during training. To address this emerging challenge, we argue that truly scalable OFT must operate directly on quantized model representations, such as NormalFloat4 and AWQ. This represents a critical shift that enables OFT to scale effectively.

To this end, we introduce QOFT, a natural extension of OFTv2 for quantized foundation models. QOFT largely follows the framework of QLoRA. Specifically, the quantized low-bit weight matrices are first dequantized to higher precision, after which the parameter-efficient adaptation is carried out in the higher-precision space. Formally, the forward pass of QOFT can be written as:

\[ \mathbf{z} = \underbrace{\text{Dequant}(\mathbf{W}_{\text{quant}})^\top}_{\text{Frozen}}\underbrace{\mathbf{R}^\top}_{\text{Trainable}}\mathbf{x} \]

The update of OFTv2's orthogonal matrix \(\mathbf{R}\) is performed in high precision (e.g., BF16). We denote the dequantization function as \(\text{Dequant}(\cdot)\) and follow QLoRA's design by adopting a double quantization strategy, where the quantization parameters of the weight matrices are themselves quantized to further reduce GPU memory usage.

Experimental Results

Finetuning Qwen2.5 for Math Reasoning

We perform supervised finetuning on the Huggingface OpenR1-Math-220k dataset—a large-scale mathematical reasoning corpus containing challenging problems and two to four reasoning traces distilled from DeepSeek R1. Following the evaluation protocol of Qwen2.5-Math, we report pass@1 performance on established math benchmarks. Finetuning was only performed on NormalFloat 4-bit quantized base models due to the substantial memory requirements imposed by the large context window size (16384), necessary for training on a reasoning dataset. The baseline type refers to the pre-trained Qwen2.5 models without any continual training. We observe that QOFT consistently outperforms both QLoRA and baseline models across all evaluated scales and tasks, despite using significantly fewer trainable parameters. The results are reported below:

Model Type # Params AMC23 AQUA CMATH GaoKao Minerva Olympiad/ SAT
2023 En Math Bench Math
Qwen2.5-1.5B-it baseline - 17.5 49.2 65.2 36.4 9.6 12.0 59.4
QLoRA 18.46M 15.0 42.5 61.5 29.6 8.1 8.9 59.4
QOFT 7.89M 27.5 53.1 68.5 41.0 11.8 14.4 81.2
Qwen2.5-1.5B baseline - 0.0 18.9 4.0 4.2 2.6 2.4 28.1
QLoRA 18.46M 15.0 37.4 64.2 26.8 8.5 6.8 62.5
QOFT 7.89M 22.5 53.1 56.3 36.1 8.5 12.7 87.5
Qwen2.5-7B-it baseline - 50.0 16.5 89.3 61.8 33.5 36.6 53.1
QLoRA 40.37M 30.0 48.0 88.8 50.1 25.4 19.7 68.8
QOFT 17.55M 52.5 70.9 90.5 63.6 33.5 37.6 96.9
Qwen2.5-7B baseline - 25.0 55.1 61.2 42.9 11.8 29.9 71.9
QLoRA 40.37M 35.0 48.8 73.7 49.9 18.8 18.5 62.5
QOFT 17.55M 52.5 59.4 80.7 55.6 21.7 34.7 87.5
Qwen2.5-32B-it baseline - 62.5 18.5 92.5 70.1 41.5 44.4 65.6
QLoRA 134.22M 62.5 71.7 94.0 71.2 39.7 46.8 96.9
QOFT 57.90M 75.0 83.1 94.7 73.5 41.5 48.7 100.0
Qwen2.5-32B baseline - 35.0 23.2 35.7 46.8 20.2 25.2 62.5
QLoRA 134.22M 40.0 52.4 90.5 61.0 32.0 29.8 65.6
QOFT 57.90M 70.0 68.5 90.7 71.4 36.0 44.9 93.8

DreamBooth Finetuning SD 3.5

algebraic reasoning

Qualitative results from Dreambooth finetuning of Stable Diffusion 3.5 Large (8.1B parameters), with peak allocated GPU memory: LoRA (52.33 GB), OFT (52.32 GB), QLoRA (41.60 GB) and QOFT (41.53 GB).

Explore Related Projects

Orthogonal Finetuning V1

Large text-to-image diffusion models have impressive capabilities in generating photorealistic images from text prompts. How to effectively guide or control these powerful models to perform different downstream tasks becomes an important open problem. To tackle this challenge, we introduce a principled finetuning method -- Orthogonal Finetuning~(OFT), for adapting text-to-image diffusion models to downstream tasks. Unlike existing methods, OFT can provably preserve hyperspherical energy which characterizes the pairwise neuron relationship on the unit hypersphere. We find that this property is crucial for preserving the semantic generation ability of text-to-image diffusion models. To improve finetuning stability, we further propose Constrained Orthogonal Finetuning (COFT) which imposes an additional radius constraint to the hypersphere. Specifically, we consider two important finetuning text-to-image tasks: subject-driven generation where the goal is to generate subject-specific images given a few images of a subject and a text prompt, and controllable generation where the goal is to enable the model to take in additional control signals. We empirically show that our OFT framework outperforms existing methods in generation quality and convergence speed.

algebraic reasoning

A Parameter-Efficient Formulation with Butterfly Factorization

Large foundation models are becoming ubiquitous, but training them from scratch is prohibitively expensive. Thus, efficiently adapting these powerful models to downstream tasks is increasingly important. In this paper, we study a principled finetuning paradigm -- Orthogonal Finetuning (OFT) -- for downstream task adaptation. Despite demonstrating good generalizability, OFT still uses a fairly large number of trainable parameters due to the high dimensionality of orthogonal matrices. To address this, we start by examining OFT from an information transmission perspective, and then identify a few key desiderata that enable better parameter-efficiency. Inspired by how the Cooley-Tukey fast Fourier transform algorithm enables efficient information transmission, we propose an efficient orthogonal parameterization using butterfly structures. We apply this parameterization to OFT, creating a novel parameter-efficient finetuning method, called Orthogonal Butterfly~(BOFT). By subsuming OFT as a special case, BOFT introduces a generalized orthogonal finetuning framework. Finally, we conduct an extensive empirical study of adapting large vision transformers, large language models, and text-to-image diffusion models to various downstream tasks in vision and language.

algebraic reasoning

BibTeX


      @misc{qiu2025orthogonalfinetuningscalable,
            title={Orthogonal Finetuning Made Scalable}, 
            author={Zeju Qiu and Weiyang Liu and Adrian Weller and Bernhard Schölkopf},
            year={2025},
            eprint={2506.19847},
            archivePrefix={arXiv},
            primaryClass={cs.LG},
            url={https://arxiv.org/abs/2506.19847}, 
      }

      @misc{qiu2024controllingtexttoimagediffusionorthogonal,
            title={Controlling Text-to-Image Diffusion by Orthogonal Finetuning}, 
            author={Zeju Qiu and Weiyang Liu and Haiwen Feng and Yuxuan Xue and Yao Feng and Zhen Liu and Dan Zhang and Adrian Weller and Bernhard Schölkopf},
            year={2024},
            eprint={2306.07280},
            archivePrefix={arXiv},
            primaryClass={cs.CV},
            url={https://arxiv.org/abs/2306.07280}, 
      }

      @misc{liu2024parameterefficientorthogonalfinetuningbutterfly,
        title={Parameter-Efficient Orthogonal Finetuning via Butterfly Factorization}, 
        author={Weiyang Liu and Zeju Qiu and Yao Feng and Yuliang Xiu and Yuxuan Xue and Longhui Yu and Haiwen Feng and Zhen Liu and Juyeon Heo and Songyou Peng and Yandong Wen and Michael J. Black and Adrian Weller and Bernhard Schölkopf},
        year={2024},
        eprint={2311.06243},
        archivePrefix={arXiv},
        primaryClass={cs.LG},
        url={https://arxiv.org/abs/2311.06243}, 
      }