Orthogonal finetuning (OFT) offers highly parameter-efficient adaptation while preventing catastrophic forgetting, but its high runtime and memory demands limit practical deployment. We identify the core computational bottleneck in OFT as its weight-centric implementation, which relies on costly matrix-matrix multiplications with cubic complexity. To overcome this, we propose OFTv2, an input-centric reformulation that instead uses matrix-vector multiplications (i.e., matrix-free computation), reducing the computational cost to quadratic. We further introduce the Cayley–Neumann parameterization, an efficient orthogonal parameterization that approximates the matrix inversion in Cayley transform via a truncated Neumann series. These modifications allow OFTv2 to achieve up to \(10\times\) faster training and \(3\times\) lower GPU memory usage without compromising performance. In addition, we extend OFTv2 to support finetuning quantized foundation models and show that it outperforms the popular QLoRA in training stability, efficiency, and memory usage.
Results of GPU memory usage for the same finetuning task.
We draw inspiration from the empirical observation in that angular feature difference well characterizes the semantic gap. SphereNet shows that training a neural network with all neurons normalized onto a unit hypersphere yields comparable capacity and even better generalizability, implying that the direction of neurons can fully capture the most important information from data. To better demonstrate the importance of neuron angles, we conduct a toy experiment where we train a standard convolutional autoencoder on some flower images. In the training stage, we use the standard inner product to produce the feature map (\(z\) denotes the element output of the convolution kernel \(\mathbf{w}\) and \(\mathbf{x}\) is the input in the sliding window). In the testing stage, we compare three ways to generate the feature map: (a) the inner product used in training, (b) the magnitude information, and (c) the angular information. The results in the following Figure show that the angular information of neurons can almost perfectly recover the input images, while the magnitude of neurons contains no useful information. We emphasize that we do not apply the cosine activation (c) during training, and the training is done only based on inner product. The results imply that the angles (directions) of neurons play the major role in storing the semantic information of the input images. In order to modify the semantic information of images, finetuning the neuron directions will likely be more effective.
We perform an experiment to demonstrate the effective regularization induced by the orthogonality constraint. We perform the controllable generation experiment using the setting of ControlNet, and the results are given in the following Figure. We can observe that our standard OFT performs quite stably and achieves accurate control after the training is finished (epoch 20). In comparison, OFT without the orthogonality constraint fails to generate any realistic image and achieve no control effect. The experiment validates the importance of the orthogonality constraint in OFT.
Comparison between low-rank adaptation (e.g., LoRA) and orthogonal finetuning (e.g., OFT): low-rank vs. sparsity to reduce trainable parameters.
Comparison between sequential adaptation (e.g., LoRA) and parallel adaptation (e.g., OFT).
OFT can be easily used as a drop-in replacement for LoRA, simply replace the LoraConfig
with OFTConfig
:
from peft import get_peft_model, OFTConfig
# Configure OFT
peft_config = OFTConfig(
oft_block_size=32,
use_cayley_neumann=True,
target_modules="all-linear",
bias="none",
task_type="CAUSAL_LM"
)
model = get_peft_model(model, peft_config)
model.print_trainable_parameters()
Important: r
x oft_block_size
should be equal to in_features
of the target module, e.g., in_features = 4096
and oft_block_size = 32
leads to r = 128
. For simplicity, we let the user speficy either r
or oft_block_size
and infer the other one. We advice the user to specify oft_block_size
for better clarity.
Key configuration parameters explained:
r = 0
set the oft_block_size
.True
for better efficiency, but performance may be slightly worse. Please test both settings (True
and False
) depending on your needs. Default is False
.OFT works as a drop-in replacement for LoRA in TRL—simply replace LoraConfig
with OFTConfig
to use it for SFT
, PPO
, or DPO
fine-tuning:
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
from trl import SFTTrainer
from peft import OFTConfig
if use_quantization:
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_storage=torch.bfloat16,
)
model = AutoModelForCausalLM.from_pretrained(
"model_name",
quantization_config=bnb_config
)
tokenizer = AutoTokenizer.from_pretrained("model_name")
# Configure OFT
peft_config = OFTConfig(
oft_block_size=32,
use_cayley_neumann=True,
target_modules="all-linear",
bias="none",
task_type="CAUSAL_LM"
)
trainer = SFTTrainer(
model=model,
train_dataset=ds['train'],
peft_config=peft_config,
tokenizer=tokenizer,
args=training_arguments,
data_collator=collator,
)
trainer.train()
OFT performs finetuning by learning an orthogonal matrix to directly transform the weight matrix, which naturally leads to a weight-centric implementation of the forward pass:
The original OFT first performs a weight transform by computing \(\mathbf{W}_{\text{OFT}}^\top=\mathbf{W}_0^\top\mathbf{R}^\top\) (i.e., a matrix-matrix multiplication) and then computes the results of a linear layer with the equivalent weight matrix \(\mathbf{W}_{\text{OFT}}^\top\) (i.e., a matrix-vector multiplication). This incurs \(\mathcal{O}(nd^2)\) complexity due to the matrix-matrix multiplication. Inspired by matrix-free methods for solving linear systems, we observe that OFT's forward pass can be interpreted as two linear maps applied to the input. This leads to an input-centric implementation:
where only two matrix-vector multiplications are required, reducing the complexity from cubic to quadratic: \(\mathcal{O}(nd + d^2)\). This simple conceptual shift in implementation entails substantial speed-up in training time and reduction in GPU memory.
The Cayley parameterization constructs an orthogonal matrix \(\mathbf{R}\) as \(\mathbf{R} = (\mathbf{I} + \mathbf{Q})(\mathbf{I} - \mathbf{Q})^{-1}\), where \(\mathbf{Q}\) is a skew-symmetric matrix. One limitation of this formulation is that it only generates rotation matrices, though empirical studies suggest that this restriction does not negatively affect performance. More critically, computing a matrix inverse introduces numerical instability and additional computational overhead, making it challenging to scale to large orthogonal matrices. To avoid numerical instability, we replace the matrix inverse with a truncated Neumann series:
where larger \(k\) leads to better approximation. Removing the matrix inversion improves training stability. The Neumann series approximation converges in the operator norm if \(\|\mathbf{Q}\|<1\). This condition is naturally satisfied in practice: to start from the pretrained model, OFT initializes the orthogonal matrix \(\mathbf{R}\) as the identity, which requires \(\mathbf{Q}\) to start as a zero matrix. Since finetuning begins with a small learning rate and typically involves relatively few steps, \(\mathbf{Q}\) tends not to drift far from zero. Empirically, even if \(\|\mathbf{Q}\|\) slightly exceeds \(1\), it does not harm OFT's training stability, as we use only a finite number of Neumann terms.
CUDA kernel for skew-symmetric matrices. To maximize GPU memory efficiency, we leverage the skew-symmetric structure of \(\mathbf{Q}\in\mathbb{R}^{n\times n}\), where \(Q_{ii} = 0\), \(Q_{ij} = -Q_{ji}\). By storing only the upper triangular part as a vector, we reduce the storage requirement from \(n^2\) to \(\frac{n (n - 1)}{2}\). During the forward pass, \(\mathbf{Q}\) is reconstructed on-the-fly using a highly optimized custom CUDA kernel that significantly accelerates this process.
While PEFT methods primarily aim to reduce optimizer memory by minimizing trainable parameters, the growing scale of foundation models has shifted the memory bottleneck to the pretrained weights themselves. As model dimensions grow, these frozen parameters increasingly dominate memory consumption during training. To address this emerging challenge, we argue that truly scalable OFT must operate directly on quantized model representations, such as NormalFloat4 and AWQ. This represents a critical shift that enables OFT to scale effectively.
To this end, we introduce QOFT, a natural extension of OFTv2 for quantized foundation models. QOFT largely follows the framework of QLoRA. Specifically, the quantized low-bit weight matrices are first dequantized to higher precision, after which the parameter-efficient adaptation is carried out in the higher-precision space. Formally, the forward pass of QOFT can be written as:
The update of OFTv2's orthogonal matrix \(\mathbf{R}\) is performed in high precision (e.g., BF16). We denote the dequantization function as \(\text{Dequant}(\cdot)\) and follow QLoRA's design by adopting a double quantization strategy, where the quantization parameters of the weight matrices are themselves quantized to further reduce GPU memory usage.
We perform supervised finetuning on the Huggingface OpenR1-Math-220k dataset—a large-scale mathematical reasoning corpus containing challenging problems and two to four reasoning traces distilled from DeepSeek R1. Following the evaluation protocol of Qwen2.5-Math, we report pass@1 performance on established math benchmarks. Finetuning was only performed on NormalFloat 4-bit quantized base models due to the substantial memory requirements imposed by the large context window size (16384), necessary for training on a reasoning dataset. The baseline type refers to the pre-trained Qwen2.5 models without any continual training. We observe that QOFT consistently outperforms both QLoRA and baseline models across all evaluated scales and tasks, despite using significantly fewer trainable parameters. The results are reported below:
Model | Type | # Params | AMC23 | AQUA | CMATH | GaoKao | Minerva | Olympiad/ | SAT |
---|---|---|---|---|---|---|---|---|---|
2023 En | Math | Bench | Math | ||||||
Qwen2.5-1.5B-it | baseline | - | 17.5 | 49.2 | 65.2 | 36.4 | 9.6 | 12.0 | 59.4 |
QLoRA | 18.46M | 15.0 | 42.5 | 61.5 | 29.6 | 8.1 | 8.9 | 59.4 | |
QOFT | 7.89M | 27.5 | 53.1 | 68.5 | 41.0 | 11.8 | 14.4 | 81.2 | |
Qwen2.5-1.5B | baseline | - | 0.0 | 18.9 | 4.0 | 4.2 | 2.6 | 2.4 | 28.1 |
QLoRA | 18.46M | 15.0 | 37.4 | 64.2 | 26.8 | 8.5 | 6.8 | 62.5 | |
QOFT | 7.89M | 22.5 | 53.1 | 56.3 | 36.1 | 8.5 | 12.7 | 87.5 | |
Qwen2.5-7B-it | baseline | - | 50.0 | 16.5 | 89.3 | 61.8 | 33.5 | 36.6 | 53.1 |
QLoRA | 40.37M | 30.0 | 48.0 | 88.8 | 50.1 | 25.4 | 19.7 | 68.8 | |
QOFT | 17.55M | 52.5 | 70.9 | 90.5 | 63.6 | 33.5 | 37.6 | 96.9 | |
Qwen2.5-7B | baseline | - | 25.0 | 55.1 | 61.2 | 42.9 | 11.8 | 29.9 | 71.9 |
QLoRA | 40.37M | 35.0 | 48.8 | 73.7 | 49.9 | 18.8 | 18.5 | 62.5 | |
QOFT | 17.55M | 52.5 | 59.4 | 80.7 | 55.6 | 21.7 | 34.7 | 87.5 | |
Qwen2.5-32B-it | baseline | - | 62.5 | 18.5 | 92.5 | 70.1 | 41.5 | 44.4 | 65.6 |
QLoRA | 134.22M | 62.5 | 71.7 | 94.0 | 71.2 | 39.7 | 46.8 | 96.9 | |
QOFT | 57.90M | 75.0 | 83.1 | 94.7 | 73.5 | 41.5 | 48.7 | 100.0 | |
Qwen2.5-32B | baseline | - | 35.0 | 23.2 | 35.7 | 46.8 | 20.2 | 25.2 | 62.5 |
QLoRA | 134.22M | 40.0 | 52.4 | 90.5 | 61.0 | 32.0 | 29.8 | 65.6 | |
QOFT | 57.90M | 70.0 | 68.5 | 90.7 | 71.4 | 36.0 | 44.9 | 93.8 |
Qualitative results from Dreambooth finetuning of Stable Diffusion 3.5 Large (8.1B parameters), with peak allocated GPU memory: LoRA (52.33 GB), OFT (52.32 GB), QLoRA (41.60 GB) and QOFT (41.53 GB).
Large text-to-image diffusion models have impressive capabilities in generating photorealistic images from text prompts. How to effectively guide or control these powerful models to perform different downstream tasks becomes an important open problem. To tackle this challenge, we introduce a principled finetuning method -- Orthogonal Finetuning~(OFT), for adapting text-to-image diffusion models to downstream tasks. Unlike existing methods, OFT can provably preserve hyperspherical energy which characterizes the pairwise neuron relationship on the unit hypersphere. We find that this property is crucial for preserving the semantic generation ability of text-to-image diffusion models. To improve finetuning stability, we further propose Constrained Orthogonal Finetuning (COFT) which imposes an additional radius constraint to the hypersphere. Specifically, we consider two important finetuning text-to-image tasks: subject-driven generation where the goal is to generate subject-specific images given a few images of a subject and a text prompt, and controllable generation where the goal is to enable the model to take in additional control signals. We empirically show that our OFT framework outperforms existing methods in generation quality and convergence speed.
Large foundation models are becoming ubiquitous, but training them from scratch is prohibitively expensive. Thus, efficiently adapting these powerful models to downstream tasks is increasingly important. In this paper, we study a principled finetuning paradigm -- Orthogonal Finetuning (OFT) -- for downstream task adaptation. Despite demonstrating good generalizability, OFT still uses a fairly large number of trainable parameters due to the high dimensionality of orthogonal matrices. To address this, we start by examining OFT from an information transmission perspective, and then identify a few key desiderata that enable better parameter-efficiency. Inspired by how the Cooley-Tukey fast Fourier transform algorithm enables efficient information transmission, we propose an efficient orthogonal parameterization using butterfly structures. We apply this parameterization to OFT, creating a novel parameter-efficient finetuning method, called Orthogonal Butterfly~(BOFT). By subsuming OFT as a special case, BOFT introduces a generalized orthogonal finetuning framework. Finally, we conduct an extensive empirical study of adapting large vision transformers, large language models, and text-to-image diffusion models to various downstream tasks in vision and language.
@misc{qiu2025orthogonalfinetuningscalable,
title={Orthogonal Finetuning Made Scalable},
author={Zeju Qiu and Weiyang Liu and Adrian Weller and Bernhard Schölkopf},
year={2025},
eprint={2506.19847},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2506.19847},
}
@misc{qiu2024controllingtexttoimagediffusionorthogonal,
title={Controlling Text-to-Image Diffusion by Orthogonal Finetuning},
author={Zeju Qiu and Weiyang Liu and Haiwen Feng and Yuxuan Xue and Yao Feng and Zhen Liu and Dan Zhang and Adrian Weller and Bernhard Schölkopf},
year={2024},
eprint={2306.07280},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2306.07280},
}
@misc{liu2024parameterefficientorthogonalfinetuningbutterfly,
title={Parameter-Efficient Orthogonal Finetuning via Butterfly Factorization},
author={Weiyang Liu and Zeju Qiu and Yao Feng and Yuliang Xiu and Yuxuan Xue and Longhui Yu and Haiwen Feng and Zhen Liu and Juyeon Heo and Songyou Peng and Yandong Wen and Michael J. Black and Adrian Weller and Bernhard Schölkopf},
year={2024},
eprint={2311.06243},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2311.06243},
}