PEFT-Arena

PEFT-Arena combines benchmark evaluation, spectral retention-adaptation profiling, and interpolation-based trade-off analysis to study how PEFT methods balance plasticity and stability.

Paper Summary

Abstract

Parameter-efficient finetuning (PEFT) has become the standard approach for adapting large foundation models, yet existing evaluations focus almost exclusively on downstream accuracy while largely overlooking the preservation of pretrained capabilities. PEFT-Arena argues that PEFT should be evaluated through the lens of the stability-plasticity dilemma: the trade-off between task adaptation and resistance to forgetting.

The benchmark jointly measures target-domain performance on mathematical and medical reasoning together with retained general ability on BBH, IFEval, and NQ, under both supervised finetuning (SFT) and reinforcement learning with verifiable rewards (RLVR / GRPO). Across Qwen2.5-7B and Llama3.2-3B-Instruct, all PEFT methods exhibit distinct stability-plasticity trade-off patterns, with OFT typically defining the most favorable Pareto frontier at comparable parameter budgets.

Beyond benchmark-level metrics, the paper introduces a spectral retention-adaptation profiling framework that decomposes weight updates into retention and adaptation components and measures their smoothness. Smoother spectral profiles correlate with better retention. The paper further identifies an overshoot phenomenon in SFT and studies interpolation-based trade-off recovery, including the geometry-aware iOFT variant.

Driving Questions

Research Questions

RQ1

Which PEFT method achieves the best stability-plasticity trade-off?

Benchmark PEFT by target adaptation and retained general capability across math and medical reasoning, under both SFT and RLVR.

RQ2

What internal mechanisms govern forgetting and retention?

Analyze spectral geometry and interpolation trajectories to understand why some parameterizations stay stable while others overshoot.

Benchmark Design

Protocol

Target Domains

Math: Math-500, AMC23, AIME24
Medical: a curated suite including MedMCQA, MedQA, PubMedQA, MMLU-Pro, GPQA (Medical), Lancet/NEJM/MedBullets-style problems, and MedXpertQA

General Retention

IFEval for instruction following
NQ for knowledge and understanding
BBH for broad reasoning retention

Models and Training

Backbones: Qwen2.5-7B and Llama3.2-3B-Instruct
Regimes: supervised finetuning and RLVR with GRPO
Target data: OpenR1-Math and MedThink

Compared Methods

Full FT
LoRA family: LoRA, AdaLoRA, DoRA, MiSS, VeRA, PiSSA, MiLoRA, KeepLoRA
Other PEFT: OFT and IA³

Main Findings

What PEFT-Arena Shows

SFT exposes a sharp trade-off

Full finetuning tends to deliver the strongest target gains, but it also incurs the largest forgetting. Target-only reporting systematically overestimates post-training quality.

OFT sits on the favorable frontier

At comparable parameter budgets, OFT consistently balances adaptation and retention better than additive low-rank baselines across the paper's main settings.

RLVR is more forgetting-resistant

Under on-policy RLVR with GRPO, models often achieve stable target improvements with substantially less general-capability erosion than under SFT.

Geometry matters

Smoother spectral retention and adaptation profiles correlate with better retention. Interpolation also reveals an SFT overshoot region and motivates geometry-aware trade-off recovery via iOFT.

Results

Representative Figures

Stability-Plasticity Benchmark

The benchmark reports target-domain performance together with retained general capability, rather than reducing post-training quality to a single downstream score. The strongest PEFT methods are those that move toward the upper-right frontier rather than simply maximizing target performance.

Spectral Retention-Adaptation Profiling

The paper decomposes updates into retention and adaptation views and quantifies local fluctuation across singular directions. OFT tends to remain in a smoother and more coherent spectral regime than spikier additive updates.

Interpolation and Overshoot

Training trajectories and interpolation trajectories are misaligned. This reveals that final SFT checkpoints often overshoot the best trade-off point, and motivates interpolation-based recovery and iOFT.

Optimization versus interpolation figure

Artifacts

Paper & Poster

Paper (coming soon)

Poster

Official Release

Code and Resources

What the Repository Includes

SFT and RLVR training entrypoints through a unified run.py interface
Math, medical, and general-retention evaluation pipelines
Spectral analysis and plotting code used for the paper's figures
Experiment checkpoints

Quick Commands

bash setup_env.sh

python run.py train sft ...
python run.py train rl ...
python run.py eval ...
python run.py merge ...

Reference

Citation

@misc{huang2026peftarena,
  title={PEFT-Arena: Understanding Parameter-Efficient Finetuning from a Stability-Plasticity Perspective},
  author={Yangyi Huang and Ruotian Peng and Zeju Qiu and Jiale Kang and Yandong Wen and Bernhard Sch{\"o}lkopf and Weiyang Liu},
  year={2026},
  url={https://spherelab.ai/PEFT-Arena}
}