Preprint

PEFT-Arena: Understanding Parameter-Efficient Finetuning from a Stability-Plasticity Perspective

Evaluate PEFT through the joint lens of target adaptation, retained general ability, and update geometry.

1The Chinese University of Hong Kong 2Westlake University 3Max Planck Institute for Intelligent Systems, Tübingen
PEFT-Arena overview figure

PEFT-Arena combines benchmark evaluation, spectral retention-adaptation profiling, and interpolation-based trade-off analysis to study how PEFT methods balance plasticity and stability.

Paper Summary

Abstract

Parameter-efficient finetuning (PEFT) has become the standard approach for adapting large foundation models, yet existing evaluations focus almost exclusively on downstream accuracy while largely overlooking the preservation of pretrained capabilities. PEFT-Arena argues that PEFT should be evaluated through the lens of the stability-plasticity dilemma: the trade-off between task adaptation and resistance to forgetting.

The benchmark jointly measures target-domain performance on mathematical and medical reasoning together with retained general ability on BBH, IFEval, and NQ, under both supervised finetuning (SFT) and reinforcement learning with verifiable rewards (RLVR / GRPO). Across Qwen2.5-7B and Llama3.2-3B-Instruct, all PEFT methods exhibit distinct stability-plasticity trade-off patterns, with OFT typically defining the most favorable Pareto frontier at comparable parameter budgets.

Beyond benchmark-level metrics, the paper introduces a spectral retention-adaptation profiling framework that decomposes weight updates into retention and adaptation components and measures their smoothness. Smoother spectral profiles correlate with better retention. The paper further identifies an overshoot phenomenon in SFT and studies interpolation-based trade-off recovery, including the geometry-aware iOFT variant.

Driving Questions

Research Questions

RQ1

Which PEFT method achieves the best stability-plasticity trade-off?

Benchmark PEFT by target adaptation and retained general capability across math and medical reasoning, under both SFT and RLVR.

RQ2

What internal mechanisms govern forgetting and retention?

Analyze spectral geometry and interpolation trajectories to understand why some parameterizations stay stable while others overshoot.

Benchmark Design

Protocol

Target Domains

  • Math: Math-500, AMC23, AIME24
  • Medical: a curated suite including MedMCQA, MedQA, PubMedQA, MMLU-Pro, GPQA (Medical), Lancet/NEJM/MedBullets-style problems, and MedXpertQA

General Retention

  • IFEval for instruction following
  • NQ for knowledge and understanding
  • BBH for broad reasoning retention

Models and Training

  • Backbones: Qwen2.5-7B and Llama3.2-3B-Instruct
  • Regimes: supervised finetuning and RLVR with GRPO
  • Target data: OpenR1-Math and MedThink

Compared Methods

  • Full FT
  • LoRA family: LoRA, AdaLoRA, DoRA, MiSS, VeRA, PiSSA, MiLoRA, KeepLoRA
  • Other PEFT: OFT and IA3

Main Findings

What PEFT-Arena Shows

SFT exposes a sharp trade-off

Full finetuning tends to deliver the strongest target gains, but it also incurs the largest forgetting. Target-only reporting systematically overestimates post-training quality.

OFT sits on the favorable frontier

At comparable parameter budgets, OFT consistently balances adaptation and retention better than additive low-rank baselines across the paper's main settings.

RLVR is more forgetting-resistant

Under on-policy RLVR with GRPO, models often achieve stable target improvements with substantially less general-capability erosion than under SFT.

Geometry matters

Smoother spectral retention and adaptation profiles correlate with better retention. Interpolation also reveals an SFT overshoot region and motivates geometry-aware trade-off recovery via iOFT.

Results

Representative Figures

Stability-Plasticity Benchmark

The benchmark reports target-domain performance together with retained general capability, rather than reducing post-training quality to a single downstream score. The strongest PEFT methods are those that move toward the upper-right frontier rather than simply maximizing target performance.

Trade-off curves for PEFT interpolation

Spectral Retention-Adaptation Profiling

The paper decomposes updates into retention and adaptation views and quantifies local fluctuation across singular directions. OFT tends to remain in a smoother and more coherent spectral regime than spikier additive updates.

Spectral analysis figure

Interpolation and Overshoot

Training trajectories and interpolation trajectories are misaligned. This reveals that final SFT checkpoints often overshoot the best trade-off point, and motivates interpolation-based recovery and iOFT.

Optimization versus interpolation figure

Artifacts

Paper & Poster

Paper (coming soon)
PEFT-Arena paper first page preview
Poster
PEFT-Arena poster preview

Official Release

Code and Resources

What the Repository Includes

  • SFT and RLVR training entrypoints through a unified run.py interface
  • Math, medical, and general-retention evaluation pipelines
  • Spectral analysis and plotting code used for the paper's figures
  • Experiment checkpoints

Quick Commands

bash setup_env.sh

python run.py train sft ...
python run.py train rl ...
python run.py eval ...
python run.py merge ...
Take-home message: PEFT should be evaluated and designed through the joint lens of plasticity, retention, and geometry.

Reference

Citation

@misc{huang2026peftarena,
  title={PEFT-Arena: Understanding Parameter-Efficient Finetuning from a Stability-Plasticity Perspective},
  author={Yangyi Huang and Ruotian Peng and Zeju Qiu and Jiale Kang and Yandong Wen and Bernhard Sch{\"o}lkopf and Weiyang Liu},
  year={2026},
  url={https://spherelab.ai/PEFT-Arena}
}