Model Merging with Functional Dual Anchors

Kexuan Shi1, Yandong Wen2, Weiyang Liu1

1The Chinese University of Hong Kong    2Westlake University

FDA Illustration

FDA: A New Framework for Model Merging

Model Merging has been an intriguing post-training strategy for integrating knowledge from existing checkpoints of a shared foundation model. Existing methods focus on operations in the parameter space (i.e., task vectors), thereby suffering from the complexity of the parameter space. To explore more knowledge utilizations, we propose Functional Dual Anchors (FDAs), a framework (Figure 1) that instead models the knowledge in the input-representation space. Specifically, FDAs are synthetic inputs whose induced gradients align with task vectors, capturing task-specific functional shifts relative to the pretrained model. Then, we use the FDAs to adapt the pretrained model. FDAs provide an alternative perspective on model merging by extending input-space modeling to this setting and bridging joint multi-task training and post-hoc merging.

Intuitive Understanding and Motivation of FDA

FDA Loss Landscape

To gain an intuitive understanding of FDAs, we compare their optimization trajectories with those of task arithmetic in Figure 2. We treat the obtained FDAs as finetuning data and optimize the model parameters accordingly. As shown in the right figure, optimizing with FDAs moves the model closer to the local minima of the loss landscape (computed over eight downstream datasets). While task vectors provide useful guidance from the pretrained model, they quickly drift away from the loss basin, whereas FDAs consistently guide optimization toward more favorable regions. Moreover, by capturing functional shifts in the input space, FDAs offer greater robustness for model merging. Unlike task vectors, which are sensitive to initialization and can drift under different starting points, FDAs exhibit robustness to such variations, facilitating more reliable model merging.

Another motivation behind FDAs is that modeling the input space is generally easier than modeling the parameter space, as the input space tends to be more structured. The effectiveness of modeling the input space for knowledge transfer has been extensively explored and empirically validated in the context of dataset distillation (Wang et al., 2018b; Cazenavette et al., 2022), iterative teaching (Liu et al., 2017a; Qiu et al., 2023), dataset condensation (Zhao et al., 2021; Zhao & Bilen, 2023) and continual learning (Shin et al., 2017; Yu et al., 2023).

FDAs Provide More Flexible and Robust Merging Direction

Method CoLASST-2MRPCSTS-BQQP MNLIQNLIRTEAvgΔ
Pretrained0.16790.48970.7480-0.04710.31590.35450.50540.46930.3754-
Individual0.63350.90010.92240.94180.90550.82670.95070.92220.8754-
TA0.16350.87160.74800.66030.31590.61010.87160.73660.5918-
TSV0.47910.93230.74590.66600.33000.67500.77610.67510.6599-
WUDI0.42010.92320.74870.73450.53930.64300.57460.57400.6447-
FDAs (Pretrained, Gauss)0.31980.84630.77900.68280.74230.56050.60210.77260.6632+0.2878
FDAs (Pretrained, Weight)0.38830.89110.78580.72300.74100.57910.62070.73290.6827+0.3073
FDAs (TA, Gauss)0.40430.94610.76920.78970.69160.71900.74870.70760.7220+0.1302
FDAs (TA, Weight)0.45110.94040.75780.79260.65180.74110.69650.71480.7183+0.1265
FDAs (TSV, Gauss)0.50360.94380.75210.79750.41280.70750.84770.73650.7127+0.0528
FDAs (TSV, Weight)0.50210.94270.74900.74180.50620.72920.81460.73650.7153+0.0554
FDAs (WUDI, Gauss)0.48410.94040.76470.76450.67780.70040.59110.66430.6984+0.0537
FDAs (WUDI, Weight)0.48480.93920.75730.75460.69790.70720.56560.66430.6964+0.0517

Performance of merging RoBERTa-Large models across eight NLU tasks. The second section (from RegMean to ProDistill) includes methods that use task-specific data, and the third section is data-free methods. “FDA (init model, FDA init)” denotes the choice of the initial model and the initialization strategies for FDAs, respectively. “Δ” denotes the performance improvement compared to the initial model.

Method SUN397CarsRESISC45EuroSAT SVHNGTSRBMNISTDTD AvgΔ
Pretrained63.8064.6065.7054.5052.0043.3051.7045.1055.00-
Individual78.5687.0896.9299.7897.8699.1799.7682.0792.65-
TA62.0766.1474.0076.4888.0273.7998.5252.5073.94-
TSV72.8380.2088.9797.2293.9393.9499.2772.6687.38-
WUDI75.4081.7190.1498.5295.3096.5599.4473.7888.85-
FDA (Pretrained, Gauss)72.5480.6287.7598.4494.3193.4399.3870.1187.07+32.07
FDA (Pretrained, Weight)73.6080.4888.0098.2694.3593.4199.3170.6487.26+32.26
FDA (TA, Gauss)73.7281.4288.6398.3794.6194.4499.3971.5487.77+13.83
FDA (TA, Weight)74.5381.2588.3798.3794.5594.2899.3471.6587.79+13.85
FDA (TSV, Gauss)74.7982.6589.7598.3794.2594.4799.4073.6788.42+1.04
FDA (TSV, Weight)74.9381.9289.7998.3394.1093.7899.3673.7888.25+0.87
FDA (WUDI, Gauss)76.2182.8491.0398.9394.5896.3299.4074.5289.23+0.38
FDA (WUDI, Weight)76.1582.7591.2198.8994.4996.2499.3974.4189.19+0.34

Performance of merging ViT-B-16 models across eight downstream vision tasks. The second section (from RegMean to ProDistill) includes methods that use task-specific data, and the third section is data-free methods. “FDA (init model, FDA init)” denotes the choice of the initial model and the initialization strategies for FDAs, respectively. “Δ” denotes the performance improvement compared to the initial model.

To validate the effectiveness of the merging direction provided by FDAs, we use FDAs to adapt the pretrained model and compare the multi-task performance with its dual framework (i.e., task vectors, TA). To show the robustness, we initialize the pretrained model by the merged parameters, which are derived from data-free task-vector-based methods. We consider three data-free methods, TA(Ilharco et al., 2022), TSVM(Gargiulo et al., 2025), WUDI(Cheng et al., 2025). The former one is the classical method, while the latter two are the current state-of-the-art methods. We present partial results in the above tables. More results can be found in our paper. From the results, we have two key observations:

Observation 1 – FDAs can effectively leverage existing task-specific knowledge for multi-task modelmerging: Comparing to the dual framework TA, FDAs bring a significant improvement: the multi-task performance of pretrained model adapted by FDAs achieves 87.26, compared with 73.94 of TA, representing an improvement of nearly 18%; meanwhile, the average GLUE score achieves 15.4% improvement.

Observation 2 – Flexible knowledge modeling: although FDAs and data-free parameter-centric methods leverage the same task-specific knowledge, FDAs still im- prove the merged models by these methods. The average improvement via FDAs on TA, TSV, and WUDI is nearly 5.10% on ViT-B/16, and about 13% on RoBERTa-Large.

A Practical Algorithm for FDAs

The practical algorithms for FDAs involve two main stages: constructing FDAs and adapting with FDAs. In the first stage, FDAs are built for each downstream checkpoint. This stage can be deemed as projecting task-specific knowledge into the input–representation space. In the second stage, these FDAs are used to adapt the model, i.e., integrate knowledge across multiple tasks.

1. Construction

Given the pretrained model $\varphi(\boldsymbol{\theta}_0)$ and the corresponding finetuned checkpoint $\varphi(\boldsymbol{\theta}_i)$, we construct the FDAs $\{\boldsymbol{x}_{ij}\}_{j=1}^n$ for $\varphi(\boldsymbol{\theta}_i)$ via solving the following optimization problem:

$$ \min_{\mathbf{x}_{i1},\dots,\mathbf{x}_{in}} \mathrm{cos\_dist}\!\Bigg( \nabla_{\boldsymbol{\theta}}\!\sum_{j=1}^n \mathrm{Dist}\!\big(\varphi(\boldsymbol{\theta}, \mathbf{x}_{ij}), \varphi(\boldsymbol{\theta}_i, \mathbf{x}_{ij})\big) \Big|_{\boldsymbol{\theta}=\boldsymbol{\theta}_0}, \boldsymbol{\tau}_i \Bigg) $$

where $\mathrm{cos\_dist}(\mathbf{A},\mathbf{B}) = 1 - \frac{\mathrm{vec}(\mathbf{A})^\top \mathrm{vec}(\mathbf{B})} {\|\mathbf{A}\|_F \|\mathbf{B}\|_F}$, $\mathrm{vec}$ denotes the operation that vectorizes a matrix into a vector in row-major order, and $\mathrm{Dist}(\cdot)$ denotes a differentiable distance function measuring the representation discrepancy between $\varphi(\boldsymbol{\theta}_0)$ and $\varphi(\boldsymbol{\theta}_i)$.

The gradient-based iterative optimization methods are adopted. It is well known that gradient-based methods are sensitive to initialization. Thus, we analyze the optimization dynamics of anchors on a linear encoder and derive a principle for initialization.

Principle

An effective initialization strategy should limit the energy of the initialization point within the tail subspace spanned by the task vector.

Based on this principle, we derive two practical initialization schemes: linear weight sampling $\boldsymbol{x}_{ij}=(\boldsymbol{W}_i)_{l_j,:}$ and scaled Gaussian sampling $\boldsymbol{x}_{ij} = \sigma \cdot \tilde{\boldsymbol{x}}_{ij}, \tilde{\boldsymbol{x}}_{ij} \sim \mathcal{N}(\mathbf{0}, \boldsymbol{I}_d)$, where $\boldsymbol{W}$ denotes the weight matrix.

2. Adaptation

The adaptation process with FDAs is the dual process of the above Construction process. When the merged model is initialized by the pretrained checkpoint, the adaptation process is to optimize the following objective: $$ \min_{\boldsymbol{\theta_0}} \sum_{i=1}^m \sum_{j=1}^{n} \mathrm{Dist}\!\Big( \varphi(\boldsymbol{\theta_0}, \mathbf{x}_{ij}), \varphi(\boldsymbol{\theta}_i, \mathbf{x}_{ij}) \Big). $$ When the merged model is initialized by the merged parameters from task-vector-based methods, the adaptation process is to refine the merged task vectors. The objective is as follows: $$ \min_{\{\phi(\boldsymbol{\tau}_i)\}_{i=1}^m} \sum_{i=1}^m \sum_{j=1}^{n} \mathrm{Dist}\!\Big( \varphi \big(\boldsymbol{\theta} + \sum_{i=1}^m \phi_i(\boldsymbol{\tau}_i), \mathbf{x}_{ij}\big), \varphi(\boldsymbol{\theta}_i, \mathbf{x}_{ij}) \Big). $$

Knowledge encoded in the FDAs

Since FDAs projects the task-specific knowledge into the input-representation space, we investigate the knowledge encoded by FDAs. We made three interesting observations.

Observation 1 – FDAs evolve into a long-tailed spectrum structure during optimization: We perform SVD on the FDA matrices and normalize singular values by the largest one. From the Figure 3, the normalized tail singular values decays rapidly in construction process for different initializations. This phenomenon is reasonable, as task-specific knowledge absorption often manifests as a long-tailed, low-rank structure in the parameter space as well.

SVD spectrum of FDAs

Observation 2 – The high-energy subspaces of FDAs gradually aligns with that of real data: Considering the long-tailed structure of FDAs, we measure subspace similarity of top $20\%$ singular vectores between real data and FDAs via Projection Matrix. From the examples in Figure 4, the similarity gradually increases as the optimization proceeds. This suggests a potential connection between the knowledge encoded in FDAs and the real task data.

Subspace of FDAs

Observation 3 – FDAs-induced adaptation increasingly aligns with that induced by real data: We analyze FDAs here by re-projecting them into the parameter space, i.e., the adaptation they induce. We project the FDA-induced adaptation onto a non-negative cone spanned by parameter updation vectors derived from real data. As shown in Figure 5, the projection energy gradually increases in both pretrained model and merged model. This indicates that FDAs progressively produce the robust task-specific functional shifts.

energy of FDAs

References

  1. Tongzhou Wang, Jun-Yan Zhu, Antonio Torralba, and Alexei A Efros. Dataset distillation.
  2. George Cazenavette, Tongzhou Wang, Antonio Torralba, Alexei A Efros, and Jun-Yan Zhu. Dataset distillation by matching training trajectories.
  3. Weiyang Liu, Bo Dai, Ahmad Humayun, Charlene Tay, Chen Yu, Linda B Smith, James M Rehg, and Le Song. Iterative machine teaching.
  4. Zeju Qiu, Weiyang Liu, Tim Z Xiao, Zhen Liu, Umang Bhatt, Yucen Luo, Adrian Weller, and Bernhard Sch¨olkopf. Iterative teaching by data hallucination.
  5. Bo Zhao, Konda Reddy Mopuri, and Hakan Bilen. Dataset condensation with gradient matching.
  6. Bo Zhao and Hakan Bilen. Dataset condensation with distribution matching.
  7. Hanul Shin, Jung Kwon Lee, Jaehong Kim, and Jiwon Kim. Continual learning with deep generative replay.
  8. Longhui Yu, Tianyang Hu, Lanqing Hong, Zhen Liu, Adrian Weller, and Weiyang Liu. Continual learning by modeling intra-class variation.
  9. Gabriel Ilharco, Marco Tulio Ribeiro, Mitchell Wortsman, Suchin Gururangan, Ludwig Schmidt, Hannaneh Hajishirzi, and Ali Farhadi. Editing models with task arithmetic.
  10. Antonio Andrea Gargiulo, Donato Crisostomi, Maria Sofia Bucarelli, Simone Scardapane, Fabrizio Silvestri, and Emanuele Rodola. Task singular vectors: Reducing task interference in model merging.
  11. Runxi Cheng, Feng Xiong, Yongxian Wei, Wanyun Zhu, and Chun Yuan. Whoever started the interference should end it: Guiding data-free model merging via task vectors.