1The Chinese University of Hong Kong 2Westlake University
Model Merging has been an intriguing post-training strategy for integrating knowledge from existing checkpoints of a shared foundation model. Existing methods focus on operations in the parameter space (i.e., task vectors), thereby suffering from the complexity of the parameter space. To explore more knowledge utilizations, we propose Functional Dual Anchors (FDAs), a framework (Figure 1) that instead models the knowledge in the input-representation space. Specifically, FDAs are synthetic inputs whose induced gradients align with task vectors, capturing task-specific functional shifts relative to the pretrained model. Then, we use the FDAs to adapt the pretrained model. FDAs provide an alternative perspective on model merging by extending input-space modeling to this setting and bridging joint multi-task training and post-hoc merging.
| Method | CoLA | SST-2 | MRPC | STS-B | QQP | MNLI | QNLI | RTE | Avg | Δ |
|---|---|---|---|---|---|---|---|---|---|---|
| Pretrained | 0.1679 | 0.4897 | 0.7480 | -0.0471 | 0.3159 | 0.3545 | 0.5054 | 0.4693 | 0.3754 | - |
| Individual | 0.6335 | 0.9001 | 0.9224 | 0.9418 | 0.9055 | 0.8267 | 0.9507 | 0.9222 | 0.8754 | - |
| TA | 0.1635 | 0.8716 | 0.7480 | 0.6603 | 0.3159 | 0.6101 | 0.8716 | 0.7366 | 0.5918 | - |
| TSV | 0.4791 | 0.9323 | 0.7459 | 0.6660 | 0.3300 | 0.6750 | 0.7761 | 0.6751 | 0.6599 | - |
| WUDI | 0.4201 | 0.9232 | 0.7487 | 0.7345 | 0.5393 | 0.6430 | 0.5746 | 0.5740 | 0.6447 | - |
| FDAs (Pretrained, Gauss) | 0.3198 | 0.8463 | 0.7790 | 0.6828 | 0.7423 | 0.5605 | 0.6021 | 0.7726 | 0.6632 | +0.2878 |
| FDAs (Pretrained, Weight) | 0.3883 | 0.8911 | 0.7858 | 0.7230 | 0.7410 | 0.5791 | 0.6207 | 0.7329 | 0.6827 | +0.3073 |
| FDAs (TA, Gauss) | 0.4043 | 0.9461 | 0.7692 | 0.7897 | 0.6916 | 0.7190 | 0.7487 | 0.7076 | 0.7220 | +0.1302 |
| FDAs (TA, Weight) | 0.4511 | 0.9404 | 0.7578 | 0.7926 | 0.6518 | 0.7411 | 0.6965 | 0.7148 | 0.7183 | +0.1265 |
| FDAs (TSV, Gauss) | 0.5036 | 0.9438 | 0.7521 | 0.7975 | 0.4128 | 0.7075 | 0.8477 | 0.7365 | 0.7127 | +0.0528 |
| FDAs (TSV, Weight) | 0.5021 | 0.9427 | 0.7490 | 0.7418 | 0.5062 | 0.7292 | 0.8146 | 0.7365 | 0.7153 | +0.0554 |
| FDAs (WUDI, Gauss) | 0.4841 | 0.9404 | 0.7647 | 0.7645 | 0.6778 | 0.7004 | 0.5911 | 0.6643 | 0.6984 | +0.0537 |
| FDAs (WUDI, Weight) | 0.4848 | 0.9392 | 0.7573 | 0.7546 | 0.6979 | 0.7072 | 0.5656 | 0.6643 | 0.6964 | +0.0517 |
Performance of merging RoBERTa-Large models across eight NLU tasks. The second section (from RegMean to ProDistill) includes methods that use task-specific data, and the third section is data-free methods. “FDA (init model, FDA init)” denotes the choice of the initial model and the initialization strategies for FDAs, respectively. “Δ” denotes the performance improvement compared to the initial model.
| Method | SUN397 | Cars | RESISC45 | EuroSAT | SVHN | GTSRB | MNIST | DTD | Avg | Δ |
|---|---|---|---|---|---|---|---|---|---|---|
| Pretrained | 63.80 | 64.60 | 65.70 | 54.50 | 52.00 | 43.30 | 51.70 | 45.10 | 55.00 | - |
| Individual | 78.56 | 87.08 | 96.92 | 99.78 | 97.86 | 99.17 | 99.76 | 82.07 | 92.65 | - |
| TA | 62.07 | 66.14 | 74.00 | 76.48 | 88.02 | 73.79 | 98.52 | 52.50 | 73.94 | - |
| TSV | 72.83 | 80.20 | 88.97 | 97.22 | 93.93 | 93.94 | 99.27 | 72.66 | 87.38 | - |
| WUDI | 75.40 | 81.71 | 90.14 | 98.52 | 95.30 | 96.55 | 99.44 | 73.78 | 88.85 | - |
| FDA (Pretrained, Gauss) | 72.54 | 80.62 | 87.75 | 98.44 | 94.31 | 93.43 | 99.38 | 70.11 | 87.07 | +32.07 |
| FDA (Pretrained, Weight) | 73.60 | 80.48 | 88.00 | 98.26 | 94.35 | 93.41 | 99.31 | 70.64 | 87.26 | +32.26 |
| FDA (TA, Gauss) | 73.72 | 81.42 | 88.63 | 98.37 | 94.61 | 94.44 | 99.39 | 71.54 | 87.77 | +13.83 |
| FDA (TA, Weight) | 74.53 | 81.25 | 88.37 | 98.37 | 94.55 | 94.28 | 99.34 | 71.65 | 87.79 | +13.85 |
| FDA (TSV, Gauss) | 74.79 | 82.65 | 89.75 | 98.37 | 94.25 | 94.47 | 99.40 | 73.67 | 88.42 | +1.04 |
| FDA (TSV, Weight) | 74.93 | 81.92 | 89.79 | 98.33 | 94.10 | 93.78 | 99.36 | 73.78 | 88.25 | +0.87 |
| FDA (WUDI, Gauss) | 76.21 | 82.84 | 91.03 | 98.93 | 94.58 | 96.32 | 99.40 | 74.52 | 89.23 | +0.38 |
| FDA (WUDI, Weight) | 76.15 | 82.75 | 91.21 | 98.89 | 94.49 | 96.24 | 99.39 | 74.41 | 89.19 | +0.34 |
Performance of merging ViT-B-16 models across eight downstream vision tasks. The second section (from RegMean to ProDistill) includes methods that use task-specific data, and the third section is data-free methods. “FDA (init model, FDA init)” denotes the choice of the initial model and the initialization strategies for FDAs, respectively. “Δ” denotes the performance improvement compared to the initial model.
To validate the effectiveness of the merging direction provided by FDAs, we use FDAs to adapt the pretrained model and compare the multi-task performance with its dual framework (i.e., task vectors, TA). To show the robustness, we initialize the pretrained model by the merged parameters, which are derived from data-free task-vector-based methods. We consider three data-free methods, TA(Ilharco et al., 2022), TSVM(Gargiulo et al., 2025), WUDI(Cheng et al., 2025). The former one is the classical method, while the latter two are the current state-of-the-art methods. We present partial results in the above tables. More results can be found in our paper. From the results, we have two key observations:
Observation 1 – FDAs can effectively leverage existing task-specific knowledge for multi-task modelmerging: Comparing to the dual framework TA, FDAs bring a significant improvement: the multi-task performance of pretrained model adapted by FDAs achieves 87.26, compared with 73.94 of TA, representing an improvement of nearly 18%; meanwhile, the average GLUE score achieves 15.4% improvement.
Observation 2 – Flexible knowledge modeling: although FDAs and data-free parameter-centric methods leverage the same task-specific knowledge, FDAs still im- prove the merged models by these methods. The average improvement via FDAs on TA, TSV, and WUDI is nearly 5.10% on ViT-B/16, and about 13% on RoBERTa-Large.
The practical algorithms for FDAs involve two main stages: constructing FDAs and adapting with FDAs. In the first stage, FDAs are built for each downstream checkpoint. This stage can be deemed as projecting task-specific knowledge into the input–representation space. In the second stage, these FDAs are used to adapt the model, i.e., integrate knowledge across multiple tasks.
Given the pretrained model $\varphi(\boldsymbol{\theta}_0)$ and the corresponding finetuned checkpoint $\varphi(\boldsymbol{\theta}_i)$, we construct the FDAs $\{\boldsymbol{x}_{ij}\}_{j=1}^n$ for $\varphi(\boldsymbol{\theta}_i)$ via solving the following optimization problem:
$$ \min_{\mathbf{x}_{i1},\dots,\mathbf{x}_{in}} \mathrm{cos\_dist}\!\Bigg( \nabla_{\boldsymbol{\theta}}\!\sum_{j=1}^n \mathrm{Dist}\!\big(\varphi(\boldsymbol{\theta}, \mathbf{x}_{ij}), \varphi(\boldsymbol{\theta}_i, \mathbf{x}_{ij})\big) \Big|_{\boldsymbol{\theta}=\boldsymbol{\theta}_0}, \boldsymbol{\tau}_i \Bigg) $$where $\mathrm{cos\_dist}(\mathbf{A},\mathbf{B}) = 1 - \frac{\mathrm{vec}(\mathbf{A})^\top \mathrm{vec}(\mathbf{B})} {\|\mathbf{A}\|_F \|\mathbf{B}\|_F}$, $\mathrm{vec}$ denotes the operation that vectorizes a matrix into a vector in row-major order, and $\mathrm{Dist}(\cdot)$ denotes a differentiable distance function measuring the representation discrepancy between $\varphi(\boldsymbol{\theta}_0)$ and $\varphi(\boldsymbol{\theta}_i)$.
The gradient-based iterative optimization methods are adopted. It is well known that gradient-based methods are sensitive to initialization. Thus, we analyze the optimization dynamics of anchors on a linear encoder and derive a principle for initialization.
An effective initialization strategy should limit the energy of the initialization point within the tail subspace spanned by the task vector.
Based on this principle, we derive two practical initialization schemes: linear weight sampling $\boldsymbol{x}_{ij}=(\boldsymbol{W}_i)_{l_j,:}$ and scaled Gaussian sampling $\boldsymbol{x}_{ij} = \sigma \cdot \tilde{\boldsymbol{x}}_{ij}, \tilde{\boldsymbol{x}}_{ij} \sim \mathcal{N}(\mathbf{0}, \boldsymbol{I}_d)$, where $\boldsymbol{W}$ denotes the weight matrix.
The adaptation process with FDAs is the dual process of the above Construction process. When the merged model is initialized by the pretrained checkpoint, the adaptation process is to optimize the following objective: $$ \min_{\boldsymbol{\theta_0}} \sum_{i=1}^m \sum_{j=1}^{n} \mathrm{Dist}\!\Big( \varphi(\boldsymbol{\theta_0}, \mathbf{x}_{ij}), \varphi(\boldsymbol{\theta}_i, \mathbf{x}_{ij}) \Big). $$ When the merged model is initialized by the merged parameters from task-vector-based methods, the adaptation process is to refine the merged task vectors. The objective is as follows: $$ \min_{\{\phi(\boldsymbol{\tau}_i)\}_{i=1}^m} \sum_{i=1}^m \sum_{j=1}^{n} \mathrm{Dist}\!\Big( \varphi \big(\boldsymbol{\theta} + \sum_{i=1}^m \phi_i(\boldsymbol{\tau}_i), \mathbf{x}_{ij}\big), \varphi(\boldsymbol{\theta}_i, \mathbf{x}_{ij}) \Big). $$
Since FDAs projects the task-specific knowledge into the input-representation space, we investigate the knowledge encoded by FDAs. We made three interesting observations.
Observation 1 – FDAs evolve into a long-tailed spectrum structure during optimization: We perform SVD on the FDA matrices and normalize singular values by the largest one. From the Figure 3, the normalized tail singular values decays rapidly in construction process for different initializations. This phenomenon is reasonable, as task-specific knowledge absorption often manifests as a long-tailed, low-rank structure in the parameter space as well.
Observation 2 – The high-energy subspaces of FDAs gradually aligns with that of real data: Considering the long-tailed structure of FDAs, we measure subspace similarity of top $20\%$ singular vectores between real data and FDAs via Projection Matrix. From the examples in Figure 4, the similarity gradually increases as the optimization proceeds. This suggests a potential connection between the knowledge encoded in FDAs and the real task data.
Observation 3 – FDAs-induced adaptation increasingly aligns with that induced by real data: We analyze FDAs here by re-projecting them into the parameter space, i.e., the adaptation they induce. We project the FDA-induced adaptation onto a non-negative cone spanned by parameter updation vectors derived from real data. As shown in Figure 5, the projection energy gradually increases in both pretrained model and merged model. This indicates that FDAs progressively produce the robust task-specific functional shifts.