1 of 18

What matters for Model Merging at Scale?

Jina Kim

School of Computing, KAIST

2025.01.08

2 of 18

Why model merging?

reduces storage and serving cost
combine expert models, utilize already existing models; recycling!
independently build models and combine them later → support decentralized and modular model development

3 of 18

Task Arithmetic

4 of 18

Task Arithmetic

T5-base (220 Million), merge 2 NLP tasks

5 of 18

TIES-merging

magnitude

sign

6 of 18

TIES-merging

220 Million

770 Million

7 of 18

Della-TIES

drop probability

~ magnitude

sign based election

8 of 18

Della-TIES

no drop

same drop prob

1 to k = TIES

cf) For Math,

TIES seems better

instruction following task

math

code

9 of 18

Motivation

Previous works focus on…

merging small models(<7B)
only 2~3 experts (cf) exception : TIES-merging)
held-in tasks that the expert models were trained on

⇒

Effect of base models (pretrained vs. instruction-tuned)?
model size increases → merging gets easier or harder?
How merging affects result on held-out tasks? + at scale?
How many expert models can be merged without performance loss? + model size?

10 of 18

Experiment Settings

model : PaLM-2 model, PaLM-2-IT (instruction tuned)
model size : 1B, 8B, 24B, 64B parameters
number of models : 2, 4, 6, 8
task : T0 data collection (8 held-in each w/ 2 datasets, 4 held-out 7 datasets)

held-in : Multiple-choice QA, Extractive Qa, Closed-Book QA, Sentiment Analysis, Topic Classification, Structure-to-text, Summarization, Paraphrase Identification
held-out : Sentence Completion, Natural Language Inference, Coreference Resolution, Word Sense Disambiguation

merged method : averaging, Task Arithmetic, TIES-merging, Della-TIES

⇒ full fine-tuning(2 model*4 size models) on the 8 held-in task ~ 64 specialized experts

11 of 18

Key Findings

merging is more effective in instruction-tuned base models
larger models facilitate easier merging
merging significantly improves zero-shot generalization
can merge more models effectively when using larger models
different merging methods perform similarly when applied to large-scale instruction-tuned models

12 of 18

Instruction-tuned models facilitate easier merging

13 of 18

2. Bigger models merge better

14 of 18

3. Big & strong base models generalize better than multitask models

15 of 18

4. Bigger models generalize better

stable

16 of 18

5. Merging methods become similar at scale (64B PaLM-2-IT)

cf) Held-in performance �needs more improvement

17 of 18

Takeaway

Using strong base model is always beneficial (3.)
Merged models often underperform compared to task-specific expert models
With strong base model, large-scale merging can outperform multitask training (3.)
With large instruction tuned models, different merging methods are very similar (5.)
Held-in task performance needs more improvement