1 of 18

What matters for Model Merging at Scale?

Jina Kim

School of Computing, KAIST

2025.01.08

2 of 18

Why model merging?

  • reduces storage and serving cost
  • combine expert models, utilize already existing models; recycling!
  • independently build models and combine them later → support decentralized and modular model development

3 of 18

Task Arithmetic

4 of 18

Task Arithmetic

T5-base (220 Million), merge 2 NLP tasks

5 of 18

TIES-merging

magnitude

sign

6 of 18

TIES-merging

220 Million

770 Million

7 of 18

Della-TIES

drop probability

~ magnitude

sign based election

8 of 18

Della-TIES

no drop

same drop prob

1 to k = TIES

cf) For Math,

TIES seems better

instruction following task

math

code

9 of 18

Motivation

Previous works focus on…

  • merging small models(<7B)
  • only 2~3 experts (cf) exception : TIES-merging)
  • held-in tasks that the expert models were trained on

  • Effect of base models (pretrained vs. instruction-tuned)?
  • model size increases → merging gets easier or harder?
  • How merging affects result on held-out tasks? + at scale?
  • How many expert models can be merged without performance loss? + model size?

10 of 18

Experiment Settings

  • model : PaLM-2 model, PaLM-2-IT (instruction tuned)
  • model size : 1B, 8B, 24B, 64B parameters
  • number of models : 2, 4, 6, 8
  • task : T0 data collection (8 held-in each w/ 2 datasets, 4 held-out 7 datasets)
    • held-in : Multiple-choice QA, Extractive Qa, Closed-Book QA, Sentiment Analysis, Topic Classification, Structure-to-text, Summarization, Paraphrase Identification
    • held-out : Sentence Completion, Natural Language Inference, Coreference Resolution, Word Sense Disambiguation
  • merged method : averaging, Task Arithmetic, TIES-merging, Della-TIES

⇒ full fine-tuning(2 model*4 size models) on the 8 held-in task ~ 64 specialized experts

11 of 18

Key Findings

  • merging is more effective in instruction-tuned base models
  • larger models facilitate easier merging
  • merging significantly improves zero-shot generalization
  • can merge more models effectively when using larger models
  • different merging methods perform similarly when applied to large-scale instruction-tuned models

12 of 18

  1. Instruction-tuned models facilitate easier merging

13 of 18

2. Bigger models merge better

14 of 18

3. Big & strong base models generalize better than multitask models

15 of 18

4. Bigger models generalize better

stable

stable

16 of 18

5. Merging methods become similar at scale (64B PaLM-2-IT)

cf) Held-in performance �needs more improvement

17 of 18

Takeaway

  • Using strong base model is always beneficial (3.)
  • Merged models often underperform compared to task-specific expert models
  • With strong base model, large-scale merging can outperform multitask training (3.)
  • With large instruction tuned models, different merging methods are very similar (5.)
  • Held-in task performance needs more improvement

18 of 18

Thank you!