What matters for Model Merging at Scale?
Jina Kim
School of Computing, KAIST
2025.01.08
Why model merging?
Task Arithmetic
Task Arithmetic
T5-base (220 Million), merge 2 NLP tasks
TIES-merging
magnitude
sign
TIES-merging
220 Million
770 Million
Della-TIES
drop probability
~ magnitude
sign based election
Della-TIES
no drop
same drop prob
1 to k = TIES
cf) For Math,
TIES seems better
instruction following task
math
code
Motivation
Previous works focus on…
⇒
Experiment Settings
⇒ full fine-tuning(2 model*4 size models) on the 8 held-in task ~ 64 specialized experts
Key Findings
2. Bigger models merge better
3. Big & strong base models generalize better than multitask models
4. Bigger models generalize better
stable
stable
5. Merging methods become similar at scale (64B PaLM-2-IT)
cf) Held-in performance �needs more improvement
Takeaway
Thank you!