10-605 / 10-805
Machine Learning from Large Datasets
Announcements
Why? stay tuned!
Outline
Distributed ML
4
RECAP
Data Parallel ML: Minimal synchronization 1
RECAP
ACL 2010
Data Parallel ML: Minimal synchronization 1
6
RECAP
Comparing synchronization approaches for perceptron training
7
one model on 1/p of the data
model averaging with no synchronization
iterative parameter mixing
RECAP
Data Parallel ML: Minimal synchronization 2
8
NeurIPS 2010
RECAP
Data Parallel ML: Minimal synchronization 2
9
RECAP
Comparing synchronization approaches for SGD and logistic regression
10
100 model average, no synchronization
10 model average, no synchronization
1 model
Note: These are convex optimization problems!
RECAP
11
talk pilfered from 🡪 …..
KDD 2011
RECAP
12
iterative SGD, no mixing
limited memory quasi-Newton
param mixing
alternating least squares
IPM
RECAP
Recap of the recap
BRANCH-TRAIN-MERGE �
2022
BTM: Key ideas
ELM = expert LM
ELMForest = group of ELMs
𝜃1, 𝜃2, …, 𝜃k
small GPT 𝜃
merge?
BTM: Key ideas
Domain posterior
Using just top k=3 domains is ok
Preferred approach: exponential moving average of the domain posterior, update every 1000 tokens
combine logits from k LLMs
BTM: Key ideas
Domain posterior
using a single merged LLM)
*parameters of 𝜃j weighted by P(D=j)
combine weights from k LLMs
BTM: Key ideas
ELM = expert LM
ELMForest = group of ELMs
𝜃1, 𝜃2, …, 𝜃k
small GPT 𝜃
merged with parameter averaging
BTM: Experiments with 8 domains
compute-matched perplexity with 8 train/8 eval domains
50% of compute for “seed” model 𝜃
inference-time model merging
BTM: Experiments with 8 domains
compute-matched perplexity with 8 train/8 eval domains
50% of compute for “seed” model 𝜃
parameter mixing model merging
BTM: Training costs
2024
Background: Mixture of Experts (MoE)
combine
with
where in training G(x) = SoftMax
and at inference time
graphics: https://arxiv.org/pdf/2504.02263
Background: Expert Parallelism
This can be parallelized!
Often need to train to “balance” the experts with an extra loss term
Background: Expert Parallelism
In Transformers MoE is used for the FFN layers so tokens are routed to the experts
Examples:
2024
Branch and Train as in Branch-train-merge
Mix the expert LMs by
BTX details
The experts are different
BTX results
math specialist
“data matched”
* BTX with no parallel training stage
*
BTX results
BTX results
FlexOLMO
August 2025
FlexOLMO
Multiple local datasets Di; one public dataset Dpub; one model Mpub trained on Dpub
For each Di
FlexOLMO
Multiple local datasets Di; one public dataset Dpub; one model Mpub trained on Dpub
For each Di
Wr is “router embeddings”
ri is initialized with off-the-shelf embeddings from Di
and then trained pairwise with Mpub
Mixture output
Optionally: create local datasets D’i where each is extracted from Mpub and similar to Di
(These can be small)
Sample uniformly from the “proxy datasets” and Dpub to fine-tune Wr
FlexOLMO
FlexOLMO
FlexOLMO
TASK VECTORS
Recap: skip-gram embeddings (word2vec)
Training data:
positive examples are pairs of words w(t), w(t+j) that co-occur
Training data:
negative examples are samples of pairs of words w(t), w(t+j) that don’t co-occur
You want to train over a very large corpus (100M words+) and hundreds+ dimensions
Recap: Results from word2vec
A number of properties of word2vec were surprising and mysterious! until they were explained by Omer Levy and Yoav Goldberg a couple of years later
2014
2014
Notation for analogies:
e.g.,
Word2vec method is “3CosAdd”:
=
=
i.e., b* (queen) should be:
Recap: distributional clustering (with LSH)
43
…guards at Pentonville prison in North London discovered that an escape attempt…
An American Werewolf in London is to be remade by the son of the original director…
…UK pop up shop on Monmouth Street in London today and on Friday the brand…
v(London): Pentonville, prison, in, North, …. and, on Friday
2014
Key idea: these two vectors are closely related
word2vec embeddings are matrix factorization of the PPMI matrix
Background
Q: what if you use the original sparse (unfactored) PPMI word vectors for analogies?
2014
Notation for analogies:
e.g.,
Word2vec method is “3CosAdd”:
=
=
i.e., b* (queen) should be:
2014
sparse PPI vectors 🡪
word2vec 🡪
Observation: old-school sparse vectors also work for analogies
but not as well as word2vec using 3CosAdd rule
i.e., b* (queen) should be:
3CosMul
*scale sims to [0,1]
2014
sparse PPI vectors 🡪
word2vec 🡪
sparse PPI vectors 🡪
word2vec 🡪
So: old-school sparse vectors also work for analogies.
Discussion: Analogies
It would be great if vector-based representations did have more modularity: e.g., add ”past tense” to a word sense
Experiment with task negation
𝜃
𝜃 + 𝜏toxic
𝜃 - 𝜏toxic
𝜃 + 𝜏nice
same magnitude
Subtracting the task vector is a way of “forgetting” how to perform that task!
Experiment with scaled task negation
𝜃
𝜃 + 𝜏toxic
𝜃 + 𝜏nice
same magnitude
Experiment with scaled task negation
𝜃
𝜃 + 𝜏
smaller model
larger model
Experiments with task analogies
λ1𝜏AmazonSentiment + (λ2𝜏YelpLM — λ3𝜏AmazonLM)
Experiments with task analogies
λ1𝜏YelpSentiment + (λ2𝜏AmazonLM — λ3𝜏YelpLM)
Experiments with task analogies
Image classification tasks with CLIP, slices are based on an image style and class.
Also using λ’s picked by validation sets
Experiments with task analogies
Collect 200 images of kings, queens, women, men.
Fine-tune CLIP models on each category, using 1000 ImageNet classes as negative examples
(class name ”something”)
Experiment: adding many task vectors
𝜃
𝜃 + 𝜏*
Note the training for the subtasks is embarrassingly parallel
Experiment: adding different task vectors
Note the training for the subtasks is embarrassingly parallel
Experiment: adding different task vectors
Note the training for the subtasks is embarrassingly parallel
Why does this work?
Task vectors are mostly orthogonal except when tasks are related
MNIST / SVHN both digits; GTRSB is reading traffic signs
EuroSAT / RESISC45 are sat images
Why does this work?
“Weight disentanglement”
2023
Disentanglement error:
Disentanglement error of ⍺1𝜏1 and ⍺2𝜏2 with dist(f,g)=1 if labels different
2023
Theory:
2023
Focus on merging models for different tasks
Question: how do you merge models when there is some “entanglement”? Can you improve on task arithmetic?
Question: how do you merge models when there is some “entanglement”?
average acc on 11 task vectors
Question: how do you merge models when there is some “entanglement”?
TIES-Merge – TrIm, Elect Sign, and Merge
TIES-Merge – TrIm, Elect Sign, and Merge
11 NLP / 8 vision datasets; k=20% and λ=1 w/o validation set
accuracy / FT accuracy merging multiple tasks
LoraHub: Method
Other results:
2024
Post-Hoc Adaptive Tokenwise Gating Over and Ocean of Specialized Adapters (PHATGOOSE)
PHATGOOSE methods
PHATGOOSE results