1 of 75

10-605 / 10-805

Machine Learning from Large Datasets

2 of 75

Recap: the course so far

Why bigdata is important in ML
History of bigdata in ML
Scalable computation on clusters

MapReduce
Spark and iteration
Sample Spark workflows

Learning as optimization

MLE for binomials, linear regression, logistic regression, k-means clustering
exact solutions, gradient descent (batch, parallel, stochastic), EM and EM-like methods

“Craft” of ML

feature extraction / augmentation

kernel methods – working a different feature space

regularization
hyperparameter tuning

2

3 of 75

Today

Scalable optimization

sparsity and how it changes optimization
the hash trick as a kernel
distributed ML

3

4 of 75

SPARSE GRADIENTS FOR SPARSE DATA FOR LINEAR/LOGISTIC REGRESSION

5 of 75

Summary: SGD for least squares and Logistic Regression

5

Initialize w₀

For i=1, … until converged:

Compute gradient g
Update: w_i = w_i-1- ⍺ g

6 of 75

Summary: SGD for least squares and Logistic Regression

6

Initialize w₀

For i=1, … until converged:

Compute gradient g
Update: w_i = w_i-1- ⍺ g

For text problems:

x is a sparse document vector (e.g., TFIDF) weights
w is a dense vector of weights (one for each word in the vocabulary)

How do we represent x and w in our program?

Compact representation:

w is a V-dimensional array of floats

w[t] is weight of term with integer id t

x is a sparse vector

internally a list of pairs: [(t₁, x[t₁]), … (t_D, x[t_D])]

A hash table is used to convert strings (“aardvark”) to term id’s (e.g., 5317)

Remember we may be moving these around across the network…and storing millions of these

7 of 75

Summary: SGD for least squares and Logistic Regression

7

Initialize w₀

For i=1, … until converged:

Compute gradient g
Update: w_i = w_i-1- ⍺ g

For text problems:

x is a sparse document vector (e.g., TFIDF) weights
w is a dense vector of weights (one for each word in the vocabulary)

If x a sparse vector….

The gradient is zero where x is zero 🡺 gradient is sparse

8 of 75

Sparse updates for logistic regression

The algorithm:

8

9 of 75

Sparsity can cause overfitting

Consider averaging the gradient over all the examples D={(x₁,y₁)…,(x_n, y_n)}

9

This will overfit badly with sparse features

e.g., if w^j is only in positive examples, its gradient is always positive !

10 of 75

Recap: Ridge regression

10

11 of 75

Regularized Logistic Regression

11

This update is not sparse

Dense vector updates aren’t always a problem in practice but …

12 of 75

Sparse Logistic Regression (sketch)

Final algorithm:
Initialize hashtable W
For each iteration t=1,…T

For each example (x_i,y_i)

p_i = …
For each feature W[j]

W[j] = W[j] - ⍺ λW[j]
If x_i^j>0 then

W[j] = W[j] + ⍺ (p_i - y_i )x_j

12

not sparse

sparse

13 of 75

Sparse Logistic Regression (sketch)

Final algorithm:
Initialize hashtable W
For each iteration t=1,…T

For each example (x_i,y_i)

p_i = …
For each feature W[j]

W[j] *= (1 - ⍺ λ )
If x_i^j>0 then

W[j] = W[j] + ⍺ (p_i - y_i )x_j

13

“weight decay”

14 of 75

Sparse Logistic Regression (sketch)

Final algorithm:
Initialize hashtable W
For each iteration t=1,…T

For each example (x_i,y_i)

p_i = …
For each feature W[j]

If x_i^j>0 then

W[j] *= (1 - ⍺ λ )^A
W[j] = W[j] + ⍺ (p_i - y_i )x_j

14

A is number of examples seen since the last time we did an x>0 update on W[j]

15 of 75

Sparse Logistic Regression (sketch)

Final algorithm:
Initialize hashtables W, A and set k=0
For each iteration t=1,…T

For each example (x_i,y_i)

p_i = …
For each feature W[j]

If x_i^j>0 then

W[j] *= (1 - ⍺ λ )^k-A[j]
W[j] = W[j] + ⍺ (p_i - y_i )x_j
A[j] = k

15

k = “clock” reading
A[j] = clock reading last time feature j was “active”
we implement the “weight decay” update using a “lazy” strategy: weights are decayed in one shot when a feature is “active”

One more ”weight decay” step needed at end of SGD

16 of 75

THE HASH KERNEL

Also called the “hack trick”—a trick to exploit sparse features

17 of 75

Recap: SGD for Logistic Regression

18 of 75

The “hash trick”

an array V = <0…0>

hash j

Assume a hash function h that maps string features to indices in V

V[j] = V[j]

array V

Ignore collisions!

19 of 75

The “hash trick”: intuition

We’ve replaced our old set of features with new ones: x🡪 ɸ(x), where ɸ(x) might be shorter and denser
New feature might be sum of many old features that collide:

If most of the original sparse features aren’t too important for the task, this won’t hurt us (much)

ɸ(x)[h] =

Often in classification there a few very important words and lots of irrelevant ones.
It’s unlikely that important features are hashed together.

20 of 75

The “hash trick”: pros/cons

Memory: Without the hash trick, vocabulary size usually grows with the dataset

For NLP tasks, usually V = O(sqrt(N))
Now it stays constant ☺

Unless you change the hash table size

Compute:

It’s much faster to ignore collisions ☺

Debugging:

Now features are meaningless ☹

21 of 75

Proc of ML Research, 2009

2010

22 of 75

Motivation

Authors are interested in fast large-scale linear classifiers

Sparse regularized logistic regression, and linear SVMs
Treat the “hash trick” as a kernel

But solved in the primal space, so it’s really feature engineering

23 of 75

Motivation

Hashing means we can introduce lots of features

We can handle more than just words on a page:

If we care about misspelled terms like “Schmithuber” we can hash in all substrings up to some length

Pipeline is set up to compute features and immediately hash them—never serialize ɸ
Minimal cost for i/o bound jobs

24 of 75

Motivation

If care about feature interactions

we just add the interactions we care about explicitly (like for quadratic feature example with linear regression) and then hash them

Special case used for multiclass classification:

For documents in class “news” replace words like “tariff” with “tariff#news” and etc
Classify feature vectors like this as +1 (created from the true class) or 0 (created from an incorrect class)
Cycle through possible classes at test time and see which assigned class looks “most positive”

25 of 75

Formal Results

Q: How different is the original kernel K(x,x’) from the hash kernel?
A: there is a bias due to adding collisions, but it’s minimal

26 of 75

Formal Results (sketch)

Q: How different is the original kernel K(x,x’) from the hash kernel?
A: there is a bias due to adding collisions, but it’s minimal

Note this doesn’t prove the hash kernel “works” – it does give some intuition that the hash trick won’t change the problem “too much”

27 of 75

Formal results

Observation: in NLP (especially with substrings and terms#class as features) there are often highly-correlated features “duplicates”

Or you can create duplicates explicitly by hashing “William#1”, “William#2”, “William#3”, etc.

Q: How much loss of information is there if you assume duplicates exist?
A: very little!

Eg 10k features, 10M hash buckets, 3 duplicates of each feature, Pr(all duplicates of any feature collide) ~= 1%

28 of 75

Experimental results: RCV1

Hash kernel version of VW learner

29 of 75

Experimental results: RCV1

30 of 75

Spam filtering at Yahoo

3.2M emails, 433k users
Emails have tokens and user-id. Features:

original tokens (hashed): “dear”, “sir”, “please”, ”accept”, …
original tokens paired with user id: “dear#wcohen”, “sir#wcohen”, …

Everything is put in one giant classifier

Intuition: the model will learn something from the general features and can refine the model with user-specific features if there is enough data from that user

From Weinberge et al 2010 paper

31 of 75

An example

2^26 entries = 1 Gb @ 8bytes/weight

32 of 75

DISTRIBUTED MACHINE LEARNING

32

33 of 75

Distributed ML

Data parallelism

replicate the model and partition the data
our focus so far

Model parallelism (related to “pipeline parallelism”)

partition the model

e.g., each worker has different parts of a huge LLM

Tensor parallelism (related to “Zero Redundancy Optimizer, ZeRO”)

partition individual tensors/matrices operations

Context parallelism (for long-context LLMs)

33

https://arxiv.org/abs/2003.06307 2023 survey

34 of 75

Distributed ML

Data parallelism

workers get parts (usually disjoint partitions) of the data and copies of the current model
workers optimize independently for a while

without communication

workers communicate and synchronize models

Decisions to consider:

when do you synchronize?

when is it ok to optimize a “stale” set of weights?

how do you synchronize?

how do you send information?
how do you compress or sparsify gradients?

34

35 of 75

Data Parallel ML: Maximal synchronization: batch gradient (recap)

35

36 of 75

Data Parallel ML: Minimal synchronization 1

36

NeurIPS 2009

37 of 75

Data Parallel ML: Minimal synchronization 1

ML models they compared:

Distributed batch gradient
Two minimal-synchronization methods

Learn P different logistic regression classifiers from P different shards with batch gradient descent for each shard

Combine predictions by majority vote
Combine weight vectors by averaging

All methods are comparable in CPU time and accuracy but minimal-synchronization methods use 1000x less network bandwidth
Can prove lower variance than one shard but not lower bias

37

38 of 75

Data Parallel ML: Minimal synchronization 2

38

NeurIPS 2010

39 of 75

Data Parallel ML: Minimal synchronization 2

Comparisons

The minimal-synchronization method

Learn P different logistic regression classifiers from T examples each with stochastic gradient descent in a shard
Combine weight vectors by averaging

Can make formal bounds on the variance and expectation of the averaged parameter vector

relative to a single SGD run over T examples

39

40 of 75

40

Training loss

Test loss

Note: This is a convex optimization problem!

41 of 75

Data Parallel ML: Minimal synchronization 3

41

NeurIPS 2011

42 of 75

Data Parallel ML: Minimal synchronization 3

Setting:

Multiple SGD processes on a multicore CPU with shared memory
Sparse gradient updates

Matrix factorization
SVM with sparse features

No locking when you update gradients!

One process’s update might overwrite another
But that’s just making GD more stochastic

42

43 of 75

Data Parallel ML: Minimal synchronization 3

HogWild! optimization still in use

e.g. FastText:

hash trick, n-grams, learned word representations, hierarchical softmax, and HogWild! optimization
very fast learner/classifier with ~ 20 threads on CPUs with fast RAID disks

43

44 of 75

Data Parallel ML: Intermediate Synchronization

44

NAACL 2010

45 of 75

Parallel Perceptrons*

Simplest idea (minimal synchronization):

Split data into S “shards”
Train a perceptron on each shard independently

weight vectors are w⁽¹⁾ , w⁽²⁾ , …

Produce some weighted average of the w⁽ⁱ⁾‘s as the final result

45

*similar to logistic regression with SGD

46 of 75

Parallel Perceptrons – take 2

46

Better idea: do the simplest possible thing iteratively.

Split the data into shards
Let w = 0
For n=1,…

Train a perceptron on each shard with one pass starting with w
Average the weight vectors and let w be that average

Extra communication cost:

redistributing the weight vectors
done less frequently than if fully synchronized, more frequently than if fully parallelized

47 of 75

Parallelizing perceptrons – take 2

47

Instances/labels

Instances/labels – 1

Instances/labels – 2

Instances/labels – 3

w -1

w- 2

w-3

w

Split into example subsets

Combine by averaging

Compute local vk’s

w (previous)

48 of 75

Formal Results

Background: Perceptrons are guaranteed to converge

Given the data has a compact “radius” and there is a classifier with “large margin”

The parallel “iterative parameter mixing” method is also guaranteed to converge in the same conditions ☺

But not to converge faster ☹

48

49 of 75

Results

49

50 of 75

Distributed ML

Data parallelism

workers get parts (usually disjoint partitions) of the data and copies of the current model
workers optimize independently for a while

without communication

workers communicate and synchronize models

Decisions to consider:

when do you synchronize?

when is it ok to optimize a “stale” set of weights?

how do you synchronize?

how do you send information?
how do you compress or sparsify information?

50

51 of 75

Matrix Factorization

Another use of SGD

51

52 of 75

Recovering latent factors in a matrix

52

m columns

v11	…
…	…


	vij

			…
				vnm

n rows

53 of 75

Recovering latent factors in a matrix

53

K * m

n * K

x1	y1
x2	y2
..	..





…	…
xn	yn

a1	a2	..	…	am
b1	b2	…	…	bm

v11	…
…	…


	vij

			…
				vnm

~

54 of 75

Recovering latent factors in a matrix

54

m movies

n users

m movies

x1	y1
x2	y2
..	..





…	…
xn	yn

a1	a2	..	…	am
b1	b2	…	…	bm

v11	…
…	…


	vij

			…
				vnm

~

V[i,j] = user i’s rating of movie j

55 of 75

55

56 of 75

MF for images

10,000 pixels

1000 images

1000 * 10,000,00

x1	y1
x2	y2
..	..





…	…
xn	yn

a1	a2	..	…	am
b1	b2	…	…	bm

v11	…		…	…
…	…


	vij

			…
				vnm

~

V[i,j] = pixel j in image i

weights for

2 prototypes

P1

P2

57 of 75

57

58 of 75

Recovering latent factors in a matrix

58

m terms

n documents

doc term matrix

x1	y1
x2	y2
..	..





…	…
xn	yn

a1	a2	..	…	am
b1	b2	…	…	bm

v11	…
…	…


	vij

			…
				vnm

~

V[i,j] = TFIDF score of term j in doc i

59 of 75

k-means Clustering

59

centroids

60 of 75

k-means as MF

cluster means

n examples

0	1
1	0
..	..





…	…
xn	yn

a1	a2	..	…	am
b1	b2	…	…	bm

v11	…
…	…


	vij

			…
				vnm

~

original data set

indicators for r clusters

Z

M

X

61 of 75

Matrix Factorization as Optimization

61

62 of 75

62

KDD 2011

63 of 75

Recovering latent factors in a matrix as optimization

63

m movies

n users

m movies

x1	y1
x2	y2
..	..





…	…
xn	yn

a1	a2	..	…	am
b1	b2	…	…	bm

v11	…
…	…


	vij

			…
				vnm

~

V[i,j] = user i’s rating of movie j

r

H

W

V

Z is non-zero entries

(matrix completion)

64 of 75

Matrix factorization as SGD

64

step size

why does this work?

65 of 75

This is a sparse update

65

66 of 75

Checking the claim

66

Think for SGD for logistic regression

LR loss = compare y and ŷ = dot(w,x)
similar but now update w (user weights) and x (movie weight)

67 of 75

What loss functions are possible?

67

N1, N2 - diagonal matrixes, sort of like IDF factors for the users/movies

“generalized” KL-divergence

68 of 75

68

ALS = alternating least squares

69 of 75

69

talk pilfered from 🡪 …..

KDD 2011

70 of 75

70

iterative SGD, no mixing

limited memory quasi-Newton

param mixing

alternating least squares

IPM

71 of 75

Sparsity of the MF update

71

72 of 75

72

73 of 75

73

74 of 75

74

	H1	H2	H3
W1	V ₁₁
W2		V ₂₂
W3			V ₃₃

	H1	H2	H3
W1		V ₁₂
W2			V ₂₃
W3	V ₃₁

	H1	H2	H3
W1			V ₁₃
W2	V ₂₁
W3		V ₃₂

Node 1

Node 2

Node 3

Diag 1

Diag 2

Diag 3

Epoch 1

75 of 75

75