1 of 75

10-605 / 10-805

Machine Learning from Large Datasets

2 of 75

Recap: the course so far

  • Why bigdata is important in ML
  • History of bigdata in ML
  • Scalable computation on clusters
    • MapReduce
    • Spark and iteration
    • Sample Spark workflows
  • Learning as optimization
    • MLE for binomials, linear regression, logistic regression, k-means clustering
    • exact solutions, gradient descent (batch, parallel, stochastic), EM and EM-like methods
  • “Craft” of ML
    • feature extraction / augmentation
      • kernel methods – working a different feature space
    • regularization
    • hyperparameter tuning

2

3 of 75

Today

  • Scalable optimization
    • sparsity and how it changes optimization
    • the hash trick as a kernel
    • distributed ML

3

4 of 75

SPARSE GRADIENTS FOR SPARSE DATA FOR LINEAR/LOGISTIC REGRESSION

5 of 75

Summary: SGD for least squares and Logistic Regression

5

Initialize w0

For i=1, … until converged:

  • Compute gradient g
  • Update: wi = wi-1 - ⍺ g

6 of 75

Summary: SGD for least squares and Logistic Regression

6

Initialize w0

For i=1, … until converged:

  • Compute gradient g
  • Update: wi = wi-1 - ⍺ g

For text problems:

  • x is a sparse document vector (e.g., TFIDF) weights
  • w is a dense vector of weights (one for each word in the vocabulary)

How do we represent x and w in our program?

Compact representation:

  • w is a V-dimensional array of floats
    • w[t] is weight of term with integer id t
  • x is a sparse vector
    • internally a list of pairs: [(t1, x[t1]), … (tD, x[tD])]

A hash table is used to convert strings (“aardvark”) to term id’s (e.g., 5317)

Remember we may be moving these around across the network…and storing millions of these

7 of 75

Summary: SGD for least squares and Logistic Regression

7

Initialize w0

For i=1, … until converged:

  • Compute gradient g
  • Update: wi = wi-1 - ⍺ g

For text problems:

  • x is a sparse document vector (e.g., TFIDF) weights
  • w is a dense vector of weights (one for each word in the vocabulary)

If x a sparse vector….

The gradient is zero where x is zero 🡺 gradient is sparse

8 of 75

Sparse updates for logistic regression

  • The algorithm:

8

9 of 75

Sparsity can cause overfitting

  • Consider averaging the gradient over all the examples D={(x1,y1)…,(xn , yn)}

9

  • This will overfit badly with sparse features
    • e.g., if wj is only in positive examples, its gradient is always positive !

10 of 75

Recap: Ridge regression

10

11 of 75

Regularized Logistic Regression

11

This update is not sparse

Dense vector updates aren’t always a problem in practice but …

12 of 75

Sparse Logistic Regression (sketch)

  • Final algorithm:
  • Initialize hashtable W
  • For each iteration t=1,…T
    • For each example (xi,yi)
      • pi = …
      • For each feature W[j]
        • W[j] = W[j] - λW[j]
        • If xij>0 then
          • W[j] = W[j] + ⍺ (pi - yi )xj

12

not sparse

sparse

13 of 75

Sparse Logistic Regression (sketch)

  • Final algorithm:
  • Initialize hashtable W
  • For each iteration t=1,…T
    • For each example (xi,yi)
      • pi = …
      • For each feature W[j]
        • W[j] *= (1 - ⍺ λ )
        • If xij>0 then
          • W[j] = W[j] + ⍺ (pi - yi )xj

13

“weight decay”

14 of 75

Sparse Logistic Regression (sketch)

  • Final algorithm:
  • Initialize hashtable W
  • For each iteration t=1,…T
    • For each example (xi,yi)
      • pi = …
      • For each feature W[j]
        • If xij>0 then
          • W[j] *= (1 - ⍺ λ )A
          • W[j] = W[j] + ⍺ (pi - yi )xj

14

A is number of examples seen since the last time we did an x>0 update on W[j]

15 of 75

Sparse Logistic Regression (sketch)

  • Final algorithm:
  • Initialize hashtables W, A and set k=0
  • For each iteration t=1,…T
    • For each example (xi,yi)
      • pi = …
      • For each feature W[j]
        • If xij>0 then
          • W[j] *= (1 - ⍺ λ )k-A[j]
          • W[j] = W[j] + ⍺ (pi - yi )xj
          • A[j] = k

15

  • k = “clock” reading
  • A[j] = clock reading last time feature j was “active”
  • we implement the “weight decay” update using a “lazy” strategy: weights are decayed in one shot when a feature is “active”

One more ”weight decay” step needed at end of SGD

16 of 75

THE HASH KERNEL

Also called the “hack trick”—a trick to exploit sparse features

17 of 75

Recap: SGD for Logistic Regression

18 of 75

The “hash trick”

an array V = <0…0>

hash j

Assume a hash function h that maps string features to indices in V

V[j] = V[j]

array V

Ignore collisions!

19 of 75

The “hash trick”: intuition

  • We’ve replaced our old set of features with new ones: x🡪 ɸ(x), where ɸ(x) might be shorter and denser
  • New feature might be sum of many old features that collide:

  • If most of the original sparse features aren’t too important for the task, this won’t hurt us (much)

ɸ(x)[h] =

  • Often in classification there a few very important words and lots of irrelevant ones.
  • It’s unlikely that important features are hashed together.

20 of 75

The “hash trick”: pros/cons

  • Memory: Without the hash trick, vocabulary size usually grows with the dataset
    • For NLP tasks, usually V = O(sqrt(N))
    • Now it stays constant ☺
      • Unless you change the hash table size
  • Compute:
    • It’s much faster to ignore collisions ☺
  • Debugging:
    • Now features are meaningless ☹

21 of 75

Proc of ML Research, 2009

2010

22 of 75

Motivation

  • Authors are interested in fast large-scale linear classifiers
    • Sparse regularized logistic regression, and linear SVMs
    • Treat the “hash trick” as a kernel
      • But solved in the primal space, so it’s really feature engineering

23 of 75

Motivation

  • Hashing means we can introduce lots of features
    • We can handle more than just words on a page:
      • If we care about misspelled terms like “Schmithuber” we can hash in all substrings up to some length
  • Pipeline is set up to compute features and immediately hash them—never serialize ɸ
  • Minimal cost for i/o bound jobs

24 of 75

Motivation

  • If care about feature interactions
    • we just add the interactions we care about explicitly (like for quadratic feature example with linear regression) and then hash them
  • Special case used for multiclass classification:
    • For documents in class “news” replace words like “tariff” with “tariff#news” and etc
    • Classify feature vectors like this as +1 (created from the true class) or 0 (created from an incorrect class)
    • Cycle through possible classes at test time and see which assigned class looks “most positive”

25 of 75

Formal Results

  • Q: How different is the original kernel K(x,x’) from the hash kernel?
  • A: there is a bias due to adding collisions, but it’s minimal

26 of 75

Formal Results (sketch)

  • Q: How different is the original kernel K(x,x’) from the hash kernel?
  • A: there is a bias due to adding collisions, but it’s minimal

Note this doesn’t prove the hash kernel “works” – it does give some intuition that the hash trick won’t change the problem “too much”

27 of 75

Formal results

  • Observation: in NLP (especially with substrings and terms#class as features) there are often highly-correlated features “duplicates”
    • Or you can create duplicates explicitly by hashing “William#1”, “William#2”, “William#3”, etc.
  • Q: How much loss of information is there if you assume duplicates exist?
  • A: very little!
    • Eg 10k features, 10M hash buckets, 3 duplicates of each feature, Pr(all duplicates of any feature collide) ~= 1%

28 of 75

Experimental results: RCV1

Hash kernel version of VW learner

29 of 75

Experimental results: RCV1

30 of 75

Spam filtering at Yahoo

  • 3.2M emails, 433k users
  • Emails have tokens and user-id. Features:
    • original tokens (hashed): “dear”, “sir”, “please”, ”accept”, …
    • original tokens paired with user id: “dear#wcohen”, “sir#wcohen”, …
  • Everything is put in one giant classifier
    • Intuition: the model will learn something from the general features and can refine the model with user-specific features if there is enough data from that user

From Weinberge et al 2010 paper

31 of 75

An example

2^26 entries = 1 Gb @ 8bytes/weight

32 of 75

DISTRIBUTED MACHINE LEARNING

32

33 of 75

Distributed ML

  • Data parallelism
    • replicate the model and partition the data
    • our focus so far
  • Model parallelism (related to “pipeline parallelism”)
    • partition the model
      • e.g., each worker has different parts of a huge LLM
  • Tensor parallelism (related to “Zero Redundancy Optimizer, ZeRO”)
    • partition individual tensors/matrices operations

  • Context parallelism (for long-context LLMs)

33

34 of 75

Distributed ML

  • Data parallelism
    • workers get parts (usually disjoint partitions) of the data and copies of the current model
    • workers optimize independently for a while
      • without communication
    • workers communicate and synchronize models
  • Decisions to consider:
    • when do you synchronize?
      • when is it ok to optimize a “stale” set of weights?
    • how do you synchronize?
      • how do you send information?
      • how do you compress or sparsify gradients?

34

35 of 75

Data Parallel ML: Maximal synchronization: batch gradient (recap)

35

36 of 75

Data Parallel ML: Minimal synchronization 1

36

NeurIPS 2009

37 of 75

Data Parallel ML: Minimal synchronization 1

  • ML models they compared:
    • Distributed batch gradient
    • Two minimal-synchronization methods
      • Learn P different logistic regression classifiers from P different shards with batch gradient descent for each shard
        • Combine predictions by majority vote
        • Combine weight vectors by averaging
  • All methods are comparable in CPU time and accuracy but minimal-synchronization methods use 1000x less network bandwidth
  • Can prove lower variance than one shard but not lower bias

37

38 of 75

Data Parallel ML: Minimal synchronization 2

38

NeurIPS 2010

39 of 75

Data Parallel ML: Minimal synchronization 2

  • Comparisons
    • The minimal-synchronization method
      • Learn P different logistic regression classifiers from T examples each with stochastic gradient descent in a shard
      • Combine weight vectors by averaging
  • Can make formal bounds on the variance and expectation of the averaged parameter vector
    • relative to a single SGD run over T examples

39

40 of 75

40

Training loss

Test loss

Note: This is a convex optimization problem!

41 of 75

Data Parallel ML: Minimal synchronization 3

41

NeurIPS 2011

42 of 75

Data Parallel ML: Minimal synchronization 3

  • Setting:
    • Multiple SGD processes on a multicore CPU with shared memory
    • Sparse gradient updates
      • Matrix factorization
      • SVM with sparse features
    • No locking when you update gradients!
      • One process’s update might overwrite another
      • But that’s just making GD more stochastic

42

43 of 75

Data Parallel ML: Minimal synchronization 3

  • HogWild! optimization still in use
    • e.g. FastText:
      • hash trick, n-grams, learned word representations, hierarchical softmax, and HogWild! optimization
      • very fast learner/classifier with ~ 20 threads on CPUs with fast RAID disks

43

44 of 75

Data Parallel ML: Intermediate Synchronization

44

NAACL 2010

45 of 75

Parallel Perceptrons*

  • Simplest idea (minimal synchronization):
    • Split data into S “shards”
    • Train a perceptron on each shard independently
      • weight vectors are w(1) , w(2) , …
    • Produce some weighted average of the w(i)‘s as the final result

45

*similar to logistic regression with SGD

46 of 75

Parallel Perceptrons – take 2

46

Better idea: do the simplest possible thing iteratively.

  • Split the data into shards
  • Let w = 0
  • For n=1,…
    • Train a perceptron on each shard with one pass starting with w
    • Average the weight vectors and let w be that average

Extra communication cost:

  • redistributing the weight vectors
  • done less frequently than if fully synchronized, more frequently than if fully parallelized

47 of 75

Parallelizing perceptrons – take 2

47

Instances/labels

Instances/labels – 1

Instances/labels – 2

Instances/labels – 3

w -1

w- 2

w-3

w

Split into example subsets

Combine by averaging

Compute local vk’s

w (previous)

48 of 75

Formal Results

  • Background: Perceptrons are guaranteed to converge
    • Given the data has a compact “radius” and there is a classifier with “large margin”
  • The parallel “iterative parameter mixing” method is also guaranteed to converge in the same conditions ☺
    • But not to converge faster ☹

48

49 of 75

Results

49

50 of 75

Distributed ML

  • Data parallelism
    • workers get parts (usually disjoint partitions) of the data and copies of the current model
    • workers optimize independently for a while
      • without communication
    • workers communicate and synchronize models
  • Decisions to consider:
    • when do you synchronize?
      • when is it ok to optimize a “stale” set of weights?
    • how do you synchronize?
      • how do you send information?
      • how do you compress or sparsify information?

50

51 of 75

Matrix Factorization

Another use of SGD

51

52 of 75

Recovering latent factors in a matrix

52

m columns

v11

vij

vnm

n rows

53 of 75

Recovering latent factors in a matrix

53

K * m

n * K

x1

y1

x2

y2

..

..

xn

yn

a1

a2

..

am

b1

b2

bm

v11

vij

vnm

~

54 of 75

Recovering latent factors in a matrix

54

m movies

n users

m movies

x1

y1

x2

y2

..

..

xn

yn

a1

a2

..

am

b1

b2

bm

v11

vij

vnm

~

V[i,j] = user i’s rating of movie j

55 of 75

55

56 of 75

MF for images

10,000 pixels

1000 images

1000 * 10,000,00

x1

y1

x2

y2

..

..

xn

yn

a1

a2

..

am

b1

b2

bm

v11

vij

vnm

~

V[i,j] = pixel j in image i

weights for

2 prototypes

P1

P2

57 of 75

57

58 of 75

Recovering latent factors in a matrix

58

m terms

n documents

doc term matrix

x1

y1

x2

y2

..

..

xn

yn

a1

a2

..

am

b1

b2

bm

v11

vij

vnm

~

V[i,j] = TFIDF score of term j in doc i

59 of 75

k-means Clustering

59

centroids

60 of 75

k-means as MF

cluster means

n examples

0

1

1

0

..

..

xn

yn

a1

a2

..

am

b1

b2

bm

v11

vij

vnm

~

original data set

indicators for r clusters

Z

M

X

61 of 75

Matrix Factorization as Optimization

61

62 of 75

62

KDD 2011

63 of 75

Recovering latent factors in a matrix as optimization

63

m movies

n users

m movies

x1

y1

x2

y2

..

..

xn

yn

a1

a2

..

am

b1

b2

bm

v11

vij

vnm

~

V[i,j] = user i’s rating of movie j

r

H

W

V

Z is non-zero entries

(matrix completion)

64 of 75

Matrix factorization as SGD

64

step size

why does this work?

65 of 75

This is a sparse update

65

66 of 75

Checking the claim

66

Think for SGD for logistic regression

  • LR loss = compare y and ŷ = dot(w,x)
  • similar but now update w (user weights) and x (movie weight)

67 of 75

What loss functions are possible?

67

N1, N2 - diagonal matrixes, sort of like IDF factors for the users/movies

“generalized” KL-divergence

68 of 75

68

ALS = alternating least squares

69 of 75

69

talk pilfered from 🡪 …..

KDD 2011

70 of 75

70

iterative SGD, no mixing

limited memory quasi-Newton

param mixing

alternating least squares

IPM

71 of 75

Sparsity of the MF update

71

72 of 75

72

73 of 75

73

74 of 75

74

H1

H2

H3

W1

V 11

W2

V 22

W3

V 33

H1

H2

H3

W1

V 12

W2

V 23

W3

V 31

H1

H2

H3

W1

V 13

W2

V 21

W3

V 32

Node 1

Node 2

Node 3

Diag 1

Diag 2

Diag 3

Epoch 1

75 of 75

75