10-605 / 10-805
Machine Learning from Large Datasets
Recap: the course so far
2
Today
3
SPARSE GRADIENTS FOR SPARSE DATA FOR LINEAR/LOGISTIC REGRESSION
Summary: SGD for least squares and Logistic Regression
5
Initialize w0
For i=1, … until converged:
Summary: SGD for least squares and Logistic Regression
6
Initialize w0
For i=1, … until converged:
For text problems:
How do we represent x and w in our program?
Compact representation:
A hash table is used to convert strings (“aardvark”) to term id’s (e.g., 5317)
Remember we may be moving these around across the network…and storing millions of these
Summary: SGD for least squares and Logistic Regression
7
Initialize w0
For i=1, … until converged:
For text problems:
If x a sparse vector….
The gradient is zero where x is zero 🡺 gradient is sparse
Sparse updates for logistic regression
8
Sparsity can cause overfitting
9
Recap: Ridge regression
10
Regularized Logistic Regression
11
This update is not sparse
Dense vector updates aren’t always a problem in practice but …
Sparse Logistic Regression (sketch)
12
not sparse
sparse
Sparse Logistic Regression (sketch)
13
“weight decay”
Sparse Logistic Regression (sketch)
14
A is number of examples seen since the last time we did an x>0 update on W[j]
Sparse Logistic Regression (sketch)
15
One more ”weight decay” step needed at end of SGD
THE HASH KERNEL
Also called the “hack trick”—a trick to exploit sparse features
Recap: SGD for Logistic Regression
The “hash trick”
an array V = <0…0>
hash j
Assume a hash function h that maps string features to indices in V
V[j] = V[j]
array V
Ignore collisions!
The “hash trick”: intuition
ɸ(x)[h] =
The “hash trick”: pros/cons
Proc of ML Research, 2009
2010
Motivation
Motivation
Motivation
Formal Results
Formal Results (sketch)
Note this doesn’t prove the hash kernel “works” – it does give some intuition that the hash trick won’t change the problem “too much”
Formal results
Experimental results: RCV1
Hash kernel version of VW learner
Experimental results: RCV1
Spam filtering at Yahoo
From Weinberge et al 2010 paper
An example
2^26 entries = 1 Gb @ 8bytes/weight
DISTRIBUTED MACHINE LEARNING
32
Distributed ML
33
https://arxiv.org/abs/2003.06307 2023 survey
Distributed ML
34
Data Parallel ML: Maximal synchronization: batch gradient (recap)
35
Data Parallel ML: Minimal synchronization 1
36
NeurIPS 2009
Data Parallel ML: Minimal synchronization 1
37
Data Parallel ML: Minimal synchronization 2
38
NeurIPS 2010
Data Parallel ML: Minimal synchronization 2
39
40
Training loss
Test loss
Note: This is a convex optimization problem!
Data Parallel ML: Minimal synchronization 3
41
NeurIPS 2011
Data Parallel ML: Minimal synchronization 3
42
Data Parallel ML: Minimal synchronization 3
43
Data Parallel ML: Intermediate Synchronization
44
NAACL 2010
Parallel Perceptrons*
45
*similar to logistic regression with SGD
Parallel Perceptrons – take 2
46
Better idea: do the simplest possible thing iteratively.
Extra communication cost:
Parallelizing perceptrons – take 2
47
Instances/labels
Instances/labels – 1
Instances/labels – 2
Instances/labels – 3
w -1
w- 2
w-3
w
Split into example subsets
Combine by averaging
Compute local vk’s
w (previous)
Formal Results
48
Results
49
Distributed ML
50
Matrix Factorization
Another use of SGD
51
Recovering latent factors in a matrix
52
m columns
v11 | … | | | |
… | … | | | |
| | | | |
| vij | | | |
| | | … | |
| | | | vnm |
n rows
Recovering latent factors in a matrix
53
K * m
n * K
x1 | y1 |
x2 | y2 |
.. | .. |
| |
| |
| |
| |
| |
… | … |
xn | yn |
a1 | a2 | .. | … | am |
b1 | b2 | … | … | bm |
v11 | … | | | |
… | … | | | |
| | | | |
| vij | | | |
| | | … | |
| | | | vnm |
~
Recovering latent factors in a matrix
54
m movies
n users
m movies
x1 | y1 |
x2 | y2 |
.. | .. |
| |
| |
| |
| |
| |
… | … |
xn | yn |
a1 | a2 | .. | … | am |
b1 | b2 | … | … | bm |
v11 | … | | | |
… | … | | | |
| | | | |
| vij | | | |
| | | … | |
| | | | vnm |
~
V[i,j] = user i’s rating of movie j
55
MF for images
10,000 pixels
1000 images
1000 * 10,000,00
x1 | y1 |
x2 | y2 |
.. | .. |
| |
| |
| |
| |
| |
… | … |
xn | yn |
a1 | a2 | .. | … | am |
b1 | b2 | … | … | bm |
v11 | … | | … | … |
… | … | | | |
| | | | |
| vij | | | |
| | | … | |
| | | | vnm |
~
V[i,j] = pixel j in image i
weights for
2 prototypes
P1
P2
57
Recovering latent factors in a matrix
58
m terms
n documents
doc term matrix
x1 | y1 |
x2 | y2 |
.. | .. |
| |
| |
| |
| |
| |
… | … |
xn | yn |
a1 | a2 | .. | … | am |
b1 | b2 | … | … | bm |
v11 | … | | | |
… | … | | | |
| | | | |
| vij | | | |
| | | … | |
| | | | vnm |
~
V[i,j] = TFIDF score of term j in doc i
k-means Clustering
59
centroids
k-means as MF
cluster means
n examples
0 | 1 |
1 | 0 |
.. | .. |
| |
| |
| |
| |
| |
… | … |
xn | yn |
a1 | a2 | .. | … | am |
b1 | b2 | … | … | bm |
v11 | … | | | |
… | … | | | |
| | | | |
| vij | | | |
| | | … | |
| | | | vnm |
~
original data set
indicators for r clusters
Z
M
X
Matrix Factorization as Optimization
61
62
KDD 2011
Recovering latent factors in a matrix as optimization
63
m movies
n users
m movies
x1 | y1 |
x2 | y2 |
.. | .. |
| |
| |
| |
| |
| |
… | … |
xn | yn |
a1 | a2 | .. | … | am |
b1 | b2 | … | … | bm |
v11 | … | | | |
… | … | | | |
| | | | |
| vij | | | |
| | | … | |
| | | | vnm |
~
V[i,j] = user i’s rating of movie j
r
H
W
V
Z is non-zero entries
(matrix completion)
Matrix factorization as SGD
64
step size
why does this work?
This is a sparse update
65
Checking the claim
66
Think for SGD for logistic regression
What loss functions are possible?
67
N1, N2 - diagonal matrixes, sort of like IDF factors for the users/movies
“generalized” KL-divergence
68
ALS = alternating least squares
69
talk pilfered from 🡪 …..
KDD 2011
70
iterative SGD, no mixing
limited memory quasi-Newton
param mixing
alternating least squares
IPM
Sparsity of the MF update
71
72
73
74
| H1 | H2 | H3 |
W1 | V 11 | | |
W2 | | V 22 | |
W3 | | | V 33 |
| H1 | H2 | H3 |
W1 | | V 12 | |
W2 | | | V 23 |
W3 | V 31 | | |
| H1 | H2 | H3 |
W1 | | | V 13 |
W2 | V 21 | | |
W3 | | V 32 | |
Node 1
Node 2
Node 3
Diag 1
Diag 2
Diag 3
Epoch 1
75