1 of 26

10-605 / 10-805

Machine Learning from Large Datasets

2 of 26

Intro 8/25

Machine Learning from Large Datasets

3 of 26

Overview of Intro Lecture

Administrivia
High-level points

Scalability of ML to very large datasets started ~= 20 years ago
Sample early paper:

Cohen’s rule-learning work
Banko and Brill paper, NAACL 2001

explored some natural NLP tasks where what happened?
what were the tasks?
why did big data help here?

Another paper: scaling laws for LLMs

what was the point here?

4 of 26

Overview of Intro Lecture

Another paper: scaling laws for LLMs

what was the point of this paper?
what did I emphasize?

Main idea: better model joint distribution of model size (# parameters), model performance (test set loss / perplexity), and training cost (floating point operations, aka FLOPS)

Idea: fit a simple model to the data that relates compute, parameters, and loss, and use that to optimize loss given a total compute budget.

Compute Scaling Laws = Data Scaling Laws (title text)

5 of 26

Overview of Intro Lecture

Another paper: scaling laws for LLMs

what was the point of this paper?
what did I emphasize?

7B params: 200B 🡪 1T toks

1B toks: 7B🡪33B params

6 of 26

Map-Reduce and Spark 8/27

Machine Learning from Large Datasets

7 of 26

Overview of Map-Reduce/Spark lecture

Cost of operations
Naïve Bayes

also on a homework

Distributed processing on large files

small-vocabulary word sort
large-vocabulary word sort
same using sorting to avoid disk seek
how shuffle-sort is used in map-reduce

Map-Reduce frameworks

TBH mostly discussed other cases so you would know what is important about the one we used in the assignments, and what is in common across different frameworks
Abstractions for map-reduce and other dataflow languages
Spark

8 of 26

Spark Workflows 9/3

Machine Learning from Large Datasets

9 of 26

Overview of Workflow lecture

Much deeper dives on

how distributed map-reduce is implemented
how Spark extends map-reduce

caching
transformations vs actions
communication pathways in Spark

which is not just RDDs!

Lots of sample workflows

A bunch of “toy” problems
A few real ML algorithms / routines I snuck in:

PageRank, k-Means and EM

Homework (we do have questions on homeworks):

one implementation of parallel linear regression
combine vs reduce computations and how they are ordered

10 of 26

Learning as Optimization �9/8, 10, 15

Machine Learning from Large Datasets

11 of 26

Overview of Optimization 1

Review and introduce notation for empirical risk minimization

You should know this all this stuff well
Linear regression as a ERM problem

Solutions via specific (exact) or batch gradient

or distributed batch gradient

When each optimization method is appropriate

SGD for linear regression
SDG for logistic regression

derivation of the gradient

EM and k-means and coordinate descent

12 of 26

Overview of Optimization 2

Tricks of the trade for ML

Feature engineering
Regularization (e.g., ridge regression)

sparsity and how it affects linear/logistic regularization

Hyperparameter tuning (e.g., grid search)
Kernels

SGD finds global optimum w* 🡺 optimal w* can be defined by kernels
definition, duality and so on
exact algorithm for kernel linear or ridge regression
hash kernels and the hash trick

details from the paper

13 of 26

Overview of Optimization 2 + 3

Distributed optimization

Flavors: data, model, tensor
Decisions: how often to synchronize (variants of data-parallel optimization)

Batch … “iterative parameter mixing” … One-shot averaging

HogWild!

what is it? when is it appropriate? when is it not appropriate?

Tree AllReduce

Distributed SGD for matrix factorization

A pretty example of when it makes sense to use a distribution scheme that isn’t “off the shelf”
MF is a problem that comes up over and over and over

Related to (is generalization of) k-means clustering

14 of 26

Randomized Algorithms�9/17 and 9/22 �

Machine Learning from Large Datasets

15 of 26

Overview of Randomized Algorithms 1

Bloom Filters and applications

What does a Bloom filter “look like” (what data structure does it approximate)? What operations are supported? How can you combine Bloom filters?
What’s the math behind the choice of number of buckets and number of hash functions?

Aside: I like the Daniely et al math (which is almost a count-min sketch) better than the standard math

Count-min sketches and applications

How do these types of sketches compare?

16 of 26

Overview of Randomized Algorithms 2

LSH (also a homework)

key idea,
random projections, SimHash, cosine similarity
MinHash and Jaccard distance
online LSH

the pooling trick

Indexing for k-NN search

multi-index hashing using LSH
k-means based approaches

k-means and hierarchical k-means
product quantization

17 of 26

Autodiff�9/24

Machine Learning from Large Datasets

18 of 26

Overview of Randomized Algorithms – extended to 2/12

Autodiff algorithm and motivation

you should understand this algorithm well
you don’t need to memorize/write down derivatives ☺

19 of 26

GPUS�9/29, 10/1

Machine Learning from Large Datasets

20 of 26

Overview of GPU-based ML�

Lots of details about sizes, types, how to define cuda kernels, what ML platforms do for you
Ring AllReduce

vs TreeAllReduce

Memory Hierarchy for GPUs

21 of 26

Overview of GPU-based ML�

Ways of parallelizing ML

Data

minibatch-parallel SGD

large batches: often better for compute time
less communication 🡺 more data throughput

small batches: often better results

when you average gradient updates, small matches lead to more diverse updates
large batches can lead to sharp local minima which tend not to generalize well

Model parallel

inter-layer

pipelined

intra-layer

tensor parallel

22 of 26

Overview of GPU-based ML�

ML and foundation models

pre-training + fine-tuning
parameter efficient fine-tuning (PEFT)

linear probing: adding a layer ”on top” (above the output layer, or above a hidden layer close to the output)
adapters: adding additional layers ”within” a model
LoRA: learning a “low rank” (factored) matrix of gradient updates and adding the product of the factors to an existing (matrix-shaped) parameter

quantized LoRA

23 of 26

Overview of GPU-based ML�

Overview of tricks used in modern ML optimizers

AdaGrad

approximate the Hessian (2^nd derivative) to model function curvature
effectively a new learning rate for each parameter

RMSProp, Momentum, Nesterov accelerated gradient, Adam

Didn’t discuss these in depth

24 of 26

FINAL POINTS

25 of 26

Writing Session ≠ Exam

Writing session

We’re checking that you didn’t overuse GenAI
Our intent: ask easy questions to confirm you understand the HW just submitted

And remember!

Exam

We want you to review the lectures and get the main points
Questions will test if you understand the main concepts
Conceptual questions about HW are also possible

26 of 26

Writing Session ≠ Exam

Exam is not closed book: you can bring a “cheatsheet”

Nothing printed
You have to write it by hand on paper
You cannot write on an iPad and print that out, photographically reduce anything, …

Exam

We want you to review the lectures and get the main points
Questions will test if you understand the main concepts
Conceptual questions about HW are also possible