1 of 26

10-605 / 10-805

Machine Learning from Large Datasets

2 of 26

Intro 8/25

Machine Learning from Large Datasets

3 of 26

Overview of Intro Lecture

  • Administrivia
  • High-level points
    • Scalability of ML to very large datasets started ~= 20 years ago
    • Sample early paper:
      • Cohen’s rule-learning work
      • Banko and Brill paper, NAACL 2001
        • explored some natural NLP tasks where what happened?
        • what were the tasks?
        • why did big data help here?
    • Another paper: scaling laws for LLMs
      • what was the point here?

4 of 26

Overview of Intro Lecture

    • Another paper: scaling laws for LLMs
      • what was the point of this paper?
      • what did I emphasize?

Main idea: better model joint distribution of model size (# parameters), model performance (test set loss / perplexity), and training cost (floating point operations, aka FLOPS)

Idea: fit a simple model to the data that relates compute, parameters, and loss, and use that to optimize loss given a total compute budget.

Compute Scaling Laws = Data Scaling Laws (title text)

5 of 26

Overview of Intro Lecture

    • Another paper: scaling laws for LLMs
      • what was the point of this paper?
      • what did I emphasize?

7B params: 200B 🡪 1T toks

1B toks: 7B🡪33B params

6 of 26

Map-Reduce and Spark 8/27

Machine Learning from Large Datasets

7 of 26

Overview of Map-Reduce/Spark lecture

  • Cost of operations
  • Naïve Bayes
    • also on a homework
  • Distributed processing on large files
    • small-vocabulary word sort
    • large-vocabulary word sort
    • same using sorting to avoid disk seek
    • how shuffle-sort is used in map-reduce
  • Map-Reduce frameworks
    • TBH mostly discussed other cases so you would know what is important about the one we used in the assignments, and what is in common across different frameworks
    • Abstractions for map-reduce and other dataflow languages
    • Spark

8 of 26

Spark Workflows 9/3

Machine Learning from Large Datasets

9 of 26

Overview of Workflow lecture

  • Much deeper dives on
    • how distributed map-reduce is implemented
    • how Spark extends map-reduce
      • caching
      • transformations vs actions
      • communication pathways in Spark
        • which is not just RDDs!
  • Lots of sample workflows
    • A bunch of “toy” problems
    • A few real ML algorithms / routines I snuck in:
      • PageRank, k-Means and EM
  • Homework (we do have questions on homeworks):
    • one implementation of parallel linear regression
    • combine vs reduce computations and how they are ordered

10 of 26

Learning as Optimization �9/8, 10, 15

Machine Learning from Large Datasets

11 of 26

Overview of Optimization 1

  • Review and introduce notation for empirical risk minimization
    • You should know this all this stuff well
    • Linear regression as a ERM problem
      • Solutions via specific (exact) or batch gradient
        • or distributed batch gradient
      • When each optimization method is appropriate
    • SGD for linear regression
    • SDG for logistic regression
      • derivation of the gradient
    • EM and k-means and coordinate descent

12 of 26

Overview of Optimization 2

  • Tricks of the trade for ML
    • Feature engineering
    • Regularization (e.g., ridge regression)
      • sparsity and how it affects linear/logistic regularization
    • Hyperparameter tuning (e.g., grid search)
    • Kernels
      • SGD finds global optimum w* 🡺 optimal w* can be defined by kernels
      • definition, duality and so on
      • exact algorithm for kernel linear or ridge regression
      • hash kernels and the hash trick
        • details from the paper

13 of 26

Overview of Optimization 2 + 3

  • Distributed optimization
    • Flavors: data, model, tensor
    • Decisions: how often to synchronize (variants of data-parallel optimization)
      • Batch … “iterative parameter mixing” … One-shot averaging
    • HogWild!
      • what is it? when is it appropriate? when is it not appropriate?
    • Tree AllReduce
  • Distributed SGD for matrix factorization
    • A pretty example of when it makes sense to use a distribution scheme that isn’t “off the shelf”
    • MF is a problem that comes up over and over and over
      • Related to (is generalization of) k-means clustering

14 of 26

Randomized Algorithms�9/17 and 9/22 �

Machine Learning from Large Datasets

15 of 26

Overview of Randomized Algorithms 1

  • Bloom Filters and applications
    • What does a Bloom filter “look like” (what data structure does it approximate)? What operations are supported? How can you combine Bloom filters?
    • What’s the math behind the choice of number of buckets and number of hash functions?
      • Aside: I like the Daniely et al math (which is almost a count-min sketch) better than the standard math
  • Count-min sketches and applications
    • How do these types of sketches compare?

16 of 26

Overview of Randomized Algorithms 2

  • LSH (also a homework)
    • key idea,
    • random projections, SimHash, cosine similarity
    • MinHash and Jaccard distance
    • online LSH
      • the pooling trick
  • Indexing for k-NN search
    • multi-index hashing using LSH
    • k-means based approaches
      • k-means and hierarchical k-means
      • product quantization

17 of 26

Autodiff�9/24

Machine Learning from Large Datasets

18 of 26

Overview of Randomized Algorithms – extended to 2/12

  • Autodiff algorithm and motivation
    • you should understand this algorithm well
    • you don’t need to memorize/write down derivatives ☺

19 of 26

GPUS�9/29, 10/1

Machine Learning from Large Datasets

20 of 26

Overview of GPU-based ML�

  • Lots of details about sizes, types, how to define cuda kernels, what ML platforms do for you
  • Ring AllReduce
    • vs TreeAllReduce
  • Memory Hierarchy for GPUs

21 of 26

Overview of GPU-based ML�

  • Ways of parallelizing ML
    • Data
      • minibatch-parallel SGD
        • large batches: often better for compute time
        • less communication 🡺 more data throughput
      • small batches: often better results
        • when you average gradient updates, small matches lead to more diverse updates
        • large batches can lead to sharp local minima which tend not to generalize well
    • Model parallel
      • inter-layer
        • pipelined
      • intra-layer
        • tensor parallel

22 of 26

Overview of GPU-based ML�

  • ML and foundation models
    • pre-training + fine-tuning
    • parameter efficient fine-tuning (PEFT)
      • linear probing: adding a layer ”on top” (above the output layer, or above a hidden layer close to the output)
      • adapters: adding additional layers ”within” a model
      • LoRA: learning a “low rank” (factored) matrix of gradient updates and adding the product of the factors to an existing (matrix-shaped) parameter
        • quantized LoRA

23 of 26

Overview of GPU-based ML�

  • Overview of tricks used in modern ML optimizers
    • AdaGrad
      • approximate the Hessian (2nd derivative) to model function curvature
      • effectively a new learning rate for each parameter
    • RMSProp, Momentum, Nesterov accelerated gradient, Adam

    • Didn’t discuss these in depth

24 of 26

FINAL POINTS

25 of 26

Writing Session ≠ Exam

  • Writing session
    • We’re checking that you didn’t overuse GenAI
    • Our intent: ask easy questions to confirm you understand the HW just submitted
      • And remember!

  • Exam
    • We want you to review the lectures and get the main points
    • Questions will test if you understand the main concepts
    • Conceptual questions about HW are also possible

26 of 26

Writing Session ≠ Exam

  • Exam is not closed book: you can bring a “cheatsheet”
    • Nothing printed
    • You have to write it by hand on paper
    • You cannot write on an iPad and print that out, photographically reduce anything, …
  • Exam
    • We want you to review the lectures and get the main points
    • Questions will test if you understand the main concepts
    • Conceptual questions about HW are also possible