1 of 64

10-605 / 10-805

Machine Learning from Large Datasets

2 of 64

ADMINISTRIVIA

3 of 64

Who/Where/When

  • Main course page: https://10605.github.io/
    • Lectures posted as we go
  • Instructor:
    • William W. Cohen
    • Office Hours:
      • Wed 2-2:45 in GHC 8015
    • Still TBA for others
  • TAs:
    • Jacob Rast and great team of TAs
    • More coming up….

4 of 64

Who/Where/When

  • Lectures:
    • Mon-Wed 11am-12:20pm in Margaret Morrison A14
    • Recitation/Writing sessions: Fri 11am-12:20 in MM A14
      • You must attend writing sessions (more later)
  • Lectures will be record and posted on-line
    • but not livestreamed
    • we can’t guarantee when they will be posted
  • On-line quiz to do soon after lecture
    • It closes after 48 hours
    • We have one today
    • They don’t count for a lot but there are no makeups
      • Quizzes reinforce the lecture
      • And make sure you keep up!

 

5 of 64

Overview ….

  • MapReduce, Spark
  • Scalable ML Optimization on Clusters
    • Fast ML models
  • Randomized algorithms
    • Bloom filters, count-min sketches, LSH
  • Learning on GPUs
  • Exam 1
  • Fall Break
  • Transformers
  • Hyperparameter search
  • Other topics:
    • quantization, pruning
    • key-value caching
    • contrastive learning*
    • retrieval-augmented generation*
    • merging expert LLMs
  • Exam 2

  1. TFIDF for entity resolution, efficient parallel Naïve Bayes
  2. Linear regression on Million Song Dataset
  3. Fast approximate similarity (LSH), bespoke scalable text classifier

  • Preparing a pre-training dataset for an LLM – dedup, filter
  • Pretrain an “expert” LLM on a bespoke, curated dataset

  • Project: merging expert LLMs

* Guest lecturers: John Wieting (Google DM), Michael de Jong (Cursor)

PySpark

DataBricks

Amazon Cloud

Amazon Cloud

6 of 64

Project / Miniproject

  • Details are being finalized
  • 15% of the grade
  • Two paths:
    • suggested topics with clear goals
      • last semester: LLM pruning
    • …or a more open-ended topic using many small LLMs (trained by your classmates)
      • open-ended project proposals will be reviewed
      • each open-ended project has a lead and 2-3 participants
        • any PhD student can be a lead
        • MS/undergrads have to send in a CV for approval

7 of 64

Who/Where/When

  • William Cohen
    • In Machine Learning Dept
    • With CMU since 2002 with 7 different job titles*
    • Spent about 8 years at Google since 2002
      • had two different LDAPs
    • Co-Chair of ICML in 1994
    • Once finished second to last in my age bracket in the Great Race
    • Vacation at the beach in SC almost every summer
    • A dataset I’ve worked with lately is MedCalc (precision instruction-following for medical tasks using patient note)

* Adjunct Prof, Visiting Assoc Prof, Assoc Research Prof, Research Prof, Prof, Consulting Prof, Visiting Prof, Prof

8 of 64

Who/Where/When

  • William W. Cohen
    • 1990-mid 1990’s: AT&T Bell Labs (working on various things, including scalable rule-learning systems)
    • Mid 1990’s—2000: AT&T Research (text classification, data integration, knowledge-as-text, information extraction)
    • 2000-2002: WhizBang! Labs (info extraction from Web startup located in Pittsburgh)
    • 2002-May 2018: CMU (IE, scientific text, social networks, graph learning, structure output learning, …)
      • 2008-2009, Jan-Jul 2017: Visiting Scientist at Google

    • May 2018-March 2024: Google (QA, neural reasoning, memory-augmented neural networks, image generation, prediction-powered inference, …)
      • Part-time at CMU

    • May 2024-now: CMU again (controllable/interpretable chain-of-thought for LLMs)
      • Still part-time at Google

9 of 64

Who/Where/When

  • William W. Cohen
    • 1990-mid 1990’s: AT&T Bell Labs (working on various things, including scalable rule-learning systems)
    • Mid 1990’s—2000: AT&T Research (text classification, data integration, knowledge-as-text, information extraction)
    • 2000-2002: Whizbang! Labs (info extraction from Web startup located in Pittsburgh)
    • 2002-May 2018: CMU (IE, scientific text, social networks, graph learning, structure output learning, …)
      • 2008-2009, Jan-Jul 2017: Visiting Scientist at Google
      • 2012: Started teaching 10-605 Learning from Large Datasets
    • May 2018-March 2024: Google (QA, neural reasoning, memory-augmented neural networks, image generation, prediction-powered inference, …)
      • Part-time at CMU
      • 2018: Stopped teaching 10-605 Learning from Large Datasets
    • May 2024-now: CMU again (controllable/interpretable chain-of-thought prompting for LLMs)
      • Still part-time at Google
    • This course will follow recent versions of 405/605/805
      • especially Spring 2025

10 of 64

Who/Where/When

  • Jacob Rast
    • Staff member in the Machine Learning Department
    • 6th semester on the course
    • Enjoys traveling and studying foreign language (中文HSK6级, Deutsch CEFR A2+)
    • Likes working with genomics datasets, programming autograders, 3D printing, and training small language models.
    • Focusing on improving CS theory fundamentals this semester

11 of 64

TAs

12 of 64

Who/Where/When

  • Christopher Berman
    • Senior Physics Major
    • Third semester TAing for this course
    • Spent the summer at CERN working on knowledge distillation for large models
    • Lived in Lisbon for 14 years
    • Plays way too much Dota 2
    • Recently worked a dataset I collected to trade 0DTE crypto binary options
    • Returns 29th of August (stuck in Lisbon)

13 of 64

Who/Where/When

  • Gunjan Dhanuka
    • Final-year MSCS student
    • Took 10605 in Fall’24, TA’ed in Spring’25
    • Currently researching on Memorization in Diffusion Models
    • Spends too much time cooking, playing video games, watching YouTube
    • Beaches == Mountains (after being to California)
    • Recently been working with LAION 🦁

14 of 64

Who/Where/When

  • Baihong Yuan
    • Second-year MS student in ECE
    • Took 10-605 in F24 and 2nd time TA this course; also TA 18-665 before.
    • Interested in ML and LLM, research in LLM inference
    • Enjoy reading books, baking cakes and playing games
    • Recently worked with financial datasets and language datasets

15 of 64

Who/Where/When

  • Hari Rangarajan
    • Second-year Software Engineering student
    • Took 10-605 in S25; prev TA’ed 17-630
    • Research in attention mechanisms of language models; LLMs for healthcare
    • Enjoy long walks, cold coffee, and chess/cricket over the weekends
    • Past: Software Dev at Microsoft

16 of 64

Who/Where/When

  • Sam Zhou
    • Second-year MSCS student
    • Took 10-605 in S25
    • Interested in scalable data science/ML and audio models
    • Enjoys traveling, photography, cafes, tennis, and video games
    • Recently worked with ESC-50 dataset for sound classification

17 of 64

Who/Where/When

  • Nikash Bhardwaj
    • Senior in Computer Science
    • Took 10405 in S25 and found it super useful when doing ML research
    • Interested in LLMs & reasoning
    • Enjoys ping pong, soccer, and traveling
    • A cool dataset is Frontier Math

18 of 64

Who/Where/When

  • Ayush Kumar
    • Second-year MCDS student
    • Took 10-605 in S25 and enjoyed the mix of theory, systems and ML
    • Interested in scaling up ML systems and efficient inference
    • Enjoys a good hike, tennis and rock music
    • Recently worked with distributed system log datasets

19 of 64

What/How

  • Programming Languages and Systems:
    • Python, PySpark, PyTorch
  • Resources:
    • Databricks
    • Amazon Elastic Cloud
      • Amazon EC2 [http://aws.amazon.com/ec2/]

20 of 64

What/How: cheating vs working together

  • We have a long and explicit policy
    • see the syllabus
    • tl;dr: transmit information like they did in the stone age, brain-to-brain, and document it
    • do not copy anything digitally
    • exceptions (eg projects) will be explicitly stated
  • “Handled severely” … ?
    • everybody involved will fail the course by default
    • every infraction always gets reported up to the Office of Academic Integrity, the head of your program, the dean of your school, ….
    • a second offense is very bad

21 of 64

What/How: using GenAI

  • For most assignments a GenAI is treated like a collaborator
    • exceptions will be explicitly stated
    • tl;dr: ask whatever you want but do not copy anything directly
      • put it in your brain and then take it out when you are generating the text/code
  • To encourage this we’re adding quizzes for each assignment you must do completely closed-book
    • Some questions are take-home, but some are in-class
      • “Writing sessions” – in lecture or in recital timeslot

22 of 64

What/How: using GenAI

  • Why not use GenAI for everything?

23 of 64

24 of 64

ITS-2008

a study to compare three learning strategies: .. learning by solving problems with feedback and hints, learning by generalizing worked-out examples exhaustively, and learning by generalizing worked-out examples only for the skills that need to be generalized. The results showed that learning by tutored problem solving outperformed other learning strategies

Example Study: you only experience the correct problem/solving decisions

Tutored Problem Solving: you experience the mistakes you make

25 of 64

Commonplace books  are a way to compile knowledge, usually by writing information into blank books. They have been kept from antiquity, and were kept particularly during the Renaissance and in the nineteenth century. Such books are similar to scrapbooks filled with items of many kinds: notes, proverbsadagesaphorismsmaxims, quotes, letters, poems, tables of weights and measures, prayers, legal formulas, and recipes.

Aristotle … suggested that they also be used to explore the validity of propositions through rhetoricCicero … applied them to public speaking. He also created a list of commonplaces which included sententiae or wise sayings or quotations by philosophers, statesmen, and poets. Quintilian further expanded these ideas in Institutio Oratoria, a treatise on rhetoric education, and asked his readers to commit their commonplaces to memory.  In the first century AD, Seneca the Younger suggested that readers collect commonplace ideas and sententiae as a bee collects pollen, and by imitation turn them into their own honey-like words…

wikipedia

26 of 64

Next sentence prediction vs masked language modeling in BERT

2018

27 of 64

The Legendary Tom Murphy VII

28 of 64

GenAI and Education: My Opinion

  • If you want to learn, doing ≫ watching
  • If you want to learn something, write it down
    • and/or reorganize it
    • and/or (pretend to) teach it to someone
  • Programming is like writing
    • but even more so

29 of 64

What/How: using GenAI (recap)

  • For most assignments a chatbot is treated like a collaborator
    • exceptions will be explicitly stated
    • tl;dr: ask whatever you want but do not copy anything directly
      • put it in your brain and then take it out when you are generating the text/code
  • To encourage this we’re adding quizzes for each assignment you must do completely closed-book
    • Some questions are take-home, but some are in-class
      • “Writing sessions” – in lecture or in recital timeslot

30 of 64

What/How: using GenAI

  • For some parts of some assignments you can use GenAI…
    • Warning: GenAI is only useful if you are suspicious of everything
    • Never trust text because it looks plausible
    • Never ever trust code because it looks plausible
      • “Nearly correct” code is very hard to debug
      • Cutting and pasting is a dangerous shortcut

31 of 64

What/How: using GenAI

  • GenAI is only useful if you are suspicious of everything
    • Never trust text because it looks plausible
    • Never ever trust code because it looks plausible
      • “Nearly correct” code is very hard to debug
      • Cutting and pasting is a dangerous shortcut
      • “Nearly correct” code is very hard to debug
  • The art of ML coding is really in testing
    • How do you test a loss function?
    • How do you test a new model?
    • How do you test a randomized algorithm?

32 of 64

BIG DATA HISTORY: FROM THE DAWN OF TIME TO THE PRESENT

33 of 64

Big ML c. 1993 (Cohen, “Efficient…Rule Learning”, IJCAI 1993)

talk_announcement :- WORDS ~ talk, WORDS ~ Subject_talk (54/1).

talk_announcement :- WORDS ~ '2d416' (26/3).

talk_announcement :- WORDS ~ system, WORDS ~ 'To_1126@research' (4/0).

talk_announcement :- WORDS ~ mh, WORDS ~ time (5/1).

talk_announcement :- WORDS ~ talk, WORDS ~ used (3/0).

talk_announcement :- WORDS ~ presentations (2/1).

default non_talk_announcement (390/1).

" :-" means "if", "~" means "contains"

34 of 64

Why?

Algorithm

  • Phase 1: build rules
    • Discrete greedy search:
    • Starting with empty rule set, add conditions greedily
  • Phase 2: prune rules
    • starting with phase 1 output, remove conditions

talk_announcement :- WORDS ~ talk, WORDS ~ Subject_talk, WORDS ~ p_comma.

talk_announcement :- WORDS ~ '2d416', WORDS ~ be.

talk_announcement :- WORDS ~ show, WORDS ~ talk (7/0).

talk_announcement :- WORDS ~ mh, WORDS ~ time, WORDS ~ research (4/0).

talk_announcement :- WORDS ~ system, WORDS ~ 'To_1126@research' (3/0).

talk_announcement :- WORDS ~ '2d416', WORDS ~ memory (3/0).

talk_announcement :- WORDS ~ interfaces, WORDS ~ From_p_exclaim_point (2/0).

talk_announcement :- WORDS ~ presentations, WORDS ~ From_att (2/0).

default non_talk_announcement .

35 of 64

More on this paper

Algorithm

  • Phase 1: build rules
    • Discrete greedy search:
    • Starting with empty rule set, add conditions greedily
  • Phase 2: prune rules
    • starting with phase 1 output, remove conditions, greedily

talk_announcement :- WORDS ~ talk, WORDS ~ Subject_talk, WORDS ~ p_comma (54/0).

talk_announcement :- WORDS ~ '2d416', WORDS ~ be (19/0).

talk_announcement :- WORDS ~ show, WORDS ~ talk (7/0).

talk_announcement :- WORDS ~ mh, WORDS ~ time, WORDS ~ research (4/0).

talk_announcement :- WORDS ~ system, WORDS ~ 'To_1126@research' (3/0).

talk_announcement :- WORDS ~ '2d416', WORDS ~ memory (3/0).

talk_announcement :- WORDS ~ interfaces, WORDS ~ From_p_exclaim_point (2/0).

talk_announcement :- WORDS ~ presentations, WORDS ~ From_att (2/0).

default non_talk_announcement .

36 of 64

More on this paper

Algorithm

  • Phase 1: build rules
    • Discrete greedy search:
    • Starting with empty rule set, add conditions greedily
  • Phase 2: prune rules
    • starting with phase 1 output, remove conditions, greedily

talk_announcement :- WORDS ~ talk, WORDS ~ Subject_talk (54/1).

talk_announcement :- WORDS ~ '2d416' (26/3).

talk_announcement :- WORDS ~ system, WORDS ~ 'To_1126@research' (4/0).

talk_announcement :- WORDS ~ mh, WORDS ~ time (5/1).

talk_announcement :- WORDS ~ talk, WORDS ~ used (3/0).

talk_announcement :- WORDS ~ presentations (2/1).

default non_talk_announcement (390/1).

37 of 64

More on this paper

Algorithm

  • Fit the data: POS,NEG
  • While POS isn’t empty:
    • Let R be “pos 🡨 True”
    • While R matches any examples in NEG:
      • Pick the “best” [i] condition c of the form “xi=True” or “xi=false”
      • Add c to the antecedent of R
    • Add R to the rule set
    • Remove examples that satisfy R from POS
  • Prune the rule set

Analysis

  • The total number of iterations of L1 is the number of conditions in the rule set (before pruning) – call it m
  • Picking the “best” condition requires looking at all examples – say there are n of these
  • Time is at least m*n
  • The problem:
    • When there are noisy positive examples, the algorithm builds rules that cover just 1-2 of them
      • Because noise is incompressible!
      • So m=O(n)
    • So with huge noisy datasets you build huge rulesets

[i] “Best” is wrt some statistics on c’s coverage of POS,NEG

L1

quadratic

even worse!

38 of 64

So in early-mid 1990’s…..

  • Experimental datasets were small
  • Many commonly used algorithms were asymptotically “slow”
  • Not many people really cared

39 of 64

40 of 64

Big ML c. 2001 (Banko & Brill, “Scaling to Very Very Large Corpora …)

Task: distinguish pairs of easily-confused words (“affect” vs “effect”) in context

41 of 64

Why does more data help?

  • Let's dig into affect/effect classification with a large dataset – Google books n-grams
    • All 5-grams that appear >= 40 times in a corpus of 1M English books
      • approx 80B words

42 of 64

43 of 64

Why More Data Helps: A Demo

  • Data: Google Book n-grams
    • All 5-grams that appear >= 40 times in a corpus of 1M English books
      • approx 80B words
      • 5-grams: 30Gb compressed, 250-300Gb uncompressed
      • Each 5-gram contains frequency distribution over years
    • Wrote code to compute
      • Pr(A,B,C,D,E|C=affect or C=effect)
      • Pr(any subset of A,…,E|any other fixed values of A,…,E with C=affect V effect)
    • Demo: …

FineWeb is 15 trillion tokens, ~180x bigger

44 of 64

Why More Data Helps

Observations [from playing with data]:

    • Mostly effect not affect
    • Most common word before affect is not, before effect is the
    • If you've written "no affect" it's probably wrong

45 of 64

So in 2001…..

  • We’re learning:
    • “there’s no data like more data”
    • For many tasks, there’s no real substitute for using lots of data

46 of 64

…and in 2009

Eugene Wigner’s article “The Unreasonable Effectiveness of Mathematics in the Natural Sciences” examines why so much of physics can be neatly explained with simple mathematical formulas such as f = ma or e = mc2. Meanwhile, sciences that involve human beings rather than elementary particles have proven more resistant to elegant mathematics. Economists suffer from physics envy over their inability to neatly model human behavior. An informal, incomplete grammar of the English language runs over 1,700 pages.

Perhaps when it comes to natural language processing and related fields, we’re doomed to complex theories that will never have the elegance of physics equations. But if that’s so, we should stop acting as if our goal is to author extremely elegant theories, and instead embrace complexity and make use of the best ally we have: the unreasonable effectiveness of data.

Norvig, Pereira, Halevy, “The Unreasonable Effectiveness of Data”, 2009

47 of 64

Bengio, Foundations & Trends, 2009

48 of 64

1M vs 10M

labeled examples

2.5M examples for “pretraining”

49 of 64

SCALING LAWS FOR LLMS

50 of 64

History of LLMs

  • Word2vec, Mikolov et al, 2013: ‘skip-gram language model’ trained on 1B word corpus
    • 46,000+ citations on Google Scholar
  • ELMo – Peters et al 2018: embeddings from biLSTM LMs, trained on 1B word corpus, error reductions of 6-20% on multiple tasks
    • 16,000+ citations
  • BERT - Devlin et al 2019: Transformer-based encoder-only model, trained on 3.3B word corpus
    • 122,000+ citations
  • GPT-2 – OpenAI, 2019: Decoder-only model, about 6B parameters
  • T5 – Google: about 11B parameters
  • GPT-3 – OpenAI, 2020, 137B parameters
  • Kimi-2 – Moonshot, 1T parameters

Models got bigger and bigger

51 of 64

Chinchilla Scaling Laws

2022

52 of 64

Chinchilla Scaling Laws

Main idea: better model joint distribution of model size (# parameters), model performance (test set loss / perplexity), and training cost (floating point operations, aka FLOPS)

53 of 64

Chinchilla Scaling Laws

63B

1.4T

Constraint = Budget = $$

Idea: fit a simple model to the data that relates compute, parameters, and loss, and use that to optimize loss given a total compute budget.

54 of 64

Chinchilla Scaling Laws

MMLU

55 of 64

Compute Scaling Laws = Data Scaling Laws

Nowadays mostly distinct tokens

Llama 1 (2023) training mixture

56 of 64

Compute Scaling Laws = Data Scaling Laws

7B params: 400B 🡪 1T toks

1T toks: 7B🡪33B params

Llama 1 (2023) training mixture

57 of 64

REVIEW: ASYMPTOTIC COMPLEXITY

58 of 64

The lecture so far

  • For many tasks/ML subareas big data is crucial
    • Deep learning
    • LLMs

59 of 64

How do we use very large amounts of data?

  • Working with big data is not about
    • code optimization
    • learning details of todays hardware/software:
      • Hadoop, Spark, Apache Beam, Polar, JAX, PyTorch, ….
  • Working with big data is about
    • Understanding the cost of what you want to do
    • Understanding what the tools that are available offer
    • Understanding how much can be accomplished with linear or nearly-linear operations (e.g., sorting, …)
    • Understanding how to organize your computations so that they effectively use whatever’s fast
    • Understanding how to test/debug/verify with large data

*

* according to William

60 of 64

Asymptotic Analysis: Basic Principles

61 of 64

Asymptotic Analysis: Basic Principles

Some useful rules:

Only highest-order terms matter

Leading constants don’t matter

Degree of something in a log doesn’t matter

62 of 64

Rule pruning again

Algorithm

  • Fit the data: POS,NEG
  • While POS isn’t empty:
    • Let R be “pos 🡨 True”
    • While R matches any examples in NEG:
      • Pick the “best” [i] condition c of the form “xi=True” or “xi=false”
      • Add c to the antecedent of R
    • Add R to the rule set
    • Remove examples that satisfy R from POS
  • Prune the rule set

Analysis

  • The total number of iterations of L1 is the number of conditions in the rule set (before pruning) – call it m
  • Picking the “best” condition requires looking at all examples – say there are n of these
  • Time is at least m*n
  • The problem:
    • When there are noisy positive examples, the algorithm builds rules that cover just 1-2 of them
      • Because noise is incompressible!
      • So m=O(n)
    • So with huge noisy datasets you build huge rulesets

[i] “Best” is wrt some statistics on c’s coverage of POS,NEG

L1

quadratic

63 of 64

Empirical analysis of complexity: plot run-time on a log-log plot and measure the slope (using linear regression)

Analytic result needs to use size of learned ruleset which is hard to predict analytically.

But experimental analysis was “good enough”

64 of 64

Where do asymptotics break down?

  • When the constants are too big
    • or n is too small
  • When we can’t predict what the program will do
    • Eg, how many iterations before convergence? Does it depend on data size or not?
    • This is when you need experiments
  • When there are different types of operations with different costs
    • We need to understand what we should count