3 of 64

Who/Where/When

Main course page: https://10605.github.io/

Lectures posted as we go

Instructor:

William W. Cohen
Office Hours:

Wed 2-2:45 in GHC 8015

Still TBA for others

TAs:

Jacob Rast and great team of TAs
More coming up….

4 of 64

Who/Where/When

Lectures:

Mon-Wed 11am-12:20pm in Margaret Morrison A14
Recitation/Writing sessions: Fri 11am-12:20 in MM A14

You must attend writing sessions (more later)

Lectures will be record and posted on-line

but not livestreamed
we can’t guarantee when they will be posted

On-line quiz to do soon after lecture

It closes after 48 hours
We have one today
They don’t count for a lot but there are no makeups

Quizzes reinforce the lecture
And make sure you keep up!

5 of 64

Overview ….

MapReduce, Spark
Scalable ML Optimization on Clusters

Fast ML models

Randomized algorithms

Bloom filters, count-min sketches, LSH

Learning on GPUs
Exam 1
Fall Break
Transformers
Hyperparameter search
Other topics:

quantization, pruning
key-value caching
contrastive learning*
retrieval-augmented generation*
merging expert LLMs

Exam 2

TFIDF for entity resolution, efficient parallel Naïve Bayes
Linear regression on Million Song Dataset
Fast approximate similarity (LSH), bespoke scalable text classifier

Preparing a pre-training dataset for an LLM – dedup, filter
Pretrain an “expert” LLM on a bespoke, curated dataset

Project: merging expert LLMs

* Guest lecturers: John Wieting (Google DM), Michael de Jong (Cursor)

PySpark

DataBricks

Amazon Cloud

6 of 64

Project / Miniproject

Details are being finalized
15% of the grade
Two paths:

suggested topics with clear goals

last semester: LLM pruning

…or a more open-ended topic using many small LLMs (trained by your classmates)

open-ended project proposals will be reviewed
each open-ended project has a lead and 2-3 participants

any PhD student can be a lead
MS/undergrads have to send in a CV for approval

7 of 64

Who/Where/When

William Cohen

In Machine Learning Dept
With CMU since 2002 with 7 different job titles*
Spent about 8 years at Google since 2002

had two different LDAPs

Co-Chair of ICML in 1994
Once finished second to last in my age bracket in the Great Race
Vacation at the beach in SC almost every summer
A dataset I’ve worked with lately is MedCalc (precision instruction-following for medical tasks using patient note)

* Adjunct Prof, Visiting Assoc Prof, Assoc Research Prof, Research Prof, Prof, Consulting Prof, Visiting Prof, Prof

8 of 64

Who/Where/When

William W. Cohen

1990-mid 1990’s: AT&T Bell Labs (working on various things, including scalable rule-learning systems)
Mid 1990’s—2000: AT&T Research (text classification, data integration, knowledge-as-text, information extraction)
2000-2002: WhizBang! Labs (info extraction from Web startup located in Pittsburgh)
2002-May 2018: CMU (IE, scientific text, social networks, graph learning, structure output learning, …)

2008-2009, Jan-Jul 2017: Visiting Scientist at Google

May 2018-March 2024: Google (QA, neural reasoning, memory-augmented neural networks, image generation, prediction-powered inference, …)

Part-time at CMU

May 2024-now: CMU again (controllable/interpretable chain-of-thought for LLMs)

Still part-time at Google

9 of 64

Who/Where/When

William W. Cohen

1990-mid 1990’s: AT&T Bell Labs (working on various things, including scalable rule-learning systems)
Mid 1990’s—2000: AT&T Research (text classification, data integration, knowledge-as-text, information extraction)
2000-2002: Whizbang! Labs (info extraction from Web startup located in Pittsburgh)
2002-May 2018: CMU (IE, scientific text, social networks, graph learning, structure output learning, …)

2008-2009, Jan-Jul 2017: Visiting Scientist at Google
2012: Started teaching 10-605 Learning from Large Datasets

May 2018-March 2024: Google (QA, neural reasoning, memory-augmented neural networks, image generation, prediction-powered inference, …)

Part-time at CMU
2018: Stopped teaching 10-605 Learning from Large Datasets

May 2024-now: CMU again (controllable/interpretable chain-of-thought prompting for LLMs)

Still part-time at Google

This course will follow recent versions of 405/605/805

especially Spring 2025

10 of 64

Who/Where/When

Jacob Rast

Staff member in the Machine Learning Department
6th semester on the course
Enjoys traveling and studying foreign language (中文HSK6级, Deutsch CEFR A2+)
Likes working with genomics datasets, programming autograders, 3D printing, and training small language models.
Focusing on improving CS theory fundamentals this semester

12 of 64

Who/Where/When

Christopher Berman

Senior Physics Major
Third semester TAing for this course
Spent the summer at CERN working on knowledge distillation for large models
Lived in Lisbon for 14 years
Plays way too much Dota 2
Recently worked a dataset I collected to trade 0DTE crypto binary options
Returns 29th of August (stuck in Lisbon)

13 of 64

Who/Where/When

Gunjan Dhanuka

Final-year MSCS student
Took 10605 in Fall’24, TA’ed in Spring’25
Currently researching on Memorization in Diffusion Models
Spends too much time cooking, playing video games, watching YouTube
Beaches == Mountains (after being to California)
Recently been working with LAION 🦁

14 of 64

Who/Where/When

Baihong Yuan

Second-year MS student in ECE
Took 10-605 in F24 and 2nd time TA this course; also TA 18-665 before.
Interested in ML and LLM, research in LLM inference
Enjoy reading books, baking cakes and playing games
Recently worked with financial datasets and language datasets

15 of 64

Who/Where/When

Hari Rangarajan

Second-year Software Engineering student
Took 10-605 in S25; prev TA’ed 17-630
Research in attention mechanisms of language models; LLMs for healthcare
Enjoy long walks, cold coffee, and chess/cricket over the weekends
Past: Software Dev at Microsoft

16 of 64

Who/Where/When

Sam Zhou

Second-year MSCS student
Took 10-605 in S25
Interested in scalable data science/ML and audio models
Enjoys traveling, photography, cafes, tennis, and video games
Recently worked with ESC-50 dataset for sound classification

17 of 64

Who/Where/When

Nikash Bhardwaj

Senior in Computer Science
Took 10405 in S25 and found it super useful when doing ML research
Interested in LLMs & reasoning
Enjoys ping pong, soccer, and traveling
A cool dataset is Frontier Math

18 of 64

Who/Where/When

Ayush Kumar

Second-year MCDS student
Took 10-605 in S25 and enjoyed the mix of theory, systems and ML
Interested in scaling up ML systems and efficient inference
Enjoys a good hike, tennis and rock music
Recently worked with distributed system log datasets

19 of 64

What/How

Programming Languages and Systems:

Python, PySpark, PyTorch

Resources:

Databricks
Amazon Elastic Cloud

Amazon EC2 [http://aws.amazon.com/ec2/]

20 of 64

What/How: cheating vs working together

We have a long and explicit policy

see the syllabus
tl;dr: transmit information like they did in the stone age, brain-to-brain, and document it
do not copy anything digitally
exceptions (eg projects) will be explicitly stated

“Handled severely” … ?

everybody involved will fail the course by default
every infraction always gets reported up to the Office of Academic Integrity, the head of your program, the dean of your school, ….
a second offense is very bad

21 of 64

What/How: using GenAI

For most assignments a GenAI is treated like a collaborator

exceptions will be explicitly stated
tl;dr: ask whatever you want but do not copy anything directly

put it in your brain and then take it out when you are generating the text/code

To encourage this we’re adding quizzes for each assignment you must do completely closed-book

Some questions are take-home, but some are in-class

“Writing sessions” – in lecture or in recital timeslot

22 of 64

What/How: using GenAI

Why not use GenAI for everything?

23 of 64

https://www.youtube.com/shorts/aL46ctBtcrs

24 of 64

ITS-2008

a study to compare three learning strategies: .. learning by solving problems with feedback and hints, learning by generalizing worked-out examples exhaustively, and learning by generalizing worked-out examples only for the skills that need to be generalized. The results showed that learning by tutored problem solving outperformed other learning strategies

Example Study: you only experience the correct problem/solving decisions

Tutored Problem Solving: you experience the mistakes you make

25 of 64

Commonplace books are a way to compile knowledge, usually by writing information into blank books. They have been kept from antiquity, and were kept particularly during the Renaissance and in the nineteenth century. Such books are similar to scrapbooks filled with items of many kinds: notes, proverbs, adages, aphorisms, maxims, quotes, letters, poems, tables of weights and measures, prayers, legal formulas, and recipes.

Aristotle … suggested that they also be used to explore the validity of propositions through rhetoric. Cicero … applied them to public speaking. He also created a list of commonplaces which included sententiae or wise sayings or quotations by philosophers, statesmen, and poets. Quintilian further expanded these ideas in Institutio Oratoria, a treatise on rhetoric education, and asked his readers to commit their commonplaces to memory. In the first century AD, Seneca the Younger suggested that readers collect commonplace ideas and sententiae as a bee collects pollen, and by imitation turn them into their own honey-like words…

wikipedia

26 of 64

Next sentence prediction vs masked language modeling in BERT

2018

27 of 64

The Legendary Tom Murphy VII

28 of 64

GenAI and Education: My Opinion

If you want to learn, doing ≫ watching
If you want to learn something, write it down

and/or reorganize it
and/or (pretend to) teach it to someone

Programming is like writing

but even more so

29 of 64

What/How: using GenAI (recap)

For most assignments a chatbot is treated like a collaborator

exceptions will be explicitly stated
tl;dr: ask whatever you want but do not copy anything directly

put it in your brain and then take it out when you are generating the text/code

To encourage this we’re adding quizzes for each assignment you must do completely closed-book

Some questions are take-home, but some are in-class

“Writing sessions” – in lecture or in recital timeslot

30 of 64

What/How: using GenAI

For some parts of some assignments you can use GenAI…

Warning: GenAI is only useful if you are suspicious of everything
Never trust text because it looks plausible
Never ever trust code because it looks plausible

“Nearly correct” code is very hard to debug
Cutting and pasting is a dangerous shortcut

31 of 64

What/How: using GenAI

GenAI is only useful if you are suspicious of everything

Never trust text because it looks plausible
Never ever trust code because it looks plausible

“Nearly correct” code is very hard to debug
Cutting and pasting is a dangerous shortcut
“Nearly correct” code is very hard to debug

The art of ML coding is really in testing

How do you test a loss function?
How do you test a new model?
How do you test a randomized algorithm?

32 of 64

BIG DATA HISTORY: FROM THE DAWN OF TIME TO THE PRESENT

33 of 64

Big ML c. 1993 (Cohen, “Efficient…Rule Learning”, IJCAI 1993)

talk_announcement :- WORDS ~ talk, WORDS ~ Subject_talk (54/1).

talk_announcement :- WORDS ~ '2d416' (26/3).

talk_announcement :- WORDS ~ system, WORDS ~ 'To_1126@research' (4/0).

talk_announcement :- WORDS ~ mh, WORDS ~ time (5/1).

talk_announcement :- WORDS ~ talk, WORDS ~ used (3/0).

talk_announcement :- WORDS ~ presentations (2/1).

default non_talk_announcement (390/1).

" :-" means "if", "~" means "contains"

34 of 64

Why?

Algorithm

Phase 1: build rules

Discrete greedy search:
Starting with empty rule set, add conditions greedily

Phase 2: prune rules

starting with phase 1 output, remove conditions

talk_announcement :- WORDS ~ talk, WORDS ~ Subject_talk, WORDS ~ p_comma.

talk_announcement :- WORDS ~ '2d416', WORDS ~ be.

talk_announcement :- WORDS ~ show, WORDS ~ talk (7/0).

talk_announcement :- WORDS ~ mh, WORDS ~ time, WORDS ~ research (4/0).

talk_announcement :- WORDS ~ system, WORDS ~ 'To_1126@research' (3/0).

talk_announcement :- WORDS ~ '2d416', WORDS ~ memory (3/0).

talk_announcement :- WORDS ~ interfaces, WORDS ~ From_p_exclaim_point (2/0).

talk_announcement :- WORDS ~ presentations, WORDS ~ From_att (2/0).

default non_talk_announcement .

35 of 64

36 of 64

37 of 64

38 of 64

So in early-mid 1990’s…..

Experimental datasets were small
Many commonly used algorithms were asymptotically “slow”
Not many people really cared

40 of 64

Big ML c. 2001 (Banko & Brill, “Scaling to Very Very Large Corpora …)

Task: distinguish pairs of easily-confused words (“affect” vs “effect”) in context

41 of 64

Why does more data help?

Let's dig into affect/effect classification with a large dataset – Google books n-grams

All 5-grams that appear >= 40 times in a corpus of 1M English books

approx 80B words

43 of 64

Why More Data Helps: A Demo

Data: Google Book n-grams

All 5-grams that appear >= 40 times in a corpus of 1M English books

approx 80B words
5-grams: 30Gb compressed, 250-300Gb uncompressed
Each 5-gram contains frequency distribution over years

Wrote code to compute

Pr(A,B,C,D,E|C=affect or C=effect)
Pr(any subset of A,…,E|any other fixed values of A,…,E with C=affect V effect)

Demo: …

FineWeb is 15 trillion tokens, ~180x bigger

44 of 64

Why More Data Helps

Observations [from playing with data]:

Mostly effect not affect
Most common word before affect is not, before effect is the
If you've written "no affect" it's probably wrong
…

45 of 64

So in 2001…..

We’re learning:

“there’s no data like more data”
For many tasks, there’s no real substitute for using lots of data

46 of 64

…and in 2009

Eugene Wigner’s article “The Unreasonable Effectiveness of Mathematics in the Natural Sciences” examines why so much of physics can be neatly explained with simple mathematical formulas such as f = ma or e = mc². Meanwhile, sciences that involve human beings rather than elementary particles have proven more resistant to elegant mathematics. Economists suffer from physics envy over their inability to neatly model human behavior. An informal, incomplete grammar of the English language runs over 1,700 pages.

Perhaps when it comes to natural language processing and related fields, we’re doomed to complex theories that will never have the elegance of physics equations. But if that’s so, we should stop acting as if our goal is to author extremely elegant theories, and instead embrace complexity and make use of the best ally we have: the unreasonable effectiveness of data.

Norvig, Pereira, Halevy, “The Unreasonable Effectiveness of Data”, 2009

47 of 64

Bengio, Foundations & Trends, 2009

48 of 64

1M vs 10M

labeled examples

2.5M examples for “pretraining”

49 of 64

SCALING LAWS FOR LLMS

50 of 64

History of LLMs

Word2vec, Mikolov et al, 2013: ‘skip-gram language model’ trained on 1B word corpus

46,000+ citations on Google Scholar

ELMo – Peters et al 2018: embeddings from biLSTM LMs, trained on 1B word corpus, error reductions of 6-20% on multiple tasks

16,000+ citations

BERT - Devlin et al 2019: Transformer-based encoder-only model, trained on 3.3B word corpus

122,000+ citations

GPT-2 – OpenAI, 2019: Decoder-only model, about 6B parameters
T5 – Google: about 11B parameters
GPT-3 – OpenAI, 2020, 137B parameters
Kimi-2 – Moonshot, 1T parameters
…

Models got bigger and bigger

51 of 64

Chinchilla Scaling Laws

2022

52 of 64

Chinchilla Scaling Laws

Main idea: better model joint distribution of model size (# parameters), model performance (test set loss / perplexity), and training cost (floating point operations, aka FLOPS)

53 of 64

Chinchilla Scaling Laws

63B

1.4T

Constraint = Budget = $$

Idea: fit a simple model to the data that relates compute, parameters, and loss, and use that to optimize loss given a total compute budget.

54 of 64

Chinchilla Scaling Laws

MMLU

55 of 64

Compute Scaling Laws = Data Scaling Laws

Nowadays mostly distinct tokens

Llama 1 (2023) training mixture

56 of 64

Compute Scaling Laws = Data Scaling Laws

7B params: 400B 🡪 1T toks

1T toks: 7B🡪33B params

Llama 1 (2023) training mixture

57 of 64

REVIEW: ASYMPTOTIC COMPLEXITY

58 of 64

The lecture so far

For many tasks/ML subareas big data is crucial

Deep learning
LLMs

59 of 64

How do we use very large amounts of data?

Working with big data is not about

code optimization
learning details of todays hardware/software:

Hadoop, Spark, Apache Beam, Polar, JAX, PyTorch, ….

Working with big data is about

Understanding the cost of what you want to do
Understanding what the tools that are available offer
Understanding how much can be accomplished with linear or nearly-linear operations (e.g., sorting, …)
Understanding how to organize your computations so that they effectively use whatever’s fast
Understanding how to test/debug/verify with large data

* according to William

60 of 64

Asymptotic Analysis: Basic Principles

61 of 64

Asymptotic Analysis: Basic Principles

Some useful rules:

Only highest-order terms matter

Leading constants don’t matter

Degree of something in a log doesn’t matter

62 of 64

Rule pruning again

Algorithm

Fit the data: POS,NEG
While POS isn’t empty:

Let R be “pos 🡨 True”
While R matches any examples in NEG:

Pick the “best” [i] condition c of the form “x_i=True” or “x_i=false”
Add c to the antecedent of R

Add R to the rule set
Remove examples that satisfy R from POS

Prune the rule set

Analysis

The total number of iterations of L1 is the number of conditions in the rule set (before pruning) – call it m
Picking the “best” condition requires looking at all examples – say there are n of these
Time is at least m*n
The problem:

When there are noisy positive examples, the algorithm builds rules that cover just 1-2 of them

Because noise is incompressible!
So m=O(n)

So with huge noisy datasets you build huge rulesets

[i] “Best” is wrt some statistics on c’s coverage of POS,NEG

quadratic

63 of 64

Empirical analysis of complexity: plot run-time on a log-log plot and measure the slope (using linear regression)

Analytic result needs to use size of learned ruleset which is hard to predict analytically.

But experimental analysis was “good enough”

64 of 64

Where do asymptotics break down?

When the constants are too big

or n is too small

When we can’t predict what the program will do

Eg, how many iterations before convergence? Does it depend on data size or not?
This is when you need experiments

When there are different types of operations with different costs

We need to understand what we should count

1 of 64

2 of 64

3 of 64

4 of 64

5 of 64

6 of 64

7 of 64

8 of 64

9 of 64

10 of 64

11 of 64

12 of 64

13 of 64

14 of 64

15 of 64

16 of 64

17 of 64

18 of 64

19 of 64

20 of 64

21 of 64

22 of 64

23 of 64

24 of 64

25 of 64

26 of 64

27 of 64

28 of 64

29 of 64

30 of 64

31 of 64

32 of 64

33 of 64

34 of 64

35 of 64

36 of 64

37 of 64

38 of 64

39 of 64

40 of 64

41 of 64

42 of 64

43 of 64

44 of 64

45 of 64

46 of 64

47 of 64

48 of 64

49 of 64

50 of 64

51 of 64

52 of 64

53 of 64

54 of 64

55 of 64

56 of 64

57 of 64

58 of 64

59 of 64

60 of 64

61 of 64

62 of 64

63 of 64

64 of 64