10-605 / 10-805
Machine Learning from Large Datasets
ADMINISTRIVIA
Who/Where/When
Who/Where/When
Overview ….
* Guest lecturers: John Wieting (Google DM), Michael de Jong (Cursor)
PySpark
DataBricks
Amazon Cloud
Amazon Cloud
Project / Miniproject
Who/Where/When
* Adjunct Prof, Visiting Assoc Prof, Assoc Research Prof, Research Prof, Prof, Consulting Prof, Visiting Prof, Prof
Who/Where/When
Who/Where/When
Who/Where/When
TAs
Who/Where/When
Who/Where/When
Who/Where/When
Who/Where/When
Who/Where/When
Who/Where/When
Who/Where/When
What/How
What/How: cheating vs working together
What/How: using GenAI
What/How: using GenAI
ITS-2008
a study to compare three learning strategies: .. learning by solving problems with feedback and hints, learning by generalizing worked-out examples exhaustively, and learning by generalizing worked-out examples only for the skills that need to be generalized. The results showed that learning by tutored problem solving outperformed other learning strategies
Example Study: you only experience the correct problem/solving decisions
Tutored Problem Solving: you experience the mistakes you make
Commonplace books are a way to compile knowledge, usually by writing information into blank books. They have been kept from antiquity, and were kept particularly during the Renaissance and in the nineteenth century. Such books are similar to scrapbooks filled with items of many kinds: notes, proverbs, adages, aphorisms, maxims, quotes, letters, poems, tables of weights and measures, prayers, legal formulas, and recipes.
Aristotle … suggested that they also be used to explore the validity of propositions through rhetoric. Cicero … applied them to public speaking. He also created a list of commonplaces which included sententiae or wise sayings or quotations by philosophers, statesmen, and poets. Quintilian further expanded these ideas in Institutio Oratoria, a treatise on rhetoric education, and asked his readers to commit their commonplaces to memory. In the first century AD, Seneca the Younger suggested that readers collect commonplace ideas and sententiae as a bee collects pollen, and by imitation turn them into their own honey-like words…
wikipedia
Next sentence prediction vs masked language modeling in BERT
2018
The Legendary Tom Murphy VII
GenAI and Education: My Opinion
What/How: using GenAI (recap)
What/How: using GenAI
What/How: using GenAI
BIG DATA HISTORY: FROM THE DAWN OF TIME TO THE PRESENT
Big ML c. 1993 (Cohen, “Efficient…Rule Learning”, IJCAI 1993)
talk_announcement :- WORDS ~ talk, WORDS ~ Subject_talk (54/1).
talk_announcement :- WORDS ~ '2d416' (26/3).
talk_announcement :- WORDS ~ system, WORDS ~ 'To_1126@research' (4/0).
talk_announcement :- WORDS ~ mh, WORDS ~ time (5/1).
talk_announcement :- WORDS ~ talk, WORDS ~ used (3/0).
talk_announcement :- WORDS ~ presentations (2/1).
default non_talk_announcement (390/1).
" :-" means "if", "~" means "contains"
Why?
Algorithm
talk_announcement :- WORDS ~ talk, WORDS ~ Subject_talk, WORDS ~ p_comma.
talk_announcement :- WORDS ~ '2d416', WORDS ~ be.
talk_announcement :- WORDS ~ show, WORDS ~ talk (7/0).
talk_announcement :- WORDS ~ mh, WORDS ~ time, WORDS ~ research (4/0).
talk_announcement :- WORDS ~ system, WORDS ~ 'To_1126@research' (3/0).
talk_announcement :- WORDS ~ '2d416', WORDS ~ memory (3/0).
talk_announcement :- WORDS ~ interfaces, WORDS ~ From_p_exclaim_point (2/0).
talk_announcement :- WORDS ~ presentations, WORDS ~ From_att (2/0).
default non_talk_announcement .
Algorithm
talk_announcement :- WORDS ~ talk, WORDS ~ Subject_talk, WORDS ~ p_comma (54/0).
talk_announcement :- WORDS ~ '2d416', WORDS ~ be (19/0).
talk_announcement :- WORDS ~ show, WORDS ~ talk (7/0).
talk_announcement :- WORDS ~ mh, WORDS ~ time, WORDS ~ research (4/0).
talk_announcement :- WORDS ~ system, WORDS ~ 'To_1126@research' (3/0).
talk_announcement :- WORDS ~ '2d416', WORDS ~ memory (3/0).
talk_announcement :- WORDS ~ interfaces, WORDS ~ From_p_exclaim_point (2/0).
talk_announcement :- WORDS ~ presentations, WORDS ~ From_att (2/0).
default non_talk_announcement .
Algorithm
talk_announcement :- WORDS ~ talk, WORDS ~ Subject_talk (54/1).
talk_announcement :- WORDS ~ '2d416' (26/3).
talk_announcement :- WORDS ~ system, WORDS ~ 'To_1126@research' (4/0).
talk_announcement :- WORDS ~ mh, WORDS ~ time (5/1).
talk_announcement :- WORDS ~ talk, WORDS ~ used (3/0).
talk_announcement :- WORDS ~ presentations (2/1).
default non_talk_announcement (390/1).
Algorithm
Analysis
[i] “Best” is wrt some statistics on c’s coverage of POS,NEG
L1
quadratic
even worse!
So in early-mid 1990’s…..
Big ML c. 2001 (Banko & Brill, “Scaling to Very Very Large Corpora …)
Task: distinguish pairs of easily-confused words (“affect” vs “effect”) in context
Why does more data help?
Why More Data Helps: A Demo
FineWeb is 15 trillion tokens, ~180x bigger
Why More Data Helps
Observations [from playing with data]:
So in 2001…..
…and in 2009
Eugene Wigner’s article “The Unreasonable Effectiveness of Mathematics in the Natural Sciences” examines why so much of physics can be neatly explained with simple mathematical formulas such as f = ma or e = mc2. Meanwhile, sciences that involve human beings rather than elementary particles have proven more resistant to elegant mathematics. Economists suffer from physics envy over their inability to neatly model human behavior. An informal, incomplete grammar of the English language runs over 1,700 pages.
Perhaps when it comes to natural language processing and related fields, we’re doomed to complex theories that will never have the elegance of physics equations. But if that’s so, we should stop acting as if our goal is to author extremely elegant theories, and instead embrace complexity and make use of the best ally we have: the unreasonable effectiveness of data.
Norvig, Pereira, Halevy, “The Unreasonable Effectiveness of Data”, 2009
Bengio, Foundations & Trends, 2009
1M vs 10M
labeled examples
2.5M examples for “pretraining”
SCALING LAWS FOR LLMS
History of LLMs
Models got bigger and bigger
Chinchilla Scaling Laws
2022
Chinchilla Scaling Laws
Main idea: better model joint distribution of model size (# parameters), model performance (test set loss / perplexity), and training cost (floating point operations, aka FLOPS)
Chinchilla Scaling Laws
63B
1.4T
Constraint = Budget = $$
Idea: fit a simple model to the data that relates compute, parameters, and loss, and use that to optimize loss given a total compute budget.
Chinchilla Scaling Laws
MMLU
Compute Scaling Laws = Data Scaling Laws
Nowadays mostly distinct tokens
Llama 1 (2023) training mixture
Compute Scaling Laws = Data Scaling Laws
7B params: 400B 🡪 1T toks
1T toks: 7B🡪33B params
Llama 1 (2023) training mixture
REVIEW: ASYMPTOTIC COMPLEXITY
The lecture so far
How do we use very large amounts of data?
*
* according to William
Asymptotic Analysis: Basic Principles
Asymptotic Analysis: Basic Principles
Some useful rules:
Only highest-order terms matter
Leading constants don’t matter
Degree of something in a log doesn’t matter
Rule pruning again
Algorithm
Analysis
[i] “Best” is wrt some statistics on c’s coverage of POS,NEG
L1
quadratic
Empirical analysis of complexity: plot run-time on a log-log plot and measure the slope (using linear regression)
Analytic result needs to use size of learned ruleset which is hard to predict analytically.
But experimental analysis was “good enough”
Where do asymptotics break down?