Data Geometry and DL - Lecture 5
Generalization: some math. interpretations
Main sources:
“Math of Deep Learning” lecture notes sec. 1-3
Belkin slides and papers (cited in the slides)
Chapters 2,3 of this book
are used for lectures 5-6
Empirical Risk Minimization – The classical theory overview
DREAM: true risk over “all data”
minimized amongst “all functions”
REALITY: empirical risk over training data
minimized on (finite-dim.) parameter space
model capacity/complexity
(more in Lecture 6)
Empirical Risk Minimization – Basics
Empirical Risk Minimization – Basics
Empirical Risk Minimization – Basics
concentration inequalities
Proofs: (paper) or take the book
Empirical Risk Minimization – Basics
Proofs: (paper) or take the book
→ how rich can hypotheses’ values be over a m-ple of points
→ largest m so that we can do any 2-label classification task on m points
(here is the paper)
Proof + summary
Proof + summary
3. Chaining (see Chapter 5 in Van Handel’s notes)
Bound via covering number growth for this (Lipschitz-process-induced) distance:
Loss is [0,1]-valued then
Then use a packing bound:
Overparametrization helps: Deep Double Descent
More experiments (paper)
Some explanation: Another concentration phenomenon
Some directions of classical theory (not used for DDD phenomena)