1 of 12

Data Geometry and DL - Lecture 5

Generalization: some math. interpretations

Main sources:

“Math of Deep Learning” lecture notes sec. 1-3

Belkin slides and papers (cited in the slides)

Chapters 2,3 of this book

are used for lectures 5-6

2 of 12

Empirical Risk Minimization – The classical theory overview

DREAM: true risk over “all data”

minimized amongst “all functions”

REALITY: empirical risk over training data

minimized on (finite-dim.) parameter space

Classical theory:

“what you see

is what you get”

(Vapnik 1991)

model capacity/complexity

(more in Lecture 6)

3 of 12

Empirical Risk Minimization – Basics

Hypotheses set: all the possible functions that we think might represent our data well

Empirical risk:

We assume i.i.d. data (law denoted “P”)

Risk:

Bayes-optimal risk value:

Model trained on random data

4 of 12

Empirical Risk Minimization – Basics

Optimization error: depends on efficiency of our gradient descent method (SGD)

Generalization error:

Approximation error: depends on expressivity of our hypothesis class

5 of 12

Empirical Risk Minimization – Basics

Main tool:

concentration inequalities

Proofs: (paper) or take the book

6 of 12

Empirical Risk Minimization – Basics

For DNNs, we need different controls on the size of hypotheses space

Proofs: (paper) or take the book

→ how rich can hypotheses’ values be over a m-ple of points

→ largest m so that we can do any 2-label classification task on m points

(here is the paper)

7 of 12

Proof + summary

McDiarmid will give (outside an event of prob. delta):

Symmetrization: (symmetrized, independent sample)

8 of 12

Proof + summary

3. Chaining (see Chapter 5 in Van Handel’s notes)

Bound via covering number growth for this (Lipschitz-process-induced) distance:

Loss is [0,1]-valued then

Then use a packing bound:

9 of 12

Overparametrization helps: Deep Double Descent

(Belkin et al. 2019)

More experiments (paper)

10 of 12

Some explanation: Another concentration phenomenon

Linear regression → goal: find theta*

Then we get

Then we get for

This holds when k*/m small, assuming X normally distributed (c: universal constant).

Paper with detailed proofs for this (linear regression) case
Paper with more general new phenomena, heuristic hypotheses

11 of 12

Some directions of classical theory (not used for DDD phenomena)

Neural tangent kernels: As width of a layer tends to infinity

the layer tends to a kernel product on Hilbert space
SGD becomes (time-discretized) gradient flow

Margin theory: Balance between risk and complexity (generalization vs approximation errors)
Implicit regularization: SGD convergence aligns to maximum margin predictor
Approximation bounds in different norms
Different measures of complexity (counting affine pieces – see Lecture 6)

1 of 12

2 of 12

3 of 12

4 of 12

5 of 12

6 of 12

7 of 12

8 of 12

9 of 12

10 of 12

11 of 12

12 of 12