1 of 12

Data Geometry and DL - Lecture 5

Generalization: some math. interpretations

Main sources:

Math of Deep Learning” lecture notes sec. 1-3

Belkin slides and papers (cited in the slides)

Chapters 2,3 of this book

are used for lectures 5-6

2 of 12

Empirical Risk Minimization – The classical theory overview

DREAM: true risk over “all data”

minimized amongst “all functions”

REALITY: empirical risk over training data

minimized on (finite-dim.) parameter space

Classical theory:

“what you see

is what you get”

(Vapnik 1991)

model capacity/complexity

(more in Lecture 6)

3 of 12

Empirical Risk Minimization – Basics

  • Hypotheses set: all the possible functions that we think might represent our data well

  • Empirical risk:

  • We assume i.i.d. data (law denoted “P”)

  • Risk:

  • Bayes-optimal risk value:

  • Model trained on random data

4 of 12

Empirical Risk Minimization – Basics

  • Optimization error: depends on efficiency of our gradient descent method (SGD)

  • Generalization error:

  • Approximation error: depends on expressivity of our hypothesis class

5 of 12

Empirical Risk Minimization – Basics

  • Main tool:

concentration inequalities

Proofs: (paper) or take the book

6 of 12

Empirical Risk Minimization – Basics

  • For DNNs, we need different controls on the size of hypotheses space

Proofs: (paper) or take the book

→ how rich can hypotheses’ values be over a m-ple of points

→ largest m so that we can do any 2-label classification task on m points

(here is the paper)

7 of 12

Proof + summary

  • McDiarmid will give (outside an event of prob. delta):

  • Symmetrization: (symmetrized, independent sample)

8 of 12

Proof + summary

3. Chaining (see Chapter 5 in Van Handel’s notes)

Bound via covering number growth for this (Lipschitz-process-induced) distance:

Loss is [0,1]-valued then

Then use a packing bound:

9 of 12

Overparametrization helps: Deep Double Descent

More experiments (paper)

10 of 12

Some explanation: Another concentration phenomenon

Linear regression → goal: find theta*

Then we get

Then we get for

This holds when k*/m small, assuming X normally distributed (c: universal constant).

  • Paper with detailed proofs for this (linear regression) case
  • Paper with more general new phenomena, heuristic hypotheses

11 of 12

Some directions of classical theory (not used for DDD phenomena)

  • Neural tangent kernels: As width of a layer tends to infinity
    • the layer tends to a kernel product on Hilbert space
    • SGD becomes (time-discretized) gradient flow
  • Margin theory: Balance between risk and complexity (generalization vs approximation errors)
  • Implicit regularization: SGD convergence aligns to maximum margin predictor
  • Approximation bounds in different norms
  • Different measures of complexity (counting affine pieces – see Lecture 6)

12 of 12