2 of 105

Logistics:

Homework 3 is due (again) tonight!
So is project proposal

It can change in the next couple weeks
Just want to make sure you are thinking about project ideas and have started to work on something concrete
Office hours today 11:30-12:30 if you want to check in

Homework 4 out tonight (really, i promise, no joke)

Optical flow
OpenCV optional to do video processing (webcam demo, etc.)

4 of 105

Softmax: normalized exponential

Generalization of logistic

Input: vector of reals�Output: probability distribution

softmax([1,2,7,3,2]):� Calculate e^x: [2.72, 7.39, 1096.63, 20.09, 7.39]� Calculate sum(e^x): 2.72+7.39+1096.63+20.09+7.39 = 1134.22� Normalize: e^x/sum(e^x) = [0.002, 0.007, 0.967, 0.017, 0.007]��

5 of 105

Multinomial logistic regression

Probability of that a data point belongs to a class is the normalized, weighted sum of the input variables with the learned weights.

softmax(wx + b)

6 of 105

Multinomial logistic regression

https://www.tensorflow.org/versions/r1.1/get_started/mnist/beginners

Probability of that a data point belongs to a class is the normalized, weighted sum of the input variables with the learned weights.

7 of 105

MNIST: Handwriting recognition

50,000 images of handwriting�28 x 28 x 1 (grayscale)�Numbers 0-9

10 class softmax regression�Input is 784 pixel values�Train with SGD�> 95% accuracy

8 of 105

Support Vector Machine (SVM)

Find max-margin classifier. Examples on the margin are supporting data points, support vectors.

min ||w||₂�s.t. y_n(w·x_n - b) ≥ 1, n = 1, 2 ..

Or: minimize weights such that margin for �each point is at least 1

9 of 105

Case study: Person detection

Dalal and Triggs ‘05:�Train SVM on HOG features of image�2 classes, person/not person

At test time:�Extract HOG features at many scales�Run SVM classifier at every location�High responses = person?��

10 of 105

Case study: Person detection

Dalal and Triggs is a sliding window detector

Many scales�Every location

10k+ classifier�evaluations per�image.

Person? No

11 of 105

Case study: Deformable parts models

http://cs.brown.edu/people/pfelzens/papers/lsvm-pami.pdf

Objects have parts, learn to recognize parts�and where they are

Latent SVM: Learn part appearances and�locations without explicit data

Hard negative mining: rebalance classes�for sliding window detectors

12 of 105

Case study: Image classification

https://lear.inrialpes.fr/~verbeek/mlcr.slides.11.12/sanchez11cvpr.pdf

Given an image, what’s in it?

Old state-of-the-art:�Extract features from image� SIFT and Fisher Vectors�Train Linear SVM

On 1000 different classes, 54% accurate

13 of 105

What’s wrong with this?

Machine learning needs features!!

What are the right features?�HOG?�SIFT?�FV?

Why not let the algorithm decide

Neural networks: Feature extraction + linear model

14 of 105

Success of Neural Networks

Image classification:

54% -> 80% accuracy on 1000 classes

Object detection:

33% mAP (DPM) -> 88% mAP on 20 classes

16 of 105

What is feature engineering?

Arguably the core problem of machine learning (especially in practice)

ML models work well if there is a clear relationship between the inputs and outputs of the function you are trying to model

17 of 105

What is feature engineering?

Arguably the core problem of machine learning (especially in practice)

ML models work well if there is a clear relationship between the inputs and outputs of the function you are trying to model

18 of 105

What is feature engineering?

Arguably the core problem of machine learning (especially in practice)

ML models work well if there is a clear relationship between the inputs and outputs of the function you are trying to model

19 of 105

What is feature engineering?

Arguably the core problem of machine learning (especially in practice)

ML models work well if there is a clear relationship between the inputs and outputs of the function you are trying to model

20 of 105

Linear model can’t do this

Cannot learn transformations of features, only use existing features. Human must create good features

21 of 105

What if we added more processing?

Generally, feature engineering is just coming up with combinations of the features we already have

22 of 105

What if we added more processing?

Create “new” features using old ones. We’ll call H our hidden layer

23 of 105

What if we added more processing?

As with linear model, H can be expressed in matrix operations

24 of 105

What if we added more processing?

Now our prediction p is a function of our hidden layer

25 of 105

What if we added more processing?

Now our prediction p is a function of our hidden layer

Feature extractor

26 of 105

What if we added more processing?

Now our prediction p is a function of our hidden layer

Feature extractor

Linear model

27 of 105

What if we added more processing?

Can still express the whole process in matrix notation! Nice because matrix ops are fast

28 of 105

This is a neural network!

This one has 1 hidden layer, but can have way more�Each layer is just some function φ applied to linear combination of the previous layer

29 of 105

φ is our activation function

Want to apply some extra processing at each layer. Why?� Imagine φ(x) = x, linear activation

30 of 105

φ is our activation function

Want to apply some extra processing at each layer. Why?� Imagine φ(x) = x, linear activation

p = v₁h₁ + v₂h₂ + v₃h₃

But h₁ = x₁w₁ + x₂w₂, h₂ = … etc�So� p = v₁w₁x₁ + v₁w₂x₂ + v₂w₃x₁ + v₂w₄x₂ + v₃w₅x₁ + v₃w₆x₂� = (v₁w₁+v₂w₃+v₃w₅)x₁ + (v₁w₂+v₂w₄+v₃w₆)x₂� = u₁x₁ + u₂x₂

31 of 105

Universal approximation theorem

https://en.wikipedia.org/wiki/Universal_approximation_theorem

What if φ not linear?

Universal approximation theorem (Cybenko 89, Hornik 91)� φ: any nonconstant, bounded, monotonically increasing function� I_m: m-dimensional unit hypercube (interval [0-1] in m-d)� Then 1-layer neural network with φ as activation can model any continuous function f: I_m -> R� (no bound on size of hidden layer)

By extension, works on f: bounded R^m -> R

What can we learn? What can’t we?

UAT just says it’s possible to model, not how.

32 of 105

How do we learn it?

Neural networks are non-convex with no closed form solution (can’t take derivative and set = 0)

Gradient descent! Recall for linear model:

33 of 105

How do we learn it?

With gradient descent we calculate the partial derivatives of the loss (or likelihood) function for every weight: ∂/∂w_i log L(w)

Then do gradient descent (or ascent) by adding gradient to weight

34 of 105

How do we learn it?

Simple example, say we have a data point [10, 1] and we predict some p. We also know the “correct” label Y. Maybe our prediction p is too small and we want to make it larger. How do we adjust w?

35 of 105

How do we learn it?

Simple example, say we have a data point [10, 1] and we predict some p. We also know the “correct” label Y. Maybe our prediction p is too small and we want to make it larger. How do we adjust w?

We adjust w₁ much more than w₂, why?

36 of 105

How do we learn it?

Simple example, say we have a data point [-1, 1] and we predict some p. We also know the “correct” label Y. Maybe our prediction p is too small and we want to make it larger. How do we adjust w?

37 of 105

How do we learn it?

Now we have a “real” neural network (using linear activation for simplicity). How do we predict p?

38 of 105

How do we learn it?

Now we have a “real” neural network (using linear activation for simplicity). How do we predict p?� Calculate hidden layer neurons

39 of 105

How do we learn it?

Now we have a “real” neural network (using linear activation for simplicity). How do we predict p?� Calculate hidden layer neurons� Calculate output p

40 of 105

How do we learn it?

Say we want to make p larger. How do we modify the weights? The first layer is easy, same as normal linear model:

41 of 105

How do we learn it?

Say we want to make p larger. How do we modify the weights? The first layer is easy, same as normal linear model:

42 of 105

How do we learn it?

Now what? Let’s calculate the “error” that the hidden layer makes. We want p to be larger, given current weights how should we adjust the hidden layer output to do that?

43 of 105

How do we learn it?

Now what? Let’s calculate the “error” that the hidden layer makes. We want p to be larger, given current weights how should we adjust the hidden layer output to do that?

44 of 105

How do we learn it?

Now that we have an “error” in our hidden layer, want to modify the previous weights. Easy again, just like our linear model.

45 of 105

How do we learn it?

Now that we have an “error” in our hidden layer, want to modify the previous weights. Easy again, just like our linear model.

46 of 105

Backpropagation: just taking derivatives

This is the backpropagation algorithm. It’s really just an easy way to calculate partial derivatives in a neural network. We forward-propagate information through the network, calculate our error, then backpropagate that error through network to calculate weight updates.

47 of 105

Backpropagation: just taking derivatives

This was with linear activations but the process is the same for any φ, just have to calculate φ’(x) for that neuron as well.

48 of 105

Backpropagation: the math

1-layer NN, sigmoid activation at hidden layer, linear output:

F(X) = φ(Xw)v� F(X) = φ(x₁w₁ + x₂w₂)v₁ + φ(x₁w₃ + x₂w₄)v₂ + φ(x₁w₅ + x₂w₆)v₃

49 of 105

Backpropagation: the math

1-layer NN, sigmoid activation at hidden layer, linear output:

F(X) = φ(Xw)v� F(X) = φ(x₁w₁ + x₂w₂)v₁ + φ(x₁w₃ + x₂w₄)v₂ + φ(x₁w₅ + x₂w₆)v₃

Say regression, Loss function is ½ L₂ norm, expected output is Y:� L_X_,_Y(w,v) = ½(Y - F(X))² = ½(Y - φ(Xw)v)²

50 of 105

Backpropagation: the math

1-layer NN, sigmoid activation at hidden layer, linear output:

F(X) = φ(Xw)v� F(X) = φ(x₁w₁ + x₂w₂)v₁ + φ(x₁w₃ + x₂w₄)v₂ + φ(x₁w₅ + x₂w₆)v₃

Say regression, Loss function is ½ L₂ norm, expected output is Y:� L_X_,_Y(w,v) = ½(Y - F(X))² = ½(Y - φ(Xw)v)²

Want partial derivative of Loss w.r.t. weights, say v₂:� ∂/∂v₂ L_X_,_Y(w,v) = ∂/∂v₂ ½(Y - φ(Xw)v)²

51 of 105

Backpropagation: the math

1-layer NN, sigmoid activation at hidden layer, linear output:

F(X) = φ(Xw)v� F(X) = φ(x₁w₁ + x₂w₂)v₁ + φ(x₁w₃ + x₂w₄)v₂ + φ(x₁w₅ + x₂w₆)v₃

Say regression, Loss function is ½ L₂ norm, expected output is Y:� L_X_,_Y(w,v) = ½(Y - F(X))² = ½(Y - φ(Xw)v)²

Want partial derivative of Loss w.r.t. weights, say v₂:� ∂/∂v₂ L_X_,_Y(w,v) = ∂/∂v₂ ½(Y - φ(Xw)v)²� = (Y - φ(Xw)v) * -[∂/∂v₂φ(Xw)v]

52 of 105

Backpropagation: the math

1-layer NN, sigmoid activation at hidden layer, linear output:

F(X) = φ(Xw)v� F(X) = φ(x₁w₁ + x₂w₂)v₁ + φ(x₁w₃ + x₂w₄)v₂ + φ(x₁w₅ + x₂w₆)v₃

Say regression, Loss function is ½ L₂ norm, expected output is Y:� L_X_,_Y(w,v) = ½(Y - F(X))² = ½(Y - φ(Xw)v)²

Want partial derivative of Loss w.r.t. weights, say v₂:� ∂/∂v₂ L_X_,_Y(w,v) = ∂/∂v₂ ½(Y - φ(Xw)v)²� = (Y - φ(Xw)v) * -[∂/∂v₂φ(Xw)v]� = (Y - φ(Xw)v) * -[∂/∂v₂ φ(x₁w₃ + x₂w₄)v₂]

53 of 105

Backpropagation: the math

1-layer NN, sigmoid activation at hidden layer, linear output:

F(X) = φ(Xw)v� F(X) = φ(x₁w₁ + x₂w₂)v₁ + φ(x₁w₃ + x₂w₄)v₂ + φ(x₁w₅ + x₂w₆)v₃

Say regression, Loss function is ½ L₂ norm, expected output is Y:� L_X_,_Y(w,v) = ½(Y - F(X))² = ½(Y - φ(Xw)v)²

Want partial derivative of Loss w.r.t. weights, say v₂:� ∂/∂v₂ L_X_,_Y(w,v) = ∂/∂v₂ ½(Y - φ(Xw)v)²� = (Y - φ(Xw)v) * -[∂/∂v₂φ(Xw)v]� = (Y - φ(Xw)v) * -[∂/∂v₂ φ(x₁w₃ + x₂w₄)v₂]� = (Y - φ(Xw)v) * -φ(x₁w₃ + x₂w₄)

54 of 105

Backpropagation: the math

Want partial derivative of Loss w.r.t. weights, say v₂:� ∂/∂v₂ L_X_,_Y(w,v) = (Y - φ(Xw)v) * -φ(x₁w₃ + x₂w₄)�

55 of 105

Backpropagation: the math

Want partial derivative of Loss w.r.t. weights, say v₂:� ∂/∂v₂ L_X_,_Y(w,v) = -(Y - φ(Xw)v)*h₂�

56 of 105

Backpropagation: the math

Want partial derivative of Loss w.r.t. weights, say v₂:� ∂/∂v₂ L_X_,_Y(w,v) = -(Y - φ(Xw)v)*h₂

Weight update rule (remember descend on loss):� v₂ = v₂ + η(Y - φ(Xw)v)*h₂�

57 of 105

Backpropagation: the math

1-layer NN, sigmoid activation at hidden layer, linear output:

F(X) = φ(Xw)v� F(X) = φ(x₁w₁ + x₂w₂)v₁ + φ(x₁w₃ + x₂w₄)v₂ + φ(x₁w₅ + x₂w₆)v₃

Say regression, Loss function is ½ L₂ norm, expected output is Y:� L_X_,_Y(w,v) = ½(Y - F(X))² = ½(Y - φ(Xw)v)²

Want partial derivative of Loss w.r.t. weights, say w₂:� ∂/∂w₂ L_X_,_Y(w,v) = ∂/∂w₂ ½(Y - φ(Xw)v)²� = (Y - φ(Xw)v) * -[∂/∂w₂ φ(Xw)v]

58 of 105

Backpropagation: the math

1-layer NN, sigmoid activation at hidden layer, linear output:

F(X) = φ(Xw)v� F(X) = φ(x₁w₁ + x₂w₂)v₁ + φ(x₁w₃ + x₂w₄)v₂ + φ(x₁w₅ + x₂w₆)v₃

Say regression, Loss function is ½ L₂ norm, expected output is Y:� L_X_,_Y(w,v) = ½(Y - F(X))² = ½(Y - φ(Xw)v)²

Want partial derivative of Loss w.r.t. weights, say w₂:� ∂/∂w₂ L_X_,_Y(w,v) = ∂/∂w₂ ½(Y - φ(Xw)v)²� = (Y - φ(Xw)v) * -[∂/∂w₂ φ(Xw)v]� = (Y - φ(Xw)v) * -[∂/∂w₂ φ(x₁w₁ + x₂w₂)v₁]

59 of 105

Backpropagation: the math

1-layer NN, sigmoid activation at hidden layer, linear output:

F(X) = φ(Xw)v� F(X) = φ(x₁w₁ + x₂w₂)v₁ + φ(x₁w₃ + x₂w₄)v₂ + φ(x₁w₅ + x₂w₆)v₃

Say regression, Loss function is ½ L₂ norm, expected output is Y:� L_X_,_Y(w,v) = ½(Y - F(X))² = ½(Y - φ(Xw)v)²

Want partial derivative of Loss w.r.t. weights, say w₂:� ∂/∂w₂ L_X_,_Y(w,v) = ∂/∂w₂ ½(Y - φ(Xw)v)²� = (Y - φ(Xw)v) * -[∂/∂w₂ φ(Xw)v]� = (Y - φ(Xw)v) * -v₁[∂/∂w₂ φ(x₁w₁ + x₂w₂)]

60 of 105

Backpropagation: the math

1-layer NN, sigmoid activation at hidden layer, linear output:

F(X) = φ(Xw)v� F(X) = φ(x₁w₁ + x₂w₂)v₁ + φ(x₁w₃ + x₂w₄)v₂ + φ(x₁w₅ + x₂w₆)v₃

Say regression, Loss function is ½ L₂ norm, expected output is Y:� L_X_,_Y(w,v) = ½(Y - F(X))² = ½(Y - φ(Xw)v)²

Want partial derivative of Loss w.r.t. weights, say w₂:� ∂/∂w₂ L_X_,_Y(w,v) = ∂/∂w₂ ½(Y - φ(Xw)v)²� = (Y - φ(Xw)v) * -[∂/∂w₂ φ(Xw)v]� = (Y - φ(Xw)v) * -v₁[∂/∂w₂ φ(x₁w₁ + x₂w₂)]� = (Y - φ(Xw)v) * -v₁φ’(x₁w₁+x₂w₂)[∂/∂w₂ (x₁w₁ + x₂w₂)]

Chain rule!

If F(x) = f(g(x))

F’(x) = f’(g(x))g’(x)

61 of 105

Backpropagation: the math

1-layer NN, sigmoid activation at hidden layer, linear output:

F(X) = φ(Xw)v� F(X) = φ(x₁w₁ + x₂w₂)v₁ + φ(x₁w₃ + x₂w₄)v₂ + φ(x₁w₅ + x₂w₆)v₃

Say regression, Loss function is ½ L₂ norm, expected output is Y:� L_X_,_Y(w,v) = ½(Y - F(X))² = ½(Y - φ(Xw)v)²

62 of 105

Backpropagation: the math

Want partial derivative of Loss w.r.t. weights, say w₂:� ∂/∂w₂ L_X_,_Y(w,v) = (Y - φ(Xw)v) * -v₁φ’(x₁w₁+x₂w₂) * x₂

63 of 105

Backpropagation: the math

Want partial derivative of Loss w.r.t. weights, say w₂:� ∂/∂w₂ L_X_,_Y(w,v) = (Y - φ(Xw)v) * -v₁φ’(x₁w₁+x₂w₂) * x₂

Model error at p

64 of 105

Backpropagation: the math

Want partial derivative of Loss w.r.t. weights, say w₂:� ∂/∂w₂ L_X_,_Y(w,v) = (Y - φ(Xw)v) * -v₁φ’(x₁w₁+x₂w₂) * x₂

Model error at p

Backpropagate through v₁

65 of 105

Backpropagation: the math

Want partial derivative of Loss w.r.t. weights, say w₂:� ∂/∂w₂ L_X_,_Y(w,v) = (Y - φ(Xw)v) * -v₁φ’(x₁w₁+x₂w₂) * x₂

Model error at p

Backpropagate through v₁

Model error at h₁

66 of 105

Backpropagation: the math

Want partial derivative of Loss w.r.t. weights, say w₂:� ∂/∂w₂ L_X_,_Y(w,v) = (Y - φ(Xw)v) * -v₁φ’(x₁w₁+x₂w₂) * x₂

Model error at p

Backpropagate through v₁

Model error at h₁

Multiply by x₂: gradient w.r.t. w₂

67 of 105

Backpropagation: the math

∂L/∂p

∂p/∂v₁

∂p/∂h₁

∂h₁/∂(w₁x₁ + w₂x₂)

∂(w₁x₁ + w₂x₂)/∂w₂

68 of 105

Backpropagation: the math

∂(w₁x₁ + w₂x₂)/∂w₂ = x₂

69 of 105

Backpropagation: the math

∂h₁/∂(w₁x₁ + w₂x₂) = φ’(x₁w₁+x₂w₂)

∂(w₁x₁ + w₂x₂)/∂w₂ = x₂

70 of 105

Backpropagation: the math

∂p/∂h₁ = v₁

∂h₁/∂(w₁x₁ + w₂x₂) = φ’(x₁w₁+x₂w₂)

∂(w₁x₁ + w₂x₂)/∂w₂ = x₂

71 of 105

Backpropagation: the math

∂p/∂v₁ = h₁

∂p/∂h₁ = v₁

∂h₁/∂(w₁x₁ + w₂x₂) = φ’(x₁w₁+x₂w₂)

∂(w₁x₁ + w₂x₂)/∂w₂ = x₂

72 of 105

Backpropagation: the math

∂L/∂p = (Y - φ(Xw)v)

∂p/∂v₁ = h₁

∂p/∂h₁ = v₁

∂h₁/∂(w₁x₁ + w₂x₂) = φ’(x₁w₁+x₂w₂)

∂(w₁x₁ + w₂x₂)/∂w₂ = x₂

73 of 105

Backpropagation: the math

∂L/∂p = (Y - φ(Xw)v)

∂p/∂v₁ = h₁

∂p/∂h₁ = v₁

∂h₁/∂(w₁x₁ + w₂x₂) = φ’(x₁w₁+x₂w₂)

∂(w₁x₁ + w₂x₂)/∂w₂ = x₂

How do we update v₁?�∂L/∂v₁

74 of 105

Backpropagation: the math

∂L/∂p = (Y - φ(Xw)v)

∂p/∂v₁ = h₁

∂p/∂h₁ = v₁

∂h₁/∂(w₁x₁ + w₂x₂) = φ’(x₁w₁+x₂w₂)

∂(w₁x₁ + w₂x₂)/∂w₂ = x₂

How do we update v₁?�∂L/∂v₁ = ∂p/∂v₁ * ∂L/∂p

75 of 105

Backpropagation: the math

∂L/∂p = (Y - φ(Xw)v)

∂p/∂v₁ = h₁

∂p/∂h₁ = v₁

∂h₁/∂(w₁x₁ + w₂x₂) = φ’(x₁w₁+x₂w₂)

∂(w₁x₁ + w₂x₂)/∂w₂ = x₂

How do we update v₁?�∂L/∂v₁ = ∂p/∂v₁ * ∂L/∂p =

H₁

76 of 105

Backpropagation: the math

∂L/∂p = (Y - φ(Xw)v)

∂p/∂v₁ = h₁

∂p/∂h₁ = v₁

∂h₁/∂(w₁x₁ + w₂x₂) = φ’(x₁w₁+x₂w₂)

∂(w₁x₁ + w₂x₂)/∂w₂ = x₂

How do we update v₁?�∂L/∂v₁ = ∂p/∂v₁ * ∂L/∂p =

H₁ * (Y - φ(Xw)v)

77 of 105

Backpropagation: the math

∂L/∂p = (Y - φ(Xw)v)

∂p/∂v₁ = h₁

∂p/∂h₁ = v₁

∂h₁/∂(w₁x₁ + w₂x₂) = φ’(x₁w₁+x₂w₂)

∂(w₁x₁ + w₂x₂)/∂w₂ = x₂

How do we update w₂?�∂L/∂w₂

78 of 105

Backpropagation: the math

∂L/∂p = (Y - φ(Xw)v)

∂p/∂v₁ = h₁

∂p/∂h₁ = v₁

∂h₁/∂(w₁x₁ + w₂x₂) = φ’(x₁w₁+x₂w₂)

∂(w₁x₁ + w₂x₂)/∂w₂ = x₂

How do we update w₂?�∂L/∂w₂ = �∂(w₁x₁ + w₂x₂)/∂w₂* ∂h₁/∂(w₁x₁ + w₂x₂) * ∂p/∂h₁ * ∂L/∂p

79 of 105

Backpropagation: the math

∂L/∂p = (Y - φ(Xw)v)

∂p/∂v₁ = h₁

∂p/∂h₁ = v₁

∂h₁/∂(w₁x₁ + w₂x₂) = φ’(x₁w₁+x₂w₂)

∂(w₁x₁ + w₂x₂)/∂w₂ = x₂

How do we update w₂?�∂L/∂w₂ = �∂(w₁x₁ + w₂x₂)/∂w₂* ∂h₁/∂(w₁x₁ + w₂x₂) * ∂p/∂h₁ * ∂L/∂p =

x₂

80 of 105

Backpropagation: the math

∂L/∂p = (Y - φ(Xw)v)

∂p/∂v₁ = h₁

∂p/∂h₁ = v₁

∂h₁/∂(w₁x₁ + w₂x₂) = φ’(x₁w₁+x₂w₂)

∂(w₁x₁ + w₂x₂)/∂w₂ = x₂

How do we update w₂?�∂L/∂w₂ = �∂(w₁x₁ + w₂x₂)/∂w₂* ∂h₁/∂(w₁x₁ + w₂x₂) * ∂p/∂h₁ * ∂L/∂p =

x₂ * φ’(x₁w₁+x₂w₂)

81 of 105

Backpropagation: the math

∂L/∂p = (Y - φ(Xw)v)

∂p/∂v₁ = h₁

∂p/∂h₁ = v₁

∂h₁/∂(w₁x₁ + w₂x₂) = φ’(x₁w₁+x₂w₂)

∂(w₁x₁ + w₂x₂)/∂w₂ = x₂

How do we update w₂?�∂L/∂w₂ = �∂(w₁x₁ + w₂x₂)/∂w₂* ∂h₁/∂(w₁x₁ + w₂x₂) * ∂p/∂h₁ * ∂L/∂p =

x₂ * φ’(x₁w₁+x₂w₂) * v₁

82 of 105

Backpropagation: the math

∂L/∂p = (Y - φ(Xw)v)

∂p/∂v₁ = h₁

∂p/∂h₁ = v₁

∂h₁/∂(w₁x₁ + w₂x₂) = φ’(x₁w₁+x₂w₂)

∂(w₁x₁ + w₂x₂)/∂w₂ = x₂

How do we update w₂?�∂L/∂w₂ = �∂(w₁x₁ + w₂x₂)/∂w₂* ∂h₁/∂(w₁x₁ + w₂x₂) * ∂p/∂h₁ * ∂L/∂p =

x₂ * φ’(x₁w₁+x₂w₂) * v₁ * (Y - φ(Xw)v)

83 of 105

Backpropagation: the math

∂L/∂p

∂L/∂v₁

84 of 105

Backpropagation: the math

∂L/∂p

∂L/∂h₁

85 of 105

Backpropagation: the math

∂L/∂h₁

∂L/∂(w₁x₁ + w₂x₂)

φ’(x₁w₁+x₂w₂)

86 of 105

Backpropagation: the math

∂L/∂(w₁x₁ + w₂x₂)

∂L/∂w₂

87 of 105

Forward propagation

88 of 105

Backward propagation

89 of 105

Weight updates

90 of 105

Under and Overfitting

Underfitting: model not powerful enough, too much bias

Overfitting: model too powerful, fits to noise, doesn’t generalize well

Want the happy medium, how?

91 of 105

Under and Overfitting

Want the happy medium, how?

Pick the right model, but very hard to know a priori

Make weak model more powerful: boosting! (or other ways)

Make strong model less likely to overfit: regularization

92 of 105

With great power comes great overfitting

Neural networks are (sort of) all powerful! Which is not necessarily a good thing.

93 of 105

With great power comes great overfitting

Like SVMs, put limits on model that make it generalize better!

SVM:�min ||w||₂�s.t. y_n(w·x_n - b) ≥ 1, n = 1, 2 ..

Neural net:�Minimize loss function and weight magnitude� Before: argmin_wL_X(w)� Now: argmin_wL_X(w) + λ ||w||₂�

94 of 105

Weight decay: neural network regularization

argmin_wL_X(w) + λ ||w||₂

λ: regularization parameter� Higher: more penalty for large weights, less powerful model� Lower: less penalty, more overfitting

Commonly use L₂ norm to regularize, weight decay

Gradient descent update rule:� w_t+1 = w_t - η[∂/∂w_t L(w_t) + λw_t]

= w_t - η∂/∂w_t L(w_t) - ηλw_t

Subtract a little bit of weight every iteration

95 of 105

Sometimes training is SLOW

With SGD we make LOTS of little steps along the gradient

Sometimes we move in the same direction for a long time… � Maybe we should speed up in that direction!

96 of 105

Momentum: speeding up SGD

If we keep moving in same direction we should move further every round

Before:� Δw_t = -∂/∂w_t L(w_t)

Now:� Δw_t = -∂/∂w_t L(w_t) + mΔw_t-1

w_t+1 = w_t + ηΔw_t

Side effect: smooths out updates if gradient is in different directions

97 of 105

NN updates with weight decay and momentum

Δw_t = -∂/∂w_t L(w_t) - λw_t + mΔw_t-1

w_t+1 = w_t + ηΔw_t

98 of 105

NN updates with weight decay and momentum

Δw_t = -∂/∂w_t L(w_t) - λw_t + mΔw_t-1

w_t+1 = w_t + ηΔw_t

Gradient of loss

99 of 105

NN updates with weight decay and momentum

Δw_t = -∂/∂w_t L(w_t) - λw_t + mΔw_t-1

w_t+1 = w_t + ηΔw_t

Gradient of loss

Weight decay

100 of 105

NN updates with weight decay and momentum

Δw_t = -∂/∂w_t L(w_t) - λw_t + mΔw_t-1

w_t+1 = w_t + ηΔw_t

Gradient of loss

Weight decay

Momentum

101 of 105

NN updates with weight decay and momentum

Δw_t = -∂/∂w_t L(w_t) - λw_t + mΔw_t-1

w_t+1 = w_t + ηΔw_t

Gradient of loss

Weight decay

Momentum

Learning rate

102 of 105

What about our activation functions φ

Many options, want them to be easy to take derivative

UAT holds when bounded, in practice bounds can be problematic

103 of 105

Common activation functions φ

linear

logistic

tanh

REctified Linear Unit (RELU)

Leaky RELU

104 of 105

So many hyper parameters!!

How do we know what to use??

105 of 105

Hyper Parameter Dark Magic

What follows are the one, true, correct, and only set of hyperparameters.�Praise be the NetLord!

η = [.0001 - .01]

λ = .0005�m = .9�φ = leaky relu

1 of 105

2 of 105

3 of 105

4 of 105

5 of 105

6 of 105

7 of 105

8 of 105

9 of 105

10 of 105

11 of 105

12 of 105

13 of 105

14 of 105

15 of 105

16 of 105

17 of 105

18 of 105

19 of 105

20 of 105

21 of 105

22 of 105

23 of 105

24 of 105

25 of 105

26 of 105

27 of 105

28 of 105

29 of 105

30 of 105

31 of 105

32 of 105

33 of 105

34 of 105

35 of 105

36 of 105

37 of 105

38 of 105

39 of 105

40 of 105

41 of 105

42 of 105

43 of 105

44 of 105

45 of 105

46 of 105

47 of 105

48 of 105

49 of 105

50 of 105

51 of 105

52 of 105

53 of 105

54 of 105

55 of 105

56 of 105

57 of 105

58 of 105

59 of 105

60 of 105

61 of 105

62 of 105

63 of 105

64 of 105

65 of 105

66 of 105

67 of 105

68 of 105

69 of 105

70 of 105

71 of 105

72 of 105

73 of 105

74 of 105

75 of 105

76 of 105

77 of 105

78 of 105

79 of 105

80 of 105