1 of 105

2 of 105

Logistics:

  • Homework 3 is due (again) tonight!
  • So is project proposal
    • It can change in the next couple weeks
    • Just want to make sure you are thinking about project ideas and have started to work on something concrete
    • Office hours today 11:30-12:30 if you want to check in
  • Homework 4 out tonight (really, i promise, no joke)
    • Optical flow
    • OpenCV optional to do video processing (webcam demo, etc.)

3 of 105

4 of 105

Softmax: normalized exponential

Generalization of logistic

Input: vector of reals�Output: probability distribution

softmax([1,2,7,3,2]):� Calculate ex: [2.72, 7.39, 1096.63, 20.09, 7.39]� Calculate sum(ex): 2.72+7.39+1096.63+20.09+7.39 = 1134.22� Normalize: ex/sum(ex) = [0.002, 0.007, 0.967, 0.017, 0.007]�

5 of 105

Multinomial logistic regression

Probability of that a data point belongs to a class is the normalized, weighted sum of the input variables with the learned weights.

softmax(wx + b)

6 of 105

Multinomial logistic regression

https://www.tensorflow.org/versions/r1.1/get_started/mnist/beginners

Probability of that a data point belongs to a class is the normalized, weighted sum of the input variables with the learned weights.

7 of 105

MNIST: Handwriting recognition

50,000 images of handwriting�28 x 28 x 1 (grayscale)�Numbers 0-9

10 class softmax regression�Input is 784 pixel values�Train with SGD�> 95% accuracy

8 of 105

Support Vector Machine (SVM)

Find max-margin classifier. Examples on the margin are supporting data points, support vectors.

min ||w||2�s.t. yn(w·xn - b) ≥ 1, n = 1, 2 ..

Or: minimize weights such that margin for �each point is at least 1

9 of 105

Case study: Person detection

Dalal and Triggs ‘05:�Train SVM on HOG features of image�2 classes, person/not person

At test time:�Extract HOG features at many scales�Run SVM classifier at every location�High responses = person?��

10 of 105

Case study: Person detection

Dalal and Triggs is a sliding window detector

Many scales�Every location

10k+ classifier�evaluations per�image.

Person? No

11 of 105

Case study: Deformable parts models

http://cs.brown.edu/people/pfelzens/papers/lsvm-pami.pdf

Objects have parts, learn to recognize parts�and where they are

Latent SVM: Learn part appearances and�locations without explicit data

Hard negative mining: rebalance classes�for sliding window detectors

12 of 105

Case study: Image classification

https://lear.inrialpes.fr/~verbeek/mlcr.slides.11.12/sanchez11cvpr.pdf

Given an image, what’s in it?

Old state-of-the-art:�Extract features from image� SIFT and Fisher Vectors�Train Linear SVM

On 1000 different classes, 54% accurate

13 of 105

What’s wrong with this?

Machine learning needs features!!

What are the right features?�HOG?�SIFT?�FV?

Why not let the algorithm decide

Neural networks: Feature extraction + linear model

14 of 105

Success of Neural Networks

Image classification:

54% -> 80% accuracy on 1000 classes

Object detection:

33% mAP (DPM) -> 88% mAP on 20 classes

15 of 105

16 of 105

What is feature engineering?

Arguably the core problem of machine learning (especially in practice)

ML models work well if there is a clear relationship between the inputs and outputs of the function you are trying to model

17 of 105

What is feature engineering?

Arguably the core problem of machine learning (especially in practice)

ML models work well if there is a clear relationship between the inputs and outputs of the function you are trying to model

18 of 105

What is feature engineering?

Arguably the core problem of machine learning (especially in practice)

ML models work well if there is a clear relationship between the inputs and outputs of the function you are trying to model

19 of 105

What is feature engineering?

Arguably the core problem of machine learning (especially in practice)

ML models work well if there is a clear relationship between the inputs and outputs of the function you are trying to model

20 of 105

Linear model can’t do this

Cannot learn transformations of features, only use existing features. Human must create good features

21 of 105

What if we added more processing?

Generally, feature engineering is just coming up with combinations of the features we already have

22 of 105

What if we added more processing?

Create “new” features using old ones. We’ll call H our hidden layer

23 of 105

What if we added more processing?

As with linear model, H can be expressed in matrix operations

24 of 105

What if we added more processing?

Now our prediction p is a function of our hidden layer

25 of 105

What if we added more processing?

Now our prediction p is a function of our hidden layer

Feature extractor

26 of 105

What if we added more processing?

Now our prediction p is a function of our hidden layer

Feature extractor

Linear model

27 of 105

What if we added more processing?

Can still express the whole process in matrix notation! Nice because matrix ops are fast

28 of 105

This is a neural network!

This one has 1 hidden layer, but can have way more�Each layer is just some function φ applied to linear combination of the previous layer

29 of 105

φ is our activation function

Want to apply some extra processing at each layer. Why?� Imagine φ(x) = x, linear activation

30 of 105

φ is our activation function

Want to apply some extra processing at each layer. Why?� Imagine φ(x) = x, linear activation

p = v1h1 + v2h2 + v3h3

But h1 = x1w1 + x2w2, h2 = … etc�So� p = v1w1x1 + v1w2x2 + v2w3x1 + v2w4x2 + v3w5x1 + v3w6x2 = (v1w1+v2w3+v3w5)x1 + (v1w2+v2w4+v3w6)x2� = u1x1 + u2x2

31 of 105

Universal approximation theorem

https://en.wikipedia.org/wiki/Universal_approximation_theorem

What if φ not linear?

Universal approximation theorem (Cybenko 89, Hornik 91)� φ: any nonconstant, bounded, monotonically increasing function� Im: m-dimensional unit hypercube (interval [0-1] in m-d)� Then 1-layer neural network with φ as activation can model any continuous function f: Im -> R� (no bound on size of hidden layer)

By extension, works on f: bounded Rm -> R

What can we learn? What can’t we?

UAT just says it’s possible to model, not how.

32 of 105

How do we learn it?

Neural networks are non-convex with no closed form solution (can’t take derivative and set = 0)

Gradient descent! Recall for linear model:

33 of 105

How do we learn it?

With gradient descent we calculate the partial derivatives of the loss (or likelihood) function for every weight: ∂/∂wi log L(w)

Then do gradient descent (or ascent) by adding gradient to weight

34 of 105

How do we learn it?

Simple example, say we have a data point [10, 1] and we predict some p. We also know the “correct” label Y. Maybe our prediction p is too small and we want to make it larger. How do we adjust w?

35 of 105

How do we learn it?

Simple example, say we have a data point [10, 1] and we predict some p. We also know the “correct” label Y. Maybe our prediction p is too small and we want to make it larger. How do we adjust w?

We adjust w1 much more than w2, why?

36 of 105

How do we learn it?

Simple example, say we have a data point [-1, 1] and we predict some p. We also know the “correct” label Y. Maybe our prediction p is too small and we want to make it larger. How do we adjust w?

37 of 105

How do we learn it?

Now we have a “real” neural network (using linear activation for simplicity). How do we predict p?

38 of 105

How do we learn it?

Now we have a “real” neural network (using linear activation for simplicity). How do we predict p?� Calculate hidden layer neurons

39 of 105

How do we learn it?

Now we have a “real” neural network (using linear activation for simplicity). How do we predict p?� Calculate hidden layer neurons� Calculate output p

40 of 105

How do we learn it?

Say we want to make p larger. How do we modify the weights? The first layer is easy, same as normal linear model:

41 of 105

How do we learn it?

Say we want to make p larger. How do we modify the weights? The first layer is easy, same as normal linear model:

42 of 105

How do we learn it?

Now what? Let’s calculate the “error” that the hidden layer makes. We want p to be larger, given current weights how should we adjust the hidden layer output to do that?

43 of 105

How do we learn it?

Now what? Let’s calculate the “error” that the hidden layer makes. We want p to be larger, given current weights how should we adjust the hidden layer output to do that?

44 of 105

How do we learn it?

Now that we have an “error” in our hidden layer, want to modify the previous weights. Easy again, just like our linear model.

45 of 105

How do we learn it?

Now that we have an “error” in our hidden layer, want to modify the previous weights. Easy again, just like our linear model.

46 of 105

Backpropagation: just taking derivatives

This is the backpropagation algorithm. It’s really just an easy way to calculate partial derivatives in a neural network. We forward-propagate information through the network, calculate our error, then backpropagate that error through network to calculate weight updates.

47 of 105

Backpropagation: just taking derivatives

This was with linear activations but the process is the same for any φ, just have to calculate φ’(x) for that neuron as well.

48 of 105

Backpropagation: the math

1-layer NN, sigmoid activation at hidden layer, linear output:

F(X) = φ(Xw)v� F(X) = φ(x1w1 + x2w2)v1 + φ(x1w3 + x2w4)v2 + φ(x1w5 + x2w6)v3

49 of 105

Backpropagation: the math

1-layer NN, sigmoid activation at hidden layer, linear output:

F(X) = φ(Xw)v� F(X) = φ(x1w1 + x2w2)v1 + φ(x1w3 + x2w4)v2 + φ(x1w5 + x2w6)v3

Say regression, Loss function is ½ L2 norm, expected output is Y:� LX,Y(w,v) = ½(Y - F(X))2 = ½(Y - φ(Xw)v)2

50 of 105

Backpropagation: the math

1-layer NN, sigmoid activation at hidden layer, linear output:

F(X) = φ(Xw)v� F(X) = φ(x1w1 + x2w2)v1 + φ(x1w3 + x2w4)v2 + φ(x1w5 + x2w6)v3

Say regression, Loss function is ½ L2 norm, expected output is Y:� LX,Y(w,v) = ½(Y - F(X))2 = ½(Y - φ(Xw)v)2

Want partial derivative of Loss w.r.t. weights, say v2:� ∂/∂v2 LX,Y(w,v) = ∂/∂v2 ½(Y - φ(Xw)v)2

51 of 105

Backpropagation: the math

1-layer NN, sigmoid activation at hidden layer, linear output:

F(X) = φ(Xw)v� F(X) = φ(x1w1 + x2w2)v1 + φ(x1w3 + x2w4)v2 + φ(x1w5 + x2w6)v3

Say regression, Loss function is ½ L2 norm, expected output is Y:� LX,Y(w,v) = ½(Y - F(X))2 = ½(Y - φ(Xw)v)2

Want partial derivative of Loss w.r.t. weights, say v2:� ∂/∂v2 LX,Y(w,v) = ∂/∂v2 ½(Y - φ(Xw)v)2� = (Y - φ(Xw)v) * -[∂/∂v2φ(Xw)v]

52 of 105

Backpropagation: the math

1-layer NN, sigmoid activation at hidden layer, linear output:

F(X) = φ(Xw)v� F(X) = φ(x1w1 + x2w2)v1 + φ(x1w3 + x2w4)v2 + φ(x1w5 + x2w6)v3

Say regression, Loss function is ½ L2 norm, expected output is Y:� LX,Y(w,v) = ½(Y - F(X))2 = ½(Y - φ(Xw)v)2

Want partial derivative of Loss w.r.t. weights, say v2:� ∂/∂v2 LX,Y(w,v) = ∂/∂v2 ½(Y - φ(Xw)v)2� = (Y - φ(Xw)v) * -[∂/∂v2φ(Xw)v]� = (Y - φ(Xw)v) * -[∂/∂v2 φ(x1w3 + x2w4)v2]

53 of 105

Backpropagation: the math

1-layer NN, sigmoid activation at hidden layer, linear output:

F(X) = φ(Xw)v� F(X) = φ(x1w1 + x2w2)v1 + φ(x1w3 + x2w4)v2 + φ(x1w5 + x2w6)v3

Say regression, Loss function is ½ L2 norm, expected output is Y:� LX,Y(w,v) = ½(Y - F(X))2 = ½(Y - φ(Xw)v)2

Want partial derivative of Loss w.r.t. weights, say v2:� ∂/∂v2 LX,Y(w,v) = ∂/∂v2 ½(Y - φ(Xw)v)2� = (Y - φ(Xw)v) * -[∂/∂v2φ(Xw)v]� = (Y - φ(Xw)v) * -[∂/∂v2 φ(x1w3 + x2w4)v2]� = (Y - φ(Xw)v) * -φ(x1w3 + x2w4)

54 of 105

Backpropagation: the math

Want partial derivative of Loss w.r.t. weights, say v2:� ∂/∂v2 LX,Y(w,v) = (Y - φ(Xw)v) * -φ(x1w3 + x2w4)�

55 of 105

Backpropagation: the math

Want partial derivative of Loss w.r.t. weights, say v2:� ∂/∂v2 LX,Y(w,v) = -(Y - φ(Xw)v)*h2

56 of 105

Backpropagation: the math

Want partial derivative of Loss w.r.t. weights, say v2:� ∂/∂v2 LX,Y(w,v) = -(Y - φ(Xw)v)*h2

Weight update rule (remember descend on loss):� v2 = v2 + η(Y - φ(Xw)v)*h2

57 of 105

Backpropagation: the math

1-layer NN, sigmoid activation at hidden layer, linear output:

F(X) = φ(Xw)v� F(X) = φ(x1w1 + x2w2)v1 + φ(x1w3 + x2w4)v2 + φ(x1w5 + x2w6)v3

Say regression, Loss function is ½ L2 norm, expected output is Y:� LX,Y(w,v) = ½(Y - F(X))2 = ½(Y - φ(Xw)v)2

Want partial derivative of Loss w.r.t. weights, say w2:� ∂/∂w2 LX,Y(w,v) = ∂/∂w2 ½(Y - φ(Xw)v)2� = (Y - φ(Xw)v) * -[∂/∂w2 φ(Xw)v]

58 of 105

Backpropagation: the math

1-layer NN, sigmoid activation at hidden layer, linear output:

F(X) = φ(Xw)v� F(X) = φ(x1w1 + x2w2)v1 + φ(x1w3 + x2w4)v2 + φ(x1w5 + x2w6)v3

Say regression, Loss function is ½ L2 norm, expected output is Y:� LX,Y(w,v) = ½(Y - F(X))2 = ½(Y - φ(Xw)v)2

Want partial derivative of Loss w.r.t. weights, say w2:� ∂/∂w2 LX,Y(w,v) = ∂/∂w2 ½(Y - φ(Xw)v)2� = (Y - φ(Xw)v) * -[∂/∂w2 φ(Xw)v]� = (Y - φ(Xw)v) * -[∂/∂w2 φ(x1w1 + x2w2)v1]

59 of 105

Backpropagation: the math

1-layer NN, sigmoid activation at hidden layer, linear output:

F(X) = φ(Xw)v� F(X) = φ(x1w1 + x2w2)v1 + φ(x1w3 + x2w4)v2 + φ(x1w5 + x2w6)v3

Say regression, Loss function is ½ L2 norm, expected output is Y:� LX,Y(w,v) = ½(Y - F(X))2 = ½(Y - φ(Xw)v)2

Want partial derivative of Loss w.r.t. weights, say w2:� ∂/∂w2 LX,Y(w,v) = ∂/∂w2 ½(Y - φ(Xw)v)2� = (Y - φ(Xw)v) * -[∂/∂w2 φ(Xw)v]� = (Y - φ(Xw)v) * -v1[∂/∂w2 φ(x1w1 + x2w2)]

60 of 105

Backpropagation: the math

1-layer NN, sigmoid activation at hidden layer, linear output:

F(X) = φ(Xw)v� F(X) = φ(x1w1 + x2w2)v1 + φ(x1w3 + x2w4)v2 + φ(x1w5 + x2w6)v3

Say regression, Loss function is ½ L2 norm, expected output is Y:� LX,Y(w,v) = ½(Y - F(X))2 = ½(Y - φ(Xw)v)2

Want partial derivative of Loss w.r.t. weights, say w2:� ∂/∂w2 LX,Y(w,v) = ∂/∂w2 ½(Y - φ(Xw)v)2� = (Y - φ(Xw)v) * -[∂/∂w2 φ(Xw)v]� = (Y - φ(Xw)v) * -v1[∂/∂w2 φ(x1w1 + x2w2)]� = (Y - φ(Xw)v) * -v1φ’(x1w1+x2w2)[∂/∂w2 (x1w1 + x2w2)]

Chain rule!

If F(x) = f(g(x))

F’(x) = f’(g(x))g’(x)

61 of 105

Backpropagation: the math

1-layer NN, sigmoid activation at hidden layer, linear output:

F(X) = φ(Xw)v� F(X) = φ(x1w1 + x2w2)v1 + φ(x1w3 + x2w4)v2 + φ(x1w5 + x2w6)v3

Say regression, Loss function is ½ L2 norm, expected output is Y:� LX,Y(w,v) = ½(Y - F(X))2 = ½(Y - φ(Xw)v)2

Want partial derivative of Loss w.r.t. weights, say w2:� ∂/∂w2 LX,Y(w,v) = ∂/∂w2 ½(Y - φ(Xw)v)2� = (Y - φ(Xw)v) * -[∂/∂w2 φ(Xw)v]� = (Y - φ(Xw)v) * -v1[∂/∂w2 φ(x1w1 + x2w2)]� = (Y - φ(Xw)v) * -v1φ’(x1w1+x2w2)[∂/∂w2 (x1w1 + x2w2)]� = (Y - φ(Xw)v) * -v1φ’(x1w1+x2w2) * x2

62 of 105

Backpropagation: the math

Want partial derivative of Loss w.r.t. weights, say w2:� ∂/∂w2 LX,Y(w,v) = (Y - φ(Xw)v) * -v1φ’(x1w1+x2w2) * x2

63 of 105

Backpropagation: the math

Want partial derivative of Loss w.r.t. weights, say w2:� ∂/∂w2 LX,Y(w,v) = (Y - φ(Xw)v) * -v1φ’(x1w1+x2w2) * x2

Model error at p

64 of 105

Backpropagation: the math

Want partial derivative of Loss w.r.t. weights, say w2:� ∂/∂w2 LX,Y(w,v) = (Y - φ(Xw)v) * -v1φ’(x1w1+x2w2) * x2

Model error at p

Backpropagate through v1

65 of 105

Backpropagation: the math

Want partial derivative of Loss w.r.t. weights, say w2:� ∂/∂w2 LX,Y(w,v) = (Y - φ(Xw)v) * -v1φ’(x1w1+x2w2) * x2

Model error at p

Backpropagate through v1

Model error at h1

66 of 105

Backpropagation: the math

Want partial derivative of Loss w.r.t. weights, say w2:� ∂/∂w2 LX,Y(w,v) = (Y - φ(Xw)v) * -v1φ’(x1w1+x2w2) * x2

Model error at p

Backpropagate through v1

Model error at h1

Multiply by x2: gradient w.r.t. w2

67 of 105

Backpropagation: the math

∂L/∂p

∂p/∂v1

∂p/∂h1

∂h1/∂(w1x1 + w2x2)

∂(w1x1 + w2x2)/∂w2

68 of 105

Backpropagation: the math

∂(w1x1 + w2x2)/∂w2 = x2

69 of 105

Backpropagation: the math

∂h1/∂(w1x1 + w2x2) = φ’(x1w1+x2w2)

∂(w1x1 + w2x2)/∂w2 = x2

70 of 105

Backpropagation: the math

∂p/∂h1 = v1

∂h1/∂(w1x1 + w2x2) = φ’(x1w1+x2w2)

∂(w1x1 + w2x2)/∂w2 = x2

71 of 105

Backpropagation: the math

∂p/∂v1 = h1

∂p/∂h1 = v1

∂h1/∂(w1x1 + w2x2) = φ’(x1w1+x2w2)

∂(w1x1 + w2x2)/∂w2 = x2

72 of 105

Backpropagation: the math

∂L/∂p = (Y - φ(Xw)v)

∂p/∂v1 = h1

∂p/∂h1 = v1

∂h1/∂(w1x1 + w2x2) = φ’(x1w1+x2w2)

∂(w1x1 + w2x2)/∂w2 = x2

73 of 105

Backpropagation: the math

∂L/∂p = (Y - φ(Xw)v)

∂p/∂v1 = h1

∂p/∂h1 = v1

∂h1/∂(w1x1 + w2x2) = φ’(x1w1+x2w2)

∂(w1x1 + w2x2)/∂w2 = x2

How do we update v1?�∂L/∂v1

74 of 105

Backpropagation: the math

∂L/∂p = (Y - φ(Xw)v)

∂p/∂v1 = h1

∂p/∂h1 = v1

∂h1/∂(w1x1 + w2x2) = φ’(x1w1+x2w2)

∂(w1x1 + w2x2)/∂w2 = x2

How do we update v1?�∂L/∂v1 = ∂p/∂v1 * ∂L/∂p

75 of 105

Backpropagation: the math

∂L/∂p = (Y - φ(Xw)v)

∂p/∂v1 = h1

∂p/∂h1 = v1

∂h1/∂(w1x1 + w2x2) = φ’(x1w1+x2w2)

∂(w1x1 + w2x2)/∂w2 = x2

How do we update v1?�∂L/∂v1 = ∂p/∂v1 * ∂L/∂p =

H1

76 of 105

Backpropagation: the math

∂L/∂p = (Y - φ(Xw)v)

∂p/∂v1 = h1

∂p/∂h1 = v1

∂h1/∂(w1x1 + w2x2) = φ’(x1w1+x2w2)

∂(w1x1 + w2x2)/∂w2 = x2

How do we update v1?�∂L/∂v1 = ∂p/∂v1 * ∂L/∂p =

H1 * (Y - φ(Xw)v)

77 of 105

Backpropagation: the math

∂L/∂p = (Y - φ(Xw)v)

∂p/∂v1 = h1

∂p/∂h1 = v1

∂h1/∂(w1x1 + w2x2) = φ’(x1w1+x2w2)

∂(w1x1 + w2x2)/∂w2 = x2

How do we update w2?�∂L/∂w2

78 of 105

Backpropagation: the math

∂L/∂p = (Y - φ(Xw)v)

∂p/∂v1 = h1

∂p/∂h1 = v1

∂h1/∂(w1x1 + w2x2) = φ’(x1w1+x2w2)

∂(w1x1 + w2x2)/∂w2 = x2

How do we update w2?�∂L/∂w2 = �∂(w1x1 + w2x2)/∂w2 * ∂h1/∂(w1x1 + w2x2) * ∂p/∂h1 * ∂L/∂p

79 of 105

Backpropagation: the math

∂L/∂p = (Y - φ(Xw)v)

∂p/∂v1 = h1

∂p/∂h1 = v1

∂h1/∂(w1x1 + w2x2) = φ’(x1w1+x2w2)

∂(w1x1 + w2x2)/∂w2 = x2

How do we update w2?�∂L/∂w2 = �∂(w1x1 + w2x2)/∂w2 * ∂h1/∂(w1x1 + w2x2) * ∂p/∂h1 * ∂L/∂p =

x2

80 of 105

Backpropagation: the math

∂L/∂p = (Y - φ(Xw)v)

∂p/∂v1 = h1

∂p/∂h1 = v1

∂h1/∂(w1x1 + w2x2) = φ’(x1w1+x2w2)

∂(w1x1 + w2x2)/∂w2 = x2

How do we update w2?�∂L/∂w2 = �∂(w1x1 + w2x2)/∂w2 * ∂h1/∂(w1x1 + w2x2) * ∂p/∂h1 * ∂L/∂p =

x2 * φ’(x1w1+x2w2)

81 of 105

Backpropagation: the math

∂L/∂p = (Y - φ(Xw)v)

∂p/∂v1 = h1

∂p/∂h1 = v1

∂h1/∂(w1x1 + w2x2) = φ’(x1w1+x2w2)

∂(w1x1 + w2x2)/∂w2 = x2

How do we update w2?�∂L/∂w2 = �∂(w1x1 + w2x2)/∂w2 * ∂h1/∂(w1x1 + w2x2) * ∂p/∂h1 * ∂L/∂p =

x2 * φ’(x1w1+x2w2) * v1

82 of 105

Backpropagation: the math

∂L/∂p = (Y - φ(Xw)v)

∂p/∂v1 = h1

∂p/∂h1 = v1

∂h1/∂(w1x1 + w2x2) = φ’(x1w1+x2w2)

∂(w1x1 + w2x2)/∂w2 = x2

How do we update w2?�∂L/∂w2 = �∂(w1x1 + w2x2)/∂w2 * ∂h1/∂(w1x1 + w2x2) * ∂p/∂h1 * ∂L/∂p =

x2 * φ’(x1w1+x2w2) * v1 * (Y - φ(Xw)v)

83 of 105

Backpropagation: the math

∂L/∂p

∂L/∂v1

84 of 105

Backpropagation: the math

∂L/∂p

∂L/∂h1

85 of 105

Backpropagation: the math

∂L/∂h1

∂L/∂(w1x1 + w2x2)

φ’(x1w1+x2w2)

86 of 105

Backpropagation: the math

∂L/∂(w1x1 + w2x2)

∂L/∂w2

87 of 105

Forward propagation

88 of 105

Backward propagation

89 of 105

Weight updates

90 of 105

Under and Overfitting

Underfitting: model not powerful enough, too much bias

Overfitting: model too powerful, fits to noise, doesn’t generalize well

Want the happy medium, how?

91 of 105

Under and Overfitting

Want the happy medium, how?

Pick the right model, but very hard to know a priori

Make weak model more powerful: boosting! (or other ways)

Make strong model less likely to overfit: regularization

92 of 105

With great power comes great overfitting

Neural networks are (sort of) all powerful! Which is not necessarily a good thing.

93 of 105

With great power comes great overfitting

Like SVMs, put limits on model that make it generalize better!

SVM:�min ||w||2�s.t. yn(w·xn - b) ≥ 1, n = 1, 2 ..

Neural net:�Minimize loss function and weight magnitude� Before: argminw LX(w)� Now: argminw LX(w) + λ ||w||2

94 of 105

Weight decay: neural network regularization

argminw LX(w) + λ ||w||2

λ: regularization parameter� Higher: more penalty for large weights, less powerful model� Lower: less penalty, more overfitting

Commonly use L2 norm to regularize, weight decay

Gradient descent update rule:� wt+1 = wt - η[∂/∂wt L(wt) + λwt]

= wt - η∂/∂wt L(wt) - ηλwt

Subtract a little bit of weight every iteration

95 of 105

Sometimes training is SLOW

With SGD we make LOTS of little steps along the gradient

Sometimes we move in the same direction for a long time… � Maybe we should speed up in that direction!

96 of 105

Momentum: speeding up SGD

If we keep moving in same direction we should move further every round

Before:� Δwt = -∂/∂wt L(wt)

Now:� Δwt = -∂/∂wt L(wt) + mΔwt-1

wt+1 = wt + ηΔwt

Side effect: smooths out updates if gradient is in different directions

97 of 105

NN updates with weight decay and momentum

Δwt = -∂/∂wt L(wt) - λwt + mΔwt-1

wt+1 = wt + ηΔwt

98 of 105

NN updates with weight decay and momentum

Δwt = -∂/∂wt L(wt) - λwt + mΔwt-1

wt+1 = wt + ηΔwt

Gradient of loss

99 of 105

NN updates with weight decay and momentum

Δwt = -∂/∂wt L(wt) - λwt + mΔwt-1

wt+1 = wt + ηΔwt

Gradient of loss

Weight decay

100 of 105

NN updates with weight decay and momentum

Δwt = -∂/∂wt L(wt) - λwt + mΔwt-1

wt+1 = wt + ηΔwt

Gradient of loss

Weight decay

Momentum

101 of 105

NN updates with weight decay and momentum

Δwt = -∂/∂wt L(wt) - λwt + mΔwt-1

wt+1 = wt + ηΔwt

Gradient of loss

Weight decay

Momentum

Learning rate

102 of 105

What about our activation functions φ

Many options, want them to be easy to take derivative

UAT holds when bounded, in practice bounds can be problematic

103 of 105

Common activation functions φ

linear

logistic

tanh

REctified Linear Unit (RELU)

Leaky RELU

104 of 105

So many hyper parameters!!

How do we know what to use??

105 of 105

Hyper Parameter Dark Magic

What follows are the one, true, correct, and only set of hyperparameters.�Praise be the NetLord!

η = [.0001 - .01]

λ = .0005�m = .9�φ = leaky relu