Logistics:
Softmax: normalized exponential
Generalization of logistic
Input: vector of reals�Output: probability distribution
softmax([1,2,7,3,2]):� Calculate ex: [2.72, 7.39, 1096.63, 20.09, 7.39]� Calculate sum(ex): 2.72+7.39+1096.63+20.09+7.39 = 1134.22� Normalize: ex/sum(ex) = [0.002, 0.007, 0.967, 0.017, 0.007]��
Multinomial logistic regression
Probability of that a data point belongs to a class is the normalized, weighted sum of the input variables with the learned weights.
softmax(wx + b)
Multinomial logistic regression
https://www.tensorflow.org/versions/r1.1/get_started/mnist/beginners
Probability of that a data point belongs to a class is the normalized, weighted sum of the input variables with the learned weights.
MNIST: Handwriting recognition
50,000 images of handwriting�28 x 28 x 1 (grayscale)�Numbers 0-9
10 class softmax regression�Input is 784 pixel values�Train with SGD�> 95% accuracy
Support Vector Machine (SVM)
Find max-margin classifier. Examples on the margin are supporting data points, support vectors.
min ||w||2�s.t. yn(w·xn - b) ≥ 1, n = 1, 2 ..
Or: minimize weights such that margin for �each point is at least 1
Case study: Person detection
Dalal and Triggs ‘05:�Train SVM on HOG features of image�2 classes, person/not person
At test time:�Extract HOG features at many scales�Run SVM classifier at every location�High responses = person?��
Case study: Person detection
Dalal and Triggs is a sliding window detector
Many scales�Every location
10k+ classifier�evaluations per�image.
Person? No
Case study: Deformable parts models
http://cs.brown.edu/people/pfelzens/papers/lsvm-pami.pdf
Objects have parts, learn to recognize parts�and where they are
Latent SVM: Learn part appearances and�locations without explicit data
Hard negative mining: rebalance classes�for sliding window detectors
Case study: Image classification
https://lear.inrialpes.fr/~verbeek/mlcr.slides.11.12/sanchez11cvpr.pdf
Given an image, what’s in it?
Old state-of-the-art:�Extract features from image� SIFT and Fisher Vectors�Train Linear SVM
On 1000 different classes, 54% accurate
What’s wrong with this?
Machine learning needs features!!
What are the right features?�HOG?�SIFT?�FV?
Why not let the algorithm decide
Neural networks: Feature extraction + linear model
Success of Neural Networks
Image classification:
54% -> 80% accuracy on 1000 classes
Object detection:
33% mAP (DPM) -> 88% mAP on 20 classes
What is feature engineering?
Arguably the core problem of machine learning (especially in practice)
ML models work well if there is a clear relationship between the inputs and outputs of the function you are trying to model
What is feature engineering?
Arguably the core problem of machine learning (especially in practice)
ML models work well if there is a clear relationship between the inputs and outputs of the function you are trying to model
What is feature engineering?
Arguably the core problem of machine learning (especially in practice)
ML models work well if there is a clear relationship between the inputs and outputs of the function you are trying to model
What is feature engineering?
Arguably the core problem of machine learning (especially in practice)
ML models work well if there is a clear relationship between the inputs and outputs of the function you are trying to model
Linear model can’t do this
Cannot learn transformations of features, only use existing features. Human must create good features
What if we added more processing?
Generally, feature engineering is just coming up with combinations of the features we already have
What if we added more processing?
Create “new” features using old ones. We’ll call H our hidden layer
What if we added more processing?
As with linear model, H can be expressed in matrix operations
What if we added more processing?
Now our prediction p is a function of our hidden layer
What if we added more processing?
Now our prediction p is a function of our hidden layer
Feature extractor
What if we added more processing?
Now our prediction p is a function of our hidden layer
Feature extractor
Linear model
What if we added more processing?
Can still express the whole process in matrix notation! Nice because matrix ops are fast
This is a neural network!
This one has 1 hidden layer, but can have way more�Each layer is just some function φ applied to linear combination of the previous layer
φ is our activation function
Want to apply some extra processing at each layer. Why?� Imagine φ(x) = x, linear activation
φ is our activation function
Want to apply some extra processing at each layer. Why?� Imagine φ(x) = x, linear activation
p = v1h1 + v2h2 + v3h3
But h1 = x1w1 + x2w2, h2 = … etc�So� p = v1w1x1 + v1w2x2 + v2w3x1 + v2w4x2 + v3w5x1 + v3w6x2� = (v1w1+v2w3+v3w5)x1 + (v1w2+v2w4+v3w6)x2� = u1x1 + u2x2
Universal approximation theorem
https://en.wikipedia.org/wiki/Universal_approximation_theorem
What if φ not linear?
Universal approximation theorem (Cybenko 89, Hornik 91)� φ: any nonconstant, bounded, monotonically increasing function� Im: m-dimensional unit hypercube (interval [0-1] in m-d)� Then 1-layer neural network with φ as activation can model any continuous function f: Im -> R� (no bound on size of hidden layer)
By extension, works on f: bounded Rm -> R
What can we learn? What can’t we?
UAT just says it’s possible to model, not how.
How do we learn it?
Neural networks are non-convex with no closed form solution (can’t take derivative and set = 0)
Gradient descent! Recall for linear model:
How do we learn it?
With gradient descent we calculate the partial derivatives of the loss (or likelihood) function for every weight: ∂/∂wi log L(w)
Then do gradient descent (or ascent) by adding gradient to weight
How do we learn it?
Simple example, say we have a data point [10, 1] and we predict some p. We also know the “correct” label Y. Maybe our prediction p is too small and we want to make it larger. How do we adjust w?
How do we learn it?
Simple example, say we have a data point [10, 1] and we predict some p. We also know the “correct” label Y. Maybe our prediction p is too small and we want to make it larger. How do we adjust w?
We adjust w1 much more than w2, why?
How do we learn it?
Simple example, say we have a data point [-1, 1] and we predict some p. We also know the “correct” label Y. Maybe our prediction p is too small and we want to make it larger. How do we adjust w?
How do we learn it?
Now we have a “real” neural network (using linear activation for simplicity). How do we predict p?
How do we learn it?
Now we have a “real” neural network (using linear activation for simplicity). How do we predict p?� Calculate hidden layer neurons
How do we learn it?
Now we have a “real” neural network (using linear activation for simplicity). How do we predict p?� Calculate hidden layer neurons� Calculate output p
How do we learn it?
Say we want to make p larger. How do we modify the weights? The first layer is easy, same as normal linear model:
How do we learn it?
Say we want to make p larger. How do we modify the weights? The first layer is easy, same as normal linear model:
How do we learn it?
Now what? Let’s calculate the “error” that the hidden layer makes. We want p to be larger, given current weights how should we adjust the hidden layer output to do that?
How do we learn it?
Now what? Let’s calculate the “error” that the hidden layer makes. We want p to be larger, given current weights how should we adjust the hidden layer output to do that?
How do we learn it?
Now that we have an “error” in our hidden layer, want to modify the previous weights. Easy again, just like our linear model.
How do we learn it?
Now that we have an “error” in our hidden layer, want to modify the previous weights. Easy again, just like our linear model.
Backpropagation: just taking derivatives
This is the backpropagation algorithm. It’s really just an easy way to calculate partial derivatives in a neural network. We forward-propagate information through the network, calculate our error, then backpropagate that error through network to calculate weight updates.
Backpropagation: just taking derivatives
This was with linear activations but the process is the same for any φ, just have to calculate φ’(x) for that neuron as well.
Backpropagation: the math
1-layer NN, sigmoid activation at hidden layer, linear output:
F(X) = φ(Xw)v� F(X) = φ(x1w1 + x2w2)v1 + φ(x1w3 + x2w4)v2 + φ(x1w5 + x2w6)v3
Backpropagation: the math
1-layer NN, sigmoid activation at hidden layer, linear output:
F(X) = φ(Xw)v� F(X) = φ(x1w1 + x2w2)v1 + φ(x1w3 + x2w4)v2 + φ(x1w5 + x2w6)v3
Say regression, Loss function is ½ L2 norm, expected output is Y:� LX,Y(w,v) = ½(Y - F(X))2 = ½(Y - φ(Xw)v)2
Backpropagation: the math
1-layer NN, sigmoid activation at hidden layer, linear output:
F(X) = φ(Xw)v� F(X) = φ(x1w1 + x2w2)v1 + φ(x1w3 + x2w4)v2 + φ(x1w5 + x2w6)v3
Say regression, Loss function is ½ L2 norm, expected output is Y:� LX,Y(w,v) = ½(Y - F(X))2 = ½(Y - φ(Xw)v)2
Want partial derivative of Loss w.r.t. weights, say v2:� ∂/∂v2 LX,Y(w,v) = ∂/∂v2 ½(Y - φ(Xw)v)2
Backpropagation: the math
1-layer NN, sigmoid activation at hidden layer, linear output:
F(X) = φ(Xw)v� F(X) = φ(x1w1 + x2w2)v1 + φ(x1w3 + x2w4)v2 + φ(x1w5 + x2w6)v3
Say regression, Loss function is ½ L2 norm, expected output is Y:� LX,Y(w,v) = ½(Y - F(X))2 = ½(Y - φ(Xw)v)2
Want partial derivative of Loss w.r.t. weights, say v2:� ∂/∂v2 LX,Y(w,v) = ∂/∂v2 ½(Y - φ(Xw)v)2� = (Y - φ(Xw)v) * -[∂/∂v2φ(Xw)v]
Backpropagation: the math
1-layer NN, sigmoid activation at hidden layer, linear output:
F(X) = φ(Xw)v� F(X) = φ(x1w1 + x2w2)v1 + φ(x1w3 + x2w4)v2 + φ(x1w5 + x2w6)v3
Say regression, Loss function is ½ L2 norm, expected output is Y:� LX,Y(w,v) = ½(Y - F(X))2 = ½(Y - φ(Xw)v)2
Want partial derivative of Loss w.r.t. weights, say v2:� ∂/∂v2 LX,Y(w,v) = ∂/∂v2 ½(Y - φ(Xw)v)2� = (Y - φ(Xw)v) * -[∂/∂v2φ(Xw)v]� = (Y - φ(Xw)v) * -[∂/∂v2 φ(x1w3 + x2w4)v2]
Backpropagation: the math
1-layer NN, sigmoid activation at hidden layer, linear output:
F(X) = φ(Xw)v� F(X) = φ(x1w1 + x2w2)v1 + φ(x1w3 + x2w4)v2 + φ(x1w5 + x2w6)v3
Say regression, Loss function is ½ L2 norm, expected output is Y:� LX,Y(w,v) = ½(Y - F(X))2 = ½(Y - φ(Xw)v)2
Want partial derivative of Loss w.r.t. weights, say v2:� ∂/∂v2 LX,Y(w,v) = ∂/∂v2 ½(Y - φ(Xw)v)2� = (Y - φ(Xw)v) * -[∂/∂v2φ(Xw)v]� = (Y - φ(Xw)v) * -[∂/∂v2 φ(x1w3 + x2w4)v2]� = (Y - φ(Xw)v) * -φ(x1w3 + x2w4)
Backpropagation: the math
Want partial derivative of Loss w.r.t. weights, say v2:� ∂/∂v2 LX,Y(w,v) = (Y - φ(Xw)v) * -φ(x1w3 + x2w4)�
Backpropagation: the math
Want partial derivative of Loss w.r.t. weights, say v2:� ∂/∂v2 LX,Y(w,v) = -(Y - φ(Xw)v)*h2�
Backpropagation: the math
Want partial derivative of Loss w.r.t. weights, say v2:� ∂/∂v2 LX,Y(w,v) = -(Y - φ(Xw)v)*h2
Weight update rule (remember descend on loss):� v2 = v2 + η(Y - φ(Xw)v)*h2�
Backpropagation: the math
1-layer NN, sigmoid activation at hidden layer, linear output:
F(X) = φ(Xw)v� F(X) = φ(x1w1 + x2w2)v1 + φ(x1w3 + x2w4)v2 + φ(x1w5 + x2w6)v3
Say regression, Loss function is ½ L2 norm, expected output is Y:� LX,Y(w,v) = ½(Y - F(X))2 = ½(Y - φ(Xw)v)2
Want partial derivative of Loss w.r.t. weights, say w2:� ∂/∂w2 LX,Y(w,v) = ∂/∂w2 ½(Y - φ(Xw)v)2� = (Y - φ(Xw)v) * -[∂/∂w2 φ(Xw)v]
Backpropagation: the math
1-layer NN, sigmoid activation at hidden layer, linear output:
F(X) = φ(Xw)v� F(X) = φ(x1w1 + x2w2)v1 + φ(x1w3 + x2w4)v2 + φ(x1w5 + x2w6)v3
Say regression, Loss function is ½ L2 norm, expected output is Y:� LX,Y(w,v) = ½(Y - F(X))2 = ½(Y - φ(Xw)v)2
Want partial derivative of Loss w.r.t. weights, say w2:� ∂/∂w2 LX,Y(w,v) = ∂/∂w2 ½(Y - φ(Xw)v)2� = (Y - φ(Xw)v) * -[∂/∂w2 φ(Xw)v]� = (Y - φ(Xw)v) * -[∂/∂w2 φ(x1w1 + x2w2)v1]
Backpropagation: the math
1-layer NN, sigmoid activation at hidden layer, linear output:
F(X) = φ(Xw)v� F(X) = φ(x1w1 + x2w2)v1 + φ(x1w3 + x2w4)v2 + φ(x1w5 + x2w6)v3
Say regression, Loss function is ½ L2 norm, expected output is Y:� LX,Y(w,v) = ½(Y - F(X))2 = ½(Y - φ(Xw)v)2
Want partial derivative of Loss w.r.t. weights, say w2:� ∂/∂w2 LX,Y(w,v) = ∂/∂w2 ½(Y - φ(Xw)v)2� = (Y - φ(Xw)v) * -[∂/∂w2 φ(Xw)v]� = (Y - φ(Xw)v) * -v1[∂/∂w2 φ(x1w1 + x2w2)]
Backpropagation: the math
1-layer NN, sigmoid activation at hidden layer, linear output:
F(X) = φ(Xw)v� F(X) = φ(x1w1 + x2w2)v1 + φ(x1w3 + x2w4)v2 + φ(x1w5 + x2w6)v3
Say regression, Loss function is ½ L2 norm, expected output is Y:� LX,Y(w,v) = ½(Y - F(X))2 = ½(Y - φ(Xw)v)2
Want partial derivative of Loss w.r.t. weights, say w2:� ∂/∂w2 LX,Y(w,v) = ∂/∂w2 ½(Y - φ(Xw)v)2� = (Y - φ(Xw)v) * -[∂/∂w2 φ(Xw)v]� = (Y - φ(Xw)v) * -v1[∂/∂w2 φ(x1w1 + x2w2)]� = (Y - φ(Xw)v) * -v1φ’(x1w1+x2w2)[∂/∂w2 (x1w1 + x2w2)]
Chain rule!
If F(x) = f(g(x))
F’(x) = f’(g(x))g’(x)
Backpropagation: the math
1-layer NN, sigmoid activation at hidden layer, linear output:
F(X) = φ(Xw)v� F(X) = φ(x1w1 + x2w2)v1 + φ(x1w3 + x2w4)v2 + φ(x1w5 + x2w6)v3
Say regression, Loss function is ½ L2 norm, expected output is Y:� LX,Y(w,v) = ½(Y - F(X))2 = ½(Y - φ(Xw)v)2
Want partial derivative of Loss w.r.t. weights, say w2:� ∂/∂w2 LX,Y(w,v) = ∂/∂w2 ½(Y - φ(Xw)v)2� = (Y - φ(Xw)v) * -[∂/∂w2 φ(Xw)v]� = (Y - φ(Xw)v) * -v1[∂/∂w2 φ(x1w1 + x2w2)]� = (Y - φ(Xw)v) * -v1φ’(x1w1+x2w2)[∂/∂w2 (x1w1 + x2w2)]� = (Y - φ(Xw)v) * -v1φ’(x1w1+x2w2) * x2
Backpropagation: the math
Want partial derivative of Loss w.r.t. weights, say w2:� ∂/∂w2 LX,Y(w,v) = (Y - φ(Xw)v) * -v1φ’(x1w1+x2w2) * x2
Backpropagation: the math
Want partial derivative of Loss w.r.t. weights, say w2:� ∂/∂w2 LX,Y(w,v) = (Y - φ(Xw)v) * -v1φ’(x1w1+x2w2) * x2
Model error at p
Backpropagation: the math
Want partial derivative of Loss w.r.t. weights, say w2:� ∂/∂w2 LX,Y(w,v) = (Y - φ(Xw)v) * -v1φ’(x1w1+x2w2) * x2
Model error at p
Backpropagate through v1
Backpropagation: the math
Want partial derivative of Loss w.r.t. weights, say w2:� ∂/∂w2 LX,Y(w,v) = (Y - φ(Xw)v) * -v1φ’(x1w1+x2w2) * x2
Model error at p
Backpropagate through v1
Model error at h1
Backpropagation: the math
Want partial derivative of Loss w.r.t. weights, say w2:� ∂/∂w2 LX,Y(w,v) = (Y - φ(Xw)v) * -v1φ’(x1w1+x2w2) * x2
Model error at p
Backpropagate through v1
Model error at h1
Multiply by x2: gradient w.r.t. w2
Backpropagation: the math
∂L/∂p
∂p/∂v1
∂p/∂h1
∂h1/∂(w1x1 + w2x2)
∂(w1x1 + w2x2)/∂w2
Backpropagation: the math
∂(w1x1 + w2x2)/∂w2 = x2
Backpropagation: the math
∂h1/∂(w1x1 + w2x2) = φ’(x1w1+x2w2)
∂(w1x1 + w2x2)/∂w2 = x2
Backpropagation: the math
∂p/∂h1 = v1
∂h1/∂(w1x1 + w2x2) = φ’(x1w1+x2w2)
∂(w1x1 + w2x2)/∂w2 = x2
Backpropagation: the math
∂p/∂v1 = h1
∂p/∂h1 = v1
∂h1/∂(w1x1 + w2x2) = φ’(x1w1+x2w2)
∂(w1x1 + w2x2)/∂w2 = x2
Backpropagation: the math
∂L/∂p = (Y - φ(Xw)v)
∂p/∂v1 = h1
∂p/∂h1 = v1
∂h1/∂(w1x1 + w2x2) = φ’(x1w1+x2w2)
∂(w1x1 + w2x2)/∂w2 = x2
Backpropagation: the math
∂L/∂p = (Y - φ(Xw)v)
∂p/∂v1 = h1
∂p/∂h1 = v1
∂h1/∂(w1x1 + w2x2) = φ’(x1w1+x2w2)
∂(w1x1 + w2x2)/∂w2 = x2
How do we update v1?�∂L/∂v1
Backpropagation: the math
∂L/∂p = (Y - φ(Xw)v)
∂p/∂v1 = h1
∂p/∂h1 = v1
∂h1/∂(w1x1 + w2x2) = φ’(x1w1+x2w2)
∂(w1x1 + w2x2)/∂w2 = x2
How do we update v1?�∂L/∂v1 = ∂p/∂v1 * ∂L/∂p
Backpropagation: the math
∂L/∂p = (Y - φ(Xw)v)
∂p/∂v1 = h1
∂p/∂h1 = v1
∂h1/∂(w1x1 + w2x2) = φ’(x1w1+x2w2)
∂(w1x1 + w2x2)/∂w2 = x2
How do we update v1?�∂L/∂v1 = ∂p/∂v1 * ∂L/∂p =
H1
Backpropagation: the math
∂L/∂p = (Y - φ(Xw)v)
∂p/∂v1 = h1
∂p/∂h1 = v1
∂h1/∂(w1x1 + w2x2) = φ’(x1w1+x2w2)
∂(w1x1 + w2x2)/∂w2 = x2
How do we update v1?�∂L/∂v1 = ∂p/∂v1 * ∂L/∂p =
H1 * (Y - φ(Xw)v)
Backpropagation: the math
∂L/∂p = (Y - φ(Xw)v)
∂p/∂v1 = h1
∂p/∂h1 = v1
∂h1/∂(w1x1 + w2x2) = φ’(x1w1+x2w2)
∂(w1x1 + w2x2)/∂w2 = x2
How do we update w2?�∂L/∂w2
Backpropagation: the math
∂L/∂p = (Y - φ(Xw)v)
∂p/∂v1 = h1
∂p/∂h1 = v1
∂h1/∂(w1x1 + w2x2) = φ’(x1w1+x2w2)
∂(w1x1 + w2x2)/∂w2 = x2
How do we update w2?�∂L/∂w2 = �∂(w1x1 + w2x2)/∂w2 * ∂h1/∂(w1x1 + w2x2) * ∂p/∂h1 * ∂L/∂p
Backpropagation: the math
∂L/∂p = (Y - φ(Xw)v)
∂p/∂v1 = h1
∂p/∂h1 = v1
∂h1/∂(w1x1 + w2x2) = φ’(x1w1+x2w2)
∂(w1x1 + w2x2)/∂w2 = x2
How do we update w2?�∂L/∂w2 = �∂(w1x1 + w2x2)/∂w2 * ∂h1/∂(w1x1 + w2x2) * ∂p/∂h1 * ∂L/∂p =
x2
Backpropagation: the math
∂L/∂p = (Y - φ(Xw)v)
∂p/∂v1 = h1
∂p/∂h1 = v1
∂h1/∂(w1x1 + w2x2) = φ’(x1w1+x2w2)
∂(w1x1 + w2x2)/∂w2 = x2
How do we update w2?�∂L/∂w2 = �∂(w1x1 + w2x2)/∂w2 * ∂h1/∂(w1x1 + w2x2) * ∂p/∂h1 * ∂L/∂p =
x2 * φ’(x1w1+x2w2)
Backpropagation: the math
∂L/∂p = (Y - φ(Xw)v)
∂p/∂v1 = h1
∂p/∂h1 = v1
∂h1/∂(w1x1 + w2x2) = φ’(x1w1+x2w2)
∂(w1x1 + w2x2)/∂w2 = x2
How do we update w2?�∂L/∂w2 = �∂(w1x1 + w2x2)/∂w2 * ∂h1/∂(w1x1 + w2x2) * ∂p/∂h1 * ∂L/∂p =
x2 * φ’(x1w1+x2w2) * v1
Backpropagation: the math
∂L/∂p = (Y - φ(Xw)v)
∂p/∂v1 = h1
∂p/∂h1 = v1
∂h1/∂(w1x1 + w2x2) = φ’(x1w1+x2w2)
∂(w1x1 + w2x2)/∂w2 = x2
How do we update w2?�∂L/∂w2 = �∂(w1x1 + w2x2)/∂w2 * ∂h1/∂(w1x1 + w2x2) * ∂p/∂h1 * ∂L/∂p =
x2 * φ’(x1w1+x2w2) * v1 * (Y - φ(Xw)v)
Backpropagation: the math
∂L/∂p
∂L/∂v1
Backpropagation: the math
∂L/∂p
∂L/∂h1
Backpropagation: the math
∂L/∂h1
∂L/∂(w1x1 + w2x2)
φ’(x1w1+x2w2)
Backpropagation: the math
∂L/∂(w1x1 + w2x2)
∂L/∂w2
Forward propagation
Backward propagation
Weight updates
Under and Overfitting
Underfitting: model not powerful enough, too much bias
Overfitting: model too powerful, fits to noise, doesn’t generalize well
Want the happy medium, how?
Under and Overfitting
Want the happy medium, how?
Pick the right model, but very hard to know a priori
Make weak model more powerful: boosting! (or other ways)
Make strong model less likely to overfit: regularization
With great power comes great overfitting
Neural networks are (sort of) all powerful! Which is not necessarily a good thing.
With great power comes great overfitting
Like SVMs, put limits on model that make it generalize better!
SVM:�min ||w||2�s.t. yn(w·xn - b) ≥ 1, n = 1, 2 ..
Neural net:�Minimize loss function and weight magnitude� Before: argminw LX(w)� Now: argminw LX(w) + λ ||w||2�
Weight decay: neural network regularization
argminw LX(w) + λ ||w||2
λ: regularization parameter� Higher: more penalty for large weights, less powerful model� Lower: less penalty, more overfitting
Commonly use L2 norm to regularize, weight decay
Gradient descent update rule:� wt+1 = wt - η[∂/∂wt L(wt) + λwt]
= wt - η∂/∂wt L(wt) - ηλwt
Subtract a little bit of weight every iteration
Sometimes training is SLOW
With SGD we make LOTS of little steps along the gradient
Sometimes we move in the same direction for a long time… � Maybe we should speed up in that direction!
Momentum: speeding up SGD
If we keep moving in same direction we should move further every round
Before:� Δwt = -∂/∂wt L(wt)
Now:� Δwt = -∂/∂wt L(wt) + mΔwt-1
wt+1 = wt + ηΔwt
Side effect: smooths out updates if gradient is in different directions
NN updates with weight decay and momentum
Δwt = -∂/∂wt L(wt) - λwt + mΔwt-1
wt+1 = wt + ηΔwt
NN updates with weight decay and momentum
Δwt = -∂/∂wt L(wt) - λwt + mΔwt-1
wt+1 = wt + ηΔwt
Gradient of loss
NN updates with weight decay and momentum
Δwt = -∂/∂wt L(wt) - λwt + mΔwt-1
wt+1 = wt + ηΔwt
Gradient of loss
Weight decay
NN updates with weight decay and momentum
Δwt = -∂/∂wt L(wt) - λwt + mΔwt-1
wt+1 = wt + ηΔwt
Gradient of loss
Weight decay
Momentum
NN updates with weight decay and momentum
Δwt = -∂/∂wt L(wt) - λwt + mΔwt-1
wt+1 = wt + ηΔwt
Gradient of loss
Weight decay
Momentum
Learning rate
What about our activation functions φ
Many options, want them to be easy to take derivative
UAT holds when bounded, in practice bounds can be problematic
Common activation functions φ
linear
logistic
tanh
REctified Linear Unit (RELU)
Leaky RELU
So many hyper parameters!!
How do we know what to use??
Hyper Parameter Dark Magic
What follows are the one, true, correct, and only set of hyperparameters.�Praise be the NetLord!
η = [.0001 - .01]
λ = .0005�m = .9�φ = leaky relu