Neural Network Backprop in just one slide!
Ready?
Sung Kim <hunkim+ml@gmail.com>
*
+
Sigmoid
loss
w
b
(1) o=a0*w
(2) l=o+b
(3) a1=sigmoid(l)
(4) E=loss(a1)
Derivatives (chain rule)
Gate derivatives
Network update (learning rate, alpha)
Network
forward
backward prop
a0
Given from the pre computed derivative
Given
Too much?
Then, let’s go one by one
a1 = sigmoid (w*a0 + b)
Forward pass, OK? Read (1), (2), ...
*
+
Sigmoid
loss
w
b
(2) l=o+b
(3) a1=sigmoid(l)
(4) E=loss(a1)
Network
forward
a0
(1) o=a0*w
Let’s do back propagation!
will be given. What would be
We can use the chain rule.
*
+
Sigmoid
loss
w
b
(2) l=o+b
(3) a1=sigmoid(l)
(4) E=loss(a1)
Network
forward
a0
backward prop
(1) o=a0*w
In the same manner, we can get back prop (3), (4), and (5)!
*
+
Sigmoid
loss
w
b
(2) l=o+b
(3) a1=sigmoid(l)
(4) E=loss(a1)
Network
forward
backward prop
a0
(1) o=a0*w
These derivatives for gates will be given.
We can just use them.
*
+
Sigmoid
loss
w
b
(2) l=o+b
(3) a1=sigmoid(l)
(4) E=loss(a1)
Gate derivatives
Network
forward
a0
(1) o=a0*w
Just apply them one by one and solve each derivative one by one!
*
+
Sigmoid
loss
w
b
(2) l=o+b
(3) a1=sigmoid(l)
(4) E=loss(a1)
Derivatives (chain rule)
Gate derivatives
Network
forward
backward prop
a0
Given from the pre computed derivative
Given
(1) o=a0*w
Matrix
*
+
Sigmoid
loss
w
b
(2) l=o+b
(3) a1=sigmoid(l)
(4) E=loss(a1)
Derivatives (chain rule)
Gate derivatives
Network update (learning rate, alpha)
Network
forward
backward prop
a0
(1) o=a0*w
Done! Let’s update our network using derivatives!
*
+
Sigmoid
loss
w
b
(2) l=o+b
(3) a1=sigmoid(l)
(4) E=loss(a1)
Derivatives (chain rule)
Gate derivatives
Network update (learning rate, alpha)
Network
forward
backward prop
a0
(1) o=a0*w
Now got it, but how about N layers?
They are the same, but just more passes!
*
+
Sigmoid
*
+
Sigmoid
loss
(1) o1=a0w1
a0
w1
b1
w2
b2
(2) l1=o1+b1
(3) a1=sigmoid(l1)
(4) o2=a1w2
(5) l2=o2+b2
(6) a2=sigmoid(l2)
(7) E
Derivatives (chain rule)
Gate derivatives
Network update (learning rate, alpha)
Network
forward
backward prop
That was for single values. How about matrix?
Almost the same. Just see the next!
*
+
Sigmoid
*
+
Sigmoid
loss
(1) O1=A0W1
A0
W1
B1
W2
B2
(2) L1=O1+B1
(3) A1=sigmoid(L1)
(4) O2=A1W2
(5) L2=O2+B2
(6) A2=sigmoid(L2)
(7) E
Derivatives (chain rule)
Gate derivatives
Network update (learning rate, alpha)
Network
forward
backward prop
I finally got it, but how to code…?
Simple one to one mapping
# Backprop (chain rule)
d_L2 = (A2 - T)
d_B2 = d_L2 * 1
d_O2 = d_L2 * 1
d_W2 = tf.matmul
(tf.transpose(A1), d_O2)
d_A1 = tf.matmul(d_O2,
tf.transpose(w2))
d_L1 = d_A1 * A1 * (A1 - 1)
...
...
See full code and others at https://github.com/hunkim/DeepLearningZeroToAll
Acknowledgement
Feel free to give us comments (on Google Slides) to make it better to understand backprop.