1 of 17

Neural Network Backprop in just one slide!

Ready?

Sung Kim <hunkim+ml@gmail.com>

2 of 17

*

+

Sigmoid

loss

w

b

(1) o=a0*w

(2) l=o+b

(3) a1=sigmoid(l)

(4) E=loss(a1)

Derivatives (chain rule)

Gate derivatives

Network update (learning rate, alpha)

Network

forward

backward prop

a0

Given from the pre computed derivative

Given

3 of 17

Too much?

Then, let’s go one by one

4 of 17

a1 = sigmoid (w*a0 + b)

Forward pass, OK? Read (1), (2), ...

*

+

Sigmoid

loss

w

b

(2) l=o+b

(3) a1=sigmoid(l)

(4) E=loss(a1)

Network

forward

a0

(1) o=a0*w

5 of 17

Let’s do back propagation!

will be given. What would be

We can use the chain rule.

*

+

Sigmoid

loss

w

b

(2) l=o+b

(3) a1=sigmoid(l)

(4) E=loss(a1)

Network

forward

a0

backward prop

(1) o=a0*w

6 of 17

In the same manner, we can get back prop (3), (4), and (5)!

*

+

Sigmoid

loss

w

b

(2) l=o+b

(3) a1=sigmoid(l)

(4) E=loss(a1)

Network

forward

backward prop

a0

(1) o=a0*w

7 of 17

These derivatives for gates will be given.

We can just use them.

*

+

Sigmoid

loss

w

b

(2) l=o+b

(3) a1=sigmoid(l)

(4) E=loss(a1)

Gate derivatives

Network

forward

a0

(1) o=a0*w

8 of 17

Just apply them one by one and solve each derivative one by one!

*

+

Sigmoid

loss

w

b

(2) l=o+b

(3) a1=sigmoid(l)

(4) E=loss(a1)

Derivatives (chain rule)

Gate derivatives

Network

forward

backward prop

a0

Given from the pre computed derivative

Given

(1) o=a0*w

9 of 17

Matrix

*

+

Sigmoid

loss

w

b

(2) l=o+b

(3) a1=sigmoid(l)

(4) E=loss(a1)

Derivatives (chain rule)

Gate derivatives

Network update (learning rate, alpha)

Network

forward

backward prop

a0

(1) o=a0*w

10 of 17

Done! Let’s update our network using derivatives!

*

+

Sigmoid

loss

w

b

(2) l=o+b

(3) a1=sigmoid(l)

(4) E=loss(a1)

Derivatives (chain rule)

Gate derivatives

Network update (learning rate, alpha)

Network

forward

backward prop

a0

(1) o=a0*w

11 of 17

Now got it, but how about N layers?

They are the same, but just more passes!

12 of 17

*

+

Sigmoid

*

+

Sigmoid

loss

(1) o1=a0w1

a0

w1

b1

w2

b2

(2) l1=o1+b1

(3) a1=sigmoid(l1)

(4) o2=a1w2

(5) l2=o2+b2

(6) a2=sigmoid(l2)

(7) E

Derivatives (chain rule)

Gate derivatives

Network update (learning rate, alpha)

Network

forward

backward prop

13 of 17

That was for single values. How about matrix?

Almost the same. Just see the next!

14 of 17

*

+

Sigmoid

*

+

Sigmoid

loss

(1) O1=A0W1

A0

W1

B1

W2

B2

(2) L1=O1+B1

(3) A1=sigmoid(L1)

(4) O2=A1W2

(5) L2=O2+B2

(6) A2=sigmoid(L2)

(7) E

Derivatives (chain rule)

Gate derivatives

Network update (learning rate, alpha)

Network

forward

backward prop

15 of 17

I finally got it, but how to code…?

16 of 17

Simple one to one mapping

# Backprop (chain rule)

d_L2 = (A2 - T)

d_B2 = d_L2 * 1

d_O2 = d_L2 * 1

d_W2 = tf.matmul

(tf.transpose(A1), d_O2)

d_A1 = tf.matmul(d_O2,

tf.transpose(w2))

d_L1 = d_A1 * A1 * (A1 - 1)

...

...

17 of 17

Acknowledgement

Feel free to give us comments (on Google Slides) to make it better to understand backprop.