1 of 38

Lecture 2:

Neural Networks and

Backpropagation

Soo Kyung Kim��(This lecture is based on cs231n class by Prof. Li Fei-Fei at Stanford)

Spring 2025

2 of 38

Neural Networks

2

3 of 38

Perceptron

3

4 of 38

Perceptron

Softmax classifier is a special case of Perceptron!

With f = I, it is linear regression.

x

Wx

y

f = σ

4

5 of 38

Neural Network with a Single Layer

x

s

W

d=3072

c=10

5

6 of 38

Multilayer Perceptron (MLP)

x

s

W₁

d=3072

c=10

h

h=100

W₂

6

7 of 38

Multilayer Perceptron (MLP)

x

s

W

d=3072

c=10

x

s

W₁

d=3072

c=10

h

h=100

W₂

7

8 of 38

Multilayer Perceptron (MLP)

x

s

W₁

d=3072

c=10

h

h=100

W₂

Multi-linear layers are still linear.

W = W₂W₁

How can we add non-linearity?

→ Activation functions!

8

9 of 38

Activation Functions

Sigmoid

tanh

ReLU

9

10 of 38

An Example of Neural Network

10

11 of 38

Computing Gradients

What do we need for (Stochastic) Gradient Descent?

Gradient of the classification loss w.r.t. each parameter (weight)

Each gradient indicates how much that particular weight contributed to the incorrect prediction.

We want to find a value where the loss value is close to 0, meaning we are at the bottom of the flat.

x

ŷ

W₁

W₂

ℒ

11

12 of 38

Computing Gradients

x

ŷ

W₁

W₂

ℒ

12

13 of 38

Implementation: 2-layer MLP

import numpy as np

from numpy.random import randn

n, d, h, c = 64, 1000, 100, 10

x, y = randn(n, d), randn(n, c)

w1, w2 = randn(d, h), randn(h, c)

learning_rate = 1e-4

for t in range(1000):

y_0= x.dot(w1)

h_0 = 1 / (1 + np.exp(-y_0))

y_pred = h.dot(w2)

loss = np.square(y_pred - y).sum()

print(t, loss)

grad_y_pred = 2.0 * (y_pred - y)

grad_w2 = h.T.dot(grad_y_pred)

grad_h = grad_y_pred.dot(w2.T)

grad_w1 = x.T.dot(grad_h * h * (1 - h))

w1 -= learning_rate * grad_w1

w2 -= learning_rate * grad_w2

Network Definition

(n: #examples, d: input dim,� d: hidden dim, c: #classes)

Forward Pass: predicting using the current network

Calculate the analytical gradients

Gradient descent

13

14 of 38

Computing Gradients

14

15 of 38

Computing Gradients

Even more complex neural nets ...

15

16 of 38

Backpropagation: Computing Gradients

16

17 of 38

Computational Graph

f(x,W) = Wx + b

x

W

b

×

+

f

Forward pass

Backpropagation

17

18 of 38

Backpropagation Example

f(x, y, z) = (x+y)z

x

y

z

+

×

For example, suppose the input is

x = -2, y = 5, z = -4.

-2

5

-4

Forward pass

q=3

f=-12

18

19 of 38

Backpropagation Example

f(x, y, z) = (x+y)z

x

y

z

+

×

For example, suppose the input is

x = -2, y = 5, z = -4.

-2

5

-4

Backpropagation

We need partial derivative of f w.r.t. each variable (x, y, z).

q=3

f=-12

19

20 of 38

Backpropagation Example

f(x, y, z) = (x+y)z

x

y

z

+

×

For example, suppose the input is

x = -2, y = 5, z = -4.

-2

5

-4

Backpropagation

The very last one is simple: 1

q=3

f=-12

1

20

21 of 38

Backpropagation Example

f(x, y, z) = (x+y)z

x

y

z

+

×

For example, suppose the input is

x = -2, y = 5, z = -4.

-2

5

-4

Backpropagation

Partial derivative of f w.r.t. z is given by

q=3

f=-12

1

3

21

22 of 38

Backpropagation Example

f(x, y, z) = (x+y)z

x

y

z

+

×

For example, suppose the input is

x = -2, y = 5, z = -4.

-2

5

-4

Backpropagation

Partial derivative of f w.r.t. q is given by

q=3

f=-12

1

3

-4

22

23 of 38

Backpropagation Example

f(x, y, z) = (x+y)z

x

y

z

+

×

For example, suppose the input is

x = -2, y = 5, z = -4.

-2

5

-4

Backpropagation

Partial derivative of f w.r.t. x is given by

q=3

f=-12

1

3

-4

23

24 of 38

Backpropagation Example

f(x, y, z) = (x+y)z

x

y

z

+

×

For example, suppose the input is

x = -2, y = 5, z = -4.

-2

5

-4

Backpropagation

Partial derivative of f w.r.t. y is given by

q=3

f=-12

1

3

-4

24

25 of 38

Chain Rule

f

Forward pass

Upstream Gradient

Local Gradient

Downstream Gradient

25

26 of 38

Another Example: Logistic Regression

w₀

x₀

w₁

x₁

b

×

+

*-1

exp

+1

1/x

26

27 of 38

Another Example: Logistic Regression

w₀

x₀

w₁

x₁

b

×

+

*-1

exp

+1

1/x

Forward pass

2.00

-1.00

-3.00

-2.00

-3.00

-2.00

6.00

4.00

1.00

-1.00

0.37

1.37

0.73

27

28 of 38

Another Example: Logistic Regression

w₀

x₀

w₁

x₁

b

×

+

*-1

exp

+1

1/x

Backpropagation

2.00

-1.00

-3.00

-2.00

-3.00

-2.00

6.00

4.00

1.00

-1.00

0.37

1.37

0.73

1.00

28

29 of 38

Another Example: Logistic Regression

w₀

x₀

w₁

x₁

b

×

+

*-1

exp

+1

1/x

Backpropagation

2.00

-1.00

-3.00

-2.00

-3.00

-2.00

6.00

4.00

1.00

-1.00

0.37

1.37

0.73

1.00

Upstream gradient

Local gradient

-0.53

29

30 of 38

Another Example: Logistic Regression

w₀

x₀

w₁

x₁

b

×

+

*-1

exp

+1

1/x

Backpropagation

2.00

-1.00

-3.00

-2.00

-3.00

-2.00

6.00

4.00

1.00

-1.00

0.37

1.37

0.73

1.00

Upstream gradient

Local gradient

-0.53

30

31 of 38

Another Example: Logistic Regression

w₀

x₀

w₁

x₁

b

×

+

*-1

exp

+1

1/x

Backpropagation

2.00

-1.00

-3.00

-2.00

-3.00

-2.00

6.00

4.00

1.00

-1.00

0.37

1.37

0.73

1.00

Upstream gradient

Local gradient

-0.53

-0.20

31

32 of 38

Another Example: Logistic Regression

w₀

x₀

w₁

x₁

b

×

+

*-1

exp

+1

1/x

Backpropagation

2.00

-1.00

-3.00

-2.00

-3.00

-2.00

6.00

4.00

1.00

-1.00

0.37

1.37

0.73

1.00

Upstream gradient

Local gradient

-0.53

-0.20

0.20

32

33 of 38

Another Example: Logistic Regression

w₀

x₀

w₁

x₁

b

×

+

*-1

exp

+1

1/x

Backpropagation

2.00

-1.00

-3.00

-2.00

-3.00

-2.00

6.00

4.00

1.00

-1.00

0.37

1.37

0.73

1.00

Upstream gradient

Local gradient

-0.53

-0.20

0.20

33

34 of 38

Another Example: Logistic Regression

w₀

x₀

w₁

x₁

b

×

+

*-1

exp

+1

1/x

Backpropagation

2.00

-1.00

-3.00

-2.00

-3.00

-2.00

6.00

4.00

1.00

-1.00

0.37

1.37

0.73

1.00

Upstream gradient

Local gradient

-0.53

-0.20

0.20

34

35 of 38

Another Example: Logistic Regression

w₀

x₀

w₁

x₁

b

×

+

*-1

exp

+1

1/x

Backpropagation

2.00

-1.00

-3.00

-2.00

-3.00

-2.00

6.00

4.00

1.00

-1.00

0.37

1.37

0.73

1.00

Upstream gradient

Local gradient

-0.53

-0.20

0.20

-0.20

0.40

35

36 of 38

Another Example: Logistic Regression

w₀

x₀

w₁

x₁

b

×

+

*-1

exp

+1

1/x

Backpropagation

2.00

-1.00

-3.00

-2.00

-3.00

-2.00

6.00

4.00

1.00

-1.00

0.37

1.37

0.73

1.00

Upstream gradient

Local gradient

-0.53

-0.20

0.20

-0.20

0.40

-0.40

-0.60

36

37 of 38

Patterns in Gradient Flow

37

38 of 38

Gradient Implementation

def f(w0, x0, w1, x1, b):

s0 = w0 * x0

s1 = w1 * x1

s2 = s0 + s1

s3 = s2 + b

return 1.0 / (1.0 + np.exp(-s3))

grad_f = 1.0

grad_s3 = grad_f * (1 - f) * f

grad_b = grad_s3

grad_s2 = grad_s3

grad_s0 = grad_s2

grad_s1 = grad_s2

grad_w1 = grad_s1 * x1

grad_x1 = grad_s1 * w1

grad_w0 = grad_s0 * x0

grad_x0 = grad_s0 * w0

s₀

s₁

s₂

s₃

f

Forward pass

Gradient computation

38