Lecture 2:
Neural Networks and
Backpropagation
Soo Kyung Kim���(This lecture is based on cs231n class by Prof. Li Fei-Fei at Stanford)
Spring 2025
Neural Networks
2
Perceptron
3
Perceptron
Softmax classifier is a special case of Perceptron!
With f = I, it is linear regression.
x
Wx
y
f = σ
4
Neural Network with a Single Layer
x
s
W
d=3072
c=10
5
Multilayer Perceptron (MLP)
x
s
W1
d=3072
c=10
h
h=100
W2
6
Multilayer Perceptron (MLP)
x
s
W
d=3072
c=10
x
s
W1
d=3072
c=10
h
h=100
W2
7
Multilayer Perceptron (MLP)
x
s
W1
d=3072
c=10
h
h=100
W2
Multi-linear layers are still linear.
How can we add non-linearity?
→ Activation functions!
8
Activation Functions
9
An Example of Neural Network
10
Computing Gradients
What do we need for (Stochastic) Gradient Descent?
We want to find a value where the loss value is close to 0, meaning we are at the bottom of the flat.
x
ŷ
W1
W2
ℒ
11
Computing Gradients
x
ŷ
W1
W2
ℒ
12
Implementation: 2-layer MLP
import numpy as np
from numpy.random import randn
n, d, h, c = 64, 1000, 100, 10
x, y = randn(n, d), randn(n, c)
w1, w2 = randn(d, h), randn(h, c)
learning_rate = 1e-4
for t in range(1000):
y_0= x.dot(w1)
h_0 = 1 / (1 + np.exp(-y_0))
y_pred = h.dot(w2)
loss = np.square(y_pred - y).sum()
print(t, loss)
grad_y_pred = 2.0 * (y_pred - y)
grad_w2 = h.T.dot(grad_y_pred)
grad_h = grad_y_pred.dot(w2.T)
grad_w1 = x.T.dot(grad_h * h * (1 - h))
w1 -= learning_rate * grad_w1
w2 -= learning_rate * grad_w2
Network Definition
(n: #examples, d: input dim,� d: hidden dim, c: #classes)
Forward Pass: predicting using the current network
Calculate the analytical gradients
Gradient descent
13
Computing Gradients
14
Computing Gradients
Even more complex neural nets ...
15
Backpropagation: Computing Gradients
16
Computational Graph
f(x,W) = Wx + b
x
W
b
×
+
f
Forward pass
Backpropagation
17
Backpropagation Example
f(x, y, z) = (x+y)z
x
y
z
+
×
For example, suppose the input is
x = -2, y = 5, z = -4.
-2
5
-4
Forward pass
q=3
f=-12
18
Backpropagation Example
f(x, y, z) = (x+y)z
x
y
z
+
×
For example, suppose the input is
x = -2, y = 5, z = -4.
-2
5
-4
Backpropagation
We need partial derivative of f w.r.t. each variable (x, y, z).
q=3
f=-12
19
Backpropagation Example
f(x, y, z) = (x+y)z
x
y
z
+
×
For example, suppose the input is
x = -2, y = 5, z = -4.
-2
5
-4
Backpropagation
The very last one is simple: 1
q=3
f=-12
1
20
Backpropagation Example
f(x, y, z) = (x+y)z
x
y
z
+
×
For example, suppose the input is
x = -2, y = 5, z = -4.
-2
5
-4
Backpropagation
Partial derivative of f w.r.t. z is given by
q=3
f=-12
1
3
21
Backpropagation Example
f(x, y, z) = (x+y)z
x
y
z
+
×
For example, suppose the input is
x = -2, y = 5, z = -4.
-2
5
-4
Backpropagation
Partial derivative of f w.r.t. q is given by
q=3
f=-12
1
3
-4
22
Backpropagation Example
f(x, y, z) = (x+y)z
x
y
z
+
×
For example, suppose the input is
x = -2, y = 5, z = -4.
-2
5
-4
Backpropagation
Partial derivative of f w.r.t. x is given by
q=3
f=-12
1
3
-4
-4
23
Backpropagation Example
f(x, y, z) = (x+y)z
x
y
z
+
×
For example, suppose the input is
x = -2, y = 5, z = -4.
-2
5
-4
Backpropagation
Partial derivative of f w.r.t. y is given by
q=3
f=-12
1
3
-4
-4
-4
24
Chain Rule
f
Forward pass
Upstream Gradient
Local Gradient
Downstream Gradient
25
Another Example: Logistic Regression
w0
x0
w1
x1
b
×
×
+
+
*-1
exp
+1
1/x
26
Another Example: Logistic Regression
w0
x0
w1
x1
b
×
×
+
+
*-1
exp
+1
1/x
Forward pass
2.00
-1.00
-3.00
-2.00
-3.00
-2.00
6.00
4.00
1.00
-1.00
0.37
1.37
0.73
27
Another Example: Logistic Regression
w0
x0
w1
x1
b
×
×
+
+
*-1
exp
+1
1/x
Backpropagation
2.00
-1.00
-3.00
-2.00
-3.00
-2.00
6.00
4.00
1.00
-1.00
0.37
1.37
0.73
1.00
28
Another Example: Logistic Regression
w0
x0
w1
x1
b
×
×
+
+
*-1
exp
+1
1/x
Backpropagation
2.00
-1.00
-3.00
-2.00
-3.00
-2.00
6.00
4.00
1.00
-1.00
0.37
1.37
0.73
1.00
Upstream gradient
Local gradient
-0.53
29
Another Example: Logistic Regression
w0
x0
w1
x1
b
×
×
+
+
*-1
exp
+1
1/x
Backpropagation
2.00
-1.00
-3.00
-2.00
-3.00
-2.00
6.00
4.00
1.00
-1.00
0.37
1.37
0.73
1.00
Upstream gradient
Local gradient
-0.53
-0.53
30
Another Example: Logistic Regression
w0
x0
w1
x1
b
×
×
+
+
*-1
exp
+1
1/x
Backpropagation
2.00
-1.00
-3.00
-2.00
-3.00
-2.00
6.00
4.00
1.00
-1.00
0.37
1.37
0.73
1.00
Upstream gradient
Local gradient
-0.53
-0.53
-0.20
31
Another Example: Logistic Regression
w0
x0
w1
x1
b
×
×
+
+
*-1
exp
+1
1/x
Backpropagation
2.00
-1.00
-3.00
-2.00
-3.00
-2.00
6.00
4.00
1.00
-1.00
0.37
1.37
0.73
1.00
Upstream gradient
Local gradient
-0.53
-0.53
-0.20
0.20
32
Another Example: Logistic Regression
w0
x0
w1
x1
b
×
×
+
+
*-1
exp
+1
1/x
Backpropagation
2.00
-1.00
-3.00
-2.00
-3.00
-2.00
6.00
4.00
1.00
-1.00
0.37
1.37
0.73
1.00
Upstream gradient
Local gradient
-0.53
-0.53
-0.20
0.20
0.20
0.20
33
Another Example: Logistic Regression
w0
x0
w1
x1
b
×
×
+
+
*-1
exp
+1
1/x
Backpropagation
2.00
-1.00
-3.00
-2.00
-3.00
-2.00
6.00
4.00
1.00
-1.00
0.37
1.37
0.73
1.00
Upstream gradient
Local gradient
-0.53
-0.53
-0.20
0.20
0.20
0.20
0.20
0.20
34
Another Example: Logistic Regression
w0
x0
w1
x1
b
×
×
+
+
*-1
exp
+1
1/x
Backpropagation
2.00
-1.00
-3.00
-2.00
-3.00
-2.00
6.00
4.00
1.00
-1.00
0.37
1.37
0.73
1.00
Upstream gradient
Local gradient
-0.53
-0.53
-0.20
0.20
0.20
0.20
0.20
0.20
-0.20
0.40
35
Another Example: Logistic Regression
w0
x0
w1
x1
b
×
×
+
+
*-1
exp
+1
1/x
Backpropagation
2.00
-1.00
-3.00
-2.00
-3.00
-2.00
6.00
4.00
1.00
-1.00
0.37
1.37
0.73
1.00
Upstream gradient
Local gradient
-0.53
-0.53
-0.20
0.20
0.20
0.20
0.20
0.20
-0.20
0.40
-0.40
-0.60
36
Patterns in Gradient Flow
37
Gradient Implementation
def f(w0, x0, w1, x1, b):
s0 = w0 * x0
s1 = w1 * x1
s2 = s0 + s1
s3 = s2 + b
return 1.0 / (1.0 + np.exp(-s3))
grad_f = 1.0
grad_s3 = grad_f * (1 - f) * f
grad_b = grad_s3
grad_s2 = grad_s3
grad_s0 = grad_s2
grad_s1 = grad_s2
grad_w1 = grad_s1 * x1
grad_x1 = grad_s1 * w1
grad_w0 = grad_s0 * x0
grad_x0 = grad_s0 * w0
s0
s1
s2
s3
f
Forward pass
Gradient computation
38