Gradient Descent
 Share
The version of the browser you are using is no longer supported. Please upgrade to a supported browser.Dismiss

 
View only
 
 
ABCDEFGHIJKLMN
1
2
Gradient Descent
3
We know by now that neural networks learn by making guesses about the parameters of a function (filters in a convolutional layer, weights in a dense layer), and updating those guesses based on how closely the network's outputs match the known labels in our data.

We might also know that networks somehow do this using derivatives, and, as I learned not too long ago, that a derivative is just the rate of change of a thing.
4
5
The Gradient
6
In order to do this, we need a way to collectively compare the outputs from our network to the labels in our data. There are multiple loss functions we can use here, but I like RMSE (root mean squared error) because it's pretty straightforward mathematically, and I'm a simple man.
7
8
To understand RMSE, let's imagine that we have a following set of training data where x and y are linearly related. A linear relationship can be represented in the form y = mx + b.

In this case, m = 2 and b = 30.
xy
9
744
10
948
11
2886
12
3090
13
50130
14
m_actual
2
15
b_actual
30
16
17
We know that the relationship between x and y is linear, but we don't know the value of the parameters m and b.

What we can do is make a guess that m=1 and b=1 and see what values of y we'd predict using those values of m and b.
xy_pred
18
78
19
910
20
2829
21
3031
22
5051
23
m_guess
1
24
b_guess
1
25
26
We can compare our predicted y values to our actual y values by subtracting them from each other and squaring the result. The square root of the sum of (y_pred - y)^2 gives us RMSE - a single number that tells us how close our guesses are to our actual labels.

The lower our RMSE, the more accurate our guesses.
yy_pred
(y_pred - y) ^2
27
4481,296
28
48101,444
29
86293,249
30
90313,481
31
130516,241
32
33
rmse125
34
35
We can see that our guesses are not that accurate right now! That's fine though - we haven't done any optimization yet.

Let's start by visualizing our loss function, or our gradient, which is just a loss function in multiple dimensions (I'm like 80% sure of this). Unfortunately, Google Sheets doesn't support 3D area charts, so we're going to have to make do with conditional formatting.

Our columns represent different guesses for the value of m, and our rows different guesses for the value of b. RMSE decreases as we get closer to actual values of m and b:
36
37
38
39
40
41
42
43
44
m
45
b036912151821242730
46
1127121115108102968983777166
47
267605447403427201370
48
33736353638414550556066
49
483879195100105110115120126131
50
5145150154159164170175180186191197
51
6209214219224230235240246251257263
52
7274279285290295301306312317323328
53
8339345350355361366372377383388394
54
55
The Descent
56
Imagine the above gradient as a landscape with peaks and valleys. If we were blindfolded and dropped onto this landscape with the goal of getting to the its lowest possible point, one way to accomplish our objective would be to test the ground around us with a foot, take a step wherever the descent feels steepest, and repeat.

This is what derivatives allow us to do.

In the following table, we have 20 values of x along with their actual y values. In the first row, we make guesses about the values of m and b, make a prediction about the value of y, and calculate squared error using our m and b values.

Next, we calculate what squared error would be if we added 0.01 to our guess of m. We find the derivative of our error with respect to m (how much our error changed when we made that change to m), and use that information to pick a new value of m, essentially taking a step in the direction of steepest descent. The learn parameter decides how large or small a step we take.

We do the same thing with b, copy our values of m and b to the next row, and do it all over again.

When we've completed this process against every pair of x and y values in our dataset, we've completed one epoch.

The "Run Epoch" button below completes as many epochs as we specify (5 by default) and records the results in a table, showing how RMSE changes after epoch. Hitting "Reset" removes the recorded values and sets our guesses for both parameters back to 1.

Try messing around with the parameters and running a few epochs to see if you can build an intuition around gradient descent.

You're probably going to want your own copy of this one - the buttons don't work in view-only mode.
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
And if it turns out that making a copy of the spreadsheet doesn't copy the scripts, I made a gist of the scripts attached to this spreadsheet here.
81
82
83
actualguess
84
epochs51m22.051rmse17
85
learn0.00011b3025.075
86
87
88
(mx+b)
(mx+b-y)^2
2(mx+b-y)
2x(mx+b-y)
89
xymby_pred
error_sq
err_m1de/dmnew_merr_b1de/dbnew_b
90
7442.05125.07539.42120-92.05221-6425.081
91
9482.05225.08143.52019-92.05320-8025.089
92
28862.05325.08982.61210-72.05312-19225.109
93
30902.05325.10986.7119-72.05411-19725.128
94
501302.05425.128127.853-42.0555-21625.150
95
22742.05525.15070.41312-72.05513-16125.166
96
3362.05525.16631.32222-92.05622-2825.169
97
9482.05625.16943.71918-92.05719-7825.177
98
30902.05725.17786.9108-62.05810-18725.195
99
23762.05825.19572.51211-72.05812-16025.211
100
401102.05825.211107.564-52.0596-19625.231
Loading...
 
 
 
Gradient Descent
 
 
Main menu