A | B | C | D | E | F | G | H | I | J | K | L | M | N | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|

1 | ||||||||||||||

2 | Gradient Descent | |||||||||||||

3 | We know by now that neural networks learn by making guesses about the parameters of a function (filters in a convolutional layer, weights in a dense layer), and updating those guesses based on how closely the network's outputs match the known labels in our data. We might also know that networks somehow do this using derivatives, and, as I learned not too long ago, that a derivative is just the rate of change of a thing. | |||||||||||||

4 | ||||||||||||||

5 | The Gradient | |||||||||||||

6 | In order to do this, we need a way to collectively compare the outputs from our network to the labels in our data. There are multiple loss functions we can use here, but I like RMSE (root mean squared error) because it's pretty straightforward mathematically, and I'm a simple man. | |||||||||||||

7 | ||||||||||||||

8 | To understand RMSE, let's imagine that we have a following set of training data where x and y are linearly related. A linear relationship can be represented in the form y = mx + b. In this case, m = 2 and b = 30. | x | y | |||||||||||

9 | 7 | 44 | ||||||||||||

10 | 9 | 48 | ||||||||||||

11 | 28 | 86 | ||||||||||||

12 | 30 | 90 | ||||||||||||

13 | 50 | 130 | ||||||||||||

14 | m_actual | 2 | ||||||||||||

15 | b_actual | 30 | ||||||||||||

16 | ||||||||||||||

17 | We know that the relationship between x and y is linear, but we don't know the value of the parameters m and b. What we can do is make a guess that m=1 and b=1 and see what values of y we'd predict using those values of m and b. | x | y_pred | |||||||||||

18 | 7 | 8 | ||||||||||||

19 | 9 | 10 | ||||||||||||

20 | 28 | 29 | ||||||||||||

21 | 30 | 31 | ||||||||||||

22 | 50 | 51 | ||||||||||||

23 | m_guess | 1 | ||||||||||||

24 | b_guess | 1 | ||||||||||||

25 | ||||||||||||||

26 | We can compare our predicted y values to our actual y values by subtracting them from each other and squaring the result. The square root of the sum of (y_pred - y)^2 gives us RMSE - a single number that tells us how close our guesses are to our actual labels. The lower our RMSE, the more accurate our guesses. | y | y_pred | (y_pred - y) ^2 | ||||||||||

27 | 44 | 8 | 1,296 | |||||||||||

28 | 48 | 10 | 1,444 | |||||||||||

29 | 86 | 29 | 3,249 | |||||||||||

30 | 90 | 31 | 3,481 | |||||||||||

31 | 130 | 51 | 6,241 | |||||||||||

32 | ||||||||||||||

33 | rmse | 125 | ||||||||||||

34 | ||||||||||||||

35 | We can see that our guesses are not that accurate right now! That's fine though - we haven't done any optimization yet. Let's start by visualizing our loss function, or our gradient, which is just a loss function in multiple dimensions (I'm like 80% sure of this). Unfortunately, Google Sheets doesn't support 3D area charts, so we're going to have to make do with conditional formatting. Our columns represent different guesses for the value of m, and our rows different guesses for the value of b. RMSE decreases as we get closer to actual values of m and b: | |||||||||||||

36 | ||||||||||||||

37 | ||||||||||||||

38 | ||||||||||||||

39 | ||||||||||||||

40 | ||||||||||||||

41 | ||||||||||||||

42 | ||||||||||||||

43 | ||||||||||||||

44 | m | |||||||||||||

45 | b | 0 | 3 | 6 | 9 | 12 | 15 | 18 | 21 | 24 | 27 | 30 | ||

46 | 1 | 127 | 121 | 115 | 108 | 102 | 96 | 89 | 83 | 77 | 71 | 66 | ||

47 | 2 | 67 | 60 | 54 | 47 | 40 | 34 | 27 | 20 | 13 | 7 | 0 | ||

48 | 3 | 37 | 36 | 35 | 36 | 38 | 41 | 45 | 50 | 55 | 60 | 66 | ||

49 | 4 | 83 | 87 | 91 | 95 | 100 | 105 | 110 | 115 | 120 | 126 | 131 | ||

50 | 5 | 145 | 150 | 154 | 159 | 164 | 170 | 175 | 180 | 186 | 191 | 197 | ||

51 | 6 | 209 | 214 | 219 | 224 | 230 | 235 | 240 | 246 | 251 | 257 | 263 | ||

52 | 7 | 274 | 279 | 285 | 290 | 295 | 301 | 306 | 312 | 317 | 323 | 328 | ||

53 | 8 | 339 | 345 | 350 | 355 | 361 | 366 | 372 | 377 | 383 | 388 | 394 | ||

54 | ||||||||||||||

55 | The Descent | |||||||||||||

56 | Imagine the above gradient as a landscape with peaks and valleys. If we were blindfolded and dropped onto this landscape with the goal of getting to the its lowest possible point, one way to accomplish our objective would be to test the ground around us with a foot, take a step wherever the descent feels steepest, and repeat. This is what derivatives allow us to do. In the following table, we have 20 values of x along with their actual y values. In the first row, we make guesses about the values of m and b, make a prediction about the value of y, and calculate squared error using our m and b values. Next, we calculate what squared error would be if we added 0.01 to our guess of m. We find the derivative of our error with respect to m (how much our error changed when we made that change to m), and use that information to pick a new value of m, essentially taking a step in the direction of steepest descent. The learn parameter decides how large or small a step we take. We do the same thing with b, copy our values of m and b to the next row, and do it all over again. When we've completed this process against every pair of x and y values in our dataset, we've completed one epoch. The "Run Epoch" button below completes as many epochs as we specify (5 by default) and records the results in a table, showing how RMSE changes after epoch. Hitting "Reset" removes the recorded values and sets our guesses for both parameters back to 1. Try messing around with the parameters and running a few epochs to see if you can build an intuition around gradient descent. | |||||||||||||

57 | ||||||||||||||

58 | ||||||||||||||

59 | ||||||||||||||

60 | ||||||||||||||

61 | ||||||||||||||

62 | ||||||||||||||

63 | ||||||||||||||

64 | ||||||||||||||

65 | ||||||||||||||

66 | ||||||||||||||

67 | ||||||||||||||

68 | ||||||||||||||

69 | ||||||||||||||

70 | ||||||||||||||

71 | ||||||||||||||

72 | ||||||||||||||

73 | ||||||||||||||

74 | ||||||||||||||

75 | ||||||||||||||

76 | ||||||||||||||

77 | ||||||||||||||

78 | actual | guess | ||||||||||||

79 | epochs | 5 | m | 2 | 1.657 | rmse | 87 | |||||||

80 | learn | 0.0001 | b | 30 | 16.423 | |||||||||

81 | ||||||||||||||

82 | ||||||||||||||

83 | (mx+b) | (mx+b-y)^2 | 2(mx+b-y) | 2x(mx+b-y) | ||||||||||

84 | x | y | m | b | y_pred | error_sq | err_m1 | de/dm | new_m | err_b1 | de/db | new_b | ||

85 | 7 | 44 | 1.657 | 16.423 | 28.0 | 255 | 253 | -32 | 1.660 | 255 | -224 | 16.445 | ||

86 | 9 | 48 | 1.660 | 16.445 | 31.4 | 276 | 273 | -33 | 1.664 | 276 | -299 | 16.475 | ||

87 | 28 | 86 | 1.664 | 16.475 | 63.1 | 527 | 514 | -46 | 1.668 | 526 | -1,285 | 16.604 | ||

88 | 30 | 90 | 1.668 | 16.604 | 66.6 | 545 | 531 | -47 | 1.673 | 545 | -1,401 | 16.744 | ||

89 | 50 | 130 | 1.673 | 16.744 | 100.4 | 877 | 848 | -59 | 1.679 | 877 | -2,962 | 17.040 | ||

90 | 22 | 74 | 1.679 | 17.040 | 54.0 | 401 | 392 | -40 | 1.683 | 401 | -881 | 17.128 | ||

91 | 3 | 36 | 1.683 | 17.128 | 22.2 | 191 | 190 | -28 | 1.685 | 191 | -83 | 17.136 | ||

92 | 9 | 48 | 1.685 | 17.136 | 32.3 | 246 | 243 | -31 | 1.689 | 246 | -282 | 17.165 | ||

93 | 30 | 90 | 1.689 | 17.165 | 67.8 | 492 | 479 | -44 | 1.693 | 491 | -1,331 | 17.298 | ||

94 | 23 | 76 | 1.693 | 17.298 | 56.2 | 391 | 382 | -40 | 1.697 | 390 | -909 | 17.389 | ||

95 | 40 | 110 | 1.697 | 17.389 | 85.3 | 612 | 592 | -49 | 1.702 | 611 | -1,979 | 17.587 | ||

96 | 18 | 66 | 1.702 | 17.587 | 48.2 | 316 | 310 | -36 | 1.706 | 316 | -640 | 17.651 | ||

97 | 7 | 44 | 1.706 | 17.651 | 29.6 | 208 | 206 | -29 | 1.708 | 207 | -202 | 17.671 | ||

98 | 14 | 58 | 1.708 | 17.671 | 41.6 | 269 | 265 | -33 | 1.712 | 269 | -460 | 17.717 | ||

99 | 41 | 112 | 1.712 | 17.717 | 87.9 | 581 | 561 | -48 | 1.716 | 581 | -1,977 | 17.914 | ||

100 | 14 | 58 | 1.716 | 17.914 | 41.9 | 258 | 253 | -32 | 1.720 | 257 | -450 | 17.959 |

Loading...