A | B | C | D | E | F | G | H | I | |
---|---|---|---|---|---|---|---|---|---|

1 | |||||||||

2 | Xavier Initialization | ||||||||

3 | This explanation of Xavier Initialization is possible thanks to andyljones.tumblr.com. | ||||||||

4 | |||||||||

5 | Xavier Initialization is a method of generating weights for a network so that its final activations start out relatively close to our desired outputs before any other optimization takes place. | ||||||||

6 | |||||||||

7 | Attempt #1: No strategy | ||||||||

8 | Let's start with a network of normally distributed weights with a mean of 0 and a standard deviation of 5. | ||||||||

9 | |||||||||

10 | Our Inputs | Our Desired Outputs | |||||||

11 | 3 | 1 | 2 | 1 | 0 | ||||

12 | |||||||||

13 | Weights #1 | -4.40 | 1.93 | -3.58 | Mean | 0 | |||

14 | -8.07 | -10.21 | 0.31 | Standard Deviation | 5 | ||||

15 | 5.97 | 0.69 | 12.17 | ||||||

16 | Activations #1 | -9.32 | -3.05 | 13.89 | |||||

17 | |||||||||

18 | Weights #2 | -2.31 | -1.94 | 3.18 | Mean | 0 | |||

19 | 6.00 | -6.64 | -4.98 | Standard Deviation | 5 | ||||

20 | 2.67 | -0.81 | -1.31 | ||||||

21 | Activations #2 | 40.30 | 27.12 | -32.56 | |||||

22 | |||||||||

23 | Weights #3 | -3.77 | 3.27 | Mean | 0 | ||||

24 | 5.88 | 6.00 | Standard Deviation | 5 | |||||

25 | -5.21 | -1.60 | |||||||

26 | Activations #3 | 177.23 | 346.91 | ||||||

27 | |||||||||

28 | Notice that the activations in our first layer are actually a lot closer to our inputs (and desired outputs). The activations seem to get worse as the signal passes through each successive layer in the network. Andy Jones touches on this in the post linked above, writing: | ||||||||

29 | |||||||||

30 | - If the weights in a network start too small, then the signal shrinks as it passes through each layer until it’s too tiny to be useful. - If the weights in a network start too large, then the signal grows as it passes through each layer until it’s too massive to be useful. | ||||||||

31 | |||||||||

32 | Our network weights should be initialized within a sort of Goldilocks zone so the signal stays within a reasonable range of values as it passes through multiple network layers. Since our weights (and the resulting signal) are currently too large, we can reduce their absolute values by experimenting with the standard deviation parameter used by our random number generator. It's true that 5 is a somewhat arbitrary number to use as our standard deviation, but what would better? | ||||||||

33 | |||||||||

34 | Attempt #2: Use the standard deviation of our inputs to initialize our weights | ||||||||

35 | = STDEV(3, 1, 2) returns 1. So let's generate our weights using a standard deviation of 1 and see what happens. | ||||||||

36 | |||||||||

37 | Our Inputs | Our Desired Outputs | |||||||

38 | 3 | 1 | 2 | 1 | 0 | ||||

39 | |||||||||

40 | Weights #1 | -1.17 | -0.33 | 1.17 | Mean | 0 | |||

41 | -0.06 | -1.62 | 0.04 | Standard Deviation | 1 | ||||

42 | -1.20 | -0.30 | -0.05 | ||||||

43 | Activations #1 | -5.97 | -3.20 | 3.46 | |||||

44 | |||||||||

45 | Weights #2 | 1.31 | 0.48 | 0.00 | Mean | 0 | |||

46 | -1.50 | -0.19 | 0.32 | Standard Deviation | 1 | ||||

47 | 1.00 | -2.63 | -0.23 | ||||||

48 | Activations #2 | 0.44 | -11.35 | -1.83 | |||||

49 | |||||||||

50 | Weights #3 | 1.20 | -0.59 | Mean | 0 | ||||

51 | -0.22 | 1.19 | Standard Deviation | 1 | |||||

52 | -0.58 | -1.70 | |||||||

53 | Activations #3 | 4.06 | -10.63 | ||||||

54 | |||||||||

55 | This should actually work fairly well. Again, I don't know what exactly you're seeing, but I usually get single or low double digit final activations (11.46 and -0.99 for example). Try refreshing the formulas a few times to get an idea of the range of values this network implementation produces. How can we improve on this? Let's see what the theory behind Xavier Initialization has to say. I've included notes where I find the process problematic. | ||||||||

56 | |||||||||

57 | Let's say we have an input, X: | ||||||||

58 | X₁ | X₂ | X₃ | ||||||

59 | |||||||||

60 | A neuron with random weights, W: | ||||||||

61 | W₁ | ||||||||

62 | W₂ | ||||||||

63 | W₃ | ||||||||

64 | |||||||||

65 | And an output, Y: | ||||||||

66 | Y = X₁W₁ + X₂W₂ + X₃W₃ | ||||||||

67 | |||||||||

68 | According to Wikipedia, the variance of X₁W₁ (or any X and W) is: | ||||||||

69 | |||||||||

70 | Var(X₁W₁) = E[X₁]² Var(W₁) + E[W₁]² Var(X₁) + Var(X₁) Var(W₁) | ||||||||

71 | |||||||||

72 | I had to look up E[X₁]² and E[W₁]², but it turns out E[X] is just statistical notation for "expected value of X". | ||||||||

73 | Assumption #1: X and W have a mean of 0. | ||||||||

74 | I understand our weights having a mean of 0, because we can just decide that our weights are drawn from a distribution with a mean of 0, but our inputs? Maybe we translate all inputs to have a mean of 0. That would be pretty easy, I guess, but I'm not sure that this is actually what's happening. | ||||||||

75 | This is a neat trick because if the mean of some data X is 0, the expected value of any component in X is 0. This allows us to simplify the above equation to: | ||||||||

76 | |||||||||

77 | Var(X₁W₁) = Var(X₁) Var(W₁) | ||||||||

78 | |||||||||

79 | Assumption #2: X and W are independent and identically distributed. | ||||||||

80 | Again, I understand our weights being identically distributed, and I can understand both our inputs and our weights being independent. But how do we know for sure that our inputs are identically distributed? How many real life datasets have identically distributed inputs? | ||||||||

81 | According to the Bienaymé formula, the variance of the sum of uncorrelated random variables is the sum of their variances. We can use this to find Var(Y): | ||||||||

82 | |||||||||

83 | Var(Y) = Var(X₁W₁ + X₂W₂ + X₃W₃) Which is the same as: Var(Y) = Var(Sum(XW)) The Bienaymé formula: Var(Y) = Var(Sum(XW)) Var(Y) = Sum(Var(XW)) If X and W are identically distributed around a mean of 0, we then get: Var(Y) = N Var(X) Var(W) Where N is the number of components in X. | ||||||||

84 | |||||||||

85 | Assumption #3: We want Var(Y) to be the same as Var(X). | ||||||||

86 | Is it not possible to have a function where small values of X map onto large values of Y or vice versa? | ||||||||

87 | We can do that by having N Var(W) = 1. In other words: | ||||||||

88 | |||||||||

89 | Var(W) = 1 / N | ||||||||

90 | |||||||||

91 | Attempt #3: XAVIER INITIALIZATION | ||||||||

92 | Our input X has 3 components, so the variance we want for our first set of weights is 1/3. Note that the NORMINV function asks for standard deviation instead of variance, so we have to get the square root of our variance. Xavier Initialization does require that we calculate a new variance for every layer of the network, but it's not obvious here because every layer has 3 components in its input. | ||||||||

93 | |||||||||

94 | Our Inputs | Our Desired Outputs | |||||||

95 | 3 | 1 | 2 | 1 | 0 | ||||

96 | |||||||||

97 | Weights #1 | -1.48 | 0.13 | 0.94 | Mean | 0 | |||

98 | -0.46 | -0.46 | -0.34 | Standard Deviation | 0.58 | ||||

99 | -0.02 | -0.23 | 0.55 | ||||||

100 | Activations #1 | -4.93 | -0.53 | 3.58 |

Loading...

Main menu