Neural Network + Xavier Initialization
The version of the browser you are using is no longer supported. Please upgrade to a supported browser.Dismiss

ABCDEFGHI
1
2
Xavier Initialization
3
This explanation of Xavier Initialization is possible thanks to andyljones.tumblr.com.
4
5
Xavier Initialization is a method of generating weights for a network so that its final activations start out relatively close to our desired outputs before any other optimization takes place.
6
7
Attempt #1: No strategy
8
Let's start with a network of normally distributed weights with a mean of 0 and a standard deviation of 5.
9
10
Our Inputs
Our Desired Outputs
11
31210
12
13
Weights #1-4.401.93-3.58Mean0
14
-8.07-10.210.31Standard Deviation5
15
5.970.6912.17
16
Activations #1-9.32-3.0513.89
17
18
Weights #2-2.31-1.943.18Mean0
19
6.00-6.64-4.98Standard Deviation5
20
2.67-0.81-1.31
21
Activations #240.3027.12-32.56
22
23
Weights #3-3.773.27Mean0
24
5.886.00Standard Deviation5
25
-5.21-1.60
26
Activations #3177.23346.91
27
28
Notice that the activations in our first layer are actually a lot closer to our inputs (and desired outputs). The activations seem to get worse as the signal passes through each successive layer in the network. Andy Jones touches on this in the post linked above, writing:
29
30
- If the weights in a network start too small, then the signal shrinks as it passes through each layer until it’s too tiny to be useful.
- If the weights in a network start too large, then the signal grows as it passes through each layer until it’s too massive to be useful.
31
32
Our network weights should be initialized within a sort of Goldilocks zone so the signal stays within a reasonable range of values as it passes through multiple network layers. Since our weights (and the resulting signal) are currently too large, we can reduce their absolute values by experimenting with the standard deviation parameter used by our random number generator.

It's true that 5 is a somewhat arbitrary number to use as our standard deviation, but what would better?
33
34
Attempt #2: Use the standard deviation of our inputs to initialize our weights
35
= STDEV(3, 1, 2) returns 1. So let's generate our weights using a standard deviation of 1 and see what happens.
36
37
Our Inputs
Our Desired Outputs
38
31210
39
40
Weights #1-1.17-0.331.17Mean0
41
-0.06-1.620.04Standard Deviation1
42
-1.20-0.30-0.05
43
Activations #1-5.97-3.203.46
44
45
Weights #21.310.480.00Mean0
46
-1.50-0.190.32Standard Deviation1
47
1.00-2.63-0.23
48
Activations #20.44-11.35-1.83
49
50
Weights #31.20-0.59Mean0
51
-0.221.19Standard Deviation1
52
-0.58-1.70
53
Activations #34.06-10.63
54
55
This should actually work fairly well.

Again, I don't know what exactly you're seeing, but I usually get single or low double digit final activations (11.46 and -0.99 for example). Try refreshing the formulas a few times to get an idea of the range of values this network implementation produces.

How can we improve on this? Let's see what the theory behind Xavier Initialization has to say. I've included notes where I find the process problematic.
56
57
Let's say we have an input, X:
58
X₁X₂X₃
59
60
A neuron with random weights, W:
61
W₁
62
W₂
63
W₃
64
65
And an output, Y:
66
Y = X₁W₁ + X₂W₂ + X₃W₃
67
68
According to Wikipedia, the variance of X₁W₁ (or any X and W) is:
69
70
Var(X₁W₁) = E[X₁]² Var(W₁) + E[W₁]² Var(X₁) + Var(X₁) Var(W₁)
71
72
I had to look up E[X₁]² and E[W₁]², but it turns out E[X] is just statistical notation for "expected value of X".
73
Assumption #1: X and W have a mean of 0.
74
I understand our weights having a mean of 0, because we can just decide that our weights are drawn from a distribution with a mean of 0, but our inputs? Maybe we translate all inputs to have a mean of 0. That would be pretty easy, I guess, but I'm not sure that this is actually what's happening.
75
This is a neat trick because if the mean of some data X is 0, the expected value of any component in X is 0. This allows us to simplify the above equation to:
76
77
Var(X₁W₁) = Var(X₁) Var(W₁)
78
79
Assumption #2: X and W are independent and identically distributed.
80
Again, I understand our weights being identically distributed, and I can understand both our inputs and our weights being independent. But how do we know for sure that our inputs are identically distributed? How many real life datasets have identically distributed inputs?
81
According to the Bienaymé formula, the variance of the sum of uncorrelated random variables is the sum of their variances. We can use this to find Var(Y):
82
83
Var(Y) = Var(X₁W₁ + X₂W₂ + X₃W₃)

Which is the same as:
Var(Y) = Var(Sum(XW))

The Bienaymé formula:
Var(Y) = Var(Sum(XW))
Var(Y) = Sum(Var(XW))

If X and W are identically distributed around a mean of 0, we then get:
Var(Y) = N Var(X) Var(W)

Where N is the number of components in X.
84
85
Assumption #3: We want Var(Y) to be the same as Var(X).
86
Is it not possible to have a function where small values of X map onto large values of Y or vice versa?
87
We can do that by having N Var(W) = 1. In other words:
88
89
Var(W) = 1 / N
90
91
Attempt #3: XAVIER INITIALIZATION
92
Our input X has 3 components, so the variance we want for our first set of weights is 1/3. Note that the NORMINV function asks for standard deviation instead of variance, so we have to get the square root of our variance.

Xavier Initialization does require that we calculate a new variance for every layer of the network, but it's not obvious here because every layer has 3 components in its input.
93
94
Our Inputs
Our Desired Outputs
95
31210
96
97
Weights #1-1.480.130.94Mean0
98
-0.46-0.46-0.34Standard Deviation0.58
99
-0.02-0.230.55
100
Activations #1-4.93-0.533.58