Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift�
Motivation��Old school related concept:��Feature scaling
Common normalizations
Two methods are usually used for rescaling or normalizing data:
Internal covariate shift:��The cup game example
“First layer parameters change and so the distribution of the input to your second layer changes”
Proposed Solution:�Batch Normalization (BN)
Batch normalization �Why it is good?��
The BN transformation is scalar invariant
Batch normalization:�Other benefits in practice
Batch normalization:�Better accuracy , faster.
BN applied to MNIST (a), and activations of a randomly selected neuron over time (b, c), where the middle line is the median activation, the top line is the 15th percentile and the bottom line is the 85th percentile.
�
Why the naïve approach Does not work?
The proposed solution: To add an extra regularization layer
A new layer is added so the gradient can “see” the normalization and make adjustments if needed.
Algorithm Summary:�Normalization via Mini-Batch Statistics�
The Batch Transformation: formally from the paper.�
Useful links