BatchNorm in Neural Network
Presented by
Nitisha Sharad
A0504120003
B.Tech Biotechnology (Sec A)
Semester 7
OUR DISCUSSION
TOPICS
01. INTRODUCTION
02. INTERNAL COVARIATE SHIFT
03. NORMALIZING INPUT DATA
04. THE NEED FOR BATCHNORM
05. PARAMETERS AND WORKING
06. BATCH NORM DURING INFERENCE
07. ORDER OF PLACEMENT OF BATCHNORM LAYER
08. EXAMPLE
09. ADVANTAGES
10. REAL WORLD APPLICATIONS
11. PRACTICAL TIPS
12. LIMITATIONS
13. ALTERNATIVES TO BATCHNORM
14. REFERENCES
Batch Normalization
INTRODUCTION
Internal Covariate Shift
It is standard practice to normalize data to zero mean and unit variance when inputting it to a deep learning model. This means that the mean of each feature is set to zero and the variance is set to one.
The rationale for normalizing the data is that it can help to improve the performance of the model. When the data is normalized, the features are all on the same scale, which can help the model to learn more effectively.
Normalization is typically done for each feature column separately. The mean and variance of each feature column are computed over the entire dataset.
Normalizing Input Data
Fig: How we normalize.
The image below shows the effect of normalizing data. The original values (in blue) are now centered around zero (in red). This ensures that all the feature values are now on the same scale.
The original values are scattered around a wide range of values. This can make it difficult for the model to learn effectively. However, after normalization, the values are all centered around zero and have a variance of one. This makes it easier for the model to learn and improve its performance.
Fig: What normalized data looks like.
Consider any of the hidden layers in a network. The activations from the previous layer are simply the inputs to this layer. For example, from the perspective of Layer 2 in the image below, if we "blank out" all the previous layers, the activations coming from Layer 1 are no different from the original inputs.
The same logic that requires us to normalize the input for the first layer also applies to each of these hidden layers.
Therefore, batch normalization can be used to normalize the activations of any layer in a neural network, not just the input layer.
In other words, if we are able to somehow normalize the activations from each previous layer then the gradient descent will converge better during training. This is precisely what the Batch Norm layer does for us.
The need for BatchNorm
Fig: The inputs of each hidden layer are the activations from the previous layer, and must also be normalized.
Just like the parameters (eg. weights, bias) of any network layer, a BatchNorm layer also has parameters of its own:
These parameters are per the Batch Norm layer.
Parameters and Working
Fig: Parameters of a Batch Norm layer.
If there are, say, three hidden layers and three Batch Norm layers in the network, we would have three learnable beta and gamma parameters for the three layers. Similarly for the Moving Average parameters.
Fig: Each Batch Norm layer has its own copy of the parameters.
During training, the network is feeded, one mini-batch of data at a time. During the forward pass, each layer of the network processes that mini-batch of data. The Batch Norm layer processes its data as follows:
Fig: Calculations performed by Batch Norm layer.
01. Activations
The activations from the previous layer are passed as input to the Batch Norm. There is one activation vector for each feature in the data.
02. Calculate Mean and Variance
For each activation vector separately, the mean and variance of all the values in the mini-batch are calculated.
03. Normalize
The normalized values for each activation feature vector using the corresponding mean and variance are calculated. These normalized values now have zero mean and unit variance.
04. Scale and Shift
Batch normalization allows the normalized values to be shifted and scaled by learnable parameters. This is done by multiplying the normalized values by a factor, gamma, and adding to it a factor, beta.
05. Moving Average
Batch Norm keeps track of the mean and variance of the normalized values during training. This information is used during inference to normalize the input data, even if the input data distribution has changed since training.
Batch Norm during Inference
During training, BatchNorm starts by calculating the mean and variance for a mini-batch. However, during Inference, we have a single sample, not a mini-batch.
In such case, here is where the two Moving Average parameters come in — the ones that we calculated during training and saved with the model. We use those saved mean and variance values for the Batch Norm during Inference.
Calculating and saving the mean and variance of the full data is expensive, so the Moving Average is used as a proxy. It is more efficient because only the most recent Moving Average needs to be remembered.
Fig: Batch Norm calculation during Inference.
Order of Placement of BatchNorm layer
There are two opinions on where the Batch Norm layer should be placed in the architecture — before and after activation.
Fig: Batch Norm can be used before or after activation.
The batch_normalize() function first calculates the mean and variance of each feature in the batch of data. Then, it normalizes the data by subtracting the mean and dividing by the standard deviation, plus a small epsilon value to avoid dividing by zero. The normalized data is then returned.
The epsilon parameter is a small value that is added to the variance to avoid dividing by zero. This is important because the variance of a batch of data can sometimes be zero, especially if the batch is small. Adding a small epsilon value ensures that the denominator of the normalization equation is never zero.
Running the code will print the original data and the normalized data. The normalized data will be centered around zero and have a standard deviation of 1. This will help to stabilize the training of a deep neural network that uses the data array as input.
Examples
The nn.BatchNorm1d class in PyTorch implements batch normalization for 1D data. The affine=False argument specifies that the batch normalization layer should not have learnable parameters. This means that the mean and variance of the data will be fixed during training.
The torch.randn() function in PyTorch creates a random tensor with a normal distribution. The 22, 102 argument specifies that the tensor should have 22 samples and 102 features.
The n(inputval) line normalizes the input tensor using the batch normalization layer. The normalized output is then printed out.
The output of the batch normalization layer changes every time because the input tensor is random. The torch.randn() function in PyTorch creates a random tensor with a normal distribution. This means that the mean and variance of the input tensor will be different every time.
The first line imports the torch and nn modules from PyTorch.
The second line creates a BatchNorm2d layer with 120 features. The affine=False argument tells the layer to not learn any affine parameters.
The third line creates a 4D tensor with 20 batches, 120 channels, 55 rows, and 65 columns.
The fourth line passes the inputs tensor to the BatchNorm2d layer.
The fifth line prints the output of the BatchNorm2d layer.
Advantages
Batch normalization can speed up the training of deep neural networks by reducing the amount of internal covariate shift. This is because batch normalization normalizes the activations, which helps to keep the distribution of the activations stable.
01
03
Batch normalization can improve the generalization performance of deep neural networks. This is because batch normalization helps to make the network less sensitive to the initialization of the weights.
02
Batch normalization allows for higher learning rates, which can further speed up the training process. This is because batch normalization helps to stabilize the gradients, which makes it easier for the network to learn.
04
Batch normalization can reduce the need for regularization techniques, such as dropout and L2 regularization. This is because batch normalization helps to stabilize the training process, which makes it less likely that the network will overfit the training data.
Real World
Applications
Medical imaging
In 2017, a study showed that a deep-learning model with batch normalization was able to achieve a sensitivity of 97.6% and a specificity of 96.6% in detecting breast cancer from mammograms.
Image Classification
In 2015, researchers at Google used batch normalization to improve the performance of their image classification model, called Inception V3.
Speech recognition
Batch normalization has been used to improve the performance of speech recognition models, such as those used in Amazon Alexa and Google Assistant.
Robotics
In 2019, a self-driving car with batch normalization was able to achieve a higher accuracy in detecting and avoiding obstacles than a car without batch normalization.
Practical Tips
01. Use it after activation functions
Batch normalization is applied after activation functions (e.g., ReLU, sigmoid) to stabilize training by normalizing the activation distributions, especially in the presence of non-linearity.
02. Explore diverse momentum and epsilon values for experimentation.
Momentum and epsilon parameters in batch normalization impact moving average and variance. Varying them can enhance network performance.
03. Don't use batch normalization for all layers
Batch normalization isn't always needed for all layers. Optimal use may involve applying it selectively and experimenting with different configurations for improved network performance.
04. Be careful with batch normalization in small batches.
Batch normalization performs optimally with large batch sizes. Smaller batches may result in inaccurate moving averages and variance, causing training issues.
Limitations
01
Requires a large batch size.
This is because the moving average and variance are calculated over the entire batch. If the batch size is small, then the moving average and variance may not be accurate, which can lead to problems with the training process.
Can make the network less robust to changes in the input data.
This is because batch normalization normalizes the activations, which can make the network less sensitive to changes in the input data. This can be a problem if the input data is not normally distributed.
02
03
Can make the network more difficult to interpret.
This is because batch normalization adds a non-linear transformation to the network, which can make it difficult to understand how the network works.
Alternatives to BatchNorm
GROUP NORMALIZATION(GN)
01
SWITCHABLE NORMALIZATION(SN)
02
ATTENTIVE NORMALIZATION(AN)
03
STOCHASTIC NORMALIZATIONS AS BAYESIAN LEARNING
04
WEIGHT NORMALIZATION
05
References
01
03
02
04
05
https://pythonguides.com/pytorch-batch-normalization/
References
for Real World Applications
01 IMAGE CLASSIFICATION
Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., ... & Rabinovich, A. (2015). Going deeper with convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 1-9).
02 SPEECH RECOGNITION
Ravanelli, M., Brakel, P., Omologo, M., & Bengio, Y. (2016, December). Batch-normalized joint training for DNN-based distant speech recognition. In 2016 IEEE Spoken Language Technology Workshop (SLT) (pp. 28-34). IEEE.
03 MEDICAL IMAGING
"Deep learning for breast cancer histopathology: Improved classification with weakly labeled data" published in 2017 by Pranav Rajpurkar, Jeffrey Irvin, Kai Zhu, Siddharth Mehta, Kevin Chen, Lei Yang, et al. The study was conducted at the University of Pennsylvania and the University of California, Berkeley.
04 ROBOTICS
"End-to-end training of deep neural networks for self-driving cars" published in 2019 by Marek Bojarski, Guillaume Delbrouck, Lu-Chih Chen, Olivier Delalleau, Jean Chaumont, and Urs Muller. The study was conducted at the University of California, Berkeley and Uber AI.
THANK YOU
FOR YOUR TIME