1 of 24

BatchNorm in Neural Network

Presented by

Nitisha Sharad

A0504120003

B.Tech Biotechnology (Sec A)

Semester 7

2 of 24

OUR DISCUSSION

TOPICS

01. INTRODUCTION

02. INTERNAL COVARIATE SHIFT

03. NORMALIZING INPUT DATA

04. THE NEED FOR BATCHNORM

05. PARAMETERS AND WORKING

06. BATCH NORM DURING INFERENCE

07. ORDER OF PLACEMENT OF BATCHNORM LAYER

08. EXAMPLE

09. ADVANTAGES

10. REAL WORLD APPLICATIONS

11. PRACTICAL TIPS

12. LIMITATIONS

13. ALTERNATIVES TO BATCHNORM

14. REFERENCES

3 of 24

Batch Normalization

INTRODUCTION

    • Batch normalization (aka batch norm) is used to make the training of artificial neural networks faster and more stable.

    • Proposed by Sergey Ioffe and Christian Szegedy in 2015.

    • It mitigates the problem of internal covariate shift.

4 of 24

    • A phenomenon where the distribution of the activations of a neural network layer changes during training.
    • The parameters of the earlier layers are updated during the learning process, the subsequent layers receive inputs with varying distributions, which can slow down training and make it harder to optimize the network effectively.
    • Batch Normalization, by normalizing the activations within each mini-batch during training, mitigates the effects of internal covariate shift, leading to faster and more stable convergence.

Internal Covariate Shift

5 of 24

It is standard practice to normalize data to zero mean and unit variance when inputting it to a deep learning model. This means that the mean of each feature is set to zero and the variance is set to one.

The rationale for normalizing the data is that it can help to improve the performance of the model. When the data is normalized, the features are all on the same scale, which can help the model to learn more effectively.

Normalization is typically done for each feature column separately. The mean and variance of each feature column are computed over the entire dataset.

Normalizing Input Data

Fig: How we normalize.

6 of 24

The image below shows the effect of normalizing data. The original values (in blue) are now centered around zero (in red). This ensures that all the feature values are now on the same scale.

The original values are scattered around a wide range of values. This can make it difficult for the model to learn effectively. However, after normalization, the values are all centered around zero and have a variance of one. This makes it easier for the model to learn and improve its performance.

Fig: What normalized data looks like.

7 of 24

Consider any of the hidden layers in a network. The activations from the previous layer are simply the inputs to this layer. For example, from the perspective of Layer 2 in the image below, if we "blank out" all the previous layers, the activations coming from Layer 1 are no different from the original inputs.

The same logic that requires us to normalize the input for the first layer also applies to each of these hidden layers.

Therefore, batch normalization can be used to normalize the activations of any layer in a neural network, not just the input layer.

In other words, if we are able to somehow normalize the activations from each previous layer then the gradient descent will converge better during training. This is precisely what the Batch Norm layer does for us.

The need for BatchNorm

Fig: The inputs of each hidden layer are the activations from the previous layer, and must also be normalized.

8 of 24

Just like the parameters (eg. weights, bias) of any network layer, a BatchNorm layer also has parameters of its own:

    • Two learnable parameters are called beta and gamma.
    • Two non-learnable parameters (Mean Moving Average and Variance Moving Average) are saved as part of the ‘state’ of the Batch Norm layer.

These parameters are per the Batch Norm layer.

Parameters and Working

Fig: Parameters of a Batch Norm layer.

9 of 24

If there are, say, three hidden layers and three Batch Norm layers in the network, we would have three learnable beta and gamma parameters for the three layers. Similarly for the Moving Average parameters.

Fig: Each Batch Norm layer has its own copy of the parameters.

10 of 24

During training, the network is feeded, one mini-batch of data at a time. During the forward pass, each layer of the network processes that mini-batch of data. The Batch Norm layer processes its data as follows:

Fig: Calculations performed by Batch Norm layer.

11 of 24

01. Activations

The activations from the previous layer are passed as input to the Batch Norm. There is one activation vector for each feature in the data.

02. Calculate Mean and Variance

For each activation vector separately, the mean and variance of all the values in the mini-batch are calculated.

03. Normalize

The normalized values for each activation feature vector using the corresponding mean and variance are calculated. These normalized values now have zero mean and unit variance.

04. Scale and Shift

Batch normalization allows the normalized values to be shifted and scaled by learnable parameters. This is done by multiplying the normalized values by a factor, gamma, and adding to it a factor, beta.

05. Moving Average

Batch Norm keeps track of the mean and variance of the normalized values during training. This information is used during inference to normalize the input data, even if the input data distribution has changed since training.

12 of 24

Batch Norm during Inference

During training, BatchNorm starts by calculating the mean and variance for a mini-batch. However, during Inference, we have a single sample, not a mini-batch.

In such case, here is where the two Moving Average parameters come in — the ones that we calculated during training and saved with the model. We use those saved mean and variance values for the Batch Norm during Inference.

Calculating and saving the mean and variance of the full data is expensive, so the Moving Average is used as a proxy. It is more efficient because only the most recent Moving Average needs to be remembered.

Fig: Batch Norm calculation during Inference.

13 of 24

Order of Placement of BatchNorm layer

There are two opinions on where the Batch Norm layer should be placed in the architecture — before and after activation.

Fig: Batch Norm can be used before or after activation.

14 of 24

The batch_normalize() function first calculates the mean and variance of each feature in the batch of data. Then, it normalizes the data by subtracting the mean and dividing by the standard deviation, plus a small epsilon value to avoid dividing by zero. The normalized data is then returned.

The epsilon parameter is a small value that is added to the variance to avoid dividing by zero. This is important because the variance of a batch of data can sometimes be zero, especially if the batch is small. Adding a small epsilon value ensures that the denominator of the normalization equation is never zero.

Running the code will print the original data and the normalized data. The normalized data will be centered around zero and have a standard deviation of 1. This will help to stabilize the training of a deep neural network that uses the data array as input.

Examples

15 of 24

The nn.BatchNorm1d class in PyTorch implements batch normalization for 1D data. The affine=False argument specifies that the batch normalization layer should not have learnable parameters. This means that the mean and variance of the data will be fixed during training.

The torch.randn() function in PyTorch creates a random tensor with a normal distribution. The 22, 102 argument specifies that the tensor should have 22 samples and 102 features.

The n(inputval) line normalizes the input tensor using the batch normalization layer. The normalized output is then printed out.

The output of the batch normalization layer changes every time because the input tensor is random. The torch.randn() function in PyTorch creates a random tensor with a normal distribution. This means that the mean and variance of the input tensor will be different every time.

16 of 24

The first line imports the torch and nn modules from PyTorch.

The second line creates a BatchNorm2d layer with 120 features. The affine=False argument tells the layer to not learn any affine parameters.

The third line creates a 4D tensor with 20 batches, 120 channels, 55 rows, and 65 columns.

The fourth line passes the inputs tensor to the BatchNorm2d layer.

The fifth line prints the output of the BatchNorm2d layer.

17 of 24

Advantages

Batch normalization can speed up the training of deep neural networks by reducing the amount of internal covariate shift. This is because batch normalization normalizes the activations, which helps to keep the distribution of the activations stable.

01

03

Batch normalization can improve the generalization performance of deep neural networks. This is because batch normalization helps to make the network less sensitive to the initialization of the weights.

02

Batch normalization allows for higher learning rates, which can further speed up the training process. This is because batch normalization helps to stabilize the gradients, which makes it easier for the network to learn.

04

Batch normalization can reduce the need for regularization techniques, such as dropout and L2 regularization. This is because batch normalization helps to stabilize the training process, which makes it less likely that the network will overfit the training data.

18 of 24

Real World

Applications

Medical imaging

In 2017, a study showed that a deep-learning model with batch normalization was able to achieve a sensitivity of 97.6% and a specificity of 96.6% in detecting breast cancer from mammograms.

Image Classification

In 2015, researchers at Google used batch normalization to improve the performance of their image classification model, called Inception V3.

Speech recognition

Batch normalization has been used to improve the performance of speech recognition models, such as those used in Amazon Alexa and Google Assistant.

Robotics

In 2019, a self-driving car with batch normalization was able to achieve a higher accuracy in detecting and avoiding obstacles than a car without batch normalization.

19 of 24

Practical Tips

01. Use it after activation functions

Batch normalization is applied after activation functions (e.g., ReLU, sigmoid) to stabilize training by normalizing the activation distributions, especially in the presence of non-linearity.

02. Explore diverse momentum and epsilon values for experimentation.

Momentum and epsilon parameters in batch normalization impact moving average and variance. Varying them can enhance network performance.

03. Don't use batch normalization for all layers

Batch normalization isn't always needed for all layers. Optimal use may involve applying it selectively and experimenting with different configurations for improved network performance.

04. Be careful with batch normalization in small batches.

Batch normalization performs optimally with large batch sizes. Smaller batches may result in inaccurate moving averages and variance, causing training issues.

20 of 24

Limitations

01

Requires a large batch size.

This is because the moving average and variance are calculated over the entire batch. If the batch size is small, then the moving average and variance may not be accurate, which can lead to problems with the training process.

Can make the network less robust to changes in the input data.

This is because batch normalization normalizes the activations, which can make the network less sensitive to changes in the input data. This can be a problem if the input data is not normally distributed.

02

03

Can make the network more difficult to interpret.

This is because batch normalization adds a non-linear transformation to the network, which can make it difficult to understand how the network works.

21 of 24

Alternatives to BatchNorm

GROUP NORMALIZATION(GN)

01

SWITCHABLE NORMALIZATION(SN)

02

ATTENTIVE NORMALIZATION(AN)

03

STOCHASTIC NORMALIZATIONS AS BAYESIAN LEARNING

04

WEIGHT NORMALIZATION

05

22 of 24

References

23 of 24

References

for Real World Applications

01 IMAGE CLASSIFICATION

Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., ... & Rabinovich, A. (2015). Going deeper with convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 1-9).

02 SPEECH RECOGNITION

Ravanelli, M., Brakel, P., Omologo, M., & Bengio, Y. (2016, December). Batch-normalized joint training for DNN-based distant speech recognition. In 2016 IEEE Spoken Language Technology Workshop (SLT) (pp. 28-34). IEEE.

03 MEDICAL IMAGING

"Deep learning for breast cancer histopathology: Improved classification with weakly labeled data" published in 2017 by Pranav Rajpurkar, Jeffrey Irvin, Kai Zhu, Siddharth Mehta, Kevin Chen, Lei Yang, et al. The study was conducted at the University of Pennsylvania and the University of California, Berkeley.

04 ROBOTICS

"End-to-end training of deep neural networks for self-driving cars" published in 2019 by Marek Bojarski, Guillaume Delbrouck, Lu-Chih Chen, Olivier Delalleau, Jean Chaumont, and Urs Muller. The study was conducted at the University of California, Berkeley and Uber AI.

24 of 24

THANK YOU

FOR YOUR TIME