Neural Network Design and�Regularization�
Lecture 16
Parameter Initialization, Normalization, Double Descent, Model Averaging, Drop-out, and Residual Connections.
EECS 189/289, Fall 2025 @ UC Berkeley
Joseph E. Gonzalez and Narges Norouzi
EECS 189/289, Fall 2025 @ UC Berkeley
Joseph E. Gonzalez and Narges Norouzi
What would you like to learn about Deep Learning or PyTorch?
The Slido app must be installed on every computer you’re presenting from
Do not edit�How to change the design
8111828
How do you feel about the midterm. (1 = Not Great, and 5 = Wonderful)
The Slido app must be installed on every computer you’re presenting from
Do not edit�How to change the design
8111828
Join at slido.com�#8111828
The Slido app must be installed on every computer you’re presenting from
Do not edit�How to change the design
8111828
Parameter Initialization
Choice of Initial Parameters Matters
The neural network loss surface is highly non-convex so choice of starting point can result in:
What makes a good starting �point?
8111828
“Sweet-spot” of the Non-Linearity
Sweet-spot is near the transition point in the non-linear transformation.
Outside the transition point the activation is saturated.
8111828
Starting at Zero-Weights
+
+
+
+
+
+
+
+
+
8111828
Why shouldn't we initialize all our neural network’s weights to all zero?
The Slido app must be installed on every computer you’re presenting from
Do not edit�How to change the design
8111828
Unbroken Symmetry at Zero-Weights
If we initialize all weights to zero, then all hidden units produce identical activations and remain the same during training.
+
+
+
0
8111828
Weight Symmetry In Neural Networks
For any layer in the neural network, we could “re-arrange” the neurons to obtain an identical network.
+
+
+
+
+
+
+
+
+
Bishop 6.2.4
8111828
Weight Symmetry In Neural Networks
For any layer in the neural network, we could “re-arrange” the neurons to obtain an identical network.
2
1
3
4
2
1
3
4
Bishop 6.2.4
+
+
+
+
+
+
+
+
+
8111828
Sign Flipping Equivalence in Networks
+
+
+
+
+
+
+
-1
-1
-1
-1
-1
-1
Bishop 6.2.4
8111828
Counting Network Symmetries
8111828
Breaking Symmetry with �Random Weight Initialization
8111828
He Initialization for ReLU Activations
Bishop 7.2.5
+
+
+
+
+
+
8111828
Xavier Initialization for Tanh Activations
Bishop 7.2.5
Fan-out
Fan-in
8111828
Weight Initialization in PyTorch
8111828
Normalization
Data Normalization
Bishop 7.4.2
Example
8111828
Normalization in the Network
Bishop 7.4.2
+
+
+
+
+
+
+
+
+
Why would hidden layer normalization help?
Problem: internal activations change during learning (SGD)!
8111828
Batch Normalization
+
+
+
+
+
+
+
+
+
BN
BN
BN
BN
BN
BN
8111828
Making Predictions with Batch Norm.
8111828
Batch Normalization is Difficult to Parallelize
Data-parallel stochastic gradient descent
GPU 1
GPU 2
Data 1
Data 2
Mini-batch
8111828
Batch Normalization is Difficult to Parallelize
Data-parallel stochastic gradient descent
GPU 1
Gradient 1
GPU 2
Gradient 2
Data 1
Data 2
8111828
Batch Normalization is Difficult to Parallelize
Data-parallel stochastic gradient descent
GPU 1
GPU 2
Data 1
Data 2
Gradient 1
Gradient 2
+
GPUs share gradients and compute sum (called all-reduce).
8111828
Batch Normalization is Difficult to Parallelize
Batch normalization couples the layers in the forward pass.
Cannot compute gradients independently across GPUs
GPU 1
GPU 2
Data 1
Data 2
8111828
Layer Normalization
Layer Normalization computes the layer mean and variance across the neurons in the layer instead of across the mini-batch:
Normalization is applied the same way at test-time.
Layer Normalization is currently used in large-language models.
8111828
Layer Normalization vs Batch Normalization
+
+
+
+
+
+
+
+
+
BN
BN
BN
BN
BN
BN
+
+
+
+
+
+
+
+
+
BN
BN
BN
BN
BN
BN
+
+
+
+
+
+
+
+
+
BN
BN
BN
BN
BN
BN
+
+
+
+
+
+
+
+
+
BN
BN
BN
BN
BN
BN
+
+
+
+
+
+
+
+
+
BN
BN
BN
BN
BN
BN
Mini-batch Dimension
Layer Dimension
8111828
Layer Normalization vs Batch Normalization
Mini-batch Dimension
Layer Dimension
Layer �Norm.
Batch
Norm.
8111828
Re-Scaling with Normalization
8111828
Batch-Norm and Layer-Norm in PyTorch
Creating norm layers:
if norm_kind == "batch":
self.norms = nn.ModuleList([nn.BatchNorm1d(c) for c in dims[1:]])
else: # norm_kind == "layer":
self.norms = nn.ModuleList([nn.LayerNorm(c) for c in dims[1:]])
def forward(self, x):
for i, (layer, norm) in enumerate(zip(self.layers, self.norms)):
x = layer(x)
if i < len(self.layers) - 1:
x = norm(x)
x = self.act(x)
return x
Using norm layers:
8111828
Inductive Biases
Learning Requires Inductive Biases
“No Free Lunch Theorem” – any learning algorithm is as good as any other when considering all possible problems.
�We encode inductive biases in neural networks through:
8111828
Stopped Here
Pre-processing through Feature Engineering
Feature Engineering – the process of manually constructing features common in many classical machine learning techniques.
Examples:
Modern deep learning uses large neural networks as “feature” functions to learn features that are “invariant” to the prediction task.
8111828
Weight Decay Regularization
8111828
Learning Curves and Early Stopping
During training it is common practice to plot a learning curve which tracks the training and validation error
Early Stopping: stop training (return to the model checkpoint) when the validation error stopped decreasing.
Bishop 9.3.1
Training Error
Validation Error
Gradient Steps
Gradient Steps
Validation Error Increases
Figure from Bishop Textbook (p267).
8111828
Early Stopping is a form of Weight Decay
Because models typically start with relatively small weight values, stopping gradient descent early can be viewed as a form of weight decay.
Bishop 9.3.1
Figure from Bishop Textbook (p268).
Early stopping weights
Error minimizing weights
Starting point
(near zero)
8111828
Double Descent on the�Bias-Variance Tradeoff
The Bias Variance Tradeoff Revisited
In previous lectures, we introduced the fundamental bias-variance tradeoff and how bias and variance contribute to test-error
Test Error
Variance
Optimal Value
Model Complexity
(Bias)2
Training Error
8111828
Double Descent
In the case of deep neural networks trained using SGD that there is a second decrease in test error.
Under-Parameterized
Regime
Over-Parameterized
Regime
Test Error
Optimal Value
Model Complexity
Training Error
Very complex models �seem to self-regularize.
(likely due to SGD)
8111828
Double Descent in Practice
Test Error
For sufficiently large models, early stopping could decrease generalization.
8111828
Model Averaging
Ensembles of Models (Experts)
8111828
Mixture of Experts (MoEs)
8111828
Constructing Multiple Experts
Ideally, each expert is a strong independent model.
Ensembles are could be constructed to by taking SoTA models from competing teams and combining them.
Bootstrap Aggregation (Bagging) is a classic technique to construct an ensemble by training on bootstrap samples of the training data.
8111828
Bagging: Bootstrap Aggregation
Original Dataset
Dataset 1
Dataset 2
Dataset 3
Bootstrap
Sampling
8111828
Bagging: Bootstrap Aggregation
Original Dataset
Dataset 1
Dataset 2
Dataset 3
train
Bootstrap
Sampling
8111828
Bagging: Bootstrap Aggregation
Original Dataset
Dataset 1
Dataset 2
Dataset 3
train
train
Bootstrap
Sampling
8111828
Bagging: Bootstrap Aggregation
Original Dataset
Dataset 1
Dataset 2
Dataset 3
train
train
train
Bootstrap
Sampling
Bagged Model:
8111828
Dropout
Dropout Regularization
Dropout regularizes neural networks by randomly “disabling” neurons during each iteration of SGD.
Full Network
SGD Iteration 1
SGD Iteration 2
8111828
What inductive bias is introduced by dropout?
The Slido app must be installed on every computer you’re presenting from
Do not edit�How to change the design
8111828
Drop-out and Model Averaging
8111828
Implementing Drop-out in PyTorch
8111828
Residual Connections
Vanishing Gradients in Deep Networks
Training networks with hundreds of layers is challenging.
Needed a way to connect lower layers in the network directly to the predictions (and subsequent loss).
8111828
Residual Networks
+
+
+
8111828
Residual Networks as Ensembles
Large Network
Medium Network
Small Network
+
+
+
8111828
Expanding the Residual Network Graph
+
+
+
+
+
+
+
Gradient
Signal
Loss
8111828
Demo
Probably the coolest demo yet.
8111828
Neural Network Design and�Regularization
Lecture 16
Reading: Chapter 8 in Bishop Deep Learning Textbook