1 of 63

Neural Network Design and�Regularization�

Lecture 16

Parameter Initialization, Normalization, Double Descent, Model Averaging, Drop-out, and Residual Connections.

EECS 189/289, Fall 2025 @ UC Berkeley

Joseph E. Gonzalez and Narges Norouzi

EECS 189/289, Fall 2025 @ UC Berkeley

Joseph E. Gonzalez and Narges Norouzi

2 of 63

What would you like to learn about Deep Learning or PyTorch?

The Slido app must be installed on every computer you’re presenting from

8111828

3 of 63

How do you feel about the midterm. (1 = Not Great, and 5 = Wonderful)

The Slido app must be installed on every computer you’re presenting from

8111828

4 of 63

Join at slido.com�#8111828

The Slido app must be installed on every computer you’re presenting from

8111828

5 of 63

Parameter Initialization

6 of 63

Choice of Initial Parameters Matters

The neural network loss surface is highly non-convex so choice of starting point can result in:

  • diverging solutions
  • slow convergence
  • bad local minima

What makes a good starting �point?

8111828

7 of 63

“Sweet-spot” of the Non-Linearity

Sweet-spot is near the transition point in the non-linear transformation.

  • Non-zero gradient
  • Typically near zero

Outside the transition point the activation is saturated.

  • Small changes don’t change activation.

8111828

8 of 63

Starting at Zero-Weights

  •  

 

 

 

 

+

 

 

 

+

 

 

 

+

 

 

 

+

 

 

 

+

 

 

 

+

 

 

 

+

 

 

 

+

 

 

 

+

 

 

 

 

 

 

 

 

 

 

 

 

8111828

9 of 63

Why shouldn't we initialize all our neural network’s weights to all zero?

The Slido app must be installed on every computer you’re presenting from

8111828

10 of 63

Unbroken Symmetry at Zero-Weights

If we initialize all weights to zero, then all hidden units produce identical activations and remain the same during training.

 

 

 

 

+

 

 

 

+

 

 

 

+

 

 

 

 

 

 

0

8111828

11 of 63

Weight Symmetry In Neural Networks

For any layer in the neural network, we could “re-arrange” the neurons to obtain an identical network.

 

 

 

 

+

 

 

 

+

 

 

 

+

 

 

 

+

 

 

 

+

 

 

 

+

 

 

 

+

 

 

 

+

 

 

 

+

 

 

 

 

 

Bishop 6.2.4

8111828

12 of 63

Weight Symmetry In Neural Networks

For any layer in the neural network, we could “re-arrange” the neurons to obtain an identical network.

2

1

3

4

 

2

1

3

4

 

 

Bishop 6.2.4

 

 

 

 

+

 

 

 

+

 

 

 

+

 

 

 

+

 

 

 

+

 

 

 

+

 

 

 

+

 

 

 

+

 

 

 

+

 

 

 

 

 

8111828

13 of 63

Sign Flipping Equivalence in Networks

  •  

 

+

 

 

 

+

 

 

 

+

 

 

 

+

 

 

 

 

 

+

 

 

 

+

 

 

 

+

 

 

-1

-1

-1

-1

-1

-1

 

 

Bishop 6.2.4

8111828

14 of 63

Counting Network Symmetries

  •  

8111828

15 of 63

Breaking Symmetry with �Random Weight Initialization

  •  

8111828

16 of 63

He Initialization for ReLU Activations

  •  

Bishop 7.2.5

 

+

 

 

 

+

 

 

 

+

 

 

 

+

 

 

 

+

 

 

 

+

 

 

 

8111828

17 of 63

Xavier Initialization for Tanh Activations

  •  

Bishop 7.2.5

Fan-out

Fan-in

8111828

18 of 63

Weight Initialization in PyTorch

8111828

19 of 63

Normalization

20 of 63

Data Normalization

  •  

Bishop 7.4.2

Example

 

 

8111828

21 of 63

Normalization in the Network

  •  

Bishop 7.4.2

 

 

 

 

+

 

 

 

+

 

 

 

+

 

 

 

+

 

 

 

+

 

 

 

+

 

 

 

+

 

 

 

+

 

 

 

+

 

 

 

 

 

Why would hidden layer normalization help?

  • Ensure internal activations are similar scale
    • Chain rule 🡪 gradients stay similar scale

Problem: internal activations change during learning (SGD)!

8111828

22 of 63

Batch Normalization

  •  

 

 

 

 

 

 

 

 

 

 

+

 

 

 

+

 

 

 

+

 

 

 

+

 

 

 

+

 

 

 

+

 

 

 

+

 

 

 

+

 

 

 

+

 

 

 

 

 

BN

BN

BN

BN

BN

BN

 

8111828

23 of 63

Making Predictions with Batch Norm.

  •  

 

 

8111828

24 of 63

Batch Normalization is Difficult to Parallelize

Data-parallel stochastic gradient descent

GPU 1

GPU 2

Data 1

Data 2

Mini-batch

8111828

25 of 63

Batch Normalization is Difficult to Parallelize

Data-parallel stochastic gradient descent

GPU 1

Gradient 1

GPU 2

Gradient 2

Data 1

Data 2

8111828

26 of 63

Batch Normalization is Difficult to Parallelize

Data-parallel stochastic gradient descent

GPU 1

GPU 2

Data 1

Data 2

Gradient 1

Gradient 2

+

GPUs share gradients and compute sum (called all-reduce).

8111828

27 of 63

Batch Normalization is Difficult to Parallelize

Batch normalization couples the layers in the forward pass.

Cannot compute gradients independently across GPUs

GPU 1

GPU 2

Data 1

Data 2

8111828

28 of 63

Layer Normalization

Layer Normalization computes the layer mean and variance across the neurons in the layer instead of across the mini-batch:

 

 

 

 

 

 

Normalization is applied the same way at test-time.

  • No need to estimate on training data.

Layer Normalization is currently used in large-language models.

8111828

29 of 63

Layer Normalization vs Batch Normalization

 

 

 

 

+

 

 

 

+

 

 

 

+

 

 

 

+

 

 

 

+

 

 

 

+

 

 

 

+

 

 

 

+

 

 

 

+

 

 

 

 

 

BN

BN

BN

BN

BN

BN

 

 

 

 

+

 

 

 

+

 

 

 

+

 

 

 

+

 

 

 

+

 

 

 

+

 

 

 

+

 

 

 

+

 

 

 

+

 

 

 

 

 

BN

BN

BN

BN

BN

BN

 

 

 

 

+

 

 

 

+

 

 

 

+

 

 

 

+

 

 

 

+

 

 

 

+

 

 

 

+

 

 

 

+

 

 

 

+

 

 

 

 

 

BN

BN

BN

BN

BN

BN

 

 

 

 

+

 

 

 

+

 

 

 

+

 

 

 

+

 

 

 

+

 

 

 

+

 

 

 

+

 

 

 

+

 

 

 

+

 

 

 

 

 

BN

BN

BN

BN

BN

BN

 

 

 

 

+

 

 

 

+

 

 

 

+

 

 

 

+

 

 

 

+

 

 

 

+

 

 

 

+

 

 

 

+

 

 

 

+

 

 

 

 

 

BN

BN

BN

BN

BN

BN

Mini-batch Dimension

Layer Dimension

8111828

30 of 63

Layer Normalization vs Batch Normalization

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Mini-batch Dimension

Layer Dimension

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Layer �Norm.

Batch

Norm.

8111828

31 of 63

Re-Scaling with Normalization

  •  

8111828

32 of 63

Batch-Norm and Layer-Norm in PyTorch

Creating norm layers:

if norm_kind == "batch":

self.norms = nn.ModuleList([nn.BatchNorm1d(c) for c in dims[1:]])

else: # norm_kind == "layer":

self.norms = nn.ModuleList([nn.LayerNorm(c) for c in dims[1:]])

def forward(self, x):

for i, (layer, norm) in enumerate(zip(self.layers, self.norms)):

x = layer(x)

if i < len(self.layers) - 1:

x = norm(x)

x = self.act(x)

return x

Using norm layers:

8111828

33 of 63

Inductive Biases

34 of 63

Learning Requires Inductive Biases

“No Free Lunch Theorem” – any learning algorithm is as good as any other when considering all possible problems.

  • Need to exploit inductive biases.

�We encode inductive biases in neural networks through:

  • Pre-processing: Transform the data to enforce desired invariances (e.g., normalize scale, orientation, or intensity).
  • Regularized objectives: Penalize models that exhibit undesired properties during training.
  • Data augmentation: Expand the training data with transformed examples to make invariances explicit (e.g., rotated images).
  • Architecture design: Choose model structures that build in invariances (e.g., convolution layers for translation invariance).

8111828

35 of 63

Stopped Here

36 of 63

Pre-processing through Feature Engineering

Feature Engineering – the process of manually constructing features common in many classical machine learning techniques.

Examples:

  • Term Frequence – Inverse Document Frequency (TF-IDF) augment document representations based on the commonality of words.
  • Histogram of Oriented Gradients (HOG) was used in computer vision to represent images features.
  • Mel-frequency cepstral coefficients (MFCCs) are (still) used to represent sound in for speech recognition systems.

Modern deep learning uses large neural networks as “feature” functions to learn features that are “invariant” to the prediction task.

  • Inductive bias shifts to data selection, model, and training process.

8111828

37 of 63

Weight Decay Regularization

  •  

 

8111828

38 of 63

Learning Curves and Early Stopping

During training it is common practice to plot a learning curve which tracks the training and validation error

Early Stopping: stop training (return to the model checkpoint) when the validation error stopped decreasing.

Bishop 9.3.1

Training Error

Validation Error

Gradient Steps

Gradient Steps

Validation Error Increases

Figure from Bishop Textbook (p267).

8111828

39 of 63

Early Stopping is a form of Weight Decay

Because models typically start with relatively small weight values, stopping gradient descent early can be viewed as a form of weight decay.

Bishop 9.3.1

Figure from Bishop Textbook (p268).

Early stopping weights

Error minimizing weights

Starting point

(near zero)

8111828

40 of 63

Double Descent on the�Bias-Variance Tradeoff

41 of 63

The Bias Variance Tradeoff Revisited

In previous lectures, we introduced the fundamental bias-variance tradeoff and how bias and variance contribute to test-error

Test Error

Variance

Optimal Value

Model Complexity

(Bias)2

Training Error

8111828

42 of 63

Double Descent

In the case of deep neural networks trained using SGD that there is a second decrease in test error.

Under-Parameterized

Regime

Over-Parameterized

Regime

Test Error

Optimal Value

Model Complexity

Training Error

Very complex models �seem to self-regularize.

(likely due to SGD)

8111828

43 of 63

Double Descent in Practice

Examples of double descent �from Nakkiran et al. ICLR’20:

Test Error

For sufficiently large models, early stopping could decrease generalization.

8111828

44 of 63

Model Averaging

45 of 63

Ensembles of Models (Experts)

  •  

8111828

46 of 63

Mixture of Experts (MoEs)

  •  

8111828

47 of 63

Constructing Multiple Experts

Ideally, each expert is a strong independent model.

  • Strong: State-of-the-art (SoTA 🡨You should know this word!)
  • Independent: Trained with different data and inductive biases

Ensembles are could be constructed to by taking SoTA models from competing teams and combining them.

  • Ensemble methods topped most classic benchmarks.

Bootstrap Aggregation (Bagging) is a classic technique to construct an ensemble by training on bootstrap samples of the training data.

8111828

48 of 63

Bagging: Bootstrap Aggregation

  •  

 

 

 

 

 

 

 

 

 

 

Original Dataset

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Dataset 1

Dataset 2

Dataset 3

Bootstrap

Sampling

8111828

49 of 63

Bagging: Bootstrap Aggregation

  •  

 

 

 

 

 

 

 

 

 

 

Original Dataset

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Dataset 1

Dataset 2

Dataset 3

train

 

Bootstrap

Sampling

8111828

50 of 63

Bagging: Bootstrap Aggregation

  •  

 

 

 

 

 

 

 

 

 

 

Original Dataset

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Dataset 1

Dataset 2

Dataset 3

train

train

 

 

Bootstrap

Sampling

8111828

51 of 63

Bagging: Bootstrap Aggregation

  •  

 

 

 

 

 

 

 

 

 

 

Original Dataset

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Dataset 1

Dataset 2

Dataset 3

 

train

train

train

 

 

 

Bootstrap

Sampling

Bagged Model:

8111828

52 of 63

Dropout

53 of 63

Dropout Regularization

Dropout regularizes neural networks by randomly “disabling” neurons during each iteration of SGD.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Full Network

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

SGD Iteration 1

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

SGD Iteration 2

 

8111828

54 of 63

What inductive bias is introduced by dropout?

The Slido app must be installed on every computer you’re presenting from

8111828

55 of 63

Drop-out and Model Averaging

  •  

8111828

56 of 63

Implementing Drop-out in PyTorch

8111828

57 of 63

Residual Connections

58 of 63

Vanishing Gradients in Deep Networks

Training networks with hundreds of layers is challenging.

  • Vanishing Gradients: error signals get multiplicative diluted across layers
  • Shattered Gradients: many stacked non-linearities result in a rough loss surface.

Needed a way to connect lower layers in the network directly to the predictions (and subsequent loss).

8111828

59 of 63

Residual Networks

  •  

 

 

 

 

 

 

 

+

+

+

 

8111828

60 of 63

Residual Networks as Ensembles

  •  

Large Network

Medium Network

Small Network

 

 

 

 

 

 

 

+

+

+

8111828

61 of 63

Expanding the Residual Network Graph

  •  

 

 

+

 

 

 

 

 

+

+

+

 

+

+

 

+

Gradient

Signal

Loss

8111828

62 of 63

Demo

Probably the coolest demo yet.

8111828

63 of 63

Neural Network Design and�Regularization

Lecture 16

Reading: Chapter 8 in Bishop Deep Learning Textbook