2 of 17

Vanishing Gradient Problem

In some cases, during training, the gradients can become either very small (vanishing gradients) of very large (exploding gradients)

They result in very small or very large update of the parameters
Solutions: change learning rate, ReLU activations, regularization, LSTM units in RNNs

Training Neural Networks

……

y₁

y₂

y_M

Small gradients, learns very slow

Slide credit: Hung-yi Lee – Deep Learning Tutorial

CS 404/504, Fall 2021

3 of 17

Generalization

Underfitting

The model is too “simple” to represent all the relevant class characteristics
E.g., model with too few parameters
Produces high error on the training set and high error on the validation set

Overfitting

The model is too “complex” and fits irrelevant characteristics (noise) in the data
E.g., model with too many parameters
Produces low error on the training error and high error on the validation set

Generalization

CS 404/504, Fall 2021

4 of 17

Overfitting

Overfitting – a model with high capacity fits the noise in the data instead of the underlying relationship

Generalization

Picture from: http://cs231n.github.io/assets/nn1/layer_sizes.jpeg

The model may fit the training data very well, but fails to generalize to new examples (test or validation data)

CS 404/504, Fall 2021

5 of 17

Regularization is a technique used in machine learning to prevent overfitting and improve the generalization ability of a model. Overfitting occurs when a model learns to perform well on the training data but fails to generalize to unseen data. Regularization methods add a penalty term to the model's objective function, discouraging the model from learning complex patterns that might fit the noise in the training data rather than the underlying true relationship.
There are different types of regularization techniques, including:
L1 Regularization (Lasso): Adds a penalty term proportional to the absolute value of the coefficients. It encourages sparsity in the model by shrinking some coefficients to exactly zero, effectively performing feature selection.
L2 Regularization (Ridge): Adds a penalty term proportional to the square of the coefficients. It penalizes large coefficients and encourages them to be small, which often leads to more stable and well-conditioned models.
Elastic Net Regularization: Combines L1 and L2 regularization, providing a balance between feature selection and coefficient shrinkage.
Dropout: Commonly used in neural networks, dropout randomly omits units (along with their connections) during training, effectively preventing units from co-adapting too much.
Early Stopping: Monitors the model's performance on a validation set during training and stops the training process when the performance starts to degrade, thereby preventing the model from overfitting.

6 of 17

Regularization: Weight Decay

Regularization

Data loss

Regularization loss

CS 404/504, Fall 2021

7 of 17

Regularization: Weight Decay

Regularization

CS 404/504, Fall 2021

8 of 17

Regularization: Weight Decay

Regularization

CS 404/504, Fall 2021

9 of 17

Regularization: Dropout

Dropout

Randomly drop units (along with their connections) during training
Each unit is retained with a fixed dropout rate p, independent of other units
The hyper-parameter p needs to be chosen (tuned)

Often, between 20% and 50% of the units are dropped

Regularization

Slide credit: Hung-yi Lee – Deep Learning Tutorial

CS 404/504, Fall 2021

10 of 17

Regularization: Dropout

Dropout is a kind of ensemble learning

Using one mini-batch to train one network with a slightly different architecture

Regularization

minibatch

……

Slide credit: Hung-yi Lee – Deep Learning Tutorial

CS 404/504, Fall 2021

11 of 17

Regularization: Early Stopping

Early-stopping

During model training, use a validation set

E.g., validation/train ratio of about 25% to 75%

Stop when the validation accuracy (or loss) has not improved after n epochs

The parameter n is called patience

Regularization

Stop training

validation

CS 404/504, Fall 2021

12 of 17

Batch Normalization

Regularization

CS 404/504, Fall 2021

13 of 17

In machine learning, parameters and hyperparameters are crucial concepts related to the configuration and behavior of a model:
Parameters:
Parameters are the variables that the model learns during training.
They are the coefficients in linear regression, weights in neural networks, or the split points in decision trees, for example.
Parameters are directly learned from the training data through optimization algorithms like gradient descent.
The goal of training is to find the optimal values for these parameters that minimize the loss function and make the model perform well on unseen data.
Hyperparameters:
Hyperparameters are the configuration settings of the model that govern its learning process.
They are not learned from the data but are set before the learning process begins.
Examples of hyperparameters include the learning rate in gradient descent, the number of hidden layers in a neural network, the depth of a decision tree, or the regularization strength.
Hyperparameters are usually chosen based on prior knowledge, experience, or through hyperparameter tuning techniques like grid search, random search, or Bayesian optimization.
The choice of hyperparameters significantly affects the performance of the model and its ability to generalize to unseen data.
In summary, parameters are learned during the training process, while hyperparameters are set before the training begins and govern the learning process itself. Parameters are specific to the model and learned from the data, while hyperparameters are higher-level settings that control how the model learns.

14 of 17

Hyper-parameter Tuning

CS 404/504, Fall 2021

15 of 17

Hyper-parameter Tuning

Grid search

Check all values in a range with a step value

Random search

Randomly sample values for the parameter
Often preferred to grid search

Bayesian hyper-parameter optimization

Is an active area of research

Hyper-parameter Tuning

CS 404/504, Fall 2021

16 of 17

k-Fold Cross-Validation

Using k-fold cross-validation for hyper-parameter tuning is common when the size of the training data is small

It also leads to a better and less noisy estimate of the model performance by averaging the results across several folds

E.g., 5-fold cross-validation (see the figure on the next slide)

Split the train data into 5 equal folds
First use folds 2-5 for training and fold 1 for validation
Repeat by using fold 2 for validation, then fold 3, fold 4, and fold 5
Average the results over the 5 runs (for reporting purposes)
Once the best hyper-parameters are determined, evaluate the model on the test data

k-Fold Cross-Validation

CS 404/504, Fall 2021

17 of 17

k-Fold Cross-Validation

Illustration of a 5-fold cross-validation

k-Fold Cross-Validation

Picture from: https://scikit-learn.org/stable/modules/cross_validation.html

CS 404/504, Fall 2021