1 of 17

2 of 17

Vanishing Gradient Problem

  • In some cases, during training, the gradients can become either very small (vanishing gradients) of very large (exploding gradients)
    • They result in very small or very large update of the parameters
    • Solutions: change learning rate, ReLU activations, regularization, LSTM units in RNNs

Training Neural Networks

……

……

……

……

……

……

……

……

y1

y2

yM

Small gradients, learns very slow

Slide credit: Hung-yi Lee – Deep Learning Tutorial

2

CS 404/504, Fall 2021

3 of 17

Generalization

  • Underfitting
    • The model is too “simple” to represent all the relevant class characteristics
    • E.g., model with too few parameters
    • Produces high error on the training set and high error on the validation set

  • Overfitting
    • The model is too “complex” and fits irrelevant characteristics (noise) in the data
    • E.g., model with too many parameters
    • Produces low error on the training error and high error on the validation set

Generalization

3

CS 404/504, Fall 2021

4 of 17

Overfitting

  • Overfitting – a model with high capacity fits the noise in the data instead of the underlying relationship

Generalization

  • The model may fit the training data very well, but fails to generalize to new examples (test or validation data)

4

CS 404/504, Fall 2021

5 of 17

  • Regularization is a technique used in machine learning to prevent overfitting and improve the generalization ability of a model. Overfitting occurs when a model learns to perform well on the training data but fails to generalize to unseen data. Regularization methods add a penalty term to the model's objective function, discouraging the model from learning complex patterns that might fit the noise in the training data rather than the underlying true relationship.
  • There are different types of regularization techniques, including:
  • L1 Regularization (Lasso): Adds a penalty term proportional to the absolute value of the coefficients. It encourages sparsity in the model by shrinking some coefficients to exactly zero, effectively performing feature selection.
  • L2 Regularization (Ridge): Adds a penalty term proportional to the square of the coefficients. It penalizes large coefficients and encourages them to be small, which often leads to more stable and well-conditioned models.
  • Elastic Net Regularization: Combines L1 and L2 regularization, providing a balance between feature selection and coefficient shrinkage.
  • Dropout: Commonly used in neural networks, dropout randomly omits units (along with their connections) during training, effectively preventing units from co-adapting too much.
  • Early Stopping: Monitors the model's performance on a validation set during training and stops the training process when the performance starts to degrade, thereby preventing the model from overfitting.

6 of 17

Regularization: Weight Decay

  •  

Regularization

Data loss

Regularization loss

6

CS 404/504, Fall 2021

7 of 17

Regularization: Weight Decay

  •  

Regularization

7

CS 404/504, Fall 2021

8 of 17

Regularization: Weight Decay

  •  

Regularization

8

CS 404/504, Fall 2021

9 of 17

Regularization: Dropout

  • Dropout
    • Randomly drop units (along with their connections) during training
    • Each unit is retained with a fixed dropout rate p, independent of other units
    • The hyper-parameter p needs to be chosen (tuned)
      • Often, between 20% and 50% of the units are dropped

Regularization

Slide credit: Hung-yi Lee – Deep Learning Tutorial

9

CS 404/504, Fall 2021

10 of 17

Regularization: Dropout

  • Dropout is a kind of ensemble learning
    • Using one mini-batch to train one network with a slightly different architecture

Regularization

minibatch

1

minibatch

2

minibatch

3

minibatch

n

……

Slide credit: Hung-yi Lee – Deep Learning Tutorial

10

CS 404/504, Fall 2021

11 of 17

Regularization: Early Stopping

  • Early-stopping
    • During model training, use a validation set
      • E.g., validation/train ratio of about 25% to 75%
    • Stop when the validation accuracy (or loss) has not improved after n epochs
      • The parameter n is called patience

Regularization

Stop training

validation

11

CS 404/504, Fall 2021

12 of 17

Batch Normalization

  •  

Regularization

12

CS 404/504, Fall 2021

13 of 17

  • In machine learning, parameters and hyperparameters are crucial concepts related to the configuration and behavior of a model:
  • Parameters:
  • Parameters are the variables that the model learns during training.
  • They are the coefficients in linear regression, weights in neural networks, or the split points in decision trees, for example.
  • Parameters are directly learned from the training data through optimization algorithms like gradient descent.
  • The goal of training is to find the optimal values for these parameters that minimize the loss function and make the model perform well on unseen data.
  • Hyperparameters:
  • Hyperparameters are the configuration settings of the model that govern its learning process.
  • They are not learned from the data but are set before the learning process begins.
  • Examples of hyperparameters include the learning rate in gradient descent, the number of hidden layers in a neural network, the depth of a decision tree, or the regularization strength.
  • Hyperparameters are usually chosen based on prior knowledge, experience, or through hyperparameter tuning techniques like grid search, random search, or Bayesian optimization.
  • The choice of hyperparameters significantly affects the performance of the model and its ability to generalize to unseen data.
  • In summary, parameters are learned during the training process, while hyperparameters are set before the training begins and govern the learning process itself. Parameters are specific to the model and learned from the data, while hyperparameters are higher-level settings that control how the model learns.

14 of 17

Hyper-parameter Tuning

  •  

Hyper-parameter Tuning

14

CS 404/504, Fall 2021

15 of 17

Hyper-parameter Tuning

  • Grid search
    • Check all values in a range with a step value
  • Random search
    • Randomly sample values for the parameter
    • Often preferred to grid search
  • Bayesian hyper-parameter optimization
    • Is an active area of research

Hyper-parameter Tuning

15

CS 404/504, Fall 2021

16 of 17

k-Fold Cross-Validation

  • Using k-fold cross-validation for hyper-parameter tuning is common when the size of the training data is small
    • It also leads to a better and less noisy estimate of the model performance by averaging the results across several folds
  • E.g., 5-fold cross-validation (see the figure on the next slide)
    1. Split the train data into 5 equal folds
    2. First use folds 2-5 for training and fold 1 for validation
    3. Repeat by using fold 2 for validation, then fold 3, fold 4, and fold 5
    4. Average the results over the 5 runs (for reporting purposes)
    5. Once the best hyper-parameters are determined, evaluate the model on the test data

k-Fold Cross-Validation

16

CS 404/504, Fall 2021

17 of 17

k-Fold Cross-Validation

  • Illustration of a 5-fold cross-validation

k-Fold Cross-Validation

17

CS 404/504, Fall 2021