1 of 33

TMBA 25th FW Track

Machine Learning & Trading

Interpretable ML (III)

Mentor | Yu-Chen (Abner) Den

Date | Nov 10th, 2024

©2024 Yu-Chen Den, SinoPac Holdings | National Taiwan University

Interpretable ML

2 of 33

  • Preface
  • More NN
  • Deep Dive into PyTorch

2 |

Outline

©2024 Yu-Chen Den, SinoPac Holdings | National Taiwan University

Interpretable ML

3 of 33

  • Preface
  • More NN
  • Deep Dive into PyTorch

3 |

Outline

©2024 Yu-Chen Den, SinoPac Holdings | National Taiwan University

Interpretable ML

4 of 33

4 |

  • Gradient Descent
  • Activation Functions

Recall Last Part of Interpretable ML (II)

©2024 Yu-Chen Den, SinoPac Holdings | National Taiwan University

Interpretable ML

5 of 33

  • Preface
  • More NN
  • Deep Dive into PyTorch

5 |

Outline

©2024 Yu-Chen Den, SinoPac Holdings | National Taiwan University

Interpretable ML

6 of 33

6 |

  • A more efficient way to do gradient descent
  • Not a new optimization algorithm
  • Backpropagation utilize chain rule to avoid repeatedly computing gradient of each weight

Backpropagation – Fast Convergence

  • Forward pass
    • Compute for all parameters
  • Backward pass
    • Compute for all activation function inputs

©2024 Yu-Chen Den, SinoPac Holdings | National Taiwan University

Interpretable ML

7 of 33

7 |

  • Using the model architecture from above

Backpropagation (Cont’d)

©2024 Yu-Chen Den, SinoPac Holdings | National Taiwan University

Interpretable ML

8 of 33

8 |

  • Techniques to perform backpropagation to change parameters (ex: weights & biases)
  • Goals: Minimize loss function
  • Different types of optimizers
    • Stochastic Gradient Descent (SGD)
    • Adaptive Momentum Estimation (Adam)
    • Adaptive Momentum Estimation with weighted decay (AdamW)
  • Learning rate (lr) is an important hyperparameter in every optimizer
    • It’s represents the degree of how big a gradient step will be

Optimizer

©2024 Yu-Chen Den, SinoPac Holdings | National Taiwan University

Interpretable ML

9 of 33

9 |

  • Basically, the stochastic gradient descent (SGD) optimization method is�
    • A too small learning rate will lead to slow convergence
    • A too high learning rate cannot reaches to the local / global minima, where the model cannot converge

How Learning Rate affects model convergence?

image source: https://www.jeremyjordan.me/nn-learning-rate/

©2024 Yu-Chen Den, SinoPac Holdings | National Taiwan University

Interpretable ML

10 of 33

10 |

  • Helps in reducing the effect of differences in scales of input features and prevent outliers
  • Improve convergence by keeping the weights and activations in a reasonable range
  • Two normalization methods that most of people will be confused with
    • Batch normalization
    • Layer normalization

Regularization in NN – Normalization

image source: https://arxiv.org/pdf/1803.08494

©2024 Yu-Chen Den, SinoPac Holdings | National Taiwan University

Interpretable ML

11 of 33

11 |

  • Batch normalization prevents a way to control and optimize the distribution after each layer, which solve the problem of Internal Covariate Shift (ICS) & gradient vanishing
  • Batch normalization transform the data points of each mini-batch to a standard normal distribution

Batch Normalization

©2024 Yu-Chen Den, SinoPac Holdings | National Taiwan University

Interpretable ML

12 of 33

12 |

  • Batch normalization is hard to applicate on RNNs / Seq2Seq models
    • Usually, data that use RNN has time dependency, doing normalization in each batch will disrupt this information
    • The sequence length varies
    • With small batch size, batch normalization seems to be useless
  • Layer normalization normalize data of each timestep, not each mini batch

Layer Normalization

©2024 Yu-Chen Den, SinoPac Holdings | National Taiwan University

Interpretable ML

13 of 33

13 |

  • Avoid model overfitting by eliminate a percentage of neurons
  • Why?
    • Ensemble
      • Each epoch we randomly drop out some neurons, so we can get different model in each epoch, it’s slightly similar to ensemble method in machine learning
    • Co-adaptation
      • Make the first neuron from the second layer not highly depend on the first neuron from the first layer

Regularization in NN – Dropout

©2024 Yu-Chen Den, SinoPac Holdings | National Taiwan University

Interpretable ML

14 of 33

14 |

  • Readability
    • A proper coding style makes it easier for others (or yourself after several years) to easily understand and maintain your code.
  • Collaborative
    • Makes it easier for your reviewer to review your code.
    • Your collaborator / reviewer will hate you if you don’t have a proper coding style.
  • Maintainability
    • Make changes without introducing errors.

Supplementary – Why Coding Style Matters

©2024 Yu-Chen Den, SinoPac Holdings | National Taiwan University

Interpretable ML

15 of 33

15 |

  • WTFs / minute

The Only Proper Measure for Code Quality

©2024 Yu-Chen Den, SinoPac Holdings | National Taiwan University

Interpretable ML

16 of 33

16 |

  • Naming Rules
    • Variable : predict_output = 1
    • Constant : LEARNING_RATE = 1e-5
    • Function : def train_session(*args, **kwargs) -> None:
    • Class : class BertForRec(nn.Module):
  • Type Annotation

Naming Rules & Type Annotation

©2024 Yu-Chen Den, SinoPac Holdings | National Taiwan University

Interpretable ML

17 of 33

  • Preface
  • More NN
  • Deep Dive into PyTorch

17 |

Outline

©2024 Yu-Chen Den, SinoPac Holdings | National Taiwan University

Interpretable ML

18 of 33

18 |

Training Neural Networks in PyTorch

©2024 Yu-Chen Den, SinoPac Holdings | National Taiwan University

Interpretable ML

19 of 33

19 |

  • Tensors
    • High-dimensional matrices (arrays)

Prerequisite - torch.Tensor

1-D tensor

e.g. a word vector

2-D tensor

e.g. Tabular Data

3-D tensor

e.g. RGB Images

©2024 Yu-Chen Den, SinoPac Holdings | National Taiwan University

Interpretable ML

20 of 33

20 |

  • Directly from Data (list or numpy.ndarray)

x = torch.tensor([[1, 2], [3, 4]])

x = torch.from_numpy(np.array([[1, 2], [3, 4]])

  • Tensor of constant zeros & ones

x = torch.zeros([2, 2])

x = torch.ones([2, 2])

torch.Tensor – Creating Tensors

Shape of tensor

©2024 Yu-Chen Den, SinoPac Holdings | National Taiwan University

Interpretable ML

21 of 33

21 |

  • Tensors are somehow operates like numpy.ndarray
    • Numerical operations
    • Matrix transpose
    • Squeeze / Unsqueeze
    • Concatenation, Stack
  • I won’t talk too much about torch.Tensor, just explore by your own!

torch.Tensor – Further

©2024 Yu-Chen Den, SinoPac Holdings | National Taiwan University

Interpretable ML

22 of 33

22 |

Training & Testing Neural Networks in PyTorch

torch.utils.data.Dataset

torch.utils.data.DataLoader

©2024 Yu-Chen Den, SinoPac Holdings | National Taiwan University

Interpretable ML

23 of 33

23 |

  • torch.utils.data.Dataset
    • Stores data samples, including features and targets (labels)
    • Performs data transformation
  • Map-style dataset
    • Implements the __getitem__() and __len__() protocols, and represents a map from indices / keys to data samples
  • Iterable-style dataset
    • Implements __iter__() protocol, and represents an iterable over data samples

Load Data – torch.utils.data.Dataset

©2024 Yu-Chen Den, SinoPac Holdings | National Taiwan University

Interpretable ML

24 of 33

24 |

torch.utils.data.Dataset – Implement Map-style dataset

Load data & preprocess

Return the length of dataset

Return the specific feature & target pair

©2024 Yu-Chen Den, SinoPac Holdings | National Taiwan University

Interpretable ML

25 of 33

25 |

Load Data – torch.utils.data.DataLoader

Batch size

©2024 Yu-Chen Den, SinoPac Holdings | National Taiwan University

Interpretable ML

26 of 33

26 |

  • Batch size
    • Smaller batch size generalize better but can be slower to train
    • Bigger batch size are computationally efficient but more likely to overfit
  • Shuffle
    • Training can shuffle, test & valid can’t
      • We want to preserve the order on time series / sequence data
      • Shuffle introduce randomness, but we won’t let it happened when inferencing

torch.utils.data.DataLoader – parameters

©2024 Yu-Chen Den, SinoPac Holdings | National Taiwan University

Interpretable ML

27 of 33

27 |

  • Every PyTorch Model or layers are based on torch.nn.Module

Build PyTorch Model – torch.nn.Module

©2024 Yu-Chen Den, SinoPac Holdings | National Taiwan University

Interpretable ML

28 of 33

28 |

  • Set model to different mode
    • model.train()
      • This setting will turn on batch normalization and dropout
    • model.eval()
      • This setting will turn off batch normalization and dropout
      • If we don’t set model.eval() when inferencing, the weight of model will still change, but we need a fixed parameter that have already trained on training process

Training process in PyTorch

©2024 Yu-Chen Den, SinoPac Holdings | National Taiwan University

Interpretable ML

29 of 33

29 |

  • optimizer.zero_grad()
    • Clear the stored gradient value
  • loss.backward()
    • Do backward propagation
  • optimizer.step()
    • Compute the new parameters based on gradient and learning rate
  • with torch.no_grad()
    • Turn off the gradient mechanism when inference

Some important method in training process

©2024 Yu-Chen Den, SinoPac Holdings | National Taiwan University

Interpretable ML

30 of 33

30 |

  • torch.nn.MSELoss()

  • torch.nn.CrossEntropyLoss()�

Loss functions

©2024 Yu-Chen Den, SinoPac Holdings | National Taiwan University

Interpretable ML

31 of 33

31 |

Lets see the code structure !!!

Move on to the code base

©2024 Yu-Chen Den, SinoPac Holdings | National Taiwan University

Interpretable ML

32 of 33

32 |

  • Early Stopping
  • Build an LSTM model with PyTorch

Preview of Next Course

©2024 Yu-Chen Den, SinoPac Holdings | National Taiwan University

Interpretable ML

33 of 33

33 |

Appendix I – Internal Covariate Shift

  • Covariate Shift
    • The distribution of training dataset & testing dataset are different
  • Internal Covariate Shift
    • The distribution of weights varies with hidden layers, leads to slow convergence
    • Batch Normalization deals with internal covariate shift because it normalize
      • Instead of just normalizing values in a mini batch, it also utilize two learnable parameters to scale & shift the normalized data

©2024 Yu-Chen Den, SinoPac Holdings | National Taiwan University

Interpretable ML