1 of 33

TMBA 25th FW Track

Machine Learning & Trading

Interpretable ML (III)

Mentor | Yu-Chen (Abner) Den

Date | Nov 10th, 2024

Interpretable ML

2 of 33

Preface
More NN
Deep Dive into PyTorch

2 |

Outline

Interpretable ML

3 of 33

Preface
More NN
Deep Dive into PyTorch

3 |

Outline

Interpretable ML

4 of 33

4 |

Gradient Descent
Activation Functions

Recall Last Part of Interpretable ML (II)

Interpretable ML

5 of 33

Preface
More NN
Deep Dive into PyTorch

5 |

Outline

Interpretable ML

6 of 33

6 |

A more efficient way to do gradient descent
Not a new optimization algorithm
Backpropagation utilize chain rule to avoid repeatedly computing gradient of each weight

Backpropagation – Fast Convergence

Forward pass

Compute for all parameters

Backward pass

Compute for all activation function inputs

Interpretable ML

7 of 33

7 |

Using the model architecture from above

Backpropagation (Cont’d)

Interpretable ML

8 of 33

8 |

Techniques to perform backpropagation to change parameters (ex: weights & biases)
Goals: Minimize loss function
Different types of optimizers

Stochastic Gradient Descent (SGD)
Adaptive Momentum Estimation (Adam)
Adaptive Momentum Estimation with weighted decay (AdamW)

Learning rate (lr) is an important hyperparameter in every optimizer

It’s represents the degree of how big a gradient step will be

Optimizer

Interpretable ML

9 of 33

9 |

Basically, the stochastic gradient descent (SGD) optimization method is�

A too small learning rate will lead to slow convergence
A too high learning rate cannot reaches to the local / global minima, where the model cannot converge

How Learning Rate affects model convergence?

image source: https://www.jeremyjordan.me/nn-learning-rate/

Interpretable ML

10 of 33

10 |

Helps in reducing the effect of differences in scales of input features and prevent outliers
Improve convergence by keeping the weights and activations in a reasonable range
Two normalization methods that most of people will be confused with

Batch normalization
Layer normalization

Regularization in NN – Normalization

image source: https://arxiv.org/pdf/1803.08494

Interpretable ML

11 of 33

11 |

Batch normalization prevents a way to control and optimize the distribution after each layer, which solve the problem of Internal Covariate Shift (ICS) & gradient vanishing
Batch normalization transform the data points of each mini-batch to a standard normal distribution

Batch Normalization

Interpretable ML

12 of 33

12 |

Batch normalization is hard to applicate on RNNs / Seq2Seq models

Usually, data that use RNN has time dependency, doing normalization in each batch will disrupt this information
The sequence length varies
With small batch size, batch normalization seems to be useless

Layer normalization normalize data of each timestep, not each mini batch

Layer Normalization

Interpretable ML

13 of 33

13 |

Avoid model overfitting by eliminate a percentage of neurons
Why?

Ensemble

Each epoch we randomly drop out some neurons, so we can get different model in each epoch, it’s slightly similar to ensemble method in machine learning

Co-adaptation

Make the first neuron from the second layer not highly depend on the first neuron from the first layer

Regularization in NN – Dropout

Interpretable ML

14 of 33

14 |

Readability

A proper coding style makes it easier for others (or yourself after several years) to easily understand and maintain your code.

Collaborative

Makes it easier for your reviewer to review your code.
Your collaborator / reviewer will hate you if you don’t have a proper coding style.

Maintainability

Make changes without introducing errors.

Supplementary – Why Coding Style Matters

Interpretable ML

15 of 33

15 |

WTFs / minute

The Only Proper Measure for Code Quality

Interpretable ML

16 of 33

16 |

Naming Rules

Variable : predict_output = 1
Constant : LEARNING_RATE = 1e-5
Function : def train_session(*args, **kwargs) -> None:
Class : class BertForRec(nn.Module):

Type Annotation

Naming Rules & Type Annotation

Interpretable ML

17 of 33

Preface
More NN
Deep Dive into PyTorch

17 |

Outline

Interpretable ML

18 of 33

18 |

Training Neural Networks in PyTorch

Key Reference: PyTorch Official Document & https://speech.ee.ntu.edu.tw/~hylee/ml/2022-spring.php

Code for this slide: https://github.com/AbnerTeng/better-ml/tree/main/pytorch_tutorial

Interpretable ML

19 of 33

19 |

Tensors

High-dimensional matrices (arrays)

Prerequisite - torch.Tensor

1-D tensor

e.g. a word vector

2-D tensor

e.g. Tabular Data

3-D tensor

e.g. RGB Images

Interpretable ML

20 of 33

20 |

Directly from Data (list or numpy.ndarray)

x = torch.tensor([[1, 2], [3, 4]])

x = torch.from_numpy(np.array([[1, 2], [3, 4]])

Tensor of constant zeros & ones

x = torch.zeros([2, 2])

x = torch.ones([2, 2])

torch.Tensor – Creating Tensors

Shape of tensor

Interpretable ML

21 of 33

21 |

Tensors are somehow operates like numpy.ndarray

Numerical operations
Matrix transpose
Squeeze / Unsqueeze
Concatenation, Stack

I won’t talk too much about torch.Tensor, just explore by your own!

torch.Tensor – Further

Interpretable ML

22 of 33

22 |

Training & Testing Neural Networks in PyTorch

torch.utils.data.Dataset

torch.utils.data.DataLoader

Interpretable ML

23 of 33

23 |

torch.utils.data.Dataset

Stores data samples, including features and targets (labels)
Performs data transformation

Map-style dataset

Implements the __getitem__() and __len__() protocols, and represents a map from indices / keys to data samples

Iterable-style dataset

Implements __iter__() protocol, and represents an iterable over data samples

Load Data – torch.utils.data.Dataset

Interpretable ML

24 of 33

24 |

torch.utils.data.Dataset – Implement Map-style dataset

Load data & preprocess

Return the length of dataset

Return the specific feature & target pair

Interpretable ML

25 of 33

25 |

Load Data – torch.utils.data.DataLoader

Batch size

Interpretable ML

26 of 33

26 |

Batch size

Smaller batch size generalize better but can be slower to train
Bigger batch size are computationally efficient but more likely to overfit

Shuffle

Training can shuffle, test & valid can’t

We want to preserve the order on time series / sequence data
Shuffle introduce randomness, but we won’t let it happened when inferencing

torch.utils.data.DataLoader – parameters

Interpretable ML

27 of 33

27 |

Every PyTorch Model or layers are based on torch.nn.Module

Build PyTorch Model – torch.nn.Module

Interpretable ML

28 of 33

28 |

Set model to different mode

model.train()

This setting will turn on batch normalization and dropout

model.eval()

This setting will turn off batch normalization and dropout
If we don’t set model.eval() when inferencing, the weight of model will still change, but we need a fixed parameter that have already trained on training process

Training process in PyTorch

Interpretable ML

29 of 33

29 |

optimizer.zero_grad()

Clear the stored gradient value

loss.backward()

Do backward propagation

optimizer.step()

Compute the new parameters based on gradient and learning rate

with torch.no_grad()

Turn off the gradient mechanism when inference

Some important method in training process

Interpretable ML

30 of 33

30 |

torch.nn.MSELoss()

torch.nn.CrossEntropyLoss()�

Loss functions

Interpretable ML

31 of 33

31 |

Lets see the code structure !!!

Move on to the code base

Interpretable ML

32 of 33

32 |

Early Stopping
Build an LSTM model with PyTorch

Preview of Next Course

Interpretable ML

33 of 33

33 |

Appendix I – Internal Covariate Shift

Covariate Shift

The distribution of training dataset & testing dataset are different

Internal Covariate Shift

The distribution of weights varies with hidden layers, leads to slow convergence
Batch Normalization deals with internal covariate shift because it normalize

Instead of just normalizing values in a mini batch, it also utilize two learnable parameters to scale & shift the normalized data

Interpretable ML