Reducing the Complexity�of Hyperparameter Tuning
Training a Neural Network is Time Consuming and Complex
>> 2
Neural Network – Inference
>> 3
inputs
Duck
Dog
Hedgehog
outputs
12
145
0
55
243
36
27
44
145
105
100
32
188
23
59
7
238
99
4
255
0
1
28
109
94
165
19
Multiply by weights
Plus biases
correct
labels
Duck
Hedgehog
Dog
Neural Network – Train
>> 4
inputs
outputs
correct
labels
Duck
Hedgehog
Dog
Loss function=average(outputs - )
correct
labels
Dog
Hedgehog
Duck
Plus biases
Multiply by weights
Neural Network – Train
>> 5
inputs
outputs
correct
labels
Hyperparameters
Training Parameters
Optimization Parameters
Neural Network – Train
>> 6
inputs
outputs
correct
labels
Neural Network – Train
>> 7
step
step
step
step
Neural Network – Training can take some time…
>> 8
step
Depending on network, data,
and hardware, this can take
hours, days, or weeks!
But what if we don’t
like the result!?
what if this
decay is too
slow for us?
what if our
customer says this
error is too high?
should we start
from a different
location?
did we stop
too early?
should we change
our training script?
We have to repeat this
after changing
Hyperparameters!
>> 9
Lets use Accuracy Instead of Loss
step
Accuracy
Aiming to Save Time and Resources
>> 10
Red:
by tuning M
hyperparameters
Accuracy
Green:
by tuning N
hyperparameters
step
Aiming to Save Time and Resources
>> 11
this is great!
can we have this?
* Only when you care about achieving a minimum accuracy
Red:
by tuning M
hyperparameters
Green:
by tuning N
hyperparameters
Accuracy
step
There are many Applications for NN
>> 12
Image classification
Medical diagnosis
Financial forecasting
Object recognition
Speech
recognition
Telecommunication Applications for NN
>> 13
Constellation design
Automatic modulation classification
Channel estimation
There are Different Network Topologies
>> 14
After Selecting Application, Dataset, and Network
>> 15
Application
Dataset
Network
How complex is training?
(application, network, dataset)
The Complexity of Training a Neural Network
>> 16
How to initial our
Network?
random, He, Xavier, …
* This graph is generalized and simplified
Start with:
How to change our
Network variables?
Optimizer, learning rate, …
How to modify our
input?
normalizing, crop, flip, …
When do we stop?
maximum number of epochs,
minimum accuracy, …
How to Quantize?
bit budget, method, …
there exists a set of suggestions, but no definite answer
How to evaluate our
Network?�MSE, MAE, CEL, …
Application
Dataset
Network
The Complexity of Training a Neural Network
>> 17
How to initial our
Network?
random, He, Xavier, …
* This graph is generalized and simplified
Start with:
How to change our
Network variables?
Optimizer, learning rate, …
How to modify our
input?
normalizing, crop, flip, …
When do we stop?
maximum number of epochs,
minimum accuracy, …
How to Quantize?
bit budget, method, …
How to evaluate our
Network?�MSE, MAE, CEL, …
Application
Dataset
Network
Answer to each question depends on other answers!
The Complexity of Training a Neural Network
>> 18
How to initial our
Network?
random, He, Xavier, …
* This graph is generalized and simplified
Start with:
How to change our
Network variables?
Optimizer, learning rate, …
How to modify our
input?
normalizing, crop, flip, …
When do we stop?
maximum number of epochs,
minimum accuracy, …
How to Quantize?
bit budget, method, …
How to evaluate our
Network?�MSE, MAE, CEL, …
Application
Dataset
Network
Answer to each question depends
on other answers!
For example learning rate and optimizer can depend on quantization
Training a 3-Node Neural Network is NP-Complete
A. Blum and R. L. Rivest, NIPS 1989
“Training a Neural Network to Produce the correct output is NP-Hard”
S. Judd, PhD Thesis 1988
The Complexity of Training a Neural Network
>> 19
* This graph is generalized and simplified
Let’s say we select these parameters
for an experiment:
Initialization:
random
Stop criteria:
maximum number of epochs
Quantization:
bit budget, method, …
Evaluation:
CEL
Modify input
normalizing, crop, flip, …
Application
Dataset
Network
How to change our
Network variables?
Optimizer, learning rate, …
Let’s say we select these parameters
for an experiment:
Initialization:
random
Stop criteria:
maximum number of epochs
Quantization:
bit budget, method, …
Evaluation:
CEL
Modify input
normalizing, crop, flip, …
Application
Dataset
Network
How to change our
Network variables?
Optimizer, learning rate, …
still no definite answer, but there exists a better set of suggestions
Automatically tune
some of these parameters
Reducing The Complexity of Training
Training a Network
>> 20
Training a Network – Gradient Descent
>> 21
current values
learning rate
derivative of loss function
using all the training set
updated values
Training a Network – A Simple Example
>> 22
current values
learning rate
derivative of loss function
using all the training set
Training a Network – A Simple Example
>> 23
Parameters
Learning
rate
Blue: goal
Green: training with fixed learning rate
step
step
Blue: goal
Red: training with non-fixed learning rate
Training a Network – A Simple Example
>> 24
Parameters
Learning
rate
Blue: goal
Green: training with fixed learning rate
step
Blue: goal
Red: training with non-fixed learning rate
step
Training a Network – We Need Tuning
>> 25
Bag of tunable
current values
learning rate
derivative of loss function
using all the training set
Training a Network – Gradient Descent
>> 26
current values
learning rate
derivative of loss function
using all the training set
Bag of tunable
derivative of loss function
using a subset of training set
ImageNet training set includes one million 224x224 images
Subset size = b
b
How to sample?
sample
Training a Network – Stochastic Gradient Descent
Training a Network – Generalization
>> 27
current values
learning rate
Bag of tunable
b
derivative of loss function
using a subset of training set
ImageNet training set includes one million 224x224 images
Train set
Validation set
Test set
A Simple Weight Decay can Improve Generalization,�Krogh, A. and Hertz, J. NIPS 1991
These sets are different!
We may overfit on train and/or validation sets, if not careful!
Weight Decay
sample
Training a Network – Too many Hyperparameters!
>> 28
Bag of tunable
b
Weight Decay
We can continue and introduce more and more parameters
momentum
Nesterov momentum
Step decay
Cyclic Learning Rate
KFAC
Hessian Free
RMSPROP
ADAM
AdaBound
milestones
gamma
Maximum learning rate
Initial learning rate
dampening
cooldown
warmup
current values
learning rate
derivative of loss function
using a subset of training set
sample
Training a Network – Review Relevant Hyperparameters
>> 29
Bag of tunable
Reminder:
we are looking at
a subset of our
Hyperparameters!
current values
learning rate
derivative of loss function
using a subset of training set
How to change our
Network variables?
Optimizer and learning rate
Calculating Second
Order Information
Using Moving
Average
Manual Tuning
Training a Network – What is a Good Learning Rate
>> 30
Bag of tunable
Learning rate controls how much
We are adjusting the weights
Training a Network – What is a Good Learning Rate
>> 31
Bag of tunable
Learning rate controls how much
We are adjusting the weights
Too large!
Training a Network – What is a Good Learning Rate
>> 32
Bag of tunable
Learning rate controls how much
We are adjusting the weights
Too small!
Common Trends – Calculating Second Derivative
>> 33
Bag of tunable
10 times more expensive than SGD
Hessian-Free Optimization�James Martens, University of Toronto
Expensive to implement
13 input parameters
Large Batch Size Training of NN with Adversarial �Training and Second Order Information�Yao, Gholami, Keutzer, and Mahoney, UC Berkeley
Can add more parameters to our tunable bag
An optimum step size
Common Trends – Using Moving Average
>> 34
Bag of tunable
Common Trends – Using Moving Average
>> 35
Bag of tunable
momentum
Common Trends – Using More Moving Averages
>> 36
Bag of tunable
ADAM
Common Trends – Problem with Moving Average
>> 37
Bag of tunable
SGD noise is larger for
smaller batch size
“Mini batches are usually better”�Geoffrey Hinton, Lecture 6, CSC321, University of Toronto�AI scientist at Google Brain
Training accuracy for ResNet-34, CIFAR10
Adaptive Gradient Methods with Dynamic Bound of Learning Rate
Luo et al. ICLR 2019
Common Trends – Manual Tuning
>> 38
Bag of tunable
Multi-step decay – input: (initial step size, [milestone], [decay])
step or epoch
number
Initial value
milestone 1
milestone 2
milestone 3
decay 1
decay 2
decay 3
Common Trends – Combining Techniques
>> 39
Bag of tunable
ADAM
Multi-step decay
(initial step size, [milestone], [decay])
A Greedy Scheduler for Learning Rate – Inspirations
>> 40
A Greedy Scheduler for Learning Rate – Inspirations
>> 41
A Greedy Scheduler for Learning Rate – Inspirations
>> 42
Learning
rate
step or epoch
number
Learning
rate
step or epoch
number
Sudden
change
Smooth
change
A Greedy Scheduler for Learning Rate – Inspirations
>> 43
The loss surface of ResNet-56 without skip connections
“Visualizing the Loss Landscape of Neural Nets”, Li et al. NIPS 2018
A Greedy Scheduler for Learning Rate – The Algorithm
>> 44
epoch
number
Possible learning rates
Initial
learning rate
A Greedy Scheduler for Learning Rate – The Algorithm
>> 45
epoch
number
Possible learning rates
Initial
learning rate
min
max
A Greedy Scheduler for Learning Rate – The Algorithm
>> 46
epoch
number
Initial
learning rate
min
max
Possible learning rates
A Greedy Scheduler for Learning Rate – The Algorithm
>> 47
epoch
number
Initial
learning rate
min
max
Possible learning rates
A Greedy Scheduler for Learning Rate – The Algorithm
>> 48
epoch
number
Initial
learning rate
min
max
Possible learning rates
A Greedy Scheduler for Learning Rate – The Algorithm
>> 49
epoch
number
Initial
learning rate
min
max
Possible learning rates
A Greedy Scheduler for Learning Rate – The Algorithm
>> 50
epoch
number
Initial
learning rate
Possible learning rates
min
max
A Greedy Scheduler for Learning Rate – The Algorithm
>> 51
epoch
number
Initial
learning rate
Possible learning rates
Validation loss�did not improve
loss
A Greedy Scheduler for Learning Rate – The Algorithm
>> 52
epoch
number
Initial
learning rate
Possible learning rates
Validation loss has been improved
continue with the current learning rate
loss
A Greedy Scheduler for Learning Rate – The Algorithm
>> 53
epoch
number
Initial
learning rate
Possible learning rates
radius = 1
loss
A Greedy Scheduler for Learning Rate – The Algorithm
>> 54
epoch
number
Initial
learning rate
Possible learning rates
radius = 2
loss
A Greedy Scheduler for Learning Rate – The Algorithm
>> 55
epoch
number
Initial
learning rate
Possible learning rates
radius = 3
loss
A Greedy Scheduler for Learning Rate – The Algorithm
>> 56
epoch
number
Initial
learning rate
Possible learning rates
radius = 4
loss
A Greedy Scheduler for Learning Rate – The Algorithm
>> 57
epoch
number
Initial
learning rate
Possible learning rates
radius = 3
loss
A Greedy Scheduler for Learning Rate – Implementation
>> 58
net = prepare_model(args.model)
optimizer = SGD(net.parameters(), lr=lr)
scheduler = MultiStep(milestones, gamma)
for number of epochs:
train_loss = train(optimizer)
val_loss = validate()
scheduler.step()
logInfo()
Training with step decay scheduler*
* Steps are simplified
net = prepare_model(args.model)
optimizer = SGD (net.parameters(), lr=lr)
scheduler = GreedyLR()
for number of epochs:
train_loss = train(optimizer)
val_loss = validate()
scheduler.step(val_loss)
logInfo()
Training with our scheduler*
Results – ResNet56
>> 59
Top1
accuracy
Learning
rate
step
epoch
Original hyperparameters*: 93.8%
Our scheduler: 93.9%
Paper: 93.0%
*obtained from: keras.io/examples
0.01
0.02
0.009
Results – ResNet20
>> 60
Top1
accuracy
Learning
rate
step
epoch
Original hyperparameters*: 91.9%
Our scheduler: 91.8%
Paper: 91.2%
*obtained from: gihub.com/osmr
Results – ResNet20
>> 61
Top1
accuracy
Learning
rate
step
epoch
Original hyperparameters*: 92.3%
Our scheduler: 92.6%
Paper: 91.2%
*obtained from: keras.io/examples
Results – VGG11_Brevitas_8bit �Brevitas: https://github.com/Xilinx/brevitas
>> 62
Top1
accuracy
Learning
rate
step
step
Original hyperparameters*: 91.5%
Our scheduler: 91.4%
*obtained from: gitenterprise.xilinx.com
Results
>> 63
Network CIFAR10 | Scheduler | Learning rate parameters (lr, milestones, gamma) | Top1 accuracy |
Resnet20 | Step Decay | 0.1, [80,120,160,180], [0.1,0.1,0.1, 0.05] | 92.3% |
Ours | 0.1 | 92.6% | |
Resnet56 | Step Decay | 0.1, [80,120,160,180], [0.1,0.1,0.1,0.05] | 93.8% |
Ours | 0.1 | 93.9% | |
DenseNet40_k12 | Step Decay | 0.1, [150, 250], [0.1] | 93.0% |
Ours | 0.1 | 92.9% | |
WRN20_10_1bit | Step Decay | 0.1, [80,120,160,180],[0.1] | 95.2% |
Ours | 0.1 | 94.9% | |
VGG11 8bit-fixed | Step Decay | 0.1, [150,250,350], [0.1] | 91.5% |
Ours | 0.1 | 91.4% | |
VGG11 6bit-fixed | Step Decay | 0.1, [150,250,350], [0.1] | 91.2% |
Ours | 0.1 | 91.2% | |
CNV 1bit | Step Decay | 0.1, [500,600,700,800],[0.1] | 78.5% |
Ours | 0.1 | 78.5% |
Wrapping up, Why Does it Matter?
>> 64
Network CIFAR10 | Scheduler | Learning rate parameters (lr, milestones, gamma) | Top1 Accuracy |
Resnet20 | Step Decay | 0.2, [160,180], [0.1] | 90.7% |
Ours | 0.2 | 92.6% |
Network CIFAR10 | Scheduler | Learning rate parameters (lr, milestones, gamma) | Top1 accuracy |
Resnet20 | Step Decay | 0.2, [160,180], [0.1] | 90.7% |
| | |
Network CIFAR10 | Scheduler | Learning rate parameters (lr, milestones, gamma) | Top1 accuracy |
Resnet20 | Step Decay | 0.1, [80,120,160,180], [0.1,0.1,0.1, 0.05] | 92.3% |
Ours | 0.1 | 92.6% |
If you get these numbers wrong (from guessing,
an online resource, a friend, etc.) you may not get good results.
For example:
In contrast, our method can be forgiving.
For example:
Disclaimer – There is no Silver Bullet for Deep Learning
>> 65
We, similar to every single method out there, cannot
guarantee that our method works better all the time.