1 of 65

Reducing the Complexity�of Hyperparameter Tuning

2 of 65

Training a Neural Network is Time Consuming and Complex

>> 2

3 of 65

Neural Network – Inference

>> 3

inputs

Duck

Dog

Hedgehog

outputs

12

145

0

55

243

36

27

44

145

105

100

32

188

23

59

7

238

99

4

255

0

1

28

109

94

165

19

Multiply by weights

Plus biases

correct

labels

Duck

Hedgehog

Dog

4 of 65

Neural Network – Train

>> 4

inputs

outputs

correct

labels

Duck

Hedgehog

Dog

Loss function=average(outputs - )

correct

labels

Dog

Hedgehog

Duck

Plus biases

Multiply by weights

 

 

 

 

 

 

 

 

 

 

 

 

 

5 of 65

Neural Network – Train

>> 5

 

 

inputs

outputs

correct

labels

 

 

 

 

 

 

 

 

 

 

 

 

 

Hyperparameters

Training Parameters

Optimization Parameters

6 of 65

Neural Network – Train

>> 6

 

 

inputs

outputs

correct

labels

 

 

 

 

 

 

 

 

 

 

 

 

 

7 of 65

Neural Network – Train

>> 7

 

 

 

 

step

 

 

step

step

step

8 of 65

Neural Network – Training can take some time…

>> 8

 

 

 

 

step

 

 

Depending on network, data,

and hardware, this can take

hours, days, or weeks!

But what if we don’t

like the result!?

 

 

what if this

decay is too

slow for us?

what if our

customer says this

error is too high?

should we start

from a different

location?

did we stop

too early?

should we change

our training script?

We have to repeat this

after changing

Hyperparameters!

9 of 65

>> 9

Lets use Accuracy Instead of Loss

 

step

Accuracy

10 of 65

Aiming to Save Time and Resources

  • If:� we reduce the number of tunable� hyperparameters (M<N)�then:� we have less to tune manually

>> 10

Red:

by tuning M

hyperparameters

Accuracy

Green:

by tuning N

hyperparameters

step

11 of 65

Aiming to Save Time and Resources

  • If:� we reduce the number of tunable� hyperparameters (M<N)�then:� we have less to tune manually�
  • If:� we achieve a minimum accuracy, faster�then:� we will be able to test more ideas*

>> 11

this is great!

can we have this?

* Only when you care about achieving a minimum accuracy

Red:

by tuning M

hyperparameters

Green:

by tuning N

hyperparameters

Accuracy

step

12 of 65

There are many Applications for NN

>> 12

Image classification

Medical diagnosis

Financial forecasting

Object recognition

Speech

recognition

13 of 65

Telecommunication Applications for NN

>> 13

Constellation design

Automatic modulation classification

Channel estimation

14 of 65

There are Different Network Topologies

>> 14

15 of 65

After Selecting Application, Dataset, and Network

>> 15

Application

Dataset

Network

How complex is training?

(application, network, dataset)

16 of 65

The Complexity of Training a Neural Network

>> 16

How to initial our

Network?

random, He, Xavier, …

* This graph is generalized and simplified

Start with:

How to change our

Network variables?

Optimizer, learning rate, …

How to modify our

input?

normalizing, crop, flip, …

When do we stop?

maximum number of epochs,

minimum accuracy, …

How to Quantize?

bit budget, method, …

there exists a set of suggestions, but no definite answer

How to evaluate our

Network?�MSE, MAE, CEL, …

Application

Dataset

Network

17 of 65

The Complexity of Training a Neural Network

>> 17

How to initial our

Network?

random, He, Xavier, …

* This graph is generalized and simplified

Start with:

How to change our

Network variables?

Optimizer, learning rate, …

How to modify our

input?

normalizing, crop, flip, …

When do we stop?

maximum number of epochs,

minimum accuracy, …

How to Quantize?

bit budget, method, …

How to evaluate our

Network?�MSE, MAE, CEL, …

Application

Dataset

Network

Answer to each question depends on other answers!

18 of 65

The Complexity of Training a Neural Network

>> 18

How to initial our

Network?

random, He, Xavier, …

* This graph is generalized and simplified

Start with:

How to change our

Network variables?

Optimizer, learning rate, …

How to modify our

input?

normalizing, crop, flip, …

When do we stop?

maximum number of epochs,

minimum accuracy, …

How to Quantize?

bit budget, method, …

How to evaluate our

Network?�MSE, MAE, CEL, …

Application

Dataset

Network

Answer to each question depends

on other answers!

For example learning rate and optimizer can depend on quantization

Training a 3-Node Neural Network is NP-Complete

A. Blum and R. L. Rivest, NIPS 1989

“Training a Neural Network to Produce the correct output is NP-Hard”

S. Judd, PhD Thesis 1988

19 of 65

The Complexity of Training a Neural Network

>> 19

* This graph is generalized and simplified

Let’s say we select these parameters

for an experiment:

Initialization:

random

Stop criteria:

maximum number of epochs

Quantization:

bit budget, method, …

Evaluation:

CEL

Modify input

normalizing, crop, flip, …

Application

Dataset

Network

How to change our

Network variables?

Optimizer, learning rate, …

Let’s say we select these parameters

for an experiment:

Initialization:

random

Stop criteria:

maximum number of epochs

Quantization:

bit budget, method, …

Evaluation:

CEL

Modify input

normalizing, crop, flip, …

Application

Dataset

Network

How to change our

Network variables?

Optimizer, learning rate, …

still no definite answer, but there exists a better set of suggestions

Automatically tune

some of these parameters

Reducing The Complexity of Training

20 of 65

Training a Network

>> 20

 

 

 

21 of 65

Training a Network – Gradient Descent

>> 21

 

current values

learning rate

derivative of loss function

using all the training set

 

updated values

22 of 65

Training a Network – A Simple Example

>> 22

 

 

 

 

 

 

 

 

 

 

 

 

 

current values

learning rate

derivative of loss function

using all the training set

 

23 of 65

Training a Network – A Simple Example

>> 23

Parameters

Learning

rate

Blue: goal

Green: training with fixed learning rate

 

 

 

step

step

Blue: goal

Red: training with non-fixed learning rate

 

 

 

 

 

 

 

 

 

 

24 of 65

Training a Network – A Simple Example

>> 24

Parameters

Learning

rate

Blue: goal

Green: training with fixed learning rate

 

 

 

step

Blue: goal

Red: training with non-fixed learning rate

 

 

 

 

 

 

 

 

 

 

step

25 of 65

Training a Network – We Need Tuning

>> 25

 

 

Bag of tunable

current values

learning rate

derivative of loss function

using all the training set

26 of 65

Training a Network – Gradient Descent

>> 26

 

current values

learning rate

derivative of loss function

using all the training set

Bag of tunable

 

derivative of loss function

using a subset of training set

 

ImageNet training set includes one million 224x224 images

Subset size = b

b

How to sample?

sample

Training a Network – Stochastic Gradient Descent

27 of 65

Training a Network – Generalization

>> 27

current values

learning rate

Bag of tunable

 

 

b

derivative of loss function

using a subset of training set

ImageNet training set includes one million 224x224 images

Train set

Validation set

Test set

A Simple Weight Decay can Improve Generalization,�Krogh, A. and Hertz, J. NIPS 1991

These sets are different!

We may overfit on train and/or validation sets, if not careful!

Weight Decay

sample

28 of 65

Training a Network – Too many Hyperparameters!

>> 28

Bag of tunable

 

b

Weight Decay

We can continue and introduce more and more parameters

 

momentum

Nesterov momentum

Step decay

Cyclic Learning Rate

KFAC

Hessian Free

 

RMSPROP

 

ADAM

 

AdaBound

 

 

 

milestones

gamma

Maximum learning rate

Initial learning rate

 

 

 

dampening

cooldown

warmup

current values

learning rate

 

derivative of loss function

using a subset of training set

sample

29 of 65

Training a Network – Review Relevant Hyperparameters

>> 29

Bag of tunable

Reminder:

we are looking at

a subset of our

Hyperparameters!

current values

learning rate

 

derivative of loss function

using a subset of training set

How to change our

Network variables?

Optimizer and learning rate

Calculating Second

Order Information

Using Moving

Average

Manual Tuning

30 of 65

Training a Network – What is a Good Learning Rate

>> 30

Bag of tunable

 

 

 

 

 

 

Learning rate controls how much

We are adjusting the weights

31 of 65

Training a Network – What is a Good Learning Rate

>> 31

Bag of tunable

 

 

 

 

 

 

Learning rate controls how much

We are adjusting the weights

 

Too large!

32 of 65

Training a Network – What is a Good Learning Rate

>> 32

Bag of tunable

 

 

 

 

 

 

Learning rate controls how much

We are adjusting the weights

 

Too small!

33 of 65

Common Trends – Calculating Second Derivative

>> 33

Bag of tunable

 

10 times more expensive than SGD

Hessian-Free Optimization�James Martens, University of Toronto

Expensive to implement

13 input parameters

Large Batch Size Training of NN with Adversarial �Training and Second Order Information�Yao, Gholami, Keutzer, and Mahoney, UC Berkeley

Can add more parameters to our tunable bag

 

 

 

 

 

An optimum step size

 

34 of 65

Common Trends – Using Moving Average

>> 34

Bag of tunable

 

 

 

 

 

 

 

35 of 65

Common Trends – Using Moving Average

>> 35

Bag of tunable

 

 

 

 

 

 

momentum

 

 

36 of 65

Common Trends – Using More Moving Averages

>> 36

Bag of tunable

 

 

 

 

 

 

ADAM

 

 

 

 

37 of 65

Common Trends – Problem with Moving Average

>> 37

Bag of tunable

 

SGD noise is larger for

smaller batch size

 

 

 

 

 

 

Mini batches are usually better”�Geoffrey Hinton, Lecture 6, CSC321, University of Toronto�AI scientist at Google Brain

Training accuracy for ResNet-34, CIFAR10

Adaptive Gradient Methods with Dynamic Bound of Learning Rate

Luo et al. ICLR 2019

38 of 65

Common Trends – Manual Tuning

>> 38

Bag of tunable

 

 

 

 

 

 

Multi-step decay – input: (initial step size, [milestone], [decay])

 

step or epoch

number

Initial value

milestone 1

milestone 2

milestone 3

decay 1

decay 2

decay 3

 

39 of 65

Common Trends – Combining Techniques

>> 39

Bag of tunable

 

 

 

 

 

 

 

ADAM

Multi-step decay

(initial step size, [milestone], [decay])

 

40 of 65

A Greedy Scheduler for Learning Rate – Inspirations

  • Second order information: learning rate can both increase and decrease during training

>> 40

41 of 65

A Greedy Scheduler for Learning Rate – Inspirations

  • Second order information: learning rate can both increase and decrease during training
  • Using moving average: damping the SGD noise can hurt the final results

>> 41

42 of 65

A Greedy Scheduler for Learning Rate – Inspirations

  • Second order information: learning rate can both increase and decrease during training
  • Using moving average: damping the SGD noise can hurt the final results
  • Gradually changing the learning rate is better than sudden changes

>> 42

Learning

rate

step or epoch

number

Learning

rate

step or epoch

number

Sudden

change

Smooth

change

43 of 65

A Greedy Scheduler for Learning Rate – Inspirations

  • Second order information: learning rate can both increase and decrease during training
  • Using moving average: damping the SGD noise can hurt the final results
  • Gradually changing the learning rate is better than sudden changes
  • We have to walk on a pathological curvature

>> 43

The loss surface of ResNet-56 without skip connections

“Visualizing the Loss Landscape of Neural Nets”, Li et al. NIPS 2018

 

 

 

 

 

 

 

44 of 65

A Greedy Scheduler for Learning Rate – The Algorithm

  •  

>> 44

epoch

number

Possible learning rates

Initial

learning rate

45 of 65

A Greedy Scheduler for Learning Rate – The Algorithm

  •  

>> 45

epoch

number

Possible learning rates

Initial

learning rate

min

max

 

 

46 of 65

A Greedy Scheduler for Learning Rate – The Algorithm

  •  

>> 46

epoch

number

Initial

learning rate

 

min

max

 

Possible learning rates

47 of 65

A Greedy Scheduler for Learning Rate – The Algorithm

  •  

>> 47

epoch

number

Initial

learning rate

min

max

 

Possible learning rates

 

48 of 65

A Greedy Scheduler for Learning Rate – The Algorithm

  •  

>> 48

epoch

number

Initial

learning rate

min

max

 

 

Possible learning rates

 

49 of 65

A Greedy Scheduler for Learning Rate – The Algorithm

  •  

>> 49

epoch

number

Initial

learning rate

min

max

 

 

Possible learning rates

 

50 of 65

A Greedy Scheduler for Learning Rate – The Algorithm

  •  

>> 50

epoch

number

Initial

learning rate

 

 

 

 

 

 

Possible learning rates

min

max

 

 

 

51 of 65

A Greedy Scheduler for Learning Rate – The Algorithm

  •  

>> 51

epoch

number

Initial

learning rate

Possible learning rates

Validation loss�did not improve

loss

52 of 65

A Greedy Scheduler for Learning Rate – The Algorithm

  •  

>> 52

epoch

number

Initial

learning rate

Possible learning rates

Validation loss has been improved

continue with the current learning rate

loss

53 of 65

A Greedy Scheduler for Learning Rate – The Algorithm

  •  

>> 53

epoch

number

Initial

learning rate

Possible learning rates

radius = 1

loss

54 of 65

A Greedy Scheduler for Learning Rate – The Algorithm

  •  

>> 54

epoch

number

Initial

learning rate

Possible learning rates

radius = 2

loss

55 of 65

A Greedy Scheduler for Learning Rate – The Algorithm

  •  

>> 55

epoch

number

Initial

learning rate

Possible learning rates

radius = 3

loss

56 of 65

A Greedy Scheduler for Learning Rate – The Algorithm

  •  

>> 56

epoch

number

Initial

learning rate

Possible learning rates

radius = 4

loss

57 of 65

A Greedy Scheduler for Learning Rate – The Algorithm

  •  

>> 57

epoch

number

Initial

learning rate

Possible learning rates

radius = 3

loss

58 of 65

A Greedy Scheduler for Learning Rate – Implementation

>> 58

net = prepare_model(args.model)

optimizer = SGD(net.parameters(), lr=lr)

scheduler = MultiStep(milestones, gamma)

for number of epochs:

train_loss = train(optimizer)

val_loss = validate()

scheduler.step()

logInfo()

Training with step decay scheduler*

* Steps are simplified

net = prepare_model(args.model)

optimizer = SGD (net.parameters(), lr=lr)

scheduler = GreedyLR()

for number of epochs:

train_loss = train(optimizer)

val_loss = validate()

scheduler.step(val_loss)

logInfo()

Training with our scheduler*

59 of 65

Results – ResNet56

>> 59

Top1

accuracy

Learning

rate

step

epoch

Original hyperparameters*: 93.8%

Our scheduler: 93.9%

Paper: 93.0%

*obtained from: keras.io/examples

0.01

0.02

0.009

60 of 65

Results – ResNet20

>> 60

Top1

accuracy

Learning

rate

step

epoch

Original hyperparameters*: 91.9%

Our scheduler: 91.8%

Paper: 91.2%

*obtained from: gihub.com/osmr

61 of 65

Results – ResNet20

>> 61

Top1

accuracy

Learning

rate

step

epoch

Original hyperparameters*: 92.3%

Our scheduler: 92.6%

Paper: 91.2%

*obtained from: keras.io/examples

62 of 65

Results – VGG11_Brevitas_8bit �Brevitas: https://github.com/Xilinx/brevitas

>> 62

Top1

accuracy

Learning

rate

step

step

Original hyperparameters*: 91.5%

Our scheduler: 91.4%

*obtained from: gitenterprise.xilinx.com

63 of 65

Results

>> 63

Network

CIFAR10

Scheduler

Learning rate parameters

(lr, milestones, gamma)

Top1

accuracy

Resnet20

Step Decay

0.1, [80,120,160,180], [0.1,0.1,0.1, 0.05]

92.3%

Ours

0.1

92.6%

Resnet56

Step Decay

0.1, [80,120,160,180], [0.1,0.1,0.1,0.05]

93.8%

Ours

0.1

93.9%

DenseNet40_k12

Step Decay

0.1, [150, 250], [0.1]

93.0%

Ours

0.1

92.9%

WRN20_10_1bit

Step Decay

0.1, [80,120,160,180],[0.1]

95.2%

Ours

0.1

94.9%

VGG11 8bit-fixed

Step Decay

0.1, [150,250,350], [0.1]

91.5%

Ours

0.1

91.4%

VGG11 6bit-fixed

Step Decay

0.1, [150,250,350], [0.1]

91.2%

Ours

0.1

91.2%

CNV 1bit

Step Decay

0.1, [500,600,700,800],[0.1]

78.5%

Ours

0.1

78.5%

64 of 65

Wrapping up, Why Does it Matter?

>> 64

Network

CIFAR10

Scheduler

Learning rate parameters

(lr, milestones, gamma)

Top1

Accuracy

Resnet20

Step Decay

0.2, [160,180], [0.1]

90.7%

Ours

0.2

92.6%

Network

CIFAR10

Scheduler

Learning rate parameters

(lr, milestones, gamma)

Top1

accuracy

Resnet20

Step Decay

0.2, [160,180], [0.1]

90.7%

Network

CIFAR10

Scheduler

Learning rate parameters

(lr, milestones, gamma)

Top1

accuracy

Resnet20

Step Decay

0.1, [80,120,160,180], [0.1,0.1,0.1, 0.05]

92.3%

Ours

0.1

92.6%

If you get these numbers wrong (from guessing,

an online resource, a friend, etc.) you may not get good results.

For example:

In contrast, our method can be forgiving.

For example:

65 of 65

Disclaimer – There is no Silver Bullet for Deep Learning

>> 65

We, similar to every single method out there, cannot

guarantee that our method works better all the time.