3 of 65

Neural Network – Inference

>> 3

inputs

Duck

Dog

Hedgehog

outputs

145

243

145

105

100

188

238

255

109

165

Multiply by weights

Plus biases

correct

labels

Duck

Hedgehog

Dog

4 of 65

Neural Network – Train

>> 4

inputs

outputs

correct

labels

Duck

Hedgehog

Dog

Loss function=average(outputs - )

correct

labels

Dog

Hedgehog

Duck

Plus biases

Multiply by weights

5 of 65

Neural Network – Train

>> 5

inputs

outputs

correct

labels

Hyperparameters

Training Parameters

Optimization Parameters

6 of 65

Neural Network – Train

>> 6

inputs

outputs

correct

labels

7 of 65

Neural Network – Train

>> 7

step

8 of 65

Neural Network – Training can take some time…

>> 8

step

Depending on network, data,

and hardware, this can take

hours, days, or weeks!

But what if we don’t

like the result!?

what if this

decay is too

slow for us?

what if our

customer says this

error is too high?

should we start

from a different

location?

did we stop

too early?

should we change

our training script?

We have to repeat this

after changing

Hyperparameters!

9 of 65

>> 9

Lets use Accuracy Instead of Loss

step

Accuracy

10 of 65

Aiming to Save Time and Resources

If:� we reduce the number of tunable� hyperparameters (M<N)�then:� we have less to tune manually�

>> 10

Red:

by tuning M

hyperparameters

Accuracy

Green:

by tuning N

hyperparameters

step

11 of 65

Aiming to Save Time and Resources

If:� we reduce the number of tunable� hyperparameters (M<N)�then:� we have less to tune manually�
If:� we achieve a minimum accuracy, faster�then:� we will be able to test more ideas*

>> 11

this is great!

can we have this?

* Only when you care about achieving a minimum accuracy

Red:

by tuning M

hyperparameters

Green:

by tuning N

hyperparameters

Accuracy

step

12 of 65

There are many Applications for NN

>> 12

Image classification

Medical diagnosis

Financial forecasting

Object recognition

Speech

recognition

13 of 65

Telecommunication Applications for NN

>> 13

Constellation design

Automatic modulation classification

Channel estimation

14 of 65

There are Different Network Topologies

>> 14

15 of 65

After Selecting Application, Dataset, and Network

>> 15

Application

Dataset

Network

How complex is training?

(application, network, dataset)

16 of 65

The Complexity of Training a Neural Network

>> 16

How to initial our

Network?

random, He, Xavier, …

* This graph is generalized and simplified

Start with:

How to change our

Network variables?

Optimizer, learning rate, …

How to modify our

input?

normalizing, crop, flip, …

When do we stop?

maximum number of epochs,

minimum accuracy, …

How to Quantize?

bit budget, method, …

there exists a set of suggestions, but no definite answer

How to evaluate our

Network?�MSE, MAE, CEL, …

Application

Dataset

Network

17 of 65

The Complexity of Training a Neural Network

>> 17

How to initial our

Network?

random, He, Xavier, …

* This graph is generalized and simplified

Start with:

How to change our

Network variables?

Optimizer, learning rate, …

How to modify our

input?

normalizing, crop, flip, …

When do we stop?

maximum number of epochs,

minimum accuracy, …

How to Quantize?

bit budget, method, …

How to evaluate our

Network?�MSE, MAE, CEL, …

Application

Dataset

Network

Answer to each question depends on other answers!

18 of 65

The Complexity of Training a Neural Network

>> 18

How to initial our

Network?

random, He, Xavier, …

* This graph is generalized and simplified

Start with:

How to change our

Network variables?

Optimizer, learning rate, …

How to modify our

input?

normalizing, crop, flip, …

When do we stop?

maximum number of epochs,

minimum accuracy, …

How to Quantize?

bit budget, method, …

How to evaluate our

Network?�MSE, MAE, CEL, …

Application

Dataset

Network

Answer to each question depends

on other answers!

For example learning rate and optimizer can depend on quantization

Training a 3-Node Neural Network is NP-Complete

A. Blum and R. L. Rivest, NIPS 1989

“Training a Neural Network to Produce the correct output is NP-Hard”

S. Judd, PhD Thesis 1988

19 of 65

The Complexity of Training a Neural Network

>> 19

* This graph is generalized and simplified

Let’s say we select these parameters

for an experiment:

Initialization:

random

Stop criteria:

maximum number of epochs

Quantization:

bit budget, method, …

Evaluation:

CEL

Modify input

normalizing, crop, flip, …

Application

Dataset

Network

How to change our

Network variables?

Optimizer, learning rate, …

Let’s say we select these parameters

for an experiment:

Initialization:

random

Stop criteria:

maximum number of epochs

Quantization:

bit budget, method, …

Evaluation:

CEL

Modify input

normalizing, crop, flip, …

Application

Dataset

Network

How to change our

Network variables?

Optimizer, learning rate, …

still no definite answer, but there exists a better set of suggestions

Automatically tune

some of these parameters

Reducing The Complexity of Training

20 of 65

Training a Network

>> 20

21 of 65

Training a Network – Gradient Descent

>> 21

current values

learning rate

derivative of loss function

using all the training set

updated values

22 of 65

Training a Network – A Simple Example

>> 22

current values

learning rate

derivative of loss function

using all the training set

23 of 65

Training a Network – A Simple Example

>> 23

Parameters

Learning

rate

Blue: goal

Green: training with fixed learning rate

step

Blue: goal

Red: training with non-fixed learning rate

24 of 65

Training a Network – A Simple Example

>> 24

Parameters

Learning

rate

Blue: goal

Green: training with fixed learning rate

step

Blue: goal

Red: training with non-fixed learning rate

step

25 of 65

Training a Network – We Need Tuning

>> 25

Bag of tunable

current values

learning rate

derivative of loss function

using all the training set

26 of 65

Training a Network – Gradient Descent

>> 26

current values

learning rate

derivative of loss function

using all the training set

Bag of tunable

derivative of loss function

using a subset of training set

ImageNet training set includes one million 224x224 images

Subset size = b

How to sample?

sample

Training a Network – Stochastic Gradient Descent

27 of 65

Training a Network – Generalization

>> 27

current values

learning rate

Bag of tunable

derivative of loss function

using a subset of training set

ImageNet training set includes one million 224x224 images

Train set

Validation set

Test set

A Simple Weight Decay can Improve Generalization,�Krogh, A. and Hertz, J. NIPS 1991

These sets are different!

We may overfit on train and/or validation sets, if not careful!

Weight Decay

sample

28 of 65

Training a Network – Too many Hyperparameters!

>> 28

Bag of tunable

Weight Decay

We can continue and introduce more and more parameters

momentum

Nesterov momentum

Step decay

Cyclic Learning Rate

KFAC

Hessian Free

RMSPROP

ADAM

AdaBound

milestones

gamma

Maximum learning rate

Initial learning rate

dampening

cooldown

warmup

current values

learning rate

derivative of loss function

using a subset of training set

sample

29 of 65

Training a Network – Review Relevant Hyperparameters

>> 29

Bag of tunable

Reminder:

we are looking at

a subset of our

Hyperparameters!

current values

learning rate

derivative of loss function

using a subset of training set

How to change our

Network variables?

Optimizer and learning rate

Calculating Second

Order Information

Using Moving

Average

Manual Tuning

Common Trends – Calculating Second Derivative

>> 33

Bag of tunable

10 times more expensive than SGD

Hessian-Free Optimization�James Martens, University of Toronto

Expensive to implement

13 input parameters

Large Batch Size Training of NN with Adversarial �Training and Second Order Information�Yao, Gholami, Keutzer, and Mahoney, UC Berkeley

Can add more parameters to our tunable bag

An optimum step size

Common Trends – Problem with Moving Average

>> 37

Bag of tunable

SGD noise is larger for

smaller batch size

“Mini batches are usually better”�Geoffrey Hinton, Lecture 6, CSC321, University of Toronto�AI scientist at Google Brain

Training accuracy for ResNet-34, CIFAR10

Adaptive Gradient Methods with Dynamic Bound of Learning Rate

Luo et al. ICLR 2019

38 of 65

Common Trends – Manual Tuning

>> 38

Bag of tunable

Multi-step decay – input: (initial step size, [milestone], [decay])

step or epoch

number

Initial value

milestone 1

milestone 2

milestone 3

decay 1

decay 2

decay 3

39 of 65

Common Trends – Combining Techniques

>> 39

Bag of tunable

ADAM

Multi-step decay

(initial step size, [milestone], [decay])

40 of 65

A Greedy Scheduler for Learning Rate – Inspirations

Second order information: learning rate can both increase and decrease during training

>> 40

41 of 65

A Greedy Scheduler for Learning Rate – Inspirations

Second order information: learning rate can both increase and decrease during training
Using moving average: damping the SGD noise can hurt the final results

>> 41

42 of 65

A Greedy Scheduler for Learning Rate – Inspirations

Second order information: learning rate can both increase and decrease during training
Using moving average: damping the SGD noise can hurt the final results
Gradually changing the learning rate is better than sudden changes

>> 42

Learning

rate

step or epoch

number

Learning

rate

step or epoch

number

Sudden

change

Smooth

change

43 of 65

A Greedy Scheduler for Learning Rate – Inspirations

Second order information: learning rate can both increase and decrease during training
Using moving average: damping the SGD noise can hurt the final results
Gradually changing the learning rate is better than sudden changes
We have to walk on a pathological curvature

>> 43

The loss surface of ResNet-56 without skip connections

A Greedy Scheduler for Learning Rate – The Algorithm

>> 52

epoch

number

Initial

learning rate

Possible learning rates

Validation loss has been improved

continue with the current learning rate

loss

53 of 65

A Greedy Scheduler for Learning Rate – The Algorithm

>> 53

epoch

number

Initial

learning rate

Possible learning rates

radius = 1

loss

54 of 65

A Greedy Scheduler for Learning Rate – The Algorithm

>> 54

epoch

number

Initial

learning rate

Possible learning rates

radius = 2

loss

55 of 65

A Greedy Scheduler for Learning Rate – The Algorithm

>> 55

epoch

number

Initial

learning rate

Possible learning rates

radius = 3

loss

56 of 65

A Greedy Scheduler for Learning Rate – The Algorithm

>> 56

epoch

number

Initial

learning rate

Possible learning rates

radius = 4

loss

57 of 65

A Greedy Scheduler for Learning Rate – The Algorithm

>> 57

epoch

number

Initial

learning rate

Possible learning rates

radius = 3

loss

58 of 65

A Greedy Scheduler for Learning Rate – Implementation

>> 58

net = prepare_model(args.model)

optimizer = SGD(net.parameters(), lr=lr)

scheduler = MultiStep(milestones, gamma)

for number of epochs:

train_loss = train(optimizer)

val_loss = validate()

scheduler.step()

logInfo()

Training with step decay scheduler*

* Steps are simplified

net = prepare_model(args.model)

optimizer = SGD (net.parameters(), lr=lr)

scheduler = GreedyLR()

for number of epochs:

train_loss = train(optimizer)

val_loss = validate()

scheduler.step(val_loss)

logInfo()

Training with our scheduler*

https://gitenterprise.xilinx.com/alirezak/experiments_scheduler/blob/master/greedy_scheduler.py

59 of 65

Results – ResNet56

>> 59

Top1

accuracy

Learning

rate

step

epoch

Original hyperparameters*: 93.8%

Our scheduler: 93.9%

Paper: 93.0%

*obtained from: keras.io/examples

0.01

0.02

0.009

60 of 65

Results – ResNet20

>> 60

Top1

accuracy

Learning

rate

step

epoch

Original hyperparameters*: 91.9%

Our scheduler: 91.8%

Paper: 91.2%

*obtained from: gihub.com/osmr

61 of 65

Results – ResNet20

>> 61

Top1

accuracy

Learning

rate

step

epoch

Original hyperparameters*: 92.3%

Our scheduler: 92.6%

Paper: 91.2%

*obtained from: keras.io/examples

62 of 65

Results – VGG11_Brevitas_8bit �Brevitas: https://github.com/Xilinx/brevitas

>> 62

Top1

accuracy

Learning

rate

step

Original hyperparameters*: 91.5%

Our scheduler: 91.4%

*obtained from: gitenterprise.xilinx.com

63 of 65

Results

>> 63

Network CIFAR10	Scheduler	Learning rate parameters (lr, milestones, gamma)	Top1 accuracy
Resnet20	Step Decay	0.1, [80,120,160,180], [0.1,0.1,0.1, 0.05]	92.3%
Resnet20	Ours	0.1	92.6%
Resnet56	Step Decay	0.1, [80,120,160,180], [0.1,0.1,0.1,0.05]	93.8%
Resnet56	Ours	0.1	93.9%
DenseNet40_k12	Step Decay	0.1, [150, 250], [0.1]	93.0%
DenseNet40_k12	Ours	0.1	92.9%
WRN20_10_1bit	Step Decay	0.1, [80,120,160,180],[0.1]	95.2%
WRN20_10_1bit	Ours	0.1	94.9%
VGG11 8bit-fixed	Step Decay	0.1, [150,250,350], [0.1]	91.5%
VGG11 8bit-fixed	Ours	0.1	91.4%
VGG11 6bit-fixed	Step Decay	0.1, [150,250,350], [0.1]	91.2%
VGG11 6bit-fixed	Ours	0.1	91.2%
CNV 1bit	Step Decay	0.1, [500,600,700,800],[0.1]	78.5%
CNV 1bit	Ours	0.1	78.5%

64 of 65

Wrapping up, Why Does it Matter?

>> 64

Network CIFAR10	Scheduler	Learning rate parameters (lr, milestones, gamma)	Top1 Accuracy
Resnet20	Step Decay	0.2, [160,180], [0.1]	90.7%
Resnet20	Ours	0.2	92.6%

Network CIFAR10	Scheduler	Learning rate parameters (lr, milestones, gamma)	Top1 accuracy
Resnet20	Step Decay	0.2, [160,180], [0.1]	90.7%
Resnet20

Network CIFAR10	Scheduler	Learning rate parameters (lr, milestones, gamma)	Top1 accuracy
Resnet20	Step Decay	0.1, [80,120,160,180], [0.1,0.1,0.1, 0.05]	92.3%
Resnet20	Ours	0.1	92.6%

If you get these numbers wrong (from guessing,

an online resource, a friend, etc.) you may not get good results.

For example:

In contrast, our method can be forgiving.

For example:

1 of 65

2 of 65

3 of 65

4 of 65

5 of 65

6 of 65

7 of 65

8 of 65

9 of 65

10 of 65

11 of 65

12 of 65

13 of 65

14 of 65

15 of 65

16 of 65

17 of 65

18 of 65

19 of 65

20 of 65

21 of 65

22 of 65

23 of 65

24 of 65

25 of 65

26 of 65

27 of 65

28 of 65

29 of 65

30 of 65

31 of 65

32 of 65

33 of 65

34 of 65

35 of 65

36 of 65

37 of 65

38 of 65

39 of 65

40 of 65

41 of 65

42 of 65

43 of 65

44 of 65

45 of 65

46 of 65

47 of 65

48 of 65

49 of 65

50 of 65

51 of 65

52 of 65

53 of 65

54 of 65

55 of 65

56 of 65

57 of 65

58 of 65

59 of 65

60 of 65

61 of 65

62 of 65

63 of 65

64 of 65

65 of 65