1 of 57

Recaps on ResNets

Some slides were adated/taken from various sources, including Andrew Ng’s Coursera lectures, CS231n: Convolutional Neural Networks for Visual Recognition lectures & lecture by Aharon Kalantar et al.

Also, don’t forget to watch the talk by Kaiming He at CVPR 2016 conference on "Deep Residual Learning for Image Recognition” at [https://youtu.be/C6tLw-rPQ2o ]

2 of 57

Introducing a breakthrough neural networks architecture introduced on 2015.
Why deep?
What’s the problem in learning deep networks?
ResNet and how it allow us to gain more performance via deeper networks.
Some results, improvements and farther works.

In this Lecture

Intro

ResNet

Technical details

Results

ResNet 1000

Comparison

3 of 57

ImageNet Large Scale Visual Recognition Challenge (ILSVRC) winners

Intro

ResNet

Technical details

Results

ResNet 1000

Comparison

4 of 57

What happens when we continue stacking deeper layers on a “plain” convolutional

neural network?

56-layer model performs worse on both training and test error

-> The deeper model performs worse, but it’s not caused by overfitting!

Training error

Iterations

56-layer

20-layer

Test error

Iterations

56-layer

20-layer

Deep vs Shallow Networks

Intro

ResNet

Technical details

Results

ResNet 1000

Comparison

5 of 57

The deeper model should be able to perform at least as well as the shallower model.
A solution by construction is copying the learned layers from the shallower model and setting additional layers to identity mapping.

Deeper models are harder to optimize

Intro

ResNet

Technical details

Results

ResNet 1000

Comparison

6 of 57

Deeper Neural Networks start to degrade in performance.
Vanish/Exploding Gradient – May lead for extremely complex parameters initializations to make it work. Still might suffer from Vanish/Exploding even for the best parameters.
Long training times – Due to too many training parameters.

Challenges

Intro

ResNet

Technical details

Results

ResNet 1000

Comparison

7 of 57

Batch Normalization – To rescale the weights over some batch.
Smart Initialization of weights – Like for example Xavier initialization.
Train portions of the network individually.

Partial Solutions for Vanish/Exploding gradients

Intro

ResNet

Technical details

Results

ResNet 1000

Comparison

8 of 57

A specialized network introduced by Microsoft.
Connects inputs of layers into farther part of that network to allow “shortcuts”.
Simple idea – great improvements with both performance and train time.

ResNet

Intro

ResNet

Technical details

Results

ResNet 1000

Comparison

9 of 57

Plain Network

Intro

ResNet

Technical details

Results

ResNet 1000

Comparison

10 of 57

Residual Blocks

Intro

ResNet

Technical details

Results

ResNet 1000

Comparison

11 of 57

Such connections are referred as skipped connections or shortcuts. In general similar models could skip over several layers.
They refer to residual part of the network as a unit with input and output.
Such residual part receives the input as an amplifier to its output – The dimensions usually are the same.
Another option is to use a projection to the output space.
Either way – no additional training parameters are used.

Skip Connections “shortcuts”

Intro

ResNet

Technical details

Results

ResNet 1000

Comparison

12 of 57

The new model

Intro

ResNet

Technical details

Results

ResNet 1000

Comparison

13 of 57

ResNet as a ConvNet

Till now we talked about fully connected layers.
The ResNet idea could easily expended into convolutional model.
Other adaptations of this idea could be easily introduced to almost any kind of deep layered network.

Intro

ResNet

Technical details

Results

ResNet 1000

Comparison

14 of 57

Input

Softmax

3x3 conv, 64

7x7 conv, 64, / 2

FC 1000

Pool

3x3 conv, 64

3x3 conv, 128

3x3 conv, 128, / 2

3x3 conv, 128

..

.

3x3 conv, 512

3x3 conv, 512, /2

3x3 conv, 512

Pool

relu

Residual block

3x3 conv

X

identity

F(x) + x

F(x)

relu

X

Full ResNet architecture:

-

Stack residual blocks

-

Every residual block has

two 3x3 conv layers

ResNet Architecture

Intro

ResNet

Technical details

Results

ResNet 1000

Comparison

15 of 57

Input

Softmax

3x3 conv, 64

7x7 conv, 64, / 2

FC 1000

Pool

3x3 conv, 64

3x3 conv, 128

3x3 conv, 128, / 2

3x3 conv, 128

..

.

3x3 conv, 512

3x3 conv, 512, /2

3x3 conv, 512

Pool

relu

Residual block

3x3 conv

X

identity

F(x) + x

F(x)

relu

X

Full ResNet architecture:

-

Stack residual blocks

-

Every residual block has

two 3x3 conv layers

-

Periodically, double # of

filters and downsample

spatially using stride 2

(/2 in each dimension)

3x3 conv, 64

filters

3x3 conv, 128

filters, /2

spatially with

stride 2

ResNet Architecture

Intro

ResNet

Technical details

Results

ResNet 1000

Comparison

16 of 57

Input

Softmax

3x3 conv, 64

7x7 conv, 64, / 2

FC 1000

Pool

3x3 conv, 64

3x3 conv, 128

3x3 conv, 128, / 2

3x3 conv, 128

..

.

3x3 conv, 512

3x3 conv, 512, /2

3x3 conv, 512

Pool

relu

Residual block

3x3 conv

X

identity

F(x) + x

F(x)

relu

X

Full ResNet architecture:

-

Stack residual blocks

-

Every residual block has

two 3x3 conv layers

-

Periodically, double # of

filters and downsample

spatially using stride 2

(/2 in each dimension)

-

Additional conv layer at

the beginning

Beginning

conv layer

ResNet Architecture

Intro

ResNet

Technical details

Results

ResNet 1000

Comparison

17 of 57

Input

Softmax

3x3 conv, 64

7x7 conv, 64, / 2

FC 1000

Pool

3x3 conv, 64

3x3 conv, 128

3x3 conv, 128, / 2

3x3 conv, 128

..

.

3x3 conv, 512

3x3 conv, 512, /2

3x3 conv, 512

Pool

relu

Residual block

3x3 conv

X

identity

F(x) + x

F(x)

relu

X

Full ResNet architecture:

-

Stack residual blocks

-

Every residual block has

two 3x3 conv layers

-

Periodically, double # of

filters and downsample

spatially using stride 2

(/2 in each dimension)

-

Additional conv layer at

the beginning

-

No FC layers at the end

(only FC 1000 to output

classes)

No FC layers

besides FC

1000 to

output

classes

Global

average

pooling layer

after last

conv layer

ResNet Architecture

Intro

ResNet

Technical details

Results

ResNet 1000

Comparison

18 of 57

Input

Softmax

3x3 conv, 64

7x7 conv, 64, / 2

FC 1000

Pool

3x3 conv, 64

3x3 conv, 128

3x3 conv, 128, / 2

3x3 conv, 128

..

.

3x3 conv, 512

3x3 conv, 512, /2

3x3 conv, 512

Pool

Total depths of 34, 50, 101, or

152 layers for ImageNet

ResNet Architecture

Intro

ResNet

Technical details

Results

ResNet 1000

Comparison

19 of 57

1x1 conv, 256

3x3 conv, 64

1x1 conv, 64

28x28x256

input

For deeper networks

(ResNet-50+), use “bottleneck”

layer to improve efficiency

(similar to GoogLeNet)

28x28x256

output

ResNet Architecture

Intro

ResNet

Technical details

Results

ResNet 1000

Comparison

20 of 57

1x1 conv, 256

3x3 conv, 64

1x1 conv, 64

28x28x256

input

For deeper networks

(ResNet-50+), use “bottleneck”

layer to improve efficiency

(similar to GoogLeNet)

1x1 conv, 64 filters

to project to

28x28x64

3x3 conv operates over

only 64 feature maps

1x1 conv, 256 filters projects

back to 256 feature maps

(28x28x256)

28x28x256

output

ResNet Architecture

Intro

ResNet

Technical details

Results

ResNet 1000

Comparison

21 of 57

Residual Blocks (skip connections)

22 of 57

Deeper Bottleneck Architecture

Intro

ResNet

Technical details

Results

ResNet 1000

Comparison

23 of 57

Deeper Bottleneck Architecture (Cont.)

Addresses high training time of very deep networks.
Keeps the time complexity same as the two layered convolution
Allows us to increase the number of layers
allows the model to converge much faster.
152-layer ResNet has 11.3 billion FLOPS while VGG-16/19 nets has 15.3/19.6 billion FLOPS.

Intro

ResNet

Technical details

Results

ResNet 1000

Comparison

24 of 57

Why Do ResNets Work Well?

Intro

ResNet

Technical details

Results

ResNet 1000

Comparison

25 of 57

Why Do ResNets Work Well? (Cont)

In theory ResNet is still identical to plain networks, but in practice due to the above the convergence is much faster.
No additional training parameters introduced.
No addition complexity introduced.

Intro

ResNet

Technical details

Results

ResNet 1000

Comparison

26 of 57

Training ResNet in practice

Batch Normalization after every CONV layer.
Xavier/2 initialization from He et al.
SGD + Momentum (0.9)
Learning rate: 0.1, divided by 10 when validation error plateaus.
Mini-batch size 256.
Weight decay of 1e-5.
No dropout used.

Intro

ResNet

Technical details

Results

ResNet 1000

Comparison

27 of 57

Loss Function

For measuring the loss of the model a combination of cross-entropy and softmax were used.
The output of the cross-entropy was normalized using softmax function.

Intro

ResNet

Technical details

Results

ResNet 1000

Comparison

28 of 57

Experimental Results

-

Able to train very deep

networks without degrading

(152 layers on ImageNet, 1202

on Cifar)

-

Deeper networks now achieve

lowing training error as

expected

-

Swept 1st place in all ILSVRC

and COCO 2015 competitions

ILSVRC 2015 classification winner (3.6%

top 5 error) -- better than “human

performance”! (Russakovsky 2014)

Intro

ResNet

Technical details

Results

ResNet 1000

Comparison

Results

29 of 57

Comparing Plain to ResNet (18/34 Layers)

Intro

ResNet

Technical details

Results

ResNet 1000

Comparison

30 of 57

Comparing Plain to Deeper ResNet

Train Error:

Test Error:

Intro

ResNet

Technical details

Results

ResNet 1000

Comparison

31 of 57

ResNet on More than 1000 Layers

To farther improve learning of extremely deep ResNet “Identity Mappings in Deep Residual Networks Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun 2016” suggests to pass the input directly to the final residual layer, hence allowing the network to easily learn to pass the input as identity mapping both in forward and backward passes.

Intro

ResNet

Technical details

Results

ResNet 1000

Comparison

32 of 57

Identity Mappings in Deep Residual Networks

Intro

ResNet

Technical details

Results

ResNet 1000

Comparison

33 of 57

Identity Mappings in Deep Residual Networks

34 of 57

Improvement on CIFAR-10

Another important improvement – using the Batch Normalization as pre-activation improves the regularization.
This improvement leads to better performances for smaller networks as well.

Intro

ResNet

Technical details

Results

ResNet 1000

Comparison

35 of 57

Reduce Learning Time with Random Layer Drops

Dropping layers during training, and using the full network in testing.
Residual block are used as network’s building block.
During training, input flows through both the shortcut and the weights.
Training: Each layer has a “survival probability” and is randomly dropped.
Testing: all blocks are kept active.
re-calibrated according to its survival probability during training.

Intro

ResNet

Technical details

Results

ResNet 1000

Comparison

36 of 57

Wide Residual Networks + ResNeXt

37 of 57

Figures copyright Alfredo Canziani, Adam Paszke, Eugenio Culurciello, 2017. Reproduced with permission.

An Analysis of Deep Neural Network Models for Practical Applications, 2017.

Comparing complexity...

Intro

ResNet

Technical details

Results

ResNet 1000

Comparison

38 of 57

An Analysis of Deep Neural Network Models for Practical Applications, 2017.

Comparing complexity...

Inception-v4: Resnet + Inception!

Intro

ResNet

Technical details

Results

ResNet 1000

Comparison

39 of 57

Comparing complexity...

An Analysis of Deep Neural Network Models for Practical Applications, 2017.

VGG: Highest

memory, most

operations

Intro

ResNet

Technical details

Results

ResNet 1000

Comparison

40 of 57

Comparing complexity...

An Analysis of Deep Neural Network Models for Practical Applications, 2017.

GoogLeNet:

most efficient

Intro

ResNet

Technical details

Results

ResNet 1000

Comparison

41 of 57

Comparing complexity...

An Analysis of Deep Neural Network Models for Practical Applications, 2017.

AlexNet:

Smaller compute, still memory

heavy, lower accuracy

Intro

ResNet

Technical details

Results

ResNet 1000

Comparison

42 of 57

Comparing complexity...

An Analysis of Deep Neural Network Models for Practical Applications, 2017.

ResNet:

Moderate efficiency depending on

model, highest accuracy

Intro

ResNet

Technical details

Results

ResNet 1000

Comparison

43 of 57

Appendix

44 of 57

Wide Residual Networks showed the power of these networks is actually in residual blocks, and that the effect of depth is supplementary at a certain point

It is worth noting that training time is reduced because wider models take advantage of GPUs being more efficient in parallel computations on large tensors even though the number of parameters and floating point operations has increased.

Wide Residual Networks

45 of 57

ResNeXt Networks

Residual connections are helpful for simplifying a network’s optimization, whereas aggregated transformations lead to stronger representation power

The aggregated transformation in Eqn.(2) serves as the residual function

where y is the output.

46 of 57

ResNeXt Networks

47 of 57

ResNeXt Networks

48 of 57

ResNeXt Networks

Test error rates vs. model sizes:

The increasing cardinality is more effective than increasing width

This graph shows the results and model sizes, comparing with the Wide ResNet which is the best published record (observed on ImageNet-1K).

49 of 57

1X1 Convolutions

50 of 57

1X1 Convolutions

1X1 filters are often used to reduce the dimensionality of a layer

ReLU

[Lin et al., 2013. Network in network]

51 of 57

Inception Network

MAX-POOL

128

32

64

28

[Szegedy et al. 2014. Going deeper with convolutions]

52 of 57

Inception Networks: Computational Cost

53 of 57

Inception Networks with 1X1 Convolutions

54 of 57

Network in Network (NiN)

[Lin et al. 2014]

-

Mlpconv layer with

“micronetwork” within each conv

layer to compute more abstract

features for local patches

-

Micronetwork uses multilayer

perceptron (FC, i.e. 1x1 conv

layers)

-

Precursor to GoogLeNet and

ResNet “bottleneck” layers

-

Philosophical inspiration for

GoogLeNet

55 of 57

Improving ResNets...

[He et al. 2016]

-

Improved ResNet block design from

creators of ResNet

-

Creates a more direct path for

propagating information throughout

network (moves activation to residual

mapping pathway)

-

Gives better performance

Identity Mappings in Deep Residual Networks

conv

BN

ReLU

conv

ReLU

BN

56 of 57

Improving ResNets...

[Zagoruyko et al. 2016]

-

Argues that residuals are the

important factor, not depth

-

User wider residual blocks (F x k

filters instead of F filters in each layer)

-

50-layer wide ResNet outperforms

152-layer original ResNet

-

Increasing width instead of depth

more computationally efficient

(parallelizable)

Wide Residual Networks

Basic residual block

Wide residual block

3x3 conv, F

3x3 conv, F x k

57 of 57

Improving ResNets...

[Xie et al. 2016]

-

Also from creators of

ResNet

-

Increases width of

residual block through

multiple parallel

pathways

(“cardinality”)

-

Parallel pathways

similar in spirit to

Inception module

Aggregated Residual Transformations for Deep

Neural Networks (ResNeXt)

1x1 conv, 256

3x3 conv, 64

1x1 conv, 64

256-d out

256-d in

1x1 conv, 256

3x3 conv, 4

1x1 conv, 4

256-d out

256-d in

1x1 conv, 256

3x3 conv, 4

1x1 conv, 4

1x1 conv, 256

3x3 conv, 4

1x1 conv, 4

...

32

paths