Recaps on ResNets
Some slides were adated/taken from various sources, including Andrew Ng’s Coursera lectures, CS231n: Convolutional Neural Networks for Visual Recognition lectures & lecture by Aharon Kalantar et al.
Also, don’t forget to watch the talk by Kaiming He at CVPR 2016 conference on "Deep Residual Learning for Image Recognition” at [https://youtu.be/C6tLw-rPQ2o ]
In this Lecture
Intro
ResNet
Technical details
Results
ResNet 1000
Comparison
ImageNet Large Scale Visual Recognition Challenge (ILSVRC) winners
Intro
ResNet
Technical details
Results
ResNet 1000
Comparison
What happens when we continue stacking deeper layers on a “plain” convolutional
neural network?
56-layer model performs worse on both training and test error
-> The deeper model performs worse, but it’s not caused by overfitting!
Training error
Iterations
56-layer
20-layer
Test error
Iterations
56-layer
20-layer
Deep vs Shallow Networks
Intro
ResNet
Technical details
Results
ResNet 1000
Comparison
Deeper models are harder to optimize
Intro
ResNet
Technical details
Results
ResNet 1000
Comparison
Challenges
Intro
ResNet
Technical details
Results
ResNet 1000
Comparison
Partial Solutions for Vanish/Exploding gradients
Intro
ResNet
Technical details
Results
ResNet 1000
Comparison
ResNet
Intro
ResNet
Technical details
Results
ResNet 1000
Comparison
|
|
Plain Network
Intro
ResNet
Technical details
Results
ResNet 1000
Comparison
|
|
Residual Blocks
Intro
ResNet
Technical details
Results
ResNet 1000
Comparison
Skip Connections “shortcuts”
Intro
ResNet
Technical details
Results
ResNet 1000
Comparison
The new model
Intro
ResNet
Technical details
Results
ResNet 1000
Comparison
ResNet as a ConvNet
Intro
ResNet
Technical details
Results
ResNet 1000
Comparison
Input
Softmax
3x3 conv, 64
7x7 conv, 64, / 2
FC 1000
Pool
3x3 conv, 64
3x3 conv, 64
3x3 conv, 64
3x3 conv, 64
3x3 conv, 64
3x3 conv, 128
3x3 conv, 128, / 2
3x3 conv, 128
3x3 conv, 128
3x3 conv, 128
3x3 conv, 128
..
.
3x3 conv, 512
3x3 conv, 512, /2
3x3 conv, 512
3x3 conv, 512
3x3 conv, 512
3x3 conv, 512
Pool
relu
Residual block
3x3 conv
3x3 conv
X
identity
F(x) + x
F(x)
relu
X
Full ResNet architecture:
-
Stack residual blocks
-
Every residual block has
two 3x3 conv layers
ResNet Architecture
Intro
ResNet
Technical details
Results
ResNet 1000
Comparison
Input
Softmax
3x3 conv, 64
7x7 conv, 64, / 2
FC 1000
Pool
3x3 conv, 64
3x3 conv, 64
3x3 conv, 64
3x3 conv, 64
3x3 conv, 64
3x3 conv, 128
3x3 conv, 128, / 2
3x3 conv, 128
3x3 conv, 128
3x3 conv, 128
3x3 conv, 128
..
.
3x3 conv, 512
3x3 conv, 512, /2
3x3 conv, 512
3x3 conv, 512
3x3 conv, 512
3x3 conv, 512
Pool
relu
Residual block
3x3 conv
3x3 conv
X
identity
F(x) + x
F(x)
relu
X
Full ResNet architecture:
-
Stack residual blocks
-
Every residual block has
two 3x3 conv layers
-
Periodically, double # of
filters and downsample
spatially using stride 2
(/2 in each dimension)
3x3 conv, 64
filters
3x3 conv, 128
filters, /2
spatially with
stride 2
ResNet Architecture
Intro
ResNet
Technical details
Results
ResNet 1000
Comparison
Input
Softmax
3x3 conv, 64
7x7 conv, 64, / 2
FC 1000
Pool
3x3 conv, 64
3x3 conv, 64
3x3 conv, 64
3x3 conv, 64
3x3 conv, 64
3x3 conv, 128
3x3 conv, 128, / 2
3x3 conv, 128
3x3 conv, 128
3x3 conv, 128
3x3 conv, 128
..
.
3x3 conv, 512
3x3 conv, 512, /2
3x3 conv, 512
3x3 conv, 512
3x3 conv, 512
3x3 conv, 512
Pool
relu
Residual block
3x3 conv
3x3 conv
X
identity
F(x) + x
F(x)
relu
X
Full ResNet architecture:
-
Stack residual blocks
-
Every residual block has
two 3x3 conv layers
-
Periodically, double # of
filters and downsample
spatially using stride 2
(/2 in each dimension)
-
Additional conv layer at
the beginning
Beginning
conv layer
ResNet Architecture
Intro
ResNet
Technical details
Results
ResNet 1000
Comparison
Input
Softmax
3x3 conv, 64
7x7 conv, 64, / 2
FC 1000
Pool
3x3 conv, 64
3x3 conv, 64
3x3 conv, 64
3x3 conv, 64
3x3 conv, 64
3x3 conv, 128
3x3 conv, 128, / 2
3x3 conv, 128
3x3 conv, 128
3x3 conv, 128
3x3 conv, 128
..
.
3x3 conv, 512
3x3 conv, 512, /2
3x3 conv, 512
3x3 conv, 512
3x3 conv, 512
3x3 conv, 512
Pool
relu
Residual block
3x3 conv
3x3 conv
X
identity
F(x) + x
F(x)
relu
X
Full ResNet architecture:
-
Stack residual blocks
-
Every residual block has
two 3x3 conv layers
-
Periodically, double # of
filters and downsample
spatially using stride 2
(/2 in each dimension)
-
Additional conv layer at
the beginning
-
No FC layers at the end
(only FC 1000 to output
classes)
No FC layers
besides FC
1000 to
output
classes
Global
average
pooling layer
after last
conv layer
ResNet Architecture
Intro
ResNet
Technical details
Results
ResNet 1000
Comparison
Input
Softmax
3x3 conv, 64
7x7 conv, 64, / 2
FC 1000
Pool
3x3 conv, 64
3x3 conv, 64
3x3 conv, 64
3x3 conv, 64
3x3 conv, 64
3x3 conv, 128
3x3 conv, 128, / 2
3x3 conv, 128
3x3 conv, 128
3x3 conv, 128
3x3 conv, 128
..
.
3x3 conv, 512
3x3 conv, 512, /2
3x3 conv, 512
3x3 conv, 512
3x3 conv, 512
3x3 conv, 512
Pool
Total depths of 34, 50, 101, or
152 layers for ImageNet
ResNet Architecture
Intro
ResNet
Technical details
Results
ResNet 1000
Comparison
1x1 conv, 256
3x3 conv, 64
1x1 conv, 64
28x28x256
input
For deeper networks
(ResNet-50+), use “bottleneck”
layer to improve efficiency
(similar to GoogLeNet)
28x28x256
output
ResNet Architecture
Intro
ResNet
Technical details
Results
ResNet 1000
Comparison
1x1 conv, 256
3x3 conv, 64
1x1 conv, 64
28x28x256
input
For deeper networks
(ResNet-50+), use “bottleneck”
layer to improve efficiency
(similar to GoogLeNet)
1x1 conv, 64 filters
to project to
28x28x64
3x3 conv operates over
only 64 feature maps
1x1 conv, 256 filters projects
back to 256 feature maps
(28x28x256)
28x28x256
output
ResNet Architecture
Intro
ResNet
Technical details
Results
ResNet 1000
Comparison
Residual Blocks (skip connections)
Deeper Bottleneck Architecture
Intro
ResNet
Technical details
Results
ResNet 1000
Comparison
Deeper Bottleneck Architecture (Cont.)
Intro
ResNet
Technical details
Results
ResNet 1000
Comparison
Why Do ResNets Work Well?
Intro
ResNet
Technical details
Results
ResNet 1000
Comparison
Why Do ResNets Work Well? (Cont)
Intro
ResNet
Technical details
Results
ResNet 1000
Comparison
Training ResNet in practice
Intro
ResNet
Technical details
Results
ResNet 1000
Comparison
Loss Function
Intro
ResNet
Technical details
Results
ResNet 1000
Comparison
Experimental Results
-
Able to train very deep
networks without degrading
(152 layers on ImageNet, 1202
on Cifar)
-
Deeper networks now achieve
lowing training error as
expected
-
Swept 1st place in all ILSVRC
and COCO 2015 competitions
ILSVRC 2015 classification winner (3.6%
top 5 error) -- better than “human
performance”! (Russakovsky 2014)
Intro
ResNet
Technical details
Results
ResNet 1000
Comparison
Results
Comparing Plain to ResNet (18/34 Layers)
Intro
ResNet
Technical details
Results
ResNet 1000
Comparison
Comparing Plain to Deeper ResNet
Train Error:
Test Error:
Intro
ResNet
Technical details
Results
ResNet 1000
Comparison
ResNet on More than 1000 Layers
Intro
ResNet
Technical details
Results
ResNet 1000
Comparison
Identity Mappings in Deep Residual Networks
Intro
ResNet
Technical details
Results
ResNet 1000
Comparison
Identity Mappings in Deep Residual Networks
Improvement on CIFAR-10
Intro
ResNet
Technical details
Results
ResNet 1000
Comparison
Reduce Learning Time with Random Layer Drops
Intro
ResNet
Technical details
Results
ResNet 1000
Comparison
Wide Residual Networks + ResNeXt
Figures copyright Alfredo Canziani, Adam Paszke, Eugenio Culurciello, 2017. Reproduced with permission.
An Analysis of Deep Neural Network Models for Practical Applications, 2017.
Comparing complexity...
Intro
ResNet
Technical details
Results
ResNet 1000
Comparison
An Analysis of Deep Neural Network Models for Practical Applications, 2017.
Comparing complexity...
Inception-v4: Resnet + Inception!
Intro
ResNet
Technical details
Results
ResNet 1000
Comparison
Comparing complexity...
An Analysis of Deep Neural Network Models for Practical Applications, 2017.
VGG: Highest
memory, most
operations
Intro
ResNet
Technical details
Results
ResNet 1000
Comparison
Comparing complexity...
An Analysis of Deep Neural Network Models for Practical Applications, 2017.
GoogLeNet:
most efficient
Intro
ResNet
Technical details
Results
ResNet 1000
Comparison
Comparing complexity...
An Analysis of Deep Neural Network Models for Practical Applications, 2017.
AlexNet:
Smaller compute, still memory
heavy, lower accuracy
Intro
ResNet
Technical details
Results
ResNet 1000
Comparison
Comparing complexity...
An Analysis of Deep Neural Network Models for Practical Applications, 2017.
ResNet:
Moderate efficiency depending on
model, highest accuracy
Intro
ResNet
Technical details
Results
ResNet 1000
Comparison
Appendix
Wide Residual Networks showed the power of these networks is actually in residual blocks, and that the effect of depth is supplementary at a certain point
It is worth noting that training time is reduced because wider models take advantage of GPUs being more efficient in parallel computations on large tensors even though the number of parameters and floating point operations has increased.
Wide Residual Networks
ResNeXt Networks
Residual connections are helpful for simplifying a network’s optimization, whereas aggregated transformations lead to stronger representation power
The aggregated transformation in Eqn.(2) serves as the residual function
where y is the output.
ResNeXt Networks
ResNeXt Networks
ResNeXt Networks
Test error rates vs. model sizes:
The increasing cardinality is more effective than increasing width
This graph shows the results and model sizes, comparing with the Wide ResNet which is the best published record (observed on ImageNet-1K).
1X1 Convolutions
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
1X1 Convolutions
1X1 filters are often used to reduce the dimensionality of a layer
ReLU
[Lin et al., 2013. Network in network]
Inception Network
MAX-POOL
128
32
32
64
28
28
[Szegedy et al. 2014. Going deeper with convolutions]
Inception Networks: Computational Cost
Inception Networks with 1X1 Convolutions
Network in Network (NiN)
[Lin et al. 2014]
-
Mlpconv layer with
“micronetwork” within each conv
layer to compute more abstract
features for local patches
-
Micronetwork uses multilayer
perceptron (FC, i.e. 1x1 conv
layers)
-
Precursor to GoogLeNet and
ResNet “bottleneck” layers
-
Philosophical inspiration for
GoogLeNet
Figures copyright Lin et al., 2014. Reproduced with permission.
Improving ResNets...
[He et al. 2016]
-
Improved ResNet block design from
creators of ResNet
-
Creates a more direct path for
propagating information throughout
network (moves activation to residual
mapping pathway)
-
Gives better performance
Identity Mappings in Deep Residual Networks
conv
BN
ReLU
conv
ReLU
BN
Improving ResNets...
[Zagoruyko et al. 2016]
-
Argues that residuals are the
important factor, not depth
-
User wider residual blocks (F x k
filters instead of F filters in each layer)
-
50-layer wide ResNet outperforms
152-layer original ResNet
-
Increasing width instead of depth
more computationally efficient
(parallelizable)
Wide Residual Networks
Basic residual block
Wide residual block
3x3 conv, F
3x3 conv, F
3x3 conv, F x k
3x3 conv, F x k
Improving ResNets...
[Xie et al. 2016]
-
Also from creators of
ResNet
-
Increases width of
residual block through
multiple parallel
pathways
(“cardinality”)
-
Parallel pathways
similar in spirit to
Inception module
Aggregated Residual Transformations for Deep
Neural Networks (ResNeXt)
1x1 conv, 256
3x3 conv, 64
1x1 conv, 64
256-d out
256-d in
1x1 conv, 256
3x3 conv, 4
1x1 conv, 4
256-d out
256-d in
1x1 conv, 256
3x3 conv, 4
1x1 conv, 4
1x1 conv, 256
3x3 conv, 4
1x1 conv, 4
...
32
paths