Logistics
2nd homework will come out soon! (probably?)
AlexNet: first good network
https://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks.pdf
Entry for ImageNet challenge�Much higher accuracy than “traditional” methods (SVM on SIFT)�Why did it take so long?��
GPUs: How modern CNNs are possible
Alex spent a long time implementing the components of a CNN on a GPU, his software Cudaconvnet was the first “modern” neural network framework
GPUs allow >100x speedup compared to CPU, training still took weeks on ImageNet
This idea, CNNs + GPUs started revolution�in computer vision just 5 years ago
(check NVIDIAs stock in last 5 years)��
What are these networks learning?
Features! Here’s 1st layer of AlexNet
Idea: find interesting images
https://arxiv.org/pdf/1312.6199.pdf
We can feed images into our network and see which ones activate certain neurons
Idea: make interesting images
Instead of optimizing network, we can do gradient descent on the image too
Optimize the image to activate certain neurons or certain layers, then we can learn what those layers or neurons do!
You may have seen this under the name deep dream
Early layers: blobs
https://www.youtube.com/watch?v=SCE-QeDfXtA
Neurons respond strongly to high contrast
Middle layers: edges, curves, eyes
https://www.youtube.com/watch?v=SCE-QeDfXtA
Neurons respond to simple features and shapes
Middle layers: edges, curves, eyes
https://www.youtube.com/watch?v=SCE-QeDfXtA
Neurons respond to simple features and shapes
Middle layers: edges, curves, eyes
https://www.youtube.com/watch?v=SCE-QeDfXtA
Neurons respond to simple features and shapes
Later layers: eyes, objects, dogs
https://www.youtube.com/watch?v=SCE-QeDfXtA
Neurons respond to arrangements of features
Later layers: eyes, objects, dogs
https://www.youtube.com/watch?v=SCE-QeDfXtA
Neurons respond to arrangements of features
Later layers: eyes, objects, dogs
https://www.youtube.com/watch?v=SCE-QeDfXtA
Neurons respond to arrangements of features
Neural networks work!
At least, seem to do what we want them to
Low-level features�Lines, oriented edges
Mid-level features�Combine edges: curves, shapes
High-level features�Combine shapes: objects, scenes
Predictor�Process features and predict output
VGG: networks getting bigger
From the Visual Geometry Group at Oxford�Considered “very deep” at the time, 16 - 19 layers�VGG-16 is still commonly used as a feature extractor� Although really it shouldn’t be, much better alternatives
VGG is inefficient
https://arxiv.org/pdf/1312.4400.pdf
Just stringing together a bunch of 3x3 convolutions in general, is pretty inefficient.
Network in network idea from Lin et al.:� Use 3x3 convolutions to process spatial information� Use 1x1 convolutions to reduce # of channels (dimensionality reduction)
E.g.�
VGG (and AlexNet) are inefficient
https://arxiv.org/pdf/1312.4400.pdf
Huge fully connected layers at the end�TONS of parameters�Do they actually help?
37,748,736 parameters
16,777,216 parameters
VGG (and AlexNet) are inefficient
https://arxiv.org/pdf/1312.4400.pdf
Huge fully connected layers at the end�TONS of parameters�Do they actually help?
37,748,736 parameters
16,777,216 parameters
NOPE!
Just use all convolutions!
https://arxiv.org/pdf/1312.4400.pdf
How many? How big?
37,748,736 parameters
16,777,216 parameters
NOPE!
AlexNet is imbalanced
https://arxiv.org/pdf/1312.4400.pdf
200M OPS
896M OPS
299M OPS
449M OPS
299M OPS
34M OPS
75M OPS
8M OPS
Darknet Reference Network
https://arxiv.org/pdf/1312.4400.pdf
57M OPS
151M OPS
2M OPS
GoogleNet: networks getting weird
https://www.cs.unc.edu/~wliu/papers/GoogLeNet.pdf
Instead of a bunch of layer that are just 3x3 convolutions, use ->
Why are these better?
GoogleNet: networks getting weird
https://www.cs.unc.edu/~wliu/papers/GoogLeNet.pdf
Another way to efficiency:� Split layers� Not just one size, but many� 1x1, 3x3, 5x5
Also use 1x1 convs to compress feature maps
Can make large networks that are still efficient, better than VGG yet smaller and faster.
GoogleNet: networks getting weird
Like… really big
And this is sort of all we do now, try to make networks bigger without making them too expensive
GoogleNet: networks getting weird
Like… really big
And this is sort of all we do now, try to make networks bigger without making them too expensive
Multiple outputs
GoogleNetv2 or Inception Net?
https://arxiv.org/pdf/1502.03167.pdf
Or something, GoogleNet + batch norm
Residual connections
Normally, output of two layers is: f(w*f(vx))�Residual connections: f(w*f(vx) + x)
Learning how to modify x, add some transformed amount�Gives delta another path, less vanishing gradient
ResNet
ResNet
3x3 conv blocks or 3x3 and 1x1 conv blocks
Residual connections
VERY deep, 100+ layers
Grouped convolutions
Most filters look at every channel in input� Very expensive� Maybe not needed? Might only pull info from a few of them
Grouped convolutions:� Split up input feature map intro groups� Run convs on groups independently� Recombine
Grouped convolutions
Grouped convolutions:� Split up input feature map intro groups� Run convs on groups independently� Recombine
E.g. 3x3 conv layer 32 x 32 x 256 input, 128 filters, 32 groups:
Split input into 32 different feature maps� Each is 32 x 32 x 8� Run 4 filters, 3x3x8 on each group� Merge 4*32 channels back together, get 32 x 32 x 128 output
Input, output stays same dimensions, less computation
ResNeXt
https://arxiv.org/pdf/1611.05431.pdf
Replace 3x3 blocks with larger grouped convs
“Larger” network but same computational complexity
What’s NeXt?
Starting to saturate ImageNet, fighting over 1-2%
Training classifiers in wild
Typically you have a much smaller dataset than 1.2 million images and 1,000 classes.
What problems to do we encounter with less data, say 10,000 images and 500 classes?
First ImageNet, then the world!
For dealing with smaller datasets where we might overfit, pretraining is key:
First ImageNet, then the world!
For dealing with smaller datasets where we might overfit, pretraining is key:
First ImageNet, then the world!
For dealing with smaller datasets where we might overfit, pretraining is key:
What else is NeXt?
Starting to saturate ImageNet, fighting over 1-2%
But now vision really works, other tasks� Segmentation� Object detection� Captioning� …
Just different loss functions!
Semantic Segmentation
Semantic Segmentation
https://arxiv.org/pdf/1505.04366.pdf
Semantic Segmentation
https://arxiv.org/pdf/1505.04366.pdf
Semantic Segmentation
https://arxiv.org/pdf/1505.04366.pdf
Semantic Segmentation
https://arxiv.org/pdf/1505.04366.pdf
Encoder
Semantic Segmentation
https://arxiv.org/pdf/1505.04366.pdf
Encoder
Decoder
Semantic Segmentation
https://arxiv.org/pdf/1505.04366.pdf
Coarse features
Semantic Segmentation
https://arxiv.org/pdf/1505.04366.pdf
Coarse features
Fine-grained predictions
U-net/Segnet
https://arxiv.org/pdf/1511.00561.pdf, https://arxiv.org/pdf/1505.04597.pdf
Spatial pyramid pooling
https://arxiv.org/pdf/1612.01105.pdf
Spatial pyramid pooling
DeepLabv3+
https://arxiv.org/pdf/1802.02611.pdf
Atrous convolutions� Spaced inputs
DeepLabv3+
https://arxiv.org/pdf/1802.02611.pdf
Atrous convolutions� Spaced inputs
Pre-train on ImageNet
DeepLabv3+
https://arxiv.org/pdf/1802.02611.pdf
Atrous convolutions� Spaced inputs
Pre-train on ImageNet
Fine-tune on segmentation