1 of 58

2 of 58

Logistics

2nd homework will come out soon! (probably?)

3 of 58

4 of 58

AlexNet: first good network

https://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks.pdf

Entry for ImageNet challenge�Much higher accuracy than “traditional” methods (SVM on SIFT)�Why did it take so long?��

5 of 58

GPUs: How modern CNNs are possible

Alex spent a long time implementing the components of a CNN on a GPU, his software Cudaconvnet was the first “modern” neural network framework

GPUs allow >100x speedup compared to CPU, training still took weeks on ImageNet

This idea, CNNs + GPUs started revolution�in computer vision just 5 years ago

(check NVIDIAs stock in last 5 years)��

6 of 58

What are these networks learning?

Features! Here’s 1st layer of AlexNet

7 of 58

Idea: find interesting images

https://arxiv.org/pdf/1312.6199.pdf

We can feed images into our network and see which ones activate certain neurons

8 of 58

Idea: make interesting images

Instead of optimizing network, we can do gradient descent on the image too

Optimize the image to activate certain neurons or certain layers, then we can learn what those layers or neurons do!

You may have seen this under the name deep dream

9 of 58

Early layers: blobs

https://www.youtube.com/watch?v=SCE-QeDfXtA

Neurons respond strongly to high contrast

10 of 58

Middle layers: edges, curves, eyes

https://www.youtube.com/watch?v=SCE-QeDfXtA

Neurons respond to simple features and shapes

11 of 58

Middle layers: edges, curves, eyes

https://www.youtube.com/watch?v=SCE-QeDfXtA

Neurons respond to simple features and shapes

12 of 58

Middle layers: edges, curves, eyes

https://www.youtube.com/watch?v=SCE-QeDfXtA

Neurons respond to simple features and shapes

13 of 58

Later layers: eyes, objects, dogs

https://www.youtube.com/watch?v=SCE-QeDfXtA

Neurons respond to arrangements of features

14 of 58

Later layers: eyes, objects, dogs

https://www.youtube.com/watch?v=SCE-QeDfXtA

Neurons respond to arrangements of features

15 of 58

Later layers: eyes, objects, dogs

https://www.youtube.com/watch?v=SCE-QeDfXtA

Neurons respond to arrangements of features

16 of 58

Neural networks work!

At least, seem to do what we want them to

Low-level features�Lines, oriented edges

Mid-level features�Combine edges: curves, shapes

High-level features�Combine shapes: objects, scenes

Predictor�Process features and predict output

17 of 58

VGG: networks getting bigger

From the Visual Geometry Group at Oxford�Considered “very deep” at the time, 16 - 19 layers�VGG-16 is still commonly used as a feature extractor� Although really it shouldn’t be, much better alternatives

18 of 58

VGG is inefficient

https://arxiv.org/pdf/1312.4400.pdf

Just stringing together a bunch of 3x3 convolutions in general, is pretty inefficient.

Network in network idea from Lin et al.:� Use 3x3 convolutions to process spatial information� Use 1x1 convolutions to reduce # of channels (dimensionality reduction)

E.g.

19 of 58

VGG (and AlexNet) are inefficient

https://arxiv.org/pdf/1312.4400.pdf

Huge fully connected layers at the end�TONS of parameters�Do they actually help?

37,748,736 parameters

16,777,216 parameters

20 of 58

VGG (and AlexNet) are inefficient

https://arxiv.org/pdf/1312.4400.pdf

Huge fully connected layers at the end�TONS of parameters�Do they actually help?

37,748,736 parameters

16,777,216 parameters

NOPE!

21 of 58

Just use all convolutions!

https://arxiv.org/pdf/1312.4400.pdf

How many? How big?

37,748,736 parameters

16,777,216 parameters

NOPE!

22 of 58

AlexNet is imbalanced

https://arxiv.org/pdf/1312.4400.pdf

200M OPS

896M OPS

299M OPS

449M OPS

299M OPS

34M OPS

75M OPS

8M OPS

23 of 58

Darknet Reference Network

https://arxiv.org/pdf/1312.4400.pdf

57M OPS

151M OPS

2M OPS

24 of 58

GoogleNet: networks getting weird

https://www.cs.unc.edu/~wliu/papers/GoogLeNet.pdf

Instead of a bunch of layer that are just 3x3 convolutions, use ->

Why are these better?

25 of 58

26 of 58

GoogleNet: networks getting weird

https://www.cs.unc.edu/~wliu/papers/GoogLeNet.pdf

Another way to efficiency:� Split layers� Not just one size, but many� 1x1, 3x3, 5x5

Also use 1x1 convs to compress feature maps

Can make large networks that are still efficient, better than VGG yet smaller and faster.

27 of 58

GoogleNet: networks getting weird

Like… really big

And this is sort of all we do now, try to make networks bigger without making them too expensive

28 of 58

GoogleNet: networks getting weird

Like… really big

And this is sort of all we do now, try to make networks bigger without making them too expensive

Multiple outputs

29 of 58

GoogleNetv2 or Inception Net?

https://arxiv.org/pdf/1502.03167.pdf

Or something, GoogleNet + batch norm

30 of 58

Residual connections

Normally, output of two layers is: f(w*f(vx))�Residual connections: f(w*f(vx) + x)

Learning how to modify x, add some transformed amount�Gives delta another path, less vanishing gradient

31 of 58

ResNet

32 of 58

ResNet

3x3 conv blocks or 3x3 and 1x1 conv blocks

Residual connections

VERY deep, 100+ layers

33 of 58

Grouped convolutions

Most filters look at every channel in input� Very expensive� Maybe not needed? Might only pull info from a few of them

Grouped convolutions:� Split up input feature map intro groups� Run convs on groups independently� Recombine

34 of 58

Grouped convolutions

Grouped convolutions:� Split up input feature map intro groups� Run convs on groups independently� Recombine

E.g. 3x3 conv layer 32 x 32 x 256 input, 128 filters, 32 groups:

Split input into 32 different feature maps� Each is 32 x 32 x 8� Run 4 filters, 3x3x8 on each group� Merge 4*32 channels back together, get 32 x 32 x 128 output

Input, output stays same dimensions, less computation

35 of 58

ResNeXt

https://arxiv.org/pdf/1611.05431.pdf

Replace 3x3 blocks with larger grouped convs

“Larger” network but same computational complexity

36 of 58

What’s NeXt?

Starting to saturate ImageNet, fighting over 1-2%

37 of 58

Training classifiers in wild

Typically you have a much smaller dataset than 1.2 million images and 1,000 classes.

What problems to do we encounter with less data, say 10,000 images and 500 classes?

38 of 58

First ImageNet, then the world!

For dealing with smaller datasets where we might overfit, pretraining is key:

  • First train on ImageNet
  • Chop off last layer
  • Keep training on your data

39 of 58

First ImageNet, then the world!

For dealing with smaller datasets where we might overfit, pretraining is key:

  • First train on ImageNet
  • Chop off last layer
  • Keep training on your data

40 of 58

First ImageNet, then the world!

For dealing with smaller datasets where we might overfit, pretraining is key:

  • First train on ImageNet
  • Chop off last layer
  • Keep training on your data

41 of 58

What else is NeXt?

Starting to saturate ImageNet, fighting over 1-2%

But now vision really works, other tasks� Segmentation� Object detection� Captioning� …

Just different loss functions!

42 of 58

Semantic Segmentation

43 of 58

Semantic Segmentation

https://arxiv.org/pdf/1505.04366.pdf

44 of 58

Semantic Segmentation

https://arxiv.org/pdf/1505.04366.pdf

45 of 58

Semantic Segmentation

https://arxiv.org/pdf/1505.04366.pdf

46 of 58

Semantic Segmentation

https://arxiv.org/pdf/1505.04366.pdf

Encoder

47 of 58

Semantic Segmentation

https://arxiv.org/pdf/1505.04366.pdf

Encoder

Decoder

48 of 58

Semantic Segmentation

https://arxiv.org/pdf/1505.04366.pdf

Coarse features

49 of 58

Semantic Segmentation

https://arxiv.org/pdf/1505.04366.pdf

Coarse features

Fine-grained predictions

50 of 58

U-net/Segnet

https://arxiv.org/pdf/1511.00561.pdf, https://arxiv.org/pdf/1505.04597.pdf

51 of 58

Spatial pyramid pooling

https://arxiv.org/pdf/1612.01105.pdf

52 of 58

Spatial pyramid pooling

53 of 58

DeepLabv3+

https://arxiv.org/pdf/1802.02611.pdf

Atrous convolutions� Spaced inputs

54 of 58

DeepLabv3+

https://arxiv.org/pdf/1802.02611.pdf

Atrous convolutions� Spaced inputs

Pre-train on ImageNet

55 of 58

DeepLabv3+

https://arxiv.org/pdf/1802.02611.pdf

Atrous convolutions� Spaced inputs

Pre-train on ImageNet

Fine-tune on segmentation

56 of 58

57 of 58

58 of 58