1 of 58

2 of 58

Final Project

  • Outside libraries, code, etc. is great!
    • Suggestions: PyTorch/TensorFlow, Pretrained ResNets, github
  • Software is always built on other’s work, just be clear in report what your contribution is
  • Experiments are good!
    • E.g. Test the effect of model FLOPs on performance
    • Try 6 different models on bird data
    • Plot performance relative to FLOPs
    • Discuss any patterns, outliers, etc.

3 of 58

4 of 58

Batch normalization

https://arxiv.org/pdf/1502.03167.pdf

One way to deal with gradient vanishing:� Normalize activations of filters spatially / over the mini-batch

During training, the distribution of network activations changes over time because the parameters (weights) change

Learning is more stable if this change (or internal covariate shift) as reduced

If output is 32 x 32 x 16 image, batch size of 64, normalize activations for each filter across all images in batch:� I.e. calculate 16 means and variances� Subtract mean, divide by std dev both across spatial dimensions and images in the batch

5 of 58

Batch normalization

https://arxiv.org/pdf/1502.03167.pdf

Other benefits:

Output is normalized before activation, mean 0 var 1 means it’s in the “good” domain of most activation functions

Each image is seen relative to others in a batch, introduces a form of regularization because we don’t ever “see” same image twice

Stabilizes training so much larger learning rates can be used

6 of 58

Residual connections

Normally, output of two layers is: f(w*f(vx))�Residual connections: f(w*f(vx) + x)

Learning how to modify x, add some transformed amount�Gives delta another path, less vanishing gradient

7 of 58

ResNet

8 of 58

Grouped convolutions

Most filters look at every channel in input� Very expensive� Maybe not needed? Might only pull info from a few of them

Grouped convolutions:� Split up input feature map intro groups� Run convs on groups independently� Recombine

9 of 58

Grouped convolutions

Grouped convolutions:� Split up input feature map intro groups� Run convs on groups independently� Recombine

E.g. 3x3 conv layer 32 x 32 x 256 input, 128 filters, 32 groups:

Split input into 32 different feature maps� Each is 32 x 32 x 8� Run 4 filters, 3x3x8 on each group� Merge 4*32 channels back together, get 32 x 32 x 128 output

Input, output stays same dimensions, less computation

10 of 58

What’s NeXt?

Starting to saturate ImageNet, fighting over 1-2%

11 of 58

Semantic Segmentation

12 of 58

Semantic Segmentation

https://arxiv.org/pdf/1505.04366.pdf

13 of 58

14 of 58

Semantic Segmentation

https://arxiv.org/pdf/1505.04366.pdf

15 of 58

Semantic Segmentation

https://arxiv.org/pdf/1505.04366.pdf

16 of 58

Semantic Segmentation

https://arxiv.org/pdf/1505.04366.pdf

Encoder

17 of 58

Semantic Segmentation

https://arxiv.org/pdf/1505.04366.pdf

Encoder

Decoder

18 of 58

Semantic Segmentation

https://arxiv.org/pdf/1505.04366.pdf

Coarse features

19 of 58

Semantic Segmentation

https://arxiv.org/pdf/1505.04366.pdf

Coarse features

Fine-grained predictions

20 of 58

U-net/Segnet

https://arxiv.org/pdf/1511.00561.pdf, https://arxiv.org/pdf/1505.04597.pdf

21 of 58

Spatial pyramid pooling

https://arxiv.org/pdf/1612.01105.pdf

22 of 58

Spatial pyramid pooling

23 of 58

DeepLabv3+

https://arxiv.org/pdf/1802.02611.pdf

Atrous convolutions� Spaced inputs

24 of 58

DeepLabv3+

https://arxiv.org/pdf/1802.02611.pdf

Atrous convolutions� Spaced inputs

Pre-train on ImageNet

25 of 58

DeepLabv3+

https://arxiv.org/pdf/1802.02611.pdf

Atrous convolutions� Spaced inputs

Pre-train on ImageNet

Fine-tune on segmentation

26 of 58

27 of 58

28 of 58

29 of 58

Object detection

30 of 58

Deformable parts models

31 of 58

Scoring object detection

Multiple classes, multiple objects per images�Can’t just use accuracy

“Correct” bounding box:� Intersection / Union > 0.5

Intersection: Ground truth ∩ prediction�Union: Ground truth ∪ prediction

32 of 58

Scoring object detection

“Correct” bounding box:� Intersection / Union > 0.5

Recall:� Correct bounding boxes / total ground-truth boxes

Precision:� Correct bounding boxes / total predicted boxes

Only the most confident predictions: High precision, low recall

All the predictions: Low precision, high recall

33 of 58

Scoring object detection

Precision-Recall curve: vary threshold, plot precision and recall

Average precision:� Area under PR curve� Only for a single class

Take mean of AP across classes:� Mean AP (mAP)� Standard detection metric� Sometimes at particular IOU� I.e. mAP@.5 or mAP@.75

34 of 58

PASCAL VOC

One of the first large detection datasets:� 20 classes� 11,530 training images� 27,450 annotated objects

DPM: 33.6% mAP

DPM is pre-neural network, how do we use CNNs for detection?

35 of 58

R-CNN: Regions with CNN features

36 of 58

Selective search: fewer proposals

37 of 58

R-CNN: Regions with CNN features

38 of 58

Lots of post processing, 20 sec/im

Pascal VOC:

AlexNet� 53.3% mAP

VGG-16 � 62.4% mAP

39 of 58

YOLO

40 of 58

Say you have an image...

41 of 58

Split it into a grid

42 of 58

For each cell predicts P(obj)

43 of 58

Also predict a bounding box

44 of 58

Also predict a bounding box

45 of 58

Also class probabilities

Dog

Bicycle

Car

Dining Table

46 of 58

Also class probabilities

47 of 58

Threshold and non-max suppression

48 of 58

Tensor encoding detection

49 of 58

R-CNN is slow

https://dl.dropboxusercontent.com/s/vlyrkgd8nz8gy5l/fast-rcnn.pdf?dl=0

Run convnet independently over every ROI

50 of 58

Fast R-CNN

https://dl.dropboxusercontent.com/s/vlyrkgd8nz8gy5l/fast-rcnn.pdf?dl=0

Run convnet once, extract features using ROI pooling

ROI Pool:� Convert variable sized� ROI to fixed size output

51 of 58

ROI Align

https://arxiv.org/pdf/1703.06870.pdf

Better than ROI Pool so we’ll talk�about it instead

Split ROI into fixed size (say 2x2)

Sample image at multiple points for�each cell in ROI (bilinear interp.)

Pool over these samples (max, avg…)

52 of 58

Fast R-CNN

https://dl.dropboxusercontent.com/s/vlyrkgd8nz8gy5l/fast-rcnn.pdf?dl=0

Run convnet once, extract features using ROI pooling

ROI Pool:� Convert variable sized� ROI to fixed size output

Much faster, no independent�network evals (except last�linear layer)

Still slow region proposer,�selective search takes ~2 sec

53 of 58

Faster R-CNN

https://dl.dropboxusercontent.com/s/vlyrkgd8nz8gy5l/fast-rcnn.pdf?dl=0

Use Convnet to propose regions and generate features

ROI Pool to fix size of ROI features

Additional layers to classify and predict�bbox for ROIs

54 of 58

Saturating PASCAL VOC, need new data

55 of 58

Common Objects in COntext (COCO)

http://cocodataset.org/#home

80 objects�117,261 train/val images�902,435 object instances

New detection metric, mAP averaged over IOU [.5 - .95]

Segmentation masks for each instanceOriginally by Microsoft but they were scared of copyright something something so they spun it off

56 of 58

Segmentation vs Detection

Pixel-level labels�Category only

Bounding box labels�Category + instance

57 of 58

Segmentation vs Detection

Pixel-level labels�Category only

Bounding box labels�Category + instance

58 of 58

Instance Segmentation

Given an image produce instance-level segmentation� Which class does each pixel belong to� Also which instance