Final Project
Batch normalization
https://arxiv.org/pdf/1502.03167.pdf
One way to deal with gradient vanishing:� Normalize activations of filters spatially / over the mini-batch
During training, the distribution of network activations changes over time because the parameters (weights) change
Learning is more stable if this change (or internal covariate shift) as reduced
If output is 32 x 32 x 16 image, batch size of 64, normalize activations for each filter across all images in batch:� I.e. calculate 16 means and variances� Subtract mean, divide by std dev both across spatial dimensions and images in the batch
Batch normalization
https://arxiv.org/pdf/1502.03167.pdf
Other benefits:
Output is normalized before activation, mean 0 var 1 means it’s in the “good” domain of most activation functions
Each image is seen relative to others in a batch, introduces a form of regularization because we don’t ever “see” same image twice
Stabilizes training so much larger learning rates can be used
Residual connections
Normally, output of two layers is: f(w*f(vx))�Residual connections: f(w*f(vx) + x)
Learning how to modify x, add some transformed amount�Gives delta another path, less vanishing gradient
ResNet
Grouped convolutions
Most filters look at every channel in input� Very expensive� Maybe not needed? Might only pull info from a few of them
Grouped convolutions:� Split up input feature map intro groups� Run convs on groups independently� Recombine
Grouped convolutions
Grouped convolutions:� Split up input feature map intro groups� Run convs on groups independently� Recombine
E.g. 3x3 conv layer 32 x 32 x 256 input, 128 filters, 32 groups:
Split input into 32 different feature maps� Each is 32 x 32 x 8� Run 4 filters, 3x3x8 on each group� Merge 4*32 channels back together, get 32 x 32 x 128 output
Input, output stays same dimensions, less computation
What’s NeXt?
Starting to saturate ImageNet, fighting over 1-2%
Semantic Segmentation
Semantic Segmentation
https://arxiv.org/pdf/1505.04366.pdf
Semantic Segmentation
https://arxiv.org/pdf/1505.04366.pdf
Semantic Segmentation
https://arxiv.org/pdf/1505.04366.pdf
Semantic Segmentation
https://arxiv.org/pdf/1505.04366.pdf
Encoder
Semantic Segmentation
https://arxiv.org/pdf/1505.04366.pdf
Encoder
Decoder
Semantic Segmentation
https://arxiv.org/pdf/1505.04366.pdf
Coarse features
Semantic Segmentation
https://arxiv.org/pdf/1505.04366.pdf
Coarse features
Fine-grained predictions
U-net/Segnet
https://arxiv.org/pdf/1511.00561.pdf, https://arxiv.org/pdf/1505.04597.pdf
Spatial pyramid pooling
https://arxiv.org/pdf/1612.01105.pdf
Spatial pyramid pooling
DeepLabv3+
https://arxiv.org/pdf/1802.02611.pdf
Atrous convolutions� Spaced inputs
DeepLabv3+
https://arxiv.org/pdf/1802.02611.pdf
Atrous convolutions� Spaced inputs
Pre-train on ImageNet
DeepLabv3+
https://arxiv.org/pdf/1802.02611.pdf
Atrous convolutions� Spaced inputs
Pre-train on ImageNet
Fine-tune on segmentation
Object detection
Deformable parts models
Scoring object detection
Multiple classes, multiple objects per images�Can’t just use accuracy
“Correct” bounding box:� Intersection / Union > 0.5
Intersection: Ground truth ∩ prediction�Union: Ground truth ∪ prediction
Scoring object detection
“Correct” bounding box:� Intersection / Union > 0.5
Recall:� Correct bounding boxes / total ground-truth boxes
Precision:� Correct bounding boxes / total predicted boxes
Only the most confident predictions: High precision, low recall
All the predictions: Low precision, high recall
Scoring object detection
Precision-Recall curve: vary threshold, plot precision and recall
Average precision:� Area under PR curve� Only for a single class
Take mean of AP across classes:� Mean AP (mAP)� Standard detection metric� Sometimes at particular IOU� I.e. mAP@.5 or mAP@.75
PASCAL VOC
One of the first large detection datasets:� 20 classes� 11,530 training images� 27,450 annotated objects
DPM: 33.6% mAP
DPM is pre-neural network, how do we use CNNs for detection?
R-CNN: Regions with CNN features
Selective search: fewer proposals
R-CNN: Regions with CNN features
Lots of post processing, 20 sec/im
Pascal VOC:
AlexNet� 53.3% mAP
VGG-16 � 62.4% mAP
YOLO
Say you have an image...
Split it into a grid
For each cell predicts P(obj)
Also predict a bounding box
Also predict a bounding box
Also class probabilities
Dog
Bicycle
Car
Dining Table
Also class probabilities
Threshold and non-max suppression
Tensor encoding detection
R-CNN is slow
https://dl.dropboxusercontent.com/s/vlyrkgd8nz8gy5l/fast-rcnn.pdf?dl=0
Run convnet independently over every ROI
Fast R-CNN
https://dl.dropboxusercontent.com/s/vlyrkgd8nz8gy5l/fast-rcnn.pdf?dl=0
Run convnet once, extract features using ROI pooling
ROI Pool:� Convert variable sized� ROI to fixed size output
ROI Align
https://arxiv.org/pdf/1703.06870.pdf
Better than ROI Pool so we’ll talk�about it instead
Split ROI into fixed size (say 2x2)
Sample image at multiple points for�each cell in ROI (bilinear interp.)
Pool over these samples (max, avg…)
Fast R-CNN
https://dl.dropboxusercontent.com/s/vlyrkgd8nz8gy5l/fast-rcnn.pdf?dl=0
Run convnet once, extract features using ROI pooling
ROI Pool:� Convert variable sized� ROI to fixed size output
Much faster, no independent�network evals (except last�linear layer)
Still slow region proposer,�selective search takes ~2 sec
Faster R-CNN
https://dl.dropboxusercontent.com/s/vlyrkgd8nz8gy5l/fast-rcnn.pdf?dl=0
Use Convnet to propose regions and generate features
ROI Pool to fix size of ROI features
Additional layers to classify and predict�bbox for ROIs
Saturating PASCAL VOC, need new data
Common Objects in COntext (COCO)
http://cocodataset.org/#home
80 objects�117,261 train/val images�902,435 object instances
New detection metric, mAP averaged over IOU [.5 - .95]
Segmentation masks for each instance�Originally by Microsoft but they were scared of copyright something something so they spun it off
Segmentation vs Detection
Pixel-level labels�Category only
Bounding box labels�Category + instance
Segmentation vs Detection
Pixel-level labels�Category only
Bounding box labels�Category + instance
Instance Segmentation
Given an image produce instance-level segmentation� Which class does each pixel belong to� Also which instance