1 of 31

Fully Convolutional Networks

for Semantic Segmentation

UC Berkeley in CVPR'15, PAMI'17

Evan Shelhamer* Jonathan Long* Trevor Darrell

1

2 of 31

pixels in, pixels out

2

semantic segmentation

monocular depth + normals Eigen & Fergus 2015

boundary prediction Xie & Tu 2015

optical flow Fischer et al. 2015

colorization

Zhang et al.2016

3 of 31

convnets perform classification

3

“tabby cat”

1000-dim vector

< 1 millisecond

end-to-end learning

4 of 31

lots of pixels, little time?

4

~1/10 second

end-to-end learning

???

5 of 31

a classification network

5

“tabby cat”

6 of 31

becoming fully convolutional

6

7 of 31

becoming fully convolutional

7

8 of 31

upsampling output

8

9 of 31

end-to-end, pixels-to-pixels network

9

10 of 31

end-to-end, pixels-to-pixels network

10

conv, pool,

nonlinearity

upsampling

pixelwise�output + loss

11 of 31

spectrum of deep features

11

combine where (local, shallow) with what (global, deep)

fuse features into deep jet

(cf. Hariharan et al. CVPR15 “hypercolumn”)

12 of 31

skip layers

12

skip to fuse layers!

interp + sum

dense output

end-to-end, joint learning

of semantics and location

13 of 31

skip layer refinement

13

stride 32

no skips

stride 16

1 skip

stride 8

2 skips

truth

input

14 of 31

skip FCN computation

Stage 1 (60.0ms)

Stage 2 (18.7ms)

Stage 3 (23.0ms)

A multi-stream network that fuses features/predictions across layers

15 of 31

15

FCN

SDS*

Truth

Input

Relative to prior state-of-the-art SDS:

30% relative improvement�for mean IoU

286× faster

*Simultaneous Detection and Segmentation Hariharan et al. ECCV14

16 of 31

past and future history of�fully convolutional networks

16

17 of 31

history

17

Convolutional Locator Network

Wolf & Platt 1994

Shape Displacement Network

Matan & LeCun 1992

18 of 31

pyramids

18

Scale Pyramid, Burt & Adelson ‘83

0

1

2

The scale pyramid is a classic multi-resolution representation.

Fusing multi-resolution network layers is a learned, nonlinear counterpart.

19 of 31

jets

19

Jet, Koenderink & Van Doorn ‘87

The local jet collects the partial derivatives at a point for a rich local description.

The deep jet collects layer compositions for a rich,

learned description.

20 of 31

extensions

20

detection + instances
structured output
weak supervision
video

21 of 31

detection: fully conv. proposals

21

Fast R-CNN, Girshick ICCV'15

Faster R-CNN, Ren et al. NIPS'15

end-to-end detection by proposal FCN RoI classification

22 of 31

fully conv. nets + structured output

22

Semantic Image Segmentation with Deep Convolutional Nets and Fully Connected CRFs.�Chen* & Papandreou* et al. ICLR 2015.

23 of 31

fully conv. nets + structured output

23

Conditional Random Fields as Recurrent Neural Networks. Zheng* & Jayasumana* et al. ICCV 2015.

24 of 31

dilation for structured output

24

Multi-Scale Context Aggregation by Dilated Convolutions. Yu & Koltun. ICLR 2016

enlarge effective receptive field for same no. params�
raise resolution�
convolutional context model:�similar accuracy to�CRF but non-probabilistic

25 of 31

25

[ comparison credit: CRF as RNN, Zheng* & Jayasumana* et al. ICCV 2015 ]

DeepLab: Chen* & Papandreou* et al. ICLR 2015. CRF-RNN: Zheng* & Jayasumana* et al. ICCV 2015

26 of 31

fully conv. nets + weak supervision

26

Constrained Convolutional Neural Networks for Weakly Supervised Segmentation.�Pathak et al. arXiv 2015.

FCNs expose a spatial loss map to guide learning:�segment from tags by MIL or pixelwise constraints

27 of 31

fully conv. nets + weak supervision

27

BoxSup: Exploiting Bounding Boxes to Supervise Convolutional Networks for Semantic Segmentation.�Dai et al. 2015.

FCNs expose a spatial loss map to guide learning:�mine boxes + feedback to refine masks

28 of 31

fully conv. nets + weak supervision

28

FCNs can learn from sparse annotations == sampling the loss

What's the Point? Semantic Segmentation with Point Supervision. Bearman et al. ECCV 2016.

29 of 31

leaderboard

29

== segmentation with Caffe

FCN

30 of 31

caffeinated contemporaries

30

Hypercolumn SDS

Hariharan, Arbeláez,�Girshick, Malik

Zoom-Out

Mostajabi, Yadollahpour,�Shaknarovich

Convolutional Feature Masking

Dai, He, Sun

31 of 31

conclusion

31

fcn.berkeleyvision.org

fully convolutional networks are fast, end-to-end models for pixelwise problems

code in Caffe master branch
models for PASCAL VOC, NYUDv2, �SIFT Flow, PASCAL-Context

caffe.berkeleyvision.org

github.com/BVLC/caffe

model example

inference example

solving example