1 of 66

CSE 5524: �Object detection

1

2 of 66

HW 1 & 2

  • Grade will be posted by tomorrow.
  • Regrading form will be posted on Carmen.

3 of 66

Midterm

  • Discuss the questions on Thursday!
  • Checking your exam packet on Thursday!

4 of 66

Homework and quiz plan

  • HW 3: (10%)
    • Release: 3/24
    • Due: 4/7

  • HW 4: (10%)
    • Release: 4/7
    • Due: 4/21

  • Quiz: 4% coming soon

5 of 66

Final project (30%)

  • Team forming:
    • 2 – 3 students: same expectation

  • Tentative plan:

6 of 66

Project first glance

  • Pre-defined tasks:

    • Reproducing existing algorithms
    • Benchmarking existing algorithms
    • Reproducing examples in the textbook

  • Self-defined “research” tasks:
    • Need approval
    • Need justification

7 of 66

Today (Chapter 50)

  • CNN & segmentation recap
  • Object detection

7

8 of 66

How do I know it is a zebra?

9 of 66

Convolutions

9

Linear receptive field

10 of 66

Convolutions

10

Linear receptive field

Exponential receptive field

(with pooling + down-sampling)

11 of 66

CNN

A general architecture of CNN or visual transformers involves

  • Multiple layers of computations + nonlinearity + (pooling + striding)
  • These result in a (final) feature map
  • The map goes through FC layers (MLP)

11

12 of 66

What is the final FC layer doing?

[Koh et al., Concept Bottleneck Models, 2020]

13 of 66

Illustration

red

blue

Long beak

Long leg

Cardinal

Flamingo

Blue Jay

Image

CNN

Yes: 1; No: -1

14 of 66

Popular CNN architectures

  • Encoder, decoder

14

15 of 66

Popular CNN architectures

  • U-Net: Encoder + decoder + skip links

15

16 of 66

Exemplar computer vision tasks

[C. Rieke, 2019]

17 of 66

Representative 2D recognition tasks

  • “Same” input: images

  • “Different” outputs:
  • A C-dim class probability vector
  • A set of bounding boxes, each with box location and class probability
  • An W x H x C feature map
  • A combination of b) and c)

  • “Different” labeled training data

17

Dog

Cat

Horse

Sheep

W

H

a)

c)

b)

d)

18 of 66

Does segmentation need a new architecture?

18

Single spatial output!

19 of 66

Fully-convolutional network (FCN)

19

CNN

Feature map

Vector after vectorization

Dog

Cat

Boat

Bird

Matrix multiplication, inner product

20 of 66

Fully-convolutional network (FCN)

20

CNN

Dog

Cat

Boat

Bird

Each row = a Conv filter

Feature map

21 of 66

What if I input a larger image?

21

CNN

Dog Cat Boat Bird

Feature map

22 of 66

What if I input a larger image?

22

CNN

Feature map

Dog Cat Boat Bird

23 of 66

What if I input a larger image?

23

CNN

Feature map

Dog Cat Boat Bird

24 of 66

Fully-convolutional network (FCN)

24

[Long et al., Fully Convolutional Networks for Semantic Segmentation, CVPR 2015]

25 of 66

U-Net

25

Help localization

Help

context + semantics

[Ronneberger et al., U-Net: Convolutional Networks for Biomedical Image Segmentation, MICCAI 2015]

26 of 66

How to teach the model?

27 of 66

Today

  • CNN & segmentation recap
  • Object detection

27

28 of 66

Object detection

  • Properties:

  • Labels + bounding boxes
  • Localization
  • Multi-scale
  • Context
  • “Undetermined” numbers

28

[class, u-center, v-center, width, height]

29 of 66

Naïve way

  • Sliding window
  • Time consuming
  • What size?

29

ResNet classifier

30 of 66

R-CNN

  • Objectness proposal
  • CNN classifier
  • Box refinement

30

[Girshick et al., Rich feature hierarchies for accurate object detection and semantic segmentation, CVPR 2014]

31 of 66

Selective search for proposal generation

  • Step 1:
    • Not deep learning
    • super-pixel-based segmentation

  • Step 2:
    • Recursively combine similar regions into larger ones

  • Step 3:
    • Boxes fitting

31

[Stanford CS 231b]

32 of 66

R-CNN

32

[Girshick et al., Rich feature hierarchies for accurate object detection and semantic segmentation, CVPR 2014]

[Girshick, CVPR 2019 tutorial]

33 of 66

R-CNN

  • Box regression:
  • (du, dv)
  • (dw, dh)

By offset = MLP(feature)

33

Proposal

Ground truth

34 of 66

R-CNN

  • Problems:
  • Slow: every proposal needs to go through a “full” CNN
  • Mis-detection: the proposal algorithm is not trained together

34

35 of 66

Fast R-CNN

35

ROI pooling

[Girshick, CVPR 2019 tutorial]

[Girshick, Fast R-CNN, ICCV 2015]

36 of 66

ROI pooling vs. ROI align

36

ROI Align

ROI Pooling

Making features extracted from different proposals the same size!

37 of 66

Faster R-CNN

37

ROI pooling

[Girshick, CVPR 2019 tutorial]

[Ren et al., Faster r-cnn: Towards real-time object detection with region proposal networks, NIPS 2015]

38 of 66

How to teach a model to propose object locations?

39 of 66

How to teach a model to propose object locations?

40 of 66

How to teach a model to propose object locations?

Ground truth

41 of 66

How to teach a model to propose object locations?

Ground truth

1

There is a car centered around this location!

42 of 66

What size?

Ground truth

1

There is a car centered around this location!

43 of 66

What size?

Ground truth

Consider “pre-defined” anchors

44 of 66

What size?

Ground truth

Consider “pre-defined” anchors

45 of 66

What size?

Ground truth

1

1

1

1

1

46 of 66

How to develop RPN�(region proposal network)?

46

5 * 8 * K * (2 + 4)

[Ren et al., 2015]

Ground truth

Anchor

47 of 66

What do we learn from RPN?

  • “How to encode your labeled data so that your CNN can learn from them and predict them” is important!

  • Inference: predict these “values” and accordingly transfer them to bounding boxes!

47

48 of 66

Questions?

49 of 66

How to deal with object sizes?

49

[Lin et al., Feature Pyramid Networks for Object Detection, CVPR 2017]

50 of 66

Mask R-CNN

50

[Girshick, CVPR 2019 tutorial]

[He et al., Mask r-cnn, ICCV 2017]

51 of 66

Mask R-CNN: for instance segmentation

51

CNN: convolutional neural network

RPN: region proposal network

Bulldozer: 80%

Bus: 15%

Motorcycle: 5%

52 of 66

2-stage vs. 1-stage detectors

  • Other names: single-shot, single-pass, … (e.g., YOLO, SSD)
  • Difference: no ROI pooling/align

52

[Redmon et al., 2016]

2-stage detector

1-stage detector

53 of 66

Exemplar 1-stage detectors

53

[Liu et al., 2016]

SSD

YOLO

[Redmon et al., 2016]

54 of 66

Exemplar 1-stage detectors (Retina Net)

54

[Lin et al., 2017]

55 of 66

2-stage vs. 1-stage detectors

  • Pros for 1-stage:
    • Faster!

  • Cons for 1-stage:
    • Too many negative locations; scale

55

[Redmon et al., 2016]

56 of 66

Inference: choose few from many

  • Non-Maximum Suppression (NMS)

56

[Pictures from “towards data science” post]

57 of 66

Example results

57

[Zhang, et al., 2021]

58 of 66

New approach to object detection

  • DEtection Transformer [Nicolas Carion et al.]

59 of 66

New approach to object detection

  • DEtection Transformer [Nicolas Carion et al.]

60 of 66

Key names

  • Tsung-Yi Lin
  • Ross Girshick
  • Kaiming He
  • Piotr Dollar

61 of 66

Take home

  • Good tutorials online:
    • CVPR 2017-2022, ECCV 2018-2020, ICCV 2017-2021 [search tutorial or workshop]
    • ICML/NeurIPS/ICLR 2018-2021 [search tutorial or workshop]

  • Good framework:
    • PyTorch: Torchvision
    • PyTorch: Detectron2

  • Good source code:
    • Papers with code: https://paperswithcode.com/

61

62 of 66

LiDAR-based 3D perception

63 of 66

LiDAR-based 3D perception

63

[Source: Graham Murdoch/Popular Science]

LiDAR:

  • Light Detection and Ranging sensor
  • accurate 3D point clouds of the environment

64 of 66

LiDAR-based 3D perception

You can view the LiDAR point clouds from different angles

64

Frontal view

Bird’s-eye view (BEV)

65 of 66

Two major ways to process LiDAR point clouds

  • Point-wise processing
    • PointNet [Qi et al., 2017]
    • PointNet++ [Qi et al., 2017]
    • PointRCNN [Shi et al., 2019]

  • Voxel-based processing: turn points into a tensor (e.g., W x D x H x F)
    • PointPillars [Lang et al., 2019]
    • VoxelNet [Zhou et al., 2017]
    • PIXOR [Liang et al., 2018]

65

66 of 66

Voxel-based processing + 3D object detectors

  • Occupation (PIXOR): 3D points as a 3D occupation tensor from bird’s-eye-view

66

[Yang et al., PIXOR: Real-time 3D Object Detection from Point Clouds, 2019]

height

depth

Left-right