1 of 66

CSE 5524: �Object detection

1

2 of 66

HW 1 & 2

Grade will be posted by tomorrow.
Regrading form will be posted on Carmen.

3 of 66

Midterm

Discuss the questions on Thursday!
Checking your exam packet on Thursday!

4 of 66

Homework and quiz plan

HW 3: (10%)

Release: 3/24
Due: 4/7

HW 4: (10%)

Release: 4/7
Due: 4/21

Quiz: 4% coming soon

5 of 66

Final project (30%)

Team forming:

2 – 3 students: same expectation

Tentative plan:

Team forming: 3/18 https://docs.google.com/forms/d/e/1FAIpQLSfXjjqpP3lLzfg6KBa_2umUwKIACpU0rDy32ET8IS2kJpHU1A/viewform?usp=header
Project sketch: 3/18 (2%) – 1 page at most: what you plan to do, who your teammates are. Figures are OK to be added.
Project proposal: 3/27 (3%)
Project presentation: 4/22 & 4/23 (10%)
Project report & code release: 4/25 (15%)

6 of 66

Project first glance

Pre-defined tasks:

CVPR 2025 competition: https://cvpr.thecvf.com/Conferences/2025/workshop-list

Keywords: competition & challenge
Example: https://sites.google.com/view/fgvc12/competitions

Reproducing existing algorithms
Benchmarking existing algorithms
Reproducing examples in the textbook

Self-defined “research” tasks:

Need approval
Need justification

7 of 66

Today (Chapter 50)

CNN & segmentation recap
Object detection

7

8 of 66

How do I know it is a zebra?

9 of 66

Convolutions

9

Linear receptive field

10 of 66

Convolutions

10

Linear receptive field

Exponential receptive field

(with pooling + down-sampling)

11 of 66

CNN

A general architecture of CNN or visual transformers involves

Multiple layers of computations + nonlinearity + (pooling + striding)
These result in a (final) feature map
The map goes through FC layers (MLP)

11

12 of 66

What is the final FC layer doing?

[Koh et al., Concept Bottleneck Models, 2020]

13 of 66

Illustration

red

blue

Long beak

Long leg

Cardinal

Flamingo

Blue Jay

Image

CNN

Yes: 1; No: -1

14 of 66

Popular CNN architectures

Encoder, decoder

14

15 of 66

Popular CNN architectures

U-Net: Encoder + decoder + skip links

15

16 of 66

Exemplar computer vision tasks

[C. Rieke, 2019]

17 of 66

Representative 2D recognition tasks

“Same” input: images

“Different” outputs:
A C-dim class probability vector
A set of bounding boxes, each with box location and class probability
An W x H x C feature map
A combination of b) and c)

“Different” labeled training data

17

Dog

Cat

Horse

Sheep

W

H

a)

c)

b)

d)

18 of 66

Does segmentation need a new architecture?

18

Single spatial output!

19 of 66

Fully-convolutional network (FCN)

19

CNN

Feature map

Vector after vectorization

Dog

Cat

Boat

Bird

Matrix multiplication, inner product

20 of 66

Fully-convolutional network (FCN)

20

CNN

Dog

Cat

Boat

Bird

Each row = a Conv filter

Feature map

21 of 66

What if I input a larger image?

21

CNN

Dog Cat Boat Bird

Feature map

22 of 66

What if I input a larger image?

22

CNN

Feature map

Dog Cat Boat Bird

23 of 66

What if I input a larger image?

23

CNN

Feature map

Dog Cat Boat Bird

24 of 66

Fully-convolutional network (FCN)

24

[Long et al., Fully Convolutional Networks for Semantic Segmentation, CVPR 2015]

25 of 66

U-Net

25

Help localization

Help

context + semantics

[Ronneberger et al., U-Net: Convolutional Networks for Biomedical Image Segmentation, MICCAI 2015]

26 of 66

How to teach the model?

27 of 66

Today

CNN & segmentation recap
Object detection

27

28 of 66

Object detection

Properties:

Labels + bounding boxes
Localization
Multi-scale
Context
“Undetermined” numbers

28

[class, u-center, v-center, width, height]

29 of 66

Naïve way

Sliding window
Time consuming
What size?

29

ResNet classifier

30 of 66

R-CNN

Objectness proposal
CNN classifier
Box refinement

30

[Girshick et al., Rich feature hierarchies for accurate object detection and semantic segmentation, CVPR 2014]

31 of 66

Selective search for proposal generation

Step 1:

Not deep learning
super-pixel-based segmentation

Step 2:

Recursively combine similar regions into larger ones

Step 3:

Boxes fitting

31

[Stanford CS 231b]

32 of 66

R-CNN

32

[Girshick et al., Rich feature hierarchies for accurate object detection and semantic segmentation, CVPR 2014]

[Girshick, CVPR 2019 tutorial]

33 of 66

R-CNN

Box regression:
(du, dv)
(dw, dh)

By offset = MLP(feature)

33

Proposal

Ground truth

34 of 66

R-CNN

Problems:
Slow: every proposal needs to go through a “full” CNN
Mis-detection: the proposal algorithm is not trained together

34

35 of 66

Fast R-CNN

35

ROI pooling

[Girshick, CVPR 2019 tutorial]

[Girshick, Fast R-CNN, ICCV 2015]

36 of 66

ROI pooling vs. ROI align

36

ROI Align

ROI Pooling

Making features extracted from different proposals the same size!

37 of 66

Faster R-CNN

37

ROI pooling

[Girshick, CVPR 2019 tutorial]

[Ren et al., Faster r-cnn: Towards real-time object detection with region proposal networks, NIPS 2015]

38 of 66

How to teach a model to propose object locations?

39 of 66

How to teach a model to propose object locations?

40 of 66

How to teach a model to propose object locations?

Ground truth

41 of 66

How to teach a model to propose object locations?

Ground truth




				1

There is a car centered around this location!

42 of 66

What size?

Ground truth




				1

There is a car centered around this location!

43 of 66

What size?

Ground truth

Consider “pre-defined” anchors

44 of 66

What size?

Ground truth

Consider “pre-defined” anchors

45 of 66

What size?

Ground truth




				1




				1




				1




				1




				1

46 of 66

How to develop RPN�(region proposal network)?

46

5 * 8 * K * (2 + 4)

[Ren et al., 2015]

Ground truth

Anchor

47 of 66

What do we learn from RPN?

“How to encode your labeled data so that your CNN can learn from them and predict them” is important!

Inference: predict these “values” and accordingly transfer them to bounding boxes!

47

48 of 66

Questions?

49 of 66

How to deal with object sizes?

49

[Lin et al., Feature Pyramid Networks for Object Detection, CVPR 2017]

50 of 66

Mask R-CNN

50

[Girshick, CVPR 2019 tutorial]

[He et al., Mask r-cnn, ICCV 2017]

51 of 66

Mask R-CNN: for instance segmentation

51

CNN: convolutional neural network

RPN: region proposal network

Bulldozer: 80%

Bus: 15%

Motorcycle: 5%

52 of 66

2-stage vs. 1-stage detectors

Other names: single-shot, single-pass, … (e.g., YOLO, SSD)
Difference: no ROI pooling/align

52

[Redmon et al., 2016]

2-stage detector

1-stage detector

53 of 66

Exemplar 1-stage detectors

53

[Liu et al., 2016]

SSD

YOLO

[Redmon et al., 2016]

54 of 66

Exemplar 1-stage detectors (Retina Net)

54

[Lin et al., 2017]

55 of 66

2-stage vs. 1-stage detectors

Pros for 1-stage:

Faster!

Cons for 1-stage:

Too many negative locations; scale

55

[Redmon et al., 2016]

56 of 66

Inference: choose few from many

Non-Maximum Suppression (NMS)

56

[Pictures from “towards data science” post]

57 of 66

Example results

57

[Zhang, et al., 2021]

58 of 66

New approach to object detection

DEtection Transformer [Nicolas Carion et al.]

59 of 66

New approach to object detection

DEtection Transformer [Nicolas Carion et al.]

60 of 66

Key names

Tsung-Yi Lin
Ross Girshick
Kaiming He
Piotr Dollar

61 of 66

Take home

Good tutorials online:

CVPR 2017-2022, ECCV 2018-2020, ICCV 2017-2021 [search tutorial or workshop]
ICML/NeurIPS/ICLR 2018-2021 [search tutorial or workshop]

Good framework:

PyTorch: Torchvision
PyTorch: Detectron2

Good source code:

Papers with code: https://paperswithcode.com/

61

62 of 66

LiDAR-based 3D perception

63 of 66

LiDAR-based 3D perception

63

[Source: Graham Murdoch/Popular Science]

LiDAR:

Light Detection and Ranging sensor
accurate 3D point clouds of the environment

64 of 66

LiDAR-based 3D perception

You can view the LiDAR point clouds from different angles

64

Frontal view

Bird’s-eye view (BEV)

65 of 66

Two major ways to process LiDAR point clouds

Point-wise processing

PointNet [Qi et al., 2017]
PointNet++ [Qi et al., 2017]
PointRCNN [Shi et al., 2019]
…

Voxel-based processing: turn points into a tensor (e.g., W x D x H x F)

PointPillars [Lang et al., 2019]
VoxelNet [Zhou et al., 2017]
PIXOR [Liang et al., 2018]
…

65

66 of 66

Voxel-based processing + 3D object detectors

Occupation (PIXOR): 3D points as a 3D occupation tensor from bird’s-eye-view

66

[Yang et al., PIXOR: Real-time 3D Object Detection from Point Clouds, 2019]

height

depth

Left-right