Computer Vision
Live Lecture 10 – Latest research in recognition
Structure
2
Structure
3
Recall: Detection
4
Open-World Object Detection
What if we want to find as many objects as possible, regardless of their classes?
A simple solution: remove the classifier heads!
5
Open-World Object Detection (OLN)
6
Can we do better?
What “defines” an object?
More directly — what doesn’t change as much between different classes of objects?
Our answer: geometric cues!
7
How to incorporate geometric cues?
Our solution: pseudo-labeling, i.e., use geometric cues to create more bbox labels
8
The performance gains are significant!
9
Structure
10
Recall: Segmentation
11
Open-vocabulary segmentation
Can we use natural language for segmentation?
12
Problem: What names are we using?
Common practice: use the class names in the current datasets.
Problem: they were annotated as class identifiers, not for vision-language tasks!
13
Our solution: RENOVATE the names!
14
Training the renaming model
15
Mask-Text Cross-Attention Alignment
Qualitative results
16
Quantitative results
Of course, the renovated names also help evaluation (at higher accuracy and finer granularity).
17
Structure
18
The resolution problem
Most of current foundation models only accept input resolutions from 224x224px to 384x384px. The feature resolutions will be even smaller (~14 - 16 times smaller).
Screenshot from OpenCLIP:
But, dense recognition (e.g., segmentation, depth estimation) obviously benefits more from higher resolutions!
19
Recall: classical fix – sliding window
20
Alternative route: Can we upsample the features?
Why features, not images?
21
Upsample features: two key considerations
22
pseudo-labeling
self-distillation
Upsample features: two key considerations
2. Architecture:
23
Upsample features: two key considerations
2. Architecture:
24
LoftUp architecture
Benefits of a coordinate-based model
25
Results
26
Multi-task enhancement
Interactive segmentation
Advertisement – Student Project call
Next for LoftUp:
27
Summary
28