1 of 28

Computer Vision

Live Lecture 10 – Latest research in recognition

2 of 28

Structure

Detection → Open-world detection
Segmentation → Open-vocabulary segmentation
Resolution problem in Dense Recognition → And how to fix it

3 of 28

Structure

Detection → Open-world detection
Segmentation → Open-vocabulary segmentation
Resolution problem in Dense Recognition → And how to fix it

4 of 28

Recall: Detection

5 of 28

Open-World Object Detection

What if we want to find as many objects as possible, regardless of their classes?

A simple solution: remove the classifier heads!

6 of 28

Open-World Object Detection (OLN)

7 of 28

Can we do better?

What “defines” an object?

More directly — what doesn’t change as much between different classes of objects?

Our answer: geometric cues!

8 of 28

How to incorporate geometric cues?

We want to learn the “objectness” concepts from geometric cues.
But we don’t want to use geometric cues during inference – that’s inconvenient

Our solution: pseudo-labeling, i.e., use geometric cues to create more bbox labels

9 of 28

The performance gains are significant!

10 of 28

Structure

Detection → Open-world detection
Segmentation → Open-vocabulary segmentation
Resolution problem in Dense Recognition → And how to fix it

11 of 28

Recall: Segmentation

12 of 28

Open-vocabulary segmentation

Can we use natural language for segmentation?

Give the model a picture and list of potential object names, let it segment!
No pre-defined, fixed class list anymore!

13 of 28

Problem: What names are we using?

Common practice: use the class names in the current datasets.

Problem: they were annotated as class identifiers, not for vision-language tasks!

14 of 28

Our solution: RENOVATE the names!

15 of 28

Training the renaming model

Mask-Text Cross-Attention Alignment

16 of 28

Qualitative results

17 of 28

Quantitative results

Of course, the renovated names also help evaluation (at higher accuracy and finer granularity).

18 of 28

Structure

Detection → Open-world detection
Segmentation → Open-vocabulary segmentation
Resolution problem in Dense Recognition → And how to fix it

19 of 28

The resolution problem

Most of current foundation models only accept input resolutions from 224x224px to 384x384px. The feature resolutions will be even smaller (~14 - 16 times smaller).

Screenshot from OpenCLIP:

But, dense recognition (e.g., segmentation, depth estimation) obviously benefits more from higher resolutions!

20 of 28

Recall: classical fix – sliding window

21 of 28

Alternative route: Can we upsample the features?

Why features, not images?

Cost-efficiency: 4x upsampling features would result in 16x higher computation cost of VFM! While feature upsampling changes nothing in the VFM backbone.
Plug-in replacement: No need to change the VFM backbones.

22 of 28

Upsample features: two key considerations

Loss:

Problem: There are no ground-truth high-res features!
Solution: Pseudo-labeling using SAM masks + self-distillation (as in self-supervised learning)

pseudo-labeling

self-distillation

23 of 28

Upsample features: two key considerations

2. Architecture:

Problem: Previous methods rely on multiple upsampling layers → slow and blurry (Recall: U-Nets)

24 of 28

Upsample features: two key considerations

2. Architecture:

Problem: Previous methods rely on multiple upsampling layers → slow and blurry (Recall: U-Nets)
Solution: A coordinate-based cross-attention model (Recall: lec9 coordinated-based methods)

LoftUp architecture

25 of 28

Benefits of a coordinate-based model

Flexible input sizes and upsampling scales
Fast
Great performance.

26 of 28

Results

Multi-task enhancement

Interactive segmentation

27 of 28

Advertisement – Student Project call

Next for LoftUp:

We are very interested in how feature upsampling helps MLLMs, evaluating on VQA.
Estimated workload: 10 hours/week (min), 3-6 months
Expected outcome: this could potentially be integrated into LoftUpv2, or as a standalone workshop paper (technical report).

28 of 28

Summary

Detection → Open-world detection

GOOD: Introducing geometric cues for pseudo-labeling

Segmentation → Open-vocabulary segmentation

RENOVATE: renovate the names to improve both training and evaluation

Resolution problem in Dense Recognition → Feature Upsampling

LoftUp: Use pseudo-labeling and self-distillation to improve loss functions, and use coordinate-based networks to improve architecture.