1 of 28

Computer Vision

Live Lecture 10 – Latest research in recognition

2 of 28

Structure

  1. Detection → Open-world detection
  2. Segmentation → Open-vocabulary segmentation
  3. Resolution problem in Dense Recognition → And how to fix it

2

3 of 28

Structure

  • Detection → Open-world detection
  • Segmentation → Open-vocabulary segmentation
  • Resolution problem in Dense Recognition → And how to fix it

3

4 of 28

Recall: Detection

4

5 of 28

Open-World Object Detection

What if we want to find as many objects as possible, regardless of their classes?

A simple solution: remove the classifier heads!

5

6 of 28

Open-World Object Detection (OLN)

6

7 of 28

Can we do better?

What “defines” an object?

More directly — what doesn’t change as much between different classes of objects?

Our answer: geometric cues!

7

8 of 28

How to incorporate geometric cues?

  1. We want to learn the “objectness” concepts from geometric cues.
  2. But we don’t want to use geometric cues during inference – that’s inconvenient

Our solution: pseudo-labeling, i.e., use geometric cues to create more bbox labels

8

9 of 28

The performance gains are significant!

9

10 of 28

Structure

  • Detection → Open-world detection
  • Segmentation → Open-vocabulary segmentation
  • Resolution problem in Dense Recognition → And how to fix it

10

11 of 28

Recall: Segmentation

11

12 of 28

Open-vocabulary segmentation

Can we use natural language for segmentation?

  • Give the model a picture and list of potential object names, let it segment!
  • No pre-defined, fixed class list anymore!

12

13 of 28

Problem: What names are we using?

Common practice: use the class names in the current datasets.

Problem: they were annotated as class identifiers, not for vision-language tasks!

13

14 of 28

Our solution: RENOVATE the names!

14

15 of 28

Training the renaming model

15

Mask-Text Cross-Attention Alignment

16 of 28

Qualitative results

16

17 of 28

Quantitative results

Of course, the renovated names also help evaluation (at higher accuracy and finer granularity).

17

18 of 28

Structure

  • Detection → Open-world detection
  • Segmentation → Open-vocabulary segmentation
  • Resolution problem in Dense Recognition → And how to fix it

18

19 of 28

The resolution problem

Most of current foundation models only accept input resolutions from 224x224px to 384x384px. The feature resolutions will be even smaller (~14 - 16 times smaller).

Screenshot from OpenCLIP:

But, dense recognition (e.g., segmentation, depth estimation) obviously benefits more from higher resolutions!

19

20 of 28

Recall: classical fix – sliding window

20

21 of 28

Alternative route: Can we upsample the features?

Why features, not images?

  • Cost-efficiency: 4x upsampling features would result in 16x higher computation cost of VFM! While feature upsampling changes nothing in the VFM backbone.
  • Plug-in replacement: No need to change the VFM backbones.

21

22 of 28

Upsample features: two key considerations

  1. Loss:
    1. Problem: There are no ground-truth high-res features!
    2. Solution: Pseudo-labeling using SAM masks + self-distillation (as in self-supervised learning)

22

pseudo-labeling

self-distillation

23 of 28

Upsample features: two key considerations

2. Architecture:

    • Problem: Previous methods rely on multiple upsampling layers → slow and blurry (Recall: U-Nets)

23

24 of 28

Upsample features: two key considerations

2. Architecture:

    • Problem: Previous methods rely on multiple upsampling layers → slow and blurry (Recall: U-Nets)
    • Solution: A coordinate-based cross-attention model (Recall: lec9 coordinated-based methods)

24

LoftUp architecture

25 of 28

Benefits of a coordinate-based model

  1. Flexible input sizes and upsampling scales
  2. Fast
  3. Great performance.

25

26 of 28

Results

26

Multi-task enhancement

Interactive segmentation

27 of 28

Advertisement – Student Project call

Next for LoftUp:

  • We are very interested in how feature upsampling helps MLLMs, evaluating on VQA.
  • Estimated workload: 10 hours/week (min), 3-6 months
  • Expected outcome: this could potentially be integrated into LoftUpv2, or as a standalone workshop paper (technical report).

27

28 of 28

Summary

  • Detection → Open-world detection
    • GOOD: Introducing geometric cues for pseudo-labeling
  • Segmentation → Open-vocabulary segmentation
    • RENOVATE: renovate the names to improve both training and evaluation
  • Resolution problem in Dense Recognition → Feature Upsampling
    • LoftUp: Use pseudo-labeling and self-distillation to improve loss functions, and use coordinate-based networks to improve architecture.

28