1 of 11

CVPR 2024

MLV Lab

Korea University

Retrieval-Augmented

Open-Vocabulary Object Detection

Jooyeon Kim1,* Eulrang Cho2,* Sehyung Kim1 Hyunwoo J. Kim1

1Korea University 2Samsung Research

2 of 11

Motivation

CVPR 2024

MLV Lab

Korea University

  • Open-Vocabulary Object Detection (OVD) aims to detect novel objects beyond the pre-trained categories.

Radford, Alec, et al. "Learning transferable visual models from natural language supervision." ICML, 2021.

Gao, Mingfei, et al. "Open vocabulary object detection with pseudo bounding-box labels." ECCV, 2022.

Vision-Language Model

Pseudo-labeling

Utilize vocabulary sets more diversely

3 of 11

Motivation

CVPR 2024

MLV Lab

Korea University

‘jaguar’

cat

bottle

Similar:

Dissimilar:

Vocabulary Set

LLM

alligator

iPod

sock

handheld music player

worn on the feet

sharp teeth

strong tail

Vocabulary Set

Concepts

  • Negative vocabularies
  • Verbalized concepts

Retrieval-Augmented Losses and visual Features (RALF)

4 of 11

RALF

CVPR 2024

MLV Lab

Korea University

Retrieval-Augmented Losses (RAL): retrieve negative vocabularies and augment loss function

Retrieval-Augmented visual Features (RAF): augment visual features using verbalized concepts

5 of 11

RALF

CVPR 2024

MLV Lab

Korea University

Retrieval-Augmented Losses (RAL): retrieve negative vocabularies and augment loss function

Ground-Truth Box

Backbone

RAL

RoI Head

RPN

Training pipeline w/RAL

Ground-Truth Label

Mean

Mean

‘broccoli’

Vocabulary Store

Ground-Truth

Box Embedding

Negative

Retriever

Text Encoder

Retrieved  negative vocabularies

Hard Negative

lettuce

avocado

green beans

Easy Negative

greylag

carillonneur

trouser press

6 of 11

RALF

CVPR 2024

MLV Lab

Korea University

Retrieval-Augmented visual Features (RAF): augment visual features using verbalized concepts

LLM

Extract

Noun Chunks

Vocabulary Store

Concept

Retriever

Describe what a(n) {vocabulary} looks like.

Concept Store

Retrieved concepts and scores

0.30273

0.29639

0.29541

Augmenter

RPN

Crop

Image

Encoder

Concept

Retriever

Augmenter

Offline

Generate

Pseudo-label

RAF training

7 of 11

RALF

CVPR 2024

MLV Lab

Korea University

Retrieval-Augmented visual Features (RAF): augment visual features using verbalized concepts

Text Embeddings of

Text Embeddings of

Text Embeddings of

Ensemble

Backbone

+ RPN

Concept

Retriever

Augmenter

RoI Head

Crop

Image

Encoder

Inference pipeline w/RAF

8 of 11

Experiments

CVPR 2024

MLV Lab

Korea University

  • Main results

9 of 11

Experiments

CVPR 2024

MLV Lab

Korea University

  • Ablation studies

10 of 11

Experiments

CVPR 2024

MLV Lab

Korea University

  • Visualization

11 of 11

Conclusion

CVPR 2024

MLV Lab

Korea University

  • We propose Retrieval-Augmented Losses and visual Features (RALF) that retrieves information from a large vocabulary set and augments losses and visual features.

  • To optimize the detector, we add Retrieval-Augmented Losses (RAL), which brings hard and easy negative vocabulary from the pre-defined vocabulary store and reflects the semantic similarity with the ground-truth label.

  • Retrieval-Augmented visual Features (RAF) augmets visual features with generated concepts from a large language model and enables improved generalizability.

  • RALF easily plugs into various detectors and significantly improves detection ability not only base categories but also novel categories.