1 of 11

CVPR 2024

MLV Lab

Korea University

Retrieval-Augmented

Open-Vocabulary Object Detection

Jooyeon Kim^1,* Eulrang Cho^2,* Sehyung Kim¹ Hyunwoo J. Kim¹

¹Korea University ²Samsung Research

2 of 11

Motivation

CVPR 2024

MLV Lab

Korea University

Open-Vocabulary Object Detection (OVD) aims to detect novel objects beyond the pre-trained categories.

Radford, Alec, et al. "Learning transferable visual models from natural language supervision." ICML, 2021.

Gao, Mingfei, et al. "Open vocabulary object detection with pseudo bounding-box labels." ECCV, 2022.

Vision-Language Model

Pseudo-labeling

Utilize vocabulary sets more diversely

3 of 11

Motivation

CVPR 2024

MLV Lab

Korea University

‘jaguar’

cat

bottle

Similar:

Dissimilar:

Vocabulary Set

LLM

alligator

iPod

sock

handheld music player

worn on the feet

sharp teeth

strong tail

Vocabulary Set

Concepts

Negative vocabularies

Verbalized concepts

Retrieval-Augmented Losses and visual Features (RALF)

We propose two approaches to more effectively utilize vocabulary sets to enhance generalization to novel categories.

First, we leverage negative vocabularies. [click] Given a category like "jaguar", [click] we retrieve similar and dissimilar words from the vocabulary set based on semantic relationships.

For example, a similar word like [click] "cat" is considered a hard negative, while a dissimilar word like [click] "bottle" is an easy negative. This allows us to use the relationships with various negative words.

Second, we use verbalized concepts. [click] By employing a Large Language Model (LLM), [click] we obtain descriptions of the words in the vocabulary set.

With these descriptions, [click] we can extract verbalized concepts that provide details, such as attributes.

Using these two approaches, we have developed [click] RALF, a method that augments both the loss function and visual features.

4 of 11

RALF

CVPR 2024

MLV Lab

Korea University

Retrieval-Augmented Losses (RAL): retrieve negative vocabularies and augment loss function

Retrieval-Augmented visual Features (RAF): augment visual features using verbalized concepts

5 of 11

RALF

CVPR 2024

MLV Lab

Korea University

Retrieval-Augmented Losses (RAL): retrieve negative vocabularies and augment loss function

Ground-Truth Box

Backbone

RAL

RoI Head

RPN

Training pipeline w/RAL

Ground-Truth Label

Mean

‘broccoli’

Vocabulary Store

Ground-Truth

Box Embedding

Negative

Retriever

Text Encoder

Retrieved negative vocabularies

Hard Negative

lettuce

avocado

green beans

Easy Negative

greylag

carillonneur

trouser press

6 of 11

RALF

CVPR 2024

MLV Lab

Korea University

Retrieval-Augmented visual Features (RAF): augment visual features using verbalized concepts

LLM

Extract

Noun Chunks

Vocabulary Store

Concept

Retriever

Describe what a(n) {vocabulary} looks like.

Concept Store

Retrieved concepts and scores

0.30273

0.29639

0.29541

Augmenter

RPN

Crop

Image

Encoder

Concept

Retriever

Augmenter

Offline

Generate

Pseudo-label

RAF training

7 of 11

RALF

CVPR 2024

MLV Lab

Korea University

Retrieval-Augmented visual Features (RAF): augment visual features using verbalized concepts

Text Embeddings of

Ensemble

Backbone

+ RPN

Concept

Retriever

Augmenter

RoI Head

Crop

Image

Encoder

Inference pipeline w/RAF

8 of 11

Experiments

CVPR 2024

MLV Lab

Korea University

Main results

9 of 11

Experiments

CVPR 2024

MLV Lab

Korea University

Ablation studies

10 of 11

Experiments

CVPR 2024

MLV Lab

Korea University

Visualization

11 of 11

Conclusion

CVPR 2024

MLV Lab

Korea University

We propose Retrieval-Augmented Losses and visual Features (RALF) that retrieves information from a large vocabulary set and augments losses and visual features.

To optimize the detector, we add Retrieval-Augmented Losses (RAL), which brings hard and easy negative vocabulary from the pre-defined vocabulary store and reflects the semantic similarity with the ground-truth label.

Retrieval-Augmented visual Features (RAF) augmets visual features with generated concepts from a large language model and enables improved generalizability.

RALF easily plugs into various detectors and significantly improves detection ability not only base categories but also novel categories.