1 of 5

Seesaw:

a system for bootstrapping image searches

Oscar Moll (orm@csail.mit.edu), Manuel Favela, Sam Madden, Vijay Gadepally

2 of 5

Problem: ad-hoc searches on image datasets, without perfect models.

Searching through your own image databases is a basic building block for many downstream tasks.

A common approach to searching images is semantic embeddings, such as CLIP

Pre-trained embeddings are insufficient because accuracy varies widely

Searches can be time consuming or virtually impossible

Example: searching for cars with open doors

3 of 5

Seesaw merges text search with region based feedback

Stage 3:

Region based feedback from user

Stage 1:

Preprocessing

CLIP

CLIP

Stage 2:

Querying starts with natural language

“Open car door”

CLIP

Stage 4:

Query vector optimization.

Im0

Im1

Im2

ImN

4 of 5

CLIP

Region based feedback necessitates region based indexing

5 of 5

User study

  • Task: to find 10 examples (multiple queries)
  • Users consistently completed task faster using SeeSaw than CLIP alone, sometimes by substantial margins.

Benchmark

  • Comprehensive benchmark of 1.4k queries comparing retrieval accuracy (NDCG)
  • NDCG increases across more than 1k queries, drops only in 40.