1 of 11

Composite Sketch+Text Queries for Retrieving Objects with Elusive Names and Complex Interactions

Prajwal Gatti1, Kshitij Gopal Parikh1, Dhriti Prasanna Paul1

Manish Gupta2, Anand Mishra1

{pgatti, parikh.2, paul.4, mishra}@iitj.ac.in, gmanish@microsoft.com

1Indian Institute of Technology, Jodhpur; 2Microsoft

1

2 of 11

Problem of Complex Queries

  • How to find this? chipmunk, badger, weasel, mongoose, or skunk?

  • Complex queries
    • “difficult-to-name but easy-to-draw” objects.
    • “difficult-to-sketch but easy-to-verbalize” object’s attributes or interaction with the scene.
    • Query: “numbat digging in the ground”

Prajwal Gatti, Kshitij Parikh, Dhriti Paul, Manish Gupta, Anand Mishra. Composite Sketch+Text Queries for Retrieving Objects with Elusive Names and Complex Interactions. AAAI 2024.

2

3 of 11

Related Work

  • Sketch-Based Image Retrieval (SBIR)
    • Methods: CNNs, Transformer-based methods, Deep Siamese models with triplet loss
    • Specialized forms: Zero Shot-SBIR, Finegrained SBIR, Category-level SBIR
  • Text-Based Image Retrieval (TBIR)
    • Alignment of (query text, images) using VisualBERT, ViLT
    • Cross-attention-based models
    • Object tags in images
    • Contrastive learning methods, zero-shot learning methods

Prajwal Gatti, Kshitij Parikh, Dhriti Paul, Manish Gupta, Anand Mishra. Composite Sketch+Text Queries for Retrieving Objects with Elusive Names and Complex Interactions. AAAI 2024.

  • Multimodal Query Based Image Retrieval
    • Reference images and category text for image retrieval.
    • speech and mouse traces as the query
    • Detailed sketch and text input
      • e-commerce product images using CNNs and LSTMs
      • scene images using CLIP

3

4 of 11

CSTBIR Problem and Dataset

  • Given: a hand-drawn sketch S, a complementary text T and a database D of N natural scene images with multiple objects
  • Rank: N images according to relevance to composite ⟨S, T⟩ query.
  • Natural images and text descriptions from Visual Genome.
  • Sketches from Quick, Draw!
  • Train (∼1.89M queries, ∼97K images)
  • Validation (∼5K images, ∼97K queries)
  • Test
    • Test-1K: 1K queries, 1K images
    • Test-5K: 4K queries, 5K images
    • Open-Category set: 750 queries, 70 objects, 1K images.

Prajwal Gatti, Kshitij Parikh, Dhriti Paul, Manish Gupta, Anand Mishra. Composite Sketch+Text Queries for Retrieving Objects with Elusive Names and Complex Interactions. AAAI 2024.

4

markhor, bodhran, and penny-farthing

5 of 11

Sketches

Prajwal Gatti, Kshitij Parikh, Dhriti Paul, Manish Gupta, Anand Mishra. Composite Sketch+Text Queries for Retrieving Objects with Elusive Names and Complex Interactions. AAAI 2024.

5

6 of 11

STNet Model for CSTBIR

  •  

Prajwal Gatti, Kshitij Parikh, Dhriti Paul, Manish Gupta, Anand Mishra. Composite Sketch+Text Queries for Retrieving Objects with Elusive Names and Complex Interactions. AAAI 2024.

6

7 of 11

Image retrieval results on Test-1K (T1K) and Test-5K (T5K)

  • Sketch-based Image Retrieval (SBIR)
    • Doodle2Search and DeepSBIR
    • ViT-based Siamese: 2 ImageNet pre-trained ViT encoders for sketch and image modalities trained using InfoNCE
  • Text-based Image Retrieval (TBIR)
    • VisualBERT, ViLT, CLIP
  • Composite Query-based Image Retrieval
    • TIRG and Taskformer
    • 2-stage
      • ViT trained for sketch classification to get an object name
      • Insert(object name, incomplete text query) and use pretrained CLIP
    • 2-stage (desc): Insert(object description, incomplete text query)

Prajwal Gatti, Kshitij Parikh, Dhriti Paul, Manish Gupta, Anand Mishra. Composite Sketch+Text Queries for Retrieving Objects with Elusive Names and Complex Interactions. AAAI 2024.

  • sketch+text > text-only > sketch-only
  • STNet>2-stage
    • incomplete semantics in object name
    • Ambiguous objects: mouse, bat, star

7

8 of 11

Prajwal Gatti, Kshitij Parikh, Dhriti Paul, Manish Gupta, Anand Mishra. Composite Sketch+Text Queries for Retrieving Objects with Elusive Names and Complex Interactions. AAAI 2024.

8

9 of 11

Further Experiments and Results

  •  

Prajwal Gatti, Kshitij Parikh, Dhriti Paul, Manish Gupta, Anand Mishra. Composite Sketch+Text Queries for Retrieving Objects with Elusive Names and Complex Interactions. AAAI 2024.

  • Results on Open-Category Test Set
    • Open-Category setting is difficult.
    • STNet is more robust to this complex setting.

9

10 of 11

capybara, sitar, penny-farthing, and okapi.

Prajwal Gatti, Kshitij Parikh, Dhriti Paul, Manish Gupta, Anand Mishra. Composite Sketch+Text Queries for Retrieving Objects with Elusive Names and Complex Interactions. AAAI 2024.

10

11 of 11

Conclusion

  • CSTBIR (Composite Sketch+Text Based Image Retrieval)
    • New dataset: ∼2M queries and ~108K natural scene images.
    • STNet (Sketch+Text Network)
      • Pretrained multimodal transformer
      • Uses a hand-drawn sketch to localize relevant objects in the natural scene image
      • Encodes the text and image to perform image retrieval
      • contrastive loss, object classification loss, sketch-guided object detection loss, and sketch reconstruction loss
  • Search for missing people, search for a product in digital catalogs, …

11

Prajwal Gatti, Kshitij Parikh, Dhriti Paul, Manish Gupta, Anand Mishra. Composite Sketch+Text Queries for Retrieving Objects with Elusive Names and Complex Interactions. AAAI 2024.