CompoDiff: Versatile Composed Image Retrieval With Latent Diffusion
Image-to-image retrieval
Query: image
Search item: image(s)
Pros:
Cons:
“a solid black shirt having long sleeves”
Image-to-text retrieval (cross-modal retrieval)
Query: text
Search item: image(s)
Pros:
Cons:
+
“with v neck”
Composed Image Retrieval (CIR)
Query: composition of image & additional condition(s)
Search item: image(s)
Pros:
Challenges:
Previous approaches to solve CIR (with text condition)
“with v neck”
Source Image
Text condition
Target Image
(
)
Main idea: training a model with triplets of (source image, text condition, target image)
Previous approaches to solve CIR (with text condition)
“ARTEMIS: Attention-based retrieval with text-explicit matching and implicit similarity.”, Delmas, Ginger, et al., ICLR 2022
Previous approaches to solve CIR (with text condition)
"Effective conditioned and composed image retrieval combining clip-based features.", Baldrati, Alberto, et al., CVPR 2022
Our motivation: CIR with various conditions in real-world retrieval scenario (e.g., zero-shot CIR)
Current CIR methods only can handle this part
Image-to-image retrieval
Query
Retrieved items
Composed Image Retrieval (conventional approach)
+ “With cherry blossom”
Query
Retrieved items
Composed Image Retrieval with negative condition
Query
Retrieved items
+ “With cherry blossom”
- “France”
Composed Image Retrieval with mask condition
Query
Retrieved items
+ “With cherry blossom”
Challenges with the existing CIR methods
"Image retrieval on real-life images with pre-trained vision-and-language models.", Liu, Zheyuan, et al., ICCV 2021
"Fashion iq: A new dataset towards retrieving images by natural language feedback.", Wu, Hui, et al. CVPR 2021
Data collection process for image-to-image retrieval
Collecting images for image classification or i2i retrieval is relatively easy and inexpensive
“Eiffel tower”
Data collection process for CIR
Collecting images for CIR is expensive, and requires a heavy labor cost because the common process for constructing CIR dataset is:
With cherry blossom
Add balloons
preprint
CompoDiff: Versatile Composed Image Retrieval With Latent Diffusion
Geonmo Gu1 *, Sanghyuk Chun2 *, Wonjae Kim2,�HeeJae Jun1, Yoohoon Kang1, Sangdoo Yun2
�1 NAVER Vision
2 NAVER AI Lab
* Equally contributed
Key idea: think CIR as a conditioned image editing task
CIR can be viewed as an image editing model with various conditions. After editing the given image, we can retrieve similar images using the existing image-to-image retrieval models
“with v neck”
Conditioned image editing model
Image-to-image retrieval model
Edited image
Source image
Retrieval database
Retrieved image
Key idea: think CIR as a conditioned image editing task
Source image
“as 4k image”
“as 4k image”
- “pink rabbit”
Conditions
Generated image
Top-1 retrieved item
(from LAION-2B)
Preliminary: InstructPix2Pix
"InstructPix2Pix: Learning to follow image editing instructions.", Brooks, Tim, et. al, CVPR 2023
Preliminary: InstructPix2Pix
Preliminary: InstructPix2Pix
(b) is based on Prompt2Prompt (a method for preserving the identity in an image)
"Prompt-to-prompt image editing with cross attention control.", Hertz, Amir, et al., ICLR 2023
Preliminary: InstructPix2Pix
"InstructPix2Pix: Learning to follow image editing instructions.", Brooks, Tim, et. al, CVPR 2023
Preliminary: InstructPix2Pix
"InstructPix2Pix: Learning to follow image editing instructions.", Brooks, Tim, et. al, CVPR 2023
Problems of directly using InstructPix2Pix for CIR
“with v neck”
Edited image
Source image
Retrieval database
Retrieved image
Why don’t we just use InstructPix2Pix for this phase, and perform retrieval using a pre-trained image encoder?
Issue 4) InstructPix2Pix only can handle image + text instruction, while we need more various and versatile conditions rather than a single text instruction
Issue 1) InstructPix2Pix outputs are “generated” images. Thus, the outputs can be “out-of-distributed” images for the pre-trained image encoder.
Issue 2) InstructPix2Pix is only trained on 452k triplets, while most of images are designed for “image editing”. In other words, this approach has a lack of generalizability to more generic image domains
Issue 3) This framework requires an expensive additional image editing process to retrieve. We need more faster approach for a search engine
Overview of latent diffusion-based CIR framework
Source image
Extracted image feature
Frozen pre-trained image encoder
Conditioned feature-to-feature transform model
“with v neck”
Transformed image feature
Image-to-image retrieval
CLIP encoder
Latent Diffusion
* Note that we can do “un-CLIP” for generating images even only using the latent diffusion framework.
Advantages of our new CIR framework
Scaling up InstructPix2Pix 450k triplets to 18M triplets
How to scale up InstructPix2Pix?
"InstructPix2Pix: Learning to follow image editing instructions.", Brooks, Tim, et. al, CVPR 2023
Generating massive diverse caption triplets by templates
Model-based filtering
We apply a CLIP-based filtering: extracting image/text features for the source and the edited image/captions and filtering using the following similarities�(I_s, I_e), (I_s, T_s), (I_e, T_e), (I_s, T_e), (I_e, T_s)
Failure cases
Make the lipstick green
Make it a photograph
A100 | 128 GPUs |
Elapsed per each triplet | 2.1 sec |
# of generated triplets | 60,000,000 |
# of filtered triplets | 18,800,000 |
Total Elapsed | 11.39 days |
InstructPix2Pix452k vs. SynthTriplet18M
COYO 700M, StableDiffusion Prompts, LAION-2B-en-aesthetic, LAION-COCO
LAION-Aesthetics
700 captions sampled from LAION-Aesthetics & human instructions
Training dataset for LLM
452k triplets of InstructPix2Pix
Base image-caption dataset for generating the training dataset
Examples from SynthTriplet18M
Recap: Challenges with the existing CIR methods
"Image retrieval on real-life images with pre-trained vision-and-language models.", Liu, Zheyuan, et al., ICCV 2021
"Fashion iq: A new dataset towards retrieving images by natural language feedback.", Wu, Hui, et al. CVPR 2021
SynthTriplet18M only resolves this part
Our motivation: CIR with various conditions
Current CIR methods only can handle this part (as well as most of image editing models)
CompoDiff: latent diffusion model with various conditions
If we have a negative text, we simply changed the value of null image embedding from all zero to the textual feature of the negative text
Null text embedding (textual feature for “”)
Null image embedding (all zero)
CompoDiff training
Controllability with classifier-free guidance weighting
(Figure from the InstructPix2Pix paper)
Experiments: datasets
Experimental protocol
Experimental results on FashionIQ
CompoDiff outperforms other methods (trained on the SynthTriplet18M dataset)�in zero-shot CIR tasks
Training any CIR model on the SynthTriplet18M dataset and fine-tuning to the downstream dataset shows the state-of-the-art performances
Experimental results on CIRR
CompoDiff outperforms other methods (trained on the SynthTriplet18M dataset)�in zero-shot CIR tasks
Training any CIR model on the SynthTriplet18M dataset and fine-tuning to the downstream dataset shows the state-of-the-art performances
Denoising vs. performance trade-off
Qualitative comparison on LAION-2B zero-shot CIR
Bonus: image generation using CompoDiff
Source image
Extracted image feature
Frozen pre-trained image encoder
Conditioned feature-to-feature transform model
“with v neck”
Transformed image feature
CLIP encoder
Latent Diffusion
Dall-e un-CLIP decoder
Bonus: our following image generation work, Graphit
Conclusion