1 of 53

CompoDiff: Versatile Composed Image Retrieval With Latent Diffusion

Sanghyuk Chun

sanghyuk.chun@gmail.com

* slide can be found in https://sanghyukchun.github.io/home/

3 of 53

Image-to-image retrieval

Query: image

Search item: image(s)

Pros:

We can search “visually similar items” based on “visual features”

Cons:

We may want different “visual features” depending on our situation and purpose of using a search engine.
We cannot search items with additional conditions (e.g., clothes with a special color, or clothes for female, …)

4 of 53

“a solid black shirt having long sleeves”

5 of 53

Image-to-text retrieval (cross-modal retrieval)

Query: text

Search item: image(s)

Pros:

We can provide specific conditions with natural language

Cons:

Text query cannot reflect specific visual features, while sometimes visual components are hard to be described only with texts

6 of 53

“with v neck”

7 of 53

Composed Image Retrieval (CIR)

Query: composition of image & additional condition(s)

Search item: image(s)

Pros:

We now can search “visually similar” items with a specific condition

Challenges:

How to compose image and conditions with a different modality?
Can we deal with various modalities?
Can we deal with multiple conditions?

8 of 53

Previous approaches to solve CIR (with text condition)

“with v neck”

Source Image

Text condition

Target Image

(

)

Main idea: training a model with triplets of (source image, text condition, target image)

9 of 53

Previous approaches to solve CIR (with text condition)

“ARTEMIS: Attention-based retrieval with text-explicit matching and implicit similarity.”, Delmas, Ginger, et al., ICLR 2022

10 of 53

Previous approaches to solve CIR (with text condition)

"Effective conditioned and composed image retrieval combining clip-based features.", Baldrati, Alberto, et al., CVPR 2022

11 of 53

Our motivation: CIR with various conditions in real-world retrieval scenario (e.g., zero-shot CIR)

Current CIR methods only can handle this part

12 of 53

Image-to-image retrieval

Query

Retrieved items

13 of 53

Composed Image Retrieval (conventional approach)

+ “With cherry blossom”

Query

Retrieved items

14 of 53

Composed Image Retrieval with negative condition

Query

Retrieved items

+ “With cherry blossom”

- “France”

15 of 53

Composed Image Retrieval with mask condition

Query

Retrieved items

+ “With cherry blossom”

16 of 53

Challenges with the existing CIR methods

Challenge 1. The triplet dataset is too expensive to collect; the existing CIR benchmarks are relatively small and are hardly generalized to real-world search engines

Fashion IQ dataset: (46.6k / 15.5k / 15.5k) (training / validation / test)� – Only focused on fashion items
CIRR dataset: (28.8k / 3.6k / 3.6k) (training / validation / test) � – More open-domain items, but its distribution is different from� actual user scenario

Challenge 2. The existing methods only can handle a composition of <image, text>, while more various conditions such as “masked condition”, “negative condition”, or “multiple conditions” are not available by the existing approaches

"Image retrieval on real-life images with pre-trained vision-and-language models.", Liu, Zheyuan, et al., ICCV 2021

"Fashion iq: A new dataset towards retrieving images by natural language feedback.", Wu, Hui, et al. CVPR 2021

17 of 53

Data collection process for image-to-image retrieval

Collecting images for image classification or i2i retrieval is relatively easy and inexpensive

“Eiffel tower”

18 of 53

Data collection process for CIR

Collecting images for CIR is expensive, and requires a heavy labor cost because the common process for constructing CIR dataset is:

Select two images that is similar with some modifications
Human labor manually annotates (in text!) the difference between two images

With cherry blossom

Add balloons

19 of 53

preprint

CompoDiff: Versatile Composed Image Retrieval With Latent Diffusion

Geonmo Gu¹^*, Sanghyuk Chun²^*, Wonjae Kim²,�HeeJae Jun¹, Yoohoon Kang¹, Sangdoo Yun²

^�¹ NAVER Vision

² NAVER AI Lab

^* Equally contributed

20 of 53

Key idea: think CIR as a conditioned image editing task

CIR can be viewed as an image editing model with various conditions. After editing the given image, we can retrieve similar images using the existing image-to-image retrieval models

“with v neck”

Conditioned image editing model

Image-to-image retrieval model

Edited image

Source image

Retrieval database

Retrieved image

21 of 53

Key idea: think CIR as a conditioned image editing task

Source image

“as 4k image”

- “pink rabbit”

Conditions

Generated image

Top-1 retrieved item

(from LAION-2B)

22 of 53

Preliminary: InstructPix2Pix

"InstructPix2Pix: Learning to follow image editing instructions.", Brooks, Tim, et. al, CVPR 2023

23 of 53

Preliminary: InstructPix2Pix

InstructPix2Pix is an image editing model with a human instruction (in a short text)
The main challenges of InstructPix2Pix is that collecting triplets of�<original image, instruction text, edited image> is too expensive!
The authors tackles this problem by generating training triplets
InstructPix2Pix employs many recent strong generative models, such as

GPT-3
StableDiffusion
Prompt2Prompt

24 of 53

Preliminary: InstructPix2Pix

(b) is based on Prompt2Prompt (a method for preserving the identity in an image)

"Prompt-to-prompt image editing with cross attention control.", Hertz, Amir, et al., ICLR 2023

25 of 53

Preliminary: InstructPix2Pix

"InstructPix2Pix: Learning to follow image editing instructions.", Brooks, Tim, et. al, CVPR 2023

26 of 53

Preliminary: InstructPix2Pix

"InstructPix2Pix: Learning to follow image editing instructions.", Brooks, Tim, et. al, CVPR 2023

27 of 53

Problems of directly using InstructPix2Pix for CIR

“with v neck”

Edited image

Source image

Retrieval database

Retrieved image

Why don’t we just use InstructPix2Pix for this phase, and perform retrieval using a pre-trained image encoder?

Issue 4) InstructPix2Pix only can handle image + text instruction, while we need more various and versatile conditions rather than a single text instruction

Issue 1) InstructPix2Pix outputs are “generated” images. Thus, the outputs can be “out-of-distributed” images for the pre-trained image encoder.

Issue 2) InstructPix2Pix is only trained on 452k triplets, while most of images are designed for “image editing”. In other words, this approach has a lack of generalizability to more generic image domains

Issue 3) This framework requires an expensive additional image editing process to retrieve. We need more faster approach for a search engine

28 of 53

Overview of latent diffusion-based CIR framework

Source image

Extracted image feature

Frozen pre-trained image encoder

Conditioned feature-to-feature transform model

“with v neck”

Transformed image feature

Image-to-image retrieval

CLIP encoder

Latent Diffusion

* Note that we can do “un-CLIP” for generating images even only using the latent diffusion framework.

29 of 53

Advantages of our new CIR framework

Our framework has two advantages over the naive image editing + image-to-image retrieval approach

Our framework directly transforms image features, where retrieval (or nearest neighbor search) is actually happened. Therefore, we do not have to worry about the generalizability of the pre-trained image encoder.
Our framework is much efficient than the naive approach.

As the output lies on the same feature space as the input, our framework can handle multiple conditions by applying editing on the transformed feature multiple times.
Also, our framework has a practical advantage: we can re-use a powerful image retrieval encoder and retrieval database. Even if we update the CIR model, the pre-extracted image retrieval database can be re-used! (assume a billion-scale DB)

30 of 53

Scaling up InstructPix2Pix 450k triplets to 18M triplets

Despite the advantages of our framework, we still have the generalizability issue of the 450k InstructPix2Pix generated training triplets. These triplets are not sufficiently generic to cover real-world image retrieval tasks

Especially, InstructPix2Pix dataset is based on the LAION-Aesthetics dataset, which is designed for generation tasks with aesthetic text-image pairs.

To solve the issue, we scale-up the InstructPix2Pix 452k triplets to 18.8M triplets! (x 41)
Our new dataset, SynthTriplet 18M dataset, is based on large-scale image-caption datasets, including: COYO 700M, StableDiffusion Prompts (user-generated prompts that make the quality of StableDiffusion better), LAION-2B-en-aesthetic, and LAION-COCO datasets (synthetic captions for LAION-5B subsets with COCO style captions. LAION-COCO less uses proper nouns than the real web texts)

31 of 53

How to scale up InstructPix2Pix?

Recap:

InstructPix2Pix generated caption triplets <source caption, instruction, edited caption> and applied Prompt2Prompt for generating images. Therefore, the key issue is that how can we scale up the caption triplets with minimal cost? Note that InstructPix2Pix needs human labors whenever scaling up the dataset

"InstructPix2Pix: Learning to follow image editing instructions.", Brooks, Tim, et. al, CVPR 2023

34 of 53

Generating massive diverse caption triplets by templates

We first extracted nouns from the large-scale image-caption datasets, and used simple templates to generative caption triplets
We used 47 template sentences to generate various caption triplets

Our process is (1) select a caption from the dataset and extract a noun from the caption (2) randomly select a noun from pre-extracted nouns (3) replacing the noun in the caption to the randomly selected noun
This process ensures the diversity of the nouns in the new dataset compared to LLM

37 of 53

Model-based filtering

We apply a CLIP-based filtering: extracting image/text features for the source and the edited image/captions and filtering using the following similarities�(I_s, I_e), (I_s, T_s), (I_e, T_e), (I_s, T_e), (I_e, T_s)

Failure cases

Make the lipstick green

Make it a photograph

A100	128 GPUs
Elapsed per each triplet	2.1 sec
# of generated triplets	60,000,000
# of filtered triplets	18,800,000
Total Elapsed	11.39 days

38 of 53

InstructPix2Pix452k vs. SynthTriplet18M

COYO 700M, StableDiffusion Prompts, LAION-2B-en-aesthetic, LAION-COCO

LAION-Aesthetics

700 captions sampled from LAION-Aesthetics & human instructions

Training dataset for LLM

452k triplets of InstructPix2Pix

Base image-caption dataset for generating the training dataset

39 of 53

Examples from SynthTriplet18M

40 of 53

Recap: Challenges with the existing CIR methods

Challenge 1. The triplet dataset is too expensive to collect; the existing CIR benchmarks are relatively small and are hardly generalized to real-world search engines

Fashion IQ dataset: (46.6k / 15.5k / 15.5k) (training / validation / test)� – Only focused on fashion items
CIRR dataset: (28.8k / 3.6k / 3.6k) (training / validation / test) � – More open-domain items, but its distribution is different from� actual user scenario

Challenge 2. The existing methods only can handle a composition of <image, text>, while more various conditions such as “masked condition”, “negative condition”, or “multiple conditions” are not available by the existing approaches

"Image retrieval on real-life images with pre-trained vision-and-language models.", Liu, Zheyuan, et al., ICCV 2021

"Fashion iq: A new dataset towards retrieving images by natural language feedback.", Wu, Hui, et al. CVPR 2021

SynthTriplet18M only resolves this part

41 of 53

Our motivation: CIR with various conditions

Current CIR methods only can handle this part (as well as most of image editing models)

42 of 53

CompoDiff: latent diffusion model with various conditions

If we have a negative text, we simply changed the value of null image embedding from all zero to the textual feature of the negative text

Null text embedding (textual feature for “”)

Null image embedding (all zero)

43 of 53

CompoDiff training

We employed a two-stage training strategy for CompoDiff training
Here, we generate masks using a zero-shot segmentation model, ClipSeg.
Simply speaking, the first stage is for training our own stable diffusion, and the second stage is for fine-tuning our own stable diffusion to the our generated training images
During training, we randomly drop each condition with 10% probability, except the mask condition
To prevent knowledge forgetting from stage 1, we adopt a multi-task learning strategy with text-to-image (30%), inpainting (30%), and the target transform (40%).

44 of 53

Controllability with classifier-free guidance weighting

(Figure from the InstructPix2Pix paper)

45 of 53

Experiments: datasets

Quantitative comparisons: conventional CIR benchmarks

Fashion IQ dataset: (46.6k / 15.5k / 15.5k) (training / validation / test)� – Only focused on fashion items / low resolution images
CIRR dataset: (28.8k / 3.6k / 3.6k) (training / validation / test) � – More open-domain items, but its distribution is different from� actual user scenario

46 of 53

Experimental protocol

CompoDiff

2-Staged training with LAION-2B & SynthTriplet18M
The pre-trained CompoDiff model has a zero-shot (ZS) capability for downstream tasks, but due to the distribution shift of each dataset, zero-shot CIR is not fully powerful as the task-specific counterparts
Hence, we fine-tuned pre-trained CompoDiff model to the downstream dataset

Other methods

The existing methods have no zero-shot capability. Also, they are not trained on our massive and high-quality datasets.
For a fair comparison and showing the power of our dataset, we train each method on SynthTriplet18M and measure ZS and FT CIR performances

47 of 53

Experimental results on FashionIQ

CompoDiff outperforms other methods (trained on the SynthTriplet18M dataset)�in zero-shot CIR tasks

Training any CIR model on the SynthTriplet18M dataset and fine-tuning to the downstream dataset shows the state-of-the-art performances

48 of 53

Experimental results on CIRR

CompoDiff outperforms other methods (trained on the SynthTriplet18M dataset)�in zero-shot CIR tasks

Training any CIR model on the SynthTriplet18M dataset and fine-tuning to the downstream dataset shows the state-of-the-art performances

49 of 53

Denoising vs. performance trade-off

50 of 53

Qualitative comparison on LAION-2B zero-shot CIR

51 of 53

Bonus: image generation using CompoDiff

Source image

Extracted image feature

Frozen pre-trained image encoder

Conditioned feature-to-feature transform model

“with v neck”

Transformed image feature

CLIP encoder

Latent Diffusion

Dall-e un-CLIP decoder

52 of 53

Bonus: our following image generation work, Graphit

https://github.com/navervision/Graphit

53 of 53

Conclusion

Composed Image Retrieval (CIR) is a challenging task because of (1) the triplet dataset collection for CIR is too expensive (2) it is difficult to design a unified framework to handle various and multiple conditions
We handle the first issue by generating massive 18M triplets based on InstructPix2Pix
For the second issue, we propose a latent diffusion-based CIR framework, CompoDiff
In the experiment, CompoDiff shows the generalizability to the real-world web-scale composed image retrieval task as well as zero-shot CIR benchmarks; while other comparison methods show inferior zero-shot CIR performances
In the fine-tuned scenario, we confirmed that training the existing methods on our SynthTriplet18M dataset achieves the state-of-the-art CIR performances

1 of 53

2 of 53

3 of 53

4 of 53

5 of 53

6 of 53

7 of 53

8 of 53

9 of 53

10 of 53

11 of 53

12 of 53

13 of 53

14 of 53

15 of 53

16 of 53

17 of 53

18 of 53

19 of 53

20 of 53

21 of 53

22 of 53

23 of 53

24 of 53

25 of 53

26 of 53

27 of 53

28 of 53

29 of 53

30 of 53

31 of 53

32 of 53

33 of 53

34 of 53

35 of 53

36 of 53

37 of 53

38 of 53

39 of 53

40 of 53

41 of 53

42 of 53

43 of 53

44 of 53

45 of 53

46 of 53

47 of 53

48 of 53

49 of 53

50 of 53

51 of 53

52 of 53

53 of 53