1 of 16

Project Update presentations

team4

Hanrui Huang

Cheng Chen

21-02-2024

2 of 16

Content

  1. Project Introduction
  2. Motivation
  3. Approach Recap
    1. Loss function
    2. Dataset Construction
    3. Prompt2Vec
    4. Metric selection
  4. Results
  5. Challenges & Issues
  6. Updated Timeline

2

3 of 16

Project Introduction

3

image:https://arxiv.org/pdf/2112.10752.pdf High-Resolution Image Synthesis with Latent Diffusion Models

Our project is about “Text-to-Image” using diffusion model.

4 of 16

Project Introduction: Stable Diffusion

4

A latent text-to-image diffusion model

Input : text prompts

Output: image

Advantage:

More stable than GANs based (Training is difficult)

High quality images

Comparison between Latent diffusion Models and others

5 of 16

Motivation: Semantic Understanding

5

The models struggle with semantic understanding when given concise narrative prompts.

Counting

Color

6 of 16

Motivation:Common-sense reasoning

6

LLM has the knowledge!

We want to transfer the semantic understanding and reasoning abilities of LLMs to pre-trained diffusion models(distilling knowledge from large language models (LLMs)

7 of 16

Related works

7

  1. Text-to-image Diffusion
  2. Large Language Models
  3. Semantic Understanding and Reasoning Dataset
  4. Re-implementation of paper: SUR-adapter: Enhancing Text-to-Image Pre-trained Diffusion Models with Large Language Models1

1.SUR-adapter: Enhancing Text-to-Image Pre-trained Diffusion Models with Large Language Models: https://arxiv.org/pdf/2305.05189.pdf

8 of 16

Approach Recap: Loss Function

8

Objective 1: To ensure that the features of simple prompts are enhanced with LLM (Large Language Model) gains.(knowledge distillation )

Objective 2: To make the features of simple prompts resemble those of complex prompts.

The QKV-Attention in the adapter allows the model to notice important semantic relationships in the original features while generating new features.

9 of 16

Approach Recap: Dataset Construction

9

We constructed our dataset through the following two steps.:

  1. Collect image-complex prompt pairs from the civitai website1.
  2. Use BLIP2 to generate simple prompts based on the images.
  3. Ensure semantic consistency between the simple prompts and their corresponding images. (Using CLIP3 model for semantic cleaning)

1.civitai: https://civitai.com/

2.BLIP: a pre-trained model designed for vision-language intersection tasks, focusing on generating descriptive text from images to improve the semantic match between images and texts.

3.CLIP: a multimodal pre-trained model that can understand and evaluate the semantic association between image content and textual descriptions.

10 of 16

Approach Recap: Dataset Construction

10

Example of image-complex prompt pairs on the Civitai website.

Use the Blip model to generate a simple prompt based on the left image:

A mysterious silhouette of a woman with glowing eyes on a dark background.

Simple prompt

11 of 16

Approach Recap: Dataset Construction

11

Ensure semantic consistency between the simple prompts and their corresponding images:

We utilize publicly available pre-trained model CLIP to ensure the correctness of both simple narrative prompts and complex keyword-based prompts

For each image, we ask CLIP to classify between its simple prompt and its complex prompt for selecting a prompt matching the image best semantically.

Data cleaning process:

In general, a complex prompt often contains other semantically irrelevant information, such as image quality descriptions, so a semantically correct simple prompt generally has a higher CLIP score than the complex prompt. Therefore, if the CLIP score of a simple prompt is not lower than the corresponding complex prompt, we retain the sample. After the automatic semantic

After the semantic cleaning based on the CLIP score, we retain 191,33 samples

12 of 16

Approach Recap: Prompt2Vec

We utilize LLaMA, a collection of foundation language models ranging from 7B to 65B parameters, as knowledge distillation for large language models (LLMs). Specifically, we save the vector representations of simple prompts in fortieth layers of LLaMA, which serve as the text understanding to finetune diffusion models.

12

We stored the simple prompts and their corresponding vectors generated by the LLM in a dictionary following the format below:

{

"prompt": torch.tensor,

}

13 of 16

Approach Recap: Metrics Evaluation

  • Baseline: The original Stable Diffusion model
  • evaluation metric:
    • semantic quality evaluation : 3 type of prompt,TISE, CLIP Score1
    • image quality evaluation: BRISQUE2, CLIP-IQA3, MUSIQ4 and User Preference

13

  1. CLIP Score: A measure that uses the CLIP model to evaluate the semantic similarity between generated images and their textual descriptions.
  2. BRISQUE: A no-reference image quality assessment metric that predicts the naturalness of images without requiring a reference image.
  3. CLIP-IQA: An image quality assessment method leveraging the CLIP model to estimate perceptual image quality by comparing images with textual descriptions.
  4. MUSIQ: A multi-scale deep learning model for no-reference image quality assessment that considers various aspects of human perception to evaluate image quality.

3 type of prompt:action, colour and counting: each with fifteen prompts. These prompts are used as the input of baseline and sur-adapter model.

User Preference:Doing a survey about the image preference

14 of 16

Our result dataset

Metadata:

{

"file_name": "3fbf5f76-2b1f-4ae4-57b4-60fd6040e000.png",

"simple_prompt": "a digital painting of a woman with long black hair",

"complex_prompt": "(Award Winning Painting:1.3) of (Ultra detailed:1.3),(Lonely:1.3) a woman (Ana de Armas) (red rose in black hair, blue eye), <lora:audrey_kawasaki:0.8>"

}

14

Image_size

Simple_Prompt_len

Complex_Prompt_len

Dataset_size

512 x 512

67

248

191,33

Dataset statistics

15 of 16

Challenges & Issues

  • Dataset:
    • Dataset contains some sexually explicit images and others unsuitable for dissemination.

    • The quantity of data within our dataset is insufficient, comprising only 6,498 entries. This limited volume of data may adversely affect the efficacy of our training outcomes.

  • Computational Resource:
    • The process of generating a simple prompt vector using llama model is highly resource-intensive, requiring significant GPU resources and entails an extended duration of training.

  • Distillation:
    • We are uncertain about the specific layer selection of the LAMMA model to serve as the representational understanding of the input prompt.

15

16 of 16

Update Timeline

What we have done so far:

Feb 26 - Mar.4 : Literature review

Mar.4 - Mar.11 : Vanilla dataset construction

Mar.11 - Mar.19 :

  • Prompt vector generation
  • Metric selection and investigation

We will progressively complete the following milestones in the remaining time:

Mar.23 : Extending the SUR dataset

Mar.24 - Mar.27 : Generating the prompt vector using the bigger dataset

Mar.28 - Mar.31 : Training the SUR-adapter

Apr.1-Ap.r15 :

  • Comparison between the vanilla dataset and the bigger dataset
  • Exploring the distillation of LLama 2 and/or Gemma

Apr.16-Apr.17:Final report and video clip

16