1 of 16

Project Update presentations

team4

Hanrui Huang

Cheng Chen

21-02-2024

2 of 16

Content

Project Introduction
Motivation
Approach Recap

Loss function
Dataset Construction
Prompt2Vec
Metric selection

Results
Challenges & Issues
Updated Timeline

3 of 16

Project Introduction

image：https://arxiv.org/pdf/2112.10752.pdf High-Resolution Image Synthesis with Latent Diffusion Models

Our project is about “Text-to-Image” using diffusion model.

4 of 16

Project Introduction: Stable Diffusion

A latent text-to-image diffusion model

Input : text prompts

Output: image

Advantage:

More stable than GANs based (Training is difficult)

High quality images

Comparison between Latent diffusion Models and others

5 of 16

Motivation: Semantic Understanding

The models struggle with semantic understanding when given concise narrative prompts.

Counting

Color

6 of 16

Motivation:Common-sense reasoning

LLM has the knowledge！

We want to transfer the semantic understanding and reasoning abilities of LLMs to pre-trained diffusion models（distilling knowledge from large language models (LLMs)）

7 of 16

Related works

Text-to-image Diffusion
Large Language Models
Semantic Understanding and Reasoning Dataset
Re-implementation of paper: SUR-adapter: Enhancing Text-to-Image Pre-trained Diffusion Models with Large Language Models¹

1.SUR-adapter: Enhancing Text-to-Image Pre-trained Diffusion Models with Large Language Models:https://arxiv.org/pdf/2305.05189.pdf

8 of 16

Approach Recap: Loss Function

Objective 1: To ensure that the features of simple prompts are enhanced with LLM (Large Language Model) gains.(knowledge distillation )

Objective 2: To make the features of simple prompts resemble those of complex prompts.

The QKV-Attention in the adapter allows the model to notice important semantic relationships in the original features while generating new features.

9 of 16

Approach Recap: Dataset Construction

We constructed our dataset through the following two steps.:

Collect image-complex prompt pairs from the civitai website¹.
Use BLIP² to generate simple prompts based on the images.
Ensure semantic consistency between the simple prompts and their corresponding images. （Using CLIP³ model for semantic cleaning）

1.civitai: https://civitai.com/

2.BLIP: a pre-trained model designed for vision-language intersection tasks, focusing on generating descriptive text from images to improve the semantic match between images and texts.

3.CLIP: a multimodal pre-trained model that can understand and evaluate the semantic association between image content and textual descriptions.

10 of 16

Approach Recap: Dataset Construction

Example of image-complex prompt pairs on the Civitai website.

Use the Blip model to generate a simple prompt based on the left image:

A mysterious silhouette of a woman with glowing eyes on a dark background.

Simple prompt

11 of 16

Approach Recap: Dataset Construction

Ensure semantic consistency between the simple prompts and their corresponding images:

We utilize publicly available pre-trained model CLIP to ensure the correctness of both simple narrative prompts and complex keyword-based prompts

For each image, we ask CLIP to classify between its simple prompt and its complex prompt for selecting a prompt matching the image best semantically.

Data cleaning process:

In general, a complex prompt often contains other semantically irrelevant information, such as image quality descriptions, so a semantically correct simple prompt generally has a higher CLIP score than the complex prompt. Therefore, if the CLIP score of a simple prompt is not lower than the corresponding complex prompt, we retain the sample. After the automatic semantic

After the semantic cleaning based on the CLIP score, we retain 191,33 samples

12 of 16

Approach Recap: Prompt2Vec

We utilize LLaMA, a collection of foundation language models ranging from 7B to 65B parameters, as knowledge distillation for large language models (LLMs). Specifically, we save the vector representations of simple prompts in fortieth layers of LLaMA, which serve as the text understanding to finetune diffusion models.

We stored the simple prompts and their corresponding vectors generated by the LLM in a dictionary following the format below：

{

"prompt": torch.tensor,

}

13 of 16

Approach Recap: Metrics Evaluation

Baseline: The original Stable Diffusion model
evaluation metric:

semantic quality evaluation : 3 type of prompt,TISE, CLIP Score¹
image quality evaluation: BRISQUE², CLIP-IQA³, MUSIQ⁴ and User Preference

CLIP Score: A measure that uses the CLIP model to evaluate the semantic similarity between generated images and their textual descriptions.
BRISQUE: A no-reference image quality assessment metric that predicts the naturalness of images without requiring a reference image.
CLIP-IQA: An image quality assessment method leveraging the CLIP model to estimate perceptual image quality by comparing images with textual descriptions.
MUSIQ: A multi-scale deep learning model for no-reference image quality assessment that considers various aspects of human perception to evaluate image quality.

3 type of prompt：action, colour and counting: each with fifteen prompts. These prompts are used as the input of baseline and sur-adapter model.

User Preference：Doing a survey about the image preference

14 of 16

Our result dataset

Metadata:

{

"file_name": "3fbf5f76-2b1f-4ae4-57b4-60fd6040e000.png",

"simple_prompt": "a digital painting of a woman with long black hair",

"complex_prompt": "(Award Winning Painting:1.3) of (Ultra detailed:1.3),(Lonely:1.3) a woman (Ana de Armas) (red rose in black hair, blue eye), <lora:audrey_kawasaki:0.8>"

}

Image_size	Simple_Prompt_len	Complex_Prompt_len	Dataset_size
512 x 512	67	248	191,33

Dataset statistics

15 of 16

Challenges & Issues

Dataset:

Dataset contains some sexually explicit images and others unsuitable for dissemination.

The quantity of data within our dataset is insufficient, comprising only 6,498 entries. This limited volume of data may adversely affect the efficacy of our training outcomes.

Computational Resource:

The process of generating a simple prompt vector using llama model is highly resource-intensive, requiring significant GPU resources and entails an extended duration of training.

Distillation:

We are uncertain about the specific layer selection of the LAMMA model to serve as the representational understanding of the input prompt.

16 of 16

Update Timeline

What we have done so far:

Feb 26 - Mar.4 : Literature review ✅

Mar.4 - Mar.11 : Vanilla dataset construction ✅

Mar.11 - Mar.19 :

Prompt vector generation ✅
Metric selection and investigation ✅

We will progressively complete the following milestones in the remaining time:

Mar.23 : Extending the SUR dataset

Mar.24 - Mar.27 : Generating the prompt vector using the bigger dataset

Mar.28 - Mar.31 : Training the SUR-adapter

Apr.1-Ap.r15 :

Comparison between the vanilla dataset and the bigger dataset
Exploring the distillation of LLama 2 and/or Gemma

Apr.16-Apr.17:Final report and video clip