Project Update presentations
team4
Hanrui Huang
Cheng Chen
21-02-2024
Content
2
Project Introduction
3
image:https://arxiv.org/pdf/2112.10752.pdf High-Resolution Image Synthesis with Latent Diffusion Models
Our project is about “Text-to-Image” using diffusion model.
Project Introduction: Stable Diffusion
4
A latent text-to-image diffusion model
Input : text prompts
Output: image
Advantage:
More stable than GANs based (Training is difficult)
High quality images
Comparison between Latent diffusion Models and others
Motivation: Semantic Understanding
5
The models struggle with semantic understanding when given concise narrative prompts.
Counting
Color
Motivation:Common-sense reasoning
6
LLM has the knowledge!
We want to transfer the semantic understanding and reasoning abilities of LLMs to pre-trained diffusion models(distilling knowledge from large language models (LLMs))
Related works
7
1.SUR-adapter: Enhancing Text-to-Image Pre-trained Diffusion Models with Large Language Models: https://arxiv.org/pdf/2305.05189.pdf
Approach Recap: Loss Function
8
Objective 1: To ensure that the features of simple prompts are enhanced with LLM (Large Language Model) gains.(knowledge distillation )
Objective 2: To make the features of simple prompts resemble those of complex prompts.
The QKV-Attention in the adapter allows the model to notice important semantic relationships in the original features while generating new features.
Approach Recap: Dataset Construction
9
We constructed our dataset through the following two steps.:
1.civitai: https://civitai.com/
2.BLIP: a pre-trained model designed for vision-language intersection tasks, focusing on generating descriptive text from images to improve the semantic match between images and texts.
3.CLIP: a multimodal pre-trained model that can understand and evaluate the semantic association between image content and textual descriptions.
Approach Recap: Dataset Construction
10
Example of image-complex prompt pairs on the Civitai website.
Use the Blip model to generate a simple prompt based on the left image:
A mysterious silhouette of a woman with glowing eyes on a dark background.
Simple prompt
Approach Recap: Dataset Construction
11
Ensure semantic consistency between the simple prompts and their corresponding images:
We utilize publicly available pre-trained model CLIP to ensure the correctness of both simple narrative prompts and complex keyword-based prompts
For each image, we ask CLIP to classify between its simple prompt and its complex prompt for selecting a prompt matching the image best semantically.
Data cleaning process:
In general, a complex prompt often contains other semantically irrelevant information, such as image quality descriptions, so a semantically correct simple prompt generally has a higher CLIP score than the complex prompt. Therefore, if the CLIP score of a simple prompt is not lower than the corresponding complex prompt, we retain the sample. After the automatic semantic
After the semantic cleaning based on the CLIP score, we retain 191,33 samples
Approach Recap: Prompt2Vec
We utilize LLaMA, a collection of foundation language models ranging from 7B to 65B parameters, as knowledge distillation for large language models (LLMs). Specifically, we save the vector representations of simple prompts in fortieth layers of LLaMA, which serve as the text understanding to finetune diffusion models.
12
We stored the simple prompts and their corresponding vectors generated by the LLM in a dictionary following the format below:
{
"prompt": torch.tensor,
}
Approach Recap: Metrics Evaluation
13
3 type of prompt:action, colour and counting: each with fifteen prompts. These prompts are used as the input of baseline and sur-adapter model.
User Preference:Doing a survey about the image preference
Our result dataset
Metadata:
{
"file_name": "3fbf5f76-2b1f-4ae4-57b4-60fd6040e000.png",
"simple_prompt": "a digital painting of a woman with long black hair",
"complex_prompt": "(Award Winning Painting:1.3) of (Ultra detailed:1.3),(Lonely:1.3) a woman (Ana de Armas) (red rose in black hair, blue eye), <lora:audrey_kawasaki:0.8>"
}
14
Image_size | Simple_Prompt_len | Complex_Prompt_len | Dataset_size |
512 x 512 | 67 | 248 | 191,33 |
Dataset statistics
Challenges & Issues
15
Update Timeline
What we have done so far:
Feb 26 - Mar.4 : Literature review ✅
Mar.4 - Mar.11 : Vanilla dataset construction ✅
Mar.11 - Mar.19 :
We will progressively complete the following milestones in the remaining time:
Mar.23 : Extending the SUR dataset
Mar.24 - Mar.27 : Generating the prompt vector using the bigger dataset
Mar.28 - Mar.31 : Training the SUR-adapter
Apr.1-Ap.r15 :
Apr.16-Apr.17:Final report and video clip
16