Seminar Presentation��Generate novel Kanji by finetuning stable diffusion model with Lora� --Side Project�
M2 YI JIU
2024/12/05
Background
2
Task
Geneate Kanji image by english text(text-to-image)
For example:
Input: “seven”
Output:
Interesting part:
Input: “multimedia” (or any other text)
Output: ??? (a novel kanji that never exists)
Data Preparation
3
No public dataset for English(text)-kanji(image)
The problem
We create our own dataset by extracting kanji’s meaning from
dictionary[1] and also matching the kanji to a rendered image[2]
What we do
Result
6k kanji image resolution 256*256 (image,text)
Method(fully fine-tune or Lora[3])
4
The initial idea(llm):
The learned over-parametrized models, in fact reside on a low intrinsic dimension. We hypothesize that the change in weights during model adaptation also has a low “intrinsic rank”.
Principle
Forward pass
W0: pretrained weight matrix
Delta W:accumulated gradient update
R: the rank of lora module
Method
5
Using lora for efficient stable diffusion[4] fine-tuning
Specifically, the weight matrices Wo, Wq, Wk, and Wv in cross-attention layers are decomposed to lower the rank of the weight updates.
Note that the usage of LoRA is not just limited to attention layers. In the original LoRA work, the authors found out that just amending the attention layers of a language model is sufficient to obtain good downstream performance with great efficiency. This is why, it’s common to just add the LoRA weights to the attention layers of a model
Experiment
6
Change the rank from 1,4,8 (default is 4)
The parameter of Lora Config
rank,lora_alpha,init_weights,target_moudules
Evaluation( subjective )
7
For the text meaning align with existing Kanji
Visual Assessment
Input: ’seven’
Output:
✅
❌
For the text meaning not align with existing Kanji(novel)�
groundtruth
generated by our finetuned model
Input: ’internet’
Output:
❓❓❓
Depends on different people’s judgement
Result
“water”
Mix with kanji and realistic background
“language model”
Seems meaningless or hard to understand
‘’forest'’
Can generate some strokes similar to Kanji, though not particularly meaningful,
and blend these strokes with realistic images.
Not the existing kanji
Problem
Reason
Expectation
Clear and meaningful strokes which represent Kanji and background should be pure white
Solution
9
1. increase the lora rank to 8,16,at most (64==fully finetune)
2. fully-finetune the stable diffustion model
3. add more traning steps,enlarge the epoches
1. Change from CLIP ViT-L/14 to better one
Future Work
10
1. Do more experiments on kanji-generation task
2. Contine reading paper about human dancing generation and build trainning pipeline
Reference
11
[1] https://www.tagaini.net/
[2] https://kanjivg.tagaini.net/viewer.html
[3]Hu, Edward J., et al. "Lora: Low-rank adaptation of large language models." arXiv preprint arXiv:2106.09685 (2021).
[4]Rombach, Robin, et al. "High-resolution image synthesis with latent diffusion models." Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2022.