1 of 11

Seminar Presentation��Generate novel Kanji by finetuning stable diffusion model with Lora� --Side Project�

M2 YI JIU

2024/12/05

2 of 11

Background

Task

Geneate Kanji image by english text(text-to-image)

Input: English definition
Output:Japanese Kanji(pixels)

For example:

Input: “seven”

Output:

Interesting part:

Input: “multimedia” (or any other text)

Output: ??? (a novel kanji that never exists)

3 of 11

Data Preparation

No public dataset for English(text)-kanji(image)

The problem

We create our own dataset by extracting kanji’s meaning from

dictionary[1] and also matching the kanji to a rendered image[2]

What we do

Result

https://huggingface.co/datasets/yijiu/kanji

6k kanji image resolution 256*256 (image,text)

4 of 11

Method(fully fine-tune or Lora[3])

The initial idea(llm):

The learned over-parametrized models, in fact reside on a low intrinsic dimension. We hypothesize that the change in weights during model adaptation also has a low “intrinsic rank”.

Principle

Forward pass

W0: pretrained weight matrix

Delta W:accumulated gradient update

R: the rank of lora module

5 of 11

Method

Using lora for efficient stable diffusion[4] fine-tuning

Specifically, the weight matrices Wo, Wq, Wk, and Wv in cross-attention layers are decomposed to lower the rank of the weight updates.

Note that the usage of LoRA is not just limited to attention layers. In the original LoRA work, the authors found out that just amending the attention layers of a language model is sufficient to obtain good downstream performance with great efficiency. This is why, it’s common to just add the LoRA weights to the attention layers of a model

6 of 11

Experiment

Change the rank from 1,4,8 (default is 4)

https://huggingface.co/yijiu/kanji_df_1.5/tensorboard

The parameter of Lora Config

rank,lora_alpha,init_weights,target_moudules

7 of 11

Evaluation( subjective )

For the text meaning align with existing Kanji

Visual Assessment

Input: ’seven’

Output:

✅

❌

For the text meaning not align with existing Kanji(novel)�

groundtruth

generated by our finetuned model

Input: ’internet’

Output:

❓❓❓

Depends on different people’s judgement

8 of 11

Result

“water”

Mix with kanji and realistic background

“language model”

Seems meaningless or hard to understand

‘’forest'’

Can generate some strokes similar to Kanji, though not particularly meaningful,

and blend these strokes with realistic images.

Not the existing kanji

Problem

Reason

Data distribution of two dataset is too different

Stable diffusion is originally trained based on realistic image
Kanji is made by strokes ,are too abstractive
Mapping from original to new kanji distribution is hard

Using lora(default config rank==4) update just very less parameters, finetune not enough

Expectation

Clear and meaningful strokes which represent Kanji and background should be pure white

9 of 11

Solution

Finetune is not enough

1. increase the lora rank to 8,16,at most (64==fully finetune)

2. fully-finetune the stable diffustion model

3. add more traning steps,enlarge the epoches

The text encoder of original stable diffusion model is not good enough for good generation quality

1. Change from CLIP ViT-L/14 to better one

Retrain a smaller diffusion model from scratch considering small image generate resolution 256*256

10 of 11

Future Work

1. Do more experiments on kanji-generation task

2. Contine reading paper about human dancing generation and build trainning pipeline

11 of 11

Reference

[1] https://www.tagaini.net/

[2] https://kanjivg.tagaini.net/viewer.html

[3]Hu, Edward J., et al. "Lora: Low-rank adaptation of large language models." arXiv preprint arXiv:2106.09685 (2021).

[4]Rombach, Robin, et al. "High-resolution image synthesis with latent diffusion models." Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2022.