1 of 11

Seminar Presentation��Generate novel Kanji by finetuning stable diffusion model with Lora--Side Project

M2 YI JIU

2024/12/05

2 of 11

Background

2

Task

Geneate Kanji image by english text(text-to-image)

  • Input: English definition
  • Output:Japanese Kanji(pixels)

For example:

Input: “seven”

Output:

Interesting part:

Input: “multimedia” (or any other text)

Output: ??? (a novel kanji that never exists)

3 of 11

Data Preparation

3

No public dataset for English(text)-kanji(image)

The problem

We create our own dataset by extracting kanji’s meaning from

dictionary[1] and also matching the kanji to a rendered image[2]

What we do

Result

6k kanji image resolution 256*256 (image,text)

4 of 11

Method(fully fine-tune or Lora[3])

4

The initial idea(llm):

The learned over-parametrized models, in fact reside on a low intrinsic dimension. We hypothesize that the change in weights during model adaptation also has a low “intrinsic rank”.

Principle

Forward pass

W0: pretrained weight matrix

Delta W:accumulated gradient update

R: the rank of lora module

5 of 11

Method

5

Using lora for efficient stable diffusion[4] fine-tuning

Specifically, the weight matrices Wo, Wq, Wk, and Wv in cross-attention layers are decomposed to lower the rank of the weight updates.

Note that the usage of LoRA is not just limited to attention layers. In the original LoRA work, the authors found out that just amending the attention layers of a language model is sufficient to obtain good downstream performance with great efficiency. This is why, it’s common to just add the LoRA weights to the attention layers of a model

6 of 11

Experiment

6

Change the rank from 1,4,8 (default is 4)

The parameter of Lora Config

rank,lora_alpha,init_weights,target_moudules

7 of 11

Evaluation( subjective )

7

For the text meaning align with existing Kanji

Visual Assessment

Input: ’seven’

Output:

For the text meaning not align with existing Kanji(novel)�

groundtruth

generated by our finetuned model

Input: ’internet’

Output:

❓❓❓

Depends on different people’s judgement

8 of 11

Result

“water”

Mix with kanji and realistic background

“language model”

Seems meaningless or hard to understand

‘’forest'’

Can generate some strokes similar to Kanji, though not particularly meaningful,

and blend these strokes with realistic images.

Not the existing kanji

Problem

Reason

  • Data distribution of two dataset is too different
    1. Stable diffusion is originally trained based on realistic image
    2. Kanji is made by strokes ,are too abstractive
    3. Mapping from original to new kanji distribution is hard

  • Using lora(default config rank==4) update just very less parameters, finetune not enough

Expectation

Clear and meaningful strokes which represent Kanji and background should be pure white

9 of 11

Solution

9

  • Finetune is not enough

1. increase the lora rank to 8,16,at most (64==fully finetune)

2. fully-finetune the stable diffustion model

3. add more traning steps,enlarge the epoches

  • The text encoder of original stable diffusion model is not good enough for good generation quality

1. Change from CLIP ViT-L/14 to better one

  • Retrain a smaller diffusion model from scratch considering small image generate resolution 256*256

10 of 11

Future Work

10

1. Do more experiments on kanji-generation task

2. Contine reading paper about human dancing generation and build trainning pipeline

11 of 11

Reference

11

[1] https://www.tagaini.net/

[2] https://kanjivg.tagaini.net/viewer.html

[3]Hu, Edward J., et al. "Lora: Low-rank adaptation of large language models." arXiv preprint arXiv:2106.09685 (2021).

[4]Rombach, Robin, et al. "High-resolution image synthesis with latent diffusion models." Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2022.