1 of 31

字體設計與文字編碼

林文心

國立臺北科技大學資訊工程系

2 of 31

論文 / 作者

P.2

章節 01

3 of 31

CalliGAN Style and Structure-aware Chinese Calligraphy Character Generator

Shan-Jean Wu, Chih-Yuan Yang and Jane Yung-jen Hsu

CVPR 2020

P.3

4 of 31

國立臺灣大學

P.4

5 of 31

P.5

Shan-Jean Wu 吳尚真（第一作者）

Yung-jen Hsu 許永真（指導教授）

6 of 31

摘要

P.6

章節 02

7 of 31

摘要

Chinese calligraphy is the writing of Chinese characters as an art form performed with brushes so Chinese characters are rich of shapes and details. Recent studies show that Chinese characters can be generated through image-to-image translation for multiple styles using a single model.

中國書法是用毛筆書寫漢字的一種藝術形式，因此漢字具有豐富的形狀和細節。最近的研究表明，可以使用單個模型通過多種樣式的圖像到圖像轉換生成漢字。

P.7

8 of 31

摘要

We propose a novel method of this approach by incorporating Chinese characters’ component information into its model. We also propose an improved network to convert characters to their embedding space.

我們通過將漢字的組成資訊納入其模型，提出了這種方法的新辦法。我們還提出了一個改進的網路來將字符轉換到它們的嵌入空間。

P.8

9 of 31

摘要

Experiments show that the proposed method generates high-quality Chinese calligraphy characters over state-of-the-art methods measured through numerical evaluations and human subject studies.

實驗表明，我們所提出的方法通過數值評估和人類受試者研究測量的最先進方法生成了高品質的中國書法字符。

P.9

10 of 31

圖片 / 表格

P.10

章節 03

11 of 31

P.11

Figure 1: Results generated by the proposed method. The style used to generate characters of the upper row is style 2 (Liu Gongquan) and of the lower row is style 3 (Ouyang XunHuangfu Dan Stele).

12 of 31

P.12

Figure 2: Architecture and losses. The proposed CalliGAN is an encoder-decoder-based image translation network with two supporting branches to control styles and structures. CalliGAN has 4 image-based losses: adversarial (Eq. 2), pixel-wise (Eq. 3), constancy (Eq. 4) and category (Eq. 5).

13 of 31

P.13

Table 1: Architecture of the image encoder and decoder. All 8 encoder layers use the same convolution kernel size 5×5, activation function LeakyReLU with a slope as 0.2, batch normalization layer, and stride size of 2. The decoder’s L1 to L7 layers use the same deconvolution kernel size 5×5, activation function ReLU, batch normalization layer. The decoder’s L8 layer uses the hyperbolic tangent activation function and has a drop out layer with a drop rate as 0.5.

14 of 31

P.14

Figure 3: Examples of component sequences. The first and second characters share the same component code k1 as 46, and the second and third characters share the same k2 as 48 and k3 as 81.

15 of 31

P.15

Figure 4: Architecture of the proposed component encoder.

16 of 31

P.16

Table 2: Architecture of the proposed discriminator D and style classifier Ds. BN means a batch normalization layer.

17 of 31

P.17

Figure 5: Example characters of the 7 styles downloaded from the online repository. The 1st, 3rd, 6th, and 7th images have a vertical long side, while the 2nd, 4th, and 5th ones have a horizontal long side.

18 of 31

P.18

Table 3: Statistics of our training and test samples.

19 of 31

P.19

Table 4: Performance comparison. One-hot means that we replace zi2zi’s label embedding vector with our proposed simple one-hot vector. The symbol Ec means the proposed component encoder. The proposed method equals to zi2zi (single channel) + one-hot + Ec

20 of 31

P.20

Figure 6: Qualitative comparison of single style transfer. All of the 6 characters are generated under the style 4. Red rectangles highlight the benefits brought by the proposed component encoder, which generates the ending hook of the first character, separates the two strokes of the second character, makes the strokes of the third, fourth, and sixth characters straight, and restores the corner of the L-shape stroke of the fifth character.

21 of 31

P.21

Table 5: Quantitative comparison of single style transfer. We disable the multi-style part of both methods so the only considerable difference between the two configurations is the existence of a component encoder, which is contained in the proposed method, but not in zi2zi. For each style, training and test images used by the two methods are the same.

22 of 31

P.22

Figure 7: Failure cases generated by zi2zi.

23 of 31

P.23

Table 6: Percentage of preferred images of our human subject study. Most of our participants think the proposed method’s output images are more similar to the ground truth than zi2zi’s ones are.

24 of 31

P.24

Figure 8: Comparison with AEGG. The style used in this comparison is style 2. Those images generated by AEGG are extracted from its original paper. Their aspect ratios differ from the one of the ground truth images because AEGG’s authors change the ratios. However, they do not explain the reason in their paper. Red rectangles highlight the regions that the proposed method handles better.

25 of 31

結論

P.25

章節 04

26 of 31

結論

In this paper, we propose a novel method to generate multi-style Chinese character images. It consists a U-Net-based generator and a component encoder. Experimental results show that the proposed method generates high-quality images of calligraphy characters.

在本文中，我們提出了一種生成多樣式漢字圖像的新方法。它由一個基於 U-Net 的生成器和一個分量編碼器組成。實驗結果表明，所提出的方法生成了高品質的書法字符圖像。

P.26

27 of 31

結論

Numerical evaluations and a human subject study show that the images generated by the proposed method more effectively than existing methods generates images similar to the ground truth ones. Our research is still ongoing and many questions are not yet answered.

數值評估和人類受試者研究表明，所提出的方法生成的圖像比現有方法更有效地生成與實際圖像相似的圖像。我們的研究仍在進行中，許多問題尚未得到解答。

P.27

28 of 31

結論

For example, how well does the proposed method perform using other types of character images such as font-rendered images or images of cursive or semi-cursive scripts? Is there a font better than Sim Sun to render our input images? Does the choice depend on the used calligraphy styles?

例如，所提出的方法在使用其他類型的字符圖像（例如字體渲染圖像或草書或半草書圖像）時表現如何？有沒有比 Sim Sun 更好的字體來渲染我們的輸入圖像？選擇是否取決於所使用的書法風格？

P.28

29 of 31

結論

How many dimensions should we use for the component codes’ embedding? Is there any pattern of those embedded feature vectors? Can some GAN training method such as WGAN-GP [7] or SN-GAN [19] improve our results? What is our method’s performance if we use another data split?

我們應該使用多少維度來嵌入組件編碼？這些嵌入的特徵向量是否有任何模式？ WGAN-GP [7] 或 SN-GAN [19] 等 GAN 訓練方法能否改善我們的結果？如果我們使用另一個數據拆分，我們的方法的性能如何？

P.29

30 of 31

結論

If we replace our shallow discriminator with a powerful and deep pre-trained image classifier, can we get better results? We wish we will be able to answer those questions soon.

如果我們用一個強大的深度預訓練圖像分類器替換我們的淺鑑別器，我們能得到更好的結果嗎？我們希望我們能夠盡快回答這些問題。

P.30