1 of 65

1

2024.04.19 @ UOS

Geewook Kim

Vision-Language Models for Context-Rich Image Understanding Tasks

Geewook Kim @24.04.19 UOS

2 of 65

2

Language Model (LM)

안 녕 하 세 요

용

욥

역

…

P(요| 안녕하세) > P(역| 안녕하세)

P(?| 안녕하세)

Geewook Kim @24.04.19 UOS

3 of 65

3

Large Language Model (LLM)

https://medium.com/@harishdatalab/unveiling-the-power-of-large-language-models-llms-e235c4eba8a9

https://jalammar.github.io/illustrated-gpt2/

Geewook Kim @24.04.19 UOS

4 of 65

4

Large Language Model (LLM)

Geewook Kim @24.04.19 UOS

5 of 65

5

Application of LMs - Text-to-Text

X

Y

https://dodnet.tistory.com/133

Geewook Kim @24.04.19 UOS

6 of 65

6

Application of LMs - Text-to-Text

P(y1 | x1 x2 x3 x4)

x1 x2 x3 x4 y1

x2 x3 x4 y1 y2

x1 x2 x3 x4

y1 y2

Decoder Model (ex. GPT)

Encoder-Decoder Model (ex. T5, BART)

https://dodnet.tistory.com/133

Geewook Kim @24.04.19 UOS

7 of 65

7

Vision-Language Model (VLM) - Image-to-Text

x1 x2 x3 x4

Text -> Image

y1 y2

https://www.slideshare.net/deeplearningitalia/transformers-in-vision-from-zero-to-hero-dlipptx

Geewook Kim @24.04.19 UOS

8 of 65

8

Vision-Language Model (VLM) - Image-to-Text

x1 x2 x3 x4

Use ViT as Encoder

y1 y2

Geewook Kim @24.04.19 UOS

9 of 65

9

Vision-Language Model (VLM) - Image-to-Text

Geewook Kim @24.04.19 UOS

10 of 65

10

Vision-Language Model (VLM)

Geewook Kim @24.04.19 UOS

11 of 65

11

Vision-Language Model (VLM)

Geewook Kim @24.04.19 UOS

12 of 65

12

Large Vision-Language Model (LVLM)

https://arxiv.org/pdf/2210.03347.pdf

https://openaccess.thecvf.com/content/CVPR2022/papers/Hu_Scaling_Up_Vision-Language_Pre-Training_for_Image_Captioning_CVPR_2022_paper.pdf

Geewook Kim @24.04.19 UOS

13 of 65

13

Better CLIP for Better VLM

https://arxiv.org/pdf/2103.00020.pdf

Geewook Kim @24.04.19 UOS

14 of 65

14

LLM-Based LVLM

https://llava-vl.github.io/

Geewook Kim @24.04.19 UOS

15 of 65

15

LLM-Based LVLM

https://llava-vl.github.io/

https://vaclavkosar.com/ml/Encoder-only-Decoder-only-vs-Encoder-Decoder-Transfomer

Geewook Kim @24.04.19 UOS

16 of 65

16

LLM-Based LVLM

https://llava-vl.github.io/

Geewook Kim @24.04.19 UOS

17 of 65

17

Context-Rich Image Understanding – ex. Document/Infographic VQA

https://openaccess.thecvf.com/content/WACV2022/supplemental/Mathew_InfographicVQA_WACV_2022_supplemental.pdf

Geewook Kim @24.04.19 UOS

18 of 65

18

Context-Rich Image Understanding – ex. Chart/Diagram VQA

https://arxiv.org/pdf/1603.07396.pdf

https://aclanthology.org/2022.findings-acl.177.pdf

Geewook Kim @24.04.19 UOS

19 of 65

19

Context-Rich Image Understanding – Service Products in NAVER

Geewook Kim @24.04.19 UOS

20 of 65

20

Context-Rich Image Understanding – Service Products in NAVER

Geewook Kim @24.04.19 UOS

21 of 65

21

Context-Rich Image Understanding – Service Products in NAVER

이곳에 방문하셨군요!: 사진 속 영수증의 가게 찾기. DEVIEW 2021: https://deview.kr/2021/sessions/524

Geewook Kim @24.04.19 UOS

22 of 65

(VLM) Donut 🍩: OCR-free Document Understanding Transformer

Geewook Kim, Teakgyu Hong, Moonbin Yim, Jeongyeon Nam, Jinyoung Park, Jinyeong Yim, Wonseok Hwang, Sangdoo Yun, Dongyoon Han, and Seunghyun Park

ECCV 2022.

(LLM-Based LVLM) Cream🍦: Visually-Situated Natural Language Understanding with Contrastive Reading Model and Frozen Large Language Models

Geewook Kim, Hodong Lee, Daehee Kim, Haeji Jung, Sanghee Park, Yoonsik Kim, Sangdoo Yun, Taeho Kil, Bado Lee, and Seunghyun Park

EMNLP 2023.

22

Recent Publications

Geewook Kim @24.04.19 UOS

23 of 65

(VLM) Donut 🍩: OCR-free Document Understanding Transformer

Geewook Kim, Teakgyu Hong, Moonbin Yim, Jeongyeon Nam, Jinyoung Park, Jinyeong Yim, Wonseok Hwang, Sangdoo Yun, Dongyoon Han, and Seunghyun Park

ECCV 2022.

(LLM-Based LVLM) Cream🍦: Visually-Situated Natural Language Understanding with Contrastive Reading Model and Frozen Large Language Models

Geewook Kim, Hodong Lee, Daehee Kim, Haeji Jung, Sanghee Park, Yoonsik Kim, Sangdoo Yun, Taeho Kil, Bado Lee, and Seunghyun Park

EMNLP 2023.

23

Recent Publications

Geewook Kim @24.04.19 UOS

24 of 65

24

Context-Rich Image Understanding;

Visual Document Understanding

VDU Model

Useful Information

Geewook Kim @24.04.19 UOS

25 of 65

25

Context-Rich Image Understanding;

Visual Document Understanding

VDU Model

{ "class": "receipt" }

Geewook Kim @24.04.19 UOS

26 of 65

26

Context-Rich Image Understanding;

Visual Document Understanding

VDU Model

{ "menu": [

{

"nm": "3002-Kyoto Choco Mochi",

"unitprice": "14.000",

"cnt": "x2",

"price": "28.000"

}, … }

Geewook Kim @24.04.19 UOS

27 of 65

27

Conventional Pipeline

Input

Output

{ "items": [

{

"name": "3002-Kyoto Choco Mochi",

"count": 2,

"priceInfo": {

"unitPrice": 14000,

"price": 28000

}

}, {

"name": "1001 - Choco Bun",

"count": 1,

"priceInfo": {

"unitPrice": 22000

"price": 22000

}

}, ...

],

"total": [ {

"menuqty_cnt": 4,

"total_price": 50000

}

]

}

≈

Geewook Kim @24.04.19 UOS

28 of 65

28

Conventional Pipeline

Input

Output

{ "items": [

{

"name": "3002-Kyoto Choco Mochi",

"count": 2,

"priceInfo": {

"unitPrice": 14000,

"price": 28000

}

}, {

"name": "1001 - Choco Bun",

"count": 1,

"priceInfo": {

"unitPrice": 22000

"price": 22000

}

}, ...

],

"total": [ {

"menuqty_cnt": 4,

"total_price": 50000

}

]

}

≈

…

Geewook Kim @24.04.19 UOS

29 of 65

29

Conventional Pipeline

{ "items": [

{

"name": "3002-Kyoto Choco Mochi",

"count": 2,

"priceInfo": {

"unitPrice": 14000,

"price": 28000

}

}, {

"name": "1001 - Choco Bun",

"count": 1,

"priceInfo": {

"unitPrice": 22000

"price": 22000

}

}, ...

],

"total": [ {

"menuqty_cnt": 4,

"total_price": 50000

}

]

}

≈

{ "words": [ {

"id": 1,

"bbox":[[360,2048],...,[355,2127]],

"text": "3002-Kyoto"

}, {

"id": 2,

"bbox":[[801,2074],...,[801,2139]],

"text": "Choco"

}, {

"id": 3,

"bbox":[[1035,2074],...,[1035,2147]],

"text": "Mochi"

}, {

"id": 4,

"bbox":[[761,2172],...,[761,2253]],

"text": "14.000"

}, …, {

"id": 22,

"bbox":[[1573,3030],...,[1571,3126]],

"text": "50.000"

}

]

}

Detection! Recognition! Parsing!

OCR

Geewook Kim @24.04.19 UOS

30 of 65

30

Conventional Models

… 3002-Kyoto Choco Mochi 14, 000 …

Transformer Backbone�(BERT, LayoutLM, …)

(Off-the-shelf)�OCR Engine

“3002-Kyoto Choco Mochi”

START END

Geewook Kim @24.04.19 UOS

31 of 65

31

Conventional Models

… 3002-Kyoto Choco Mochi 14, 000 …

B-name I-name I-name B-price I-price

Transformer Backbone�(BERT, LayoutLM, …)

(Off-the-shelf)�OCR Engine

Geewook Kim @24.04.19 UOS

32 of 65

32

Conventional Models

… 3002-Kyoto Choco Mochi 14, 000 …

B-name I-name I-name B-price I-price

Transformer Backbone�(BERT, LayoutLM, …)

(Off-the-shelf)�OCR Engine

{ "menu": [

{

"nm": "3002-Kyoto Choco Mochi",

"unitprice": "14.000",

"cnt": "x2",

"price": "28.000"

}, … }

Geewook Kim @24.04.19 UOS

33 of 65

33

Conventional Models

Input Image

(Off-the-shelf)�OCR Engine

Backbone

(BERT-like)

BIO-Tags / Answer Token Span / etc

Output

high computational costs
inflexibility of OCR on languages or document type
OCR error propagation

Geewook Kim @24.04.19 UOS

34 of 65

AS-IS v.s. TO-BE

34

Conventional Models v.s. Donut

Donut 🍩�(End-to-end Model)

Token Sequence

Output

Input Image

(Off-the-shelf)�OCR Engine

Backbone

(BERT-like)

BIO-Tags / Answer Token Span / etc

Output

Geewook Kim @24.04.19 UOS

35 of 65

35

Donut Architecture

<vqa><question>what is the price

of choco mochi?</question><answer>

Converted JSON

transformer encoder

Input Image and Prompt

transformer decoder

Donut 🍩

<class>receipt</class>

</classification>

14,000</answer></vqa>

<item><name>3002-Kyoto Choco Mochi</name>・・・ </parsing>

{ "items": [{"name": "3002-Kyoto Choco Mochi",

"count": 2,

"unitprice": 14000, …}], … }

Output Sequence

{ "class":"receipt" }

{ "question": "what is the price of choco mochi?",

"answer": "14,000" }

Geewook Kim @24.04.19 UOS

36 of 65

36

Donut Architecture

<vqa><question>what is the price

of choco mochi?</question><answer>

Converted JSON

transformer encoder

Input Image and Prompt

transformer decoder

Donut 🍩

<class>receipt</class>

</classification>

14,000</answer></vqa>

<item><name>3002-Kyoto Choco Mochi</name>・・・ </parsing>

{ "items": [{"name": "3002-Kyoto Choco Mochi",

"count": 2,

"unitprice": 14000, …}], … }

Output Sequence

{ "class":"receipt" }

{ "question": "what is the price of choco mochi?",

"answer": "14,000" }

Geewook Kim @24.04.19 UOS

37 of 65

37

Donut Architecture

<vqa><question>what is the price

of choco mochi?</question><answer>

Converted JSON

transformer encoder

Input Image and Prompt

transformer decoder

Donut 🍩

<class>receipt</class>

</classification>

14,000</answer></vqa>

<item><name>3002-Kyoto Choco Mochi</name>・・・ </parsing>

{ "items": [{"name": "3002-Kyoto Choco Mochi",

"count": 2,

"unitprice": 14000, …}], … }

Output Sequence

{ "class":"receipt" }

{ "question": "what is the price of choco mochi?",

"answer": "14,000" }

Geewook Kim @24.04.19 UOS

38 of 65

38

Donut Architecture

<vqa><question>what is the price

of choco mochi?</question><answer>

Converted JSON

transformer encoder

Input Image and Prompt

transformer decoder

Donut 🍩

<class>receipt</class>

</classification>

14,000</answer></vqa>

<item><name>3002-Kyoto Choco Mochi</name>・・・ </parsing>

{ "items": [{"name": "3002-Kyoto Choco Mochi",

"count": 2,

"unitprice": 14000, …}], … }

Output Sequence

{ "class":"receipt" }

{ "question": "what is the price of choco mochi?",

"answer": "14,000" }

Geewook Kim @24.04.19 UOS

39 of 65

39

Donut Architecture

<vqa><question>what is the price

of choco mochi?</question><answer>

Converted JSON

transformer encoder

Input Image and Prompt

transformer decoder

Donut 🍩

<class>receipt</class>

</classification>

14,000</answer></vqa>

<item><name>3002-Kyoto Choco Mochi</name>・・・ </parsing>

{ "items": [{"name": "3002-Kyoto Choco Mochi",

"count": 2,

"unitprice": 14000, …}], … }

Output Sequence

{ "class":"receipt" }

{ "question": "what is the price of choco mochi?",

"answer": "14,000" }

Geewook Kim @24.04.19 UOS

40 of 65

40

Model Training and Inference Overview

Geewook Kim @24.04.19 UOS

41 of 65

41

Proposed Pre-Training Task is Simple

Donut 🍩�(End-to-end Model)

In terms of what that can teach, in what from …

Geewook Kim @24.04.19 UOS

42 of 65

42

Proposed Pre-Training Task is Simple

Model

…

This can be interpret as

a token classification at each step.

in

terms

of

what

terms

of

what

Minimize Cross Entropy

Geewook Kim @24.04.19 UOS

43 of 65

43

Document Parsing

Geewook Kim @24.04.19 UOS

44 of 65

44

Document Parsing

https://github.com/clovaai/donut

Geewook Kim @24.04.19 UOS

45 of 65

45

Document Parsing

Geewook Kim @24.04.19 UOS

46 of 65

46

Document VQA

Q: What is the Extension Number as per the voucher?��A: (910) 741–0673

Geewook Kim @24.04.19 UOS

47 of 65

47

Document VQA

Geewook Kim @24.04.19 UOS

48 of 65

(VLM) Donut 🍩: OCR-free Document Understanding Transformer

Geewook Kim, Teakgyu Hong, Moonbin Yim, Jeongyeon Nam, Jinyoung Park, Jinyeong Yim, Wonseok Hwang, Sangdoo Yun, Dongyoon Han, and Seunghyun Park

ECCV 2022.

(LLM-Based LVLM) Cream🍦: Visually-Situated Natural Language Understanding with Contrastive Reading Model and Frozen Large Language Models

Geewook Kim, Hodong Lee, Daehee Kim, Haeji Jung, Sanghee Park, Yoonsik Kim, Sangdoo Yun, Taeho Kil, Bado Lee, and Seunghyun Park

EMNLP 2023.

48

Recent Publications

Geewook Kim @24.04.19 UOS

49 of 65

49

LVLM v.s. LLM-Based LVLM

Geewook Kim @24.04.19 UOS

50 of 65

Input OCR

Input Vision Embeddings

Input Both

50

Conventional LLM-Based LVLM

OCR

LLM

OCR

LLM

Vision Module

LLM

Vision Module

Types

Geewook Kim @24.04.19 UOS

51 of 65

Input OCR

Input Vision Embeddings

Input Both

51

Conventional LLM-Based LVLM

OCR

LLM

OCR

LLM

Vision Module

LLM

Vision Module

Types

Geewook Kim @24.04.19 UOS

52 of 65

(Ours) Input Both, but Efficient

Input OCR

Input Vision Embeddings

Input Both

52

Conventional LLM-Based LVLM

OCR

LLM

OCR

LLM

Vision Module

OCR

LLM

Vision Module

LLM

Vision Module

Types

optional

Geewook Kim @24.04.19 UOS

53 of 65

53

Cream Architecture

Geewook Kim @24.04.19 UOS

54 of 65

54

Cream Architecture

Cream utilizes two encoders; Vision encoder and Auxiliary encoder.

Geewook Kim @24.04.19 UOS

55 of 65

55

Cream Architecture – Two Encoders

Geewook Kim @24.04.19 UOS

56 of 65

56

Cream Architecture – Contrastive Learning

Geewook Kim @24.04.19 UOS

57 of 65

The outputs of the two encoders are used in the decoder module.

57

Cream Architecture

Geewook Kim @24.04.19 UOS

58 of 65

58

Cream Training

Geewook Kim @24.04.19 UOS

59 of 65

59

Evaluation Benchmarks

Geewook Kim @24.04.19 UOS

60 of 65

60

Key Results

Geewook Kim @24.04.19 UOS

61 of 65

61

Key Results

Geewook Kim @24.04.19 UOS

62 of 65

62

Analysis – OCR Robustness

Geewook Kim @24.04.19 UOS

63 of 65

63

Analysis – VLM v.s. LLM-Based LVLM

Geewook Kim @24.04.19 UOS

64 of 65

64

Analysis – VLM v.s. LLM-Based LVLM

Geewook Kim @24.04.19 UOS

65 of 65

65

EOD

Q&A�

gw.kim@navercorp.com

Geewook Kim @24.04.19 UOS