1 of 65

1

2024.04.19 @ UOS

Geewook Kim

Vision-Language Models for Context-Rich Image Understanding Tasks

Geewook Kim @24.04.19 UOS

ⓒ NAVER Cloud Corp.

2 of 65

2

Language Model (LM)

안 녕 하 세

P(요| 안녕하세) > P(역| 안녕하세)

P(?| 안녕하세)

Geewook Kim @24.04.19 UOS

ⓒ NAVER Cloud Corp.

3 of 65

3

Large Language Model (LLM)

https://medium.com/@harishdatalab/unveiling-the-power-of-large-language-models-llms-e235c4eba8a9

https://jalammar.github.io/illustrated-gpt2/

Geewook Kim @24.04.19 UOS

ⓒ NAVER Cloud Corp.

4 of 65

4

Large Language Model (LLM)

Geewook Kim @24.04.19 UOS

ⓒ NAVER Cloud Corp.

5 of 65

5

Application of LMs - Text-to-Text

X

Y

https://dodnet.tistory.com/133

Geewook Kim @24.04.19 UOS

ⓒ NAVER Cloud Corp.

6 of 65

6

Application of LMs - Text-to-Text

P(y1 | x1 x2 x3 x4)

x1 x2 x3 x4 y1

x2 x3 x4 y1 y2

x1 x2 x3 x4

y1 y2

Decoder Model (ex. GPT)

Encoder-Decoder Model (ex. T5, BART)

https://dodnet.tistory.com/133

Geewook Kim @24.04.19 UOS

ⓒ NAVER Cloud Corp.

7 of 65

7

Vision-Language Model (VLM) - Image-to-Text

x1 x2 x3 x4

Text -> Image

y1 y2

https://www.slideshare.net/deeplearningitalia/transformers-in-vision-from-zero-to-hero-dlipptx

Geewook Kim @24.04.19 UOS

ⓒ NAVER Cloud Corp.

8 of 65

8

Vision-Language Model (VLM) - Image-to-Text

x1 x2 x3 x4

Use ViT as Encoder

y1 y2

Geewook Kim @24.04.19 UOS

ⓒ NAVER Cloud Corp.

9 of 65

9

Vision-Language Model (VLM) - Image-to-Text

Geewook Kim @24.04.19 UOS

ⓒ NAVER Cloud Corp.

10 of 65

10

Vision-Language Model (VLM)

Geewook Kim @24.04.19 UOS

ⓒ NAVER Cloud Corp.

11 of 65

11

Vision-Language Model (VLM)

Geewook Kim @24.04.19 UOS

ⓒ NAVER Cloud Corp.

12 of 65

12

Large Vision-Language Model (LVLM)

https://arxiv.org/pdf/2210.03347.pdf

https://openaccess.thecvf.com/content/CVPR2022/papers/Hu_Scaling_Up_Vision-Language_Pre-Training_for_Image_Captioning_CVPR_2022_paper.pdf

Geewook Kim @24.04.19 UOS

ⓒ NAVER Cloud Corp.

13 of 65

13

Better CLIP for Better VLM

https://arxiv.org/pdf/2103.00020.pdf

Geewook Kim @24.04.19 UOS

ⓒ NAVER Cloud Corp.

14 of 65

14

LLM-Based LVLM

https://llava-vl.github.io/

Geewook Kim @24.04.19 UOS

ⓒ NAVER Cloud Corp.

15 of 65

15

LLM-Based LVLM

https://llava-vl.github.io/

https://vaclavkosar.com/ml/Encoder-only-Decoder-only-vs-Encoder-Decoder-Transfomer

Geewook Kim @24.04.19 UOS

ⓒ NAVER Cloud Corp.

16 of 65

16

LLM-Based LVLM

https://llava-vl.github.io/

Geewook Kim @24.04.19 UOS

ⓒ NAVER Cloud Corp.

17 of 65

17

Context-Rich Image Understanding – ex. Document/Infographic VQA

https://openaccess.thecvf.com/content/WACV2022/supplemental/Mathew_InfographicVQA_WACV_2022_supplemental.pdf

Geewook Kim @24.04.19 UOS

ⓒ NAVER Cloud Corp.

18 of 65

18

Context-Rich Image Understanding – ex. Chart/Diagram VQA

https://arxiv.org/pdf/1603.07396.pdf

https://aclanthology.org/2022.findings-acl.177.pdf

Geewook Kim @24.04.19 UOS

ⓒ NAVER Cloud Corp.

19 of 65

19

Context-Rich Image Understanding – Service Products in NAVER

Geewook Kim @24.04.19 UOS

ⓒ NAVER Cloud Corp.

20 of 65

20

Context-Rich Image Understanding – Service Products in NAVER

Geewook Kim @24.04.19 UOS

ⓒ NAVER Cloud Corp.

21 of 65

21

Context-Rich Image Understanding – Service Products in NAVER

이곳에 방문하셨군요!: 사진 속 영수증의 가게 찾기. DEVIEW 2021: https://deview.kr/2021/sessions/524

Geewook Kim @24.04.19 UOS

ⓒ NAVER Cloud Corp.

22 of 65

(VLM) Donut 🍩: OCR-free Document Understanding Transformer

Geewook Kim, Teakgyu Hong, Moonbin Yim, Jeongyeon Nam, Jinyoung Park, Jinyeong Yim, Wonseok Hwang, Sangdoo Yun, Dongyoon Han, and Seunghyun Park

ECCV 2022.

(LLM-Based LVLM) Cream🍦: Visually-Situated Natural Language Understanding with Contrastive Reading Model and Frozen Large Language Models

Geewook Kim, Hodong Lee, Daehee Kim, Haeji Jung, Sanghee Park, Yoonsik Kim, Sangdoo Yun, Taeho Kil, Bado Lee, and Seunghyun Park

EMNLP 2023.

22

Recent Publications

Geewook Kim @24.04.19 UOS

ⓒ NAVER Cloud Corp.

23 of 65

(VLM) Donut 🍩: OCR-free Document Understanding Transformer

Geewook Kim, Teakgyu Hong, Moonbin Yim, Jeongyeon Nam, Jinyoung Park, Jinyeong Yim, Wonseok Hwang, Sangdoo Yun, Dongyoon Han, and Seunghyun Park

ECCV 2022.

(LLM-Based LVLM) Cream🍦: Visually-Situated Natural Language Understanding with Contrastive Reading Model and Frozen Large Language Models

Geewook Kim, Hodong Lee, Daehee Kim, Haeji Jung, Sanghee Park, Yoonsik Kim, Sangdoo Yun, Taeho Kil, Bado Lee, and Seunghyun Park

EMNLP 2023.

23

Recent Publications

Geewook Kim @24.04.19 UOS

ⓒ NAVER Cloud Corp.

24 of 65

24

Context-Rich Image Understanding;

Visual Document Understanding

VDU Model

Useful Information

Geewook Kim @24.04.19 UOS

ⓒ NAVER Cloud Corp.

25 of 65

25

Context-Rich Image Understanding;

Visual Document Understanding

VDU Model

{ "class": "receipt" }

Geewook Kim @24.04.19 UOS

ⓒ NAVER Cloud Corp.

26 of 65

26

Context-Rich Image Understanding;

Visual Document Understanding

VDU Model

{ "menu": [

{

"nm": "3002-Kyoto Choco Mochi",

"unitprice": "14.000",

"cnt": "x2",

"price": "28.000"

}, … }

Geewook Kim @24.04.19 UOS

ⓒ NAVER Cloud Corp.

27 of 65

27

Conventional Pipeline

Input

Output

{ "items": [

{

"name": "3002-Kyoto Choco Mochi",

"count": 2,

"priceInfo": {

"unitPrice": 14000,

"price": 28000

}

}, {

"name": "1001 - Choco Bun",

"count": 1,

"priceInfo": {

"unitPrice": 22000

"price": 22000

}

}, ...

],

"total": [ {

"menuqty_cnt": 4,

"total_price": 50000

}

]

}

Geewook Kim @24.04.19 UOS

ⓒ NAVER Cloud Corp.

28 of 65

28

Conventional Pipeline

Input

Output

{ "items": [

{

"name": "3002-Kyoto Choco Mochi",

"count": 2,

"priceInfo": {

"unitPrice": 14000,

"price": 28000

}

}, {

"name": "1001 - Choco Bun",

"count": 1,

"priceInfo": {

"unitPrice": 22000

"price": 22000

}

}, ...

],

"total": [ {

"menuqty_cnt": 4,

"total_price": 50000

}

]

}

Geewook Kim @24.04.19 UOS

ⓒ NAVER Cloud Corp.

29 of 65

29

Conventional Pipeline

{ "items": [

{

"name": "3002-Kyoto Choco Mochi",

"count": 2,

"priceInfo": {

"unitPrice": 14000,

"price": 28000

}

}, {

"name": "1001 - Choco Bun",

"count": 1,

"priceInfo": {

"unitPrice": 22000

"price": 22000

}

}, ...

],

"total": [ {

"menuqty_cnt": 4,

"total_price": 50000

}

]

}

{ "words": [ {

"id": 1,

"bbox":[[360,2048],...,[355,2127]],

"text": "3002-Kyoto"

}, {

"id": 2,

"bbox":[[801,2074],...,[801,2139]],

"text": "Choco"

}, {

"id": 3,

"bbox":[[1035,2074],...,[1035,2147]],

"text": "Mochi"

}, {

"id": 4,

"bbox":[[761,2172],...,[761,2253]],

"text": "14.000"

}, …, {

"id": 22,

"bbox":[[1573,3030],...,[1571,3126]],

"text": "50.000"

}

]

}

Detection! Recognition! Parsing!

OCR

Geewook Kim @24.04.19 UOS

ⓒ NAVER Cloud Corp.

30 of 65

30

Conventional Models

3002-Kyoto Choco Mochi 14, 000

Transformer Backbone�(BERT, LayoutLM, …)

(Off-the-shelf)�OCR Engine

“3002-Kyoto Choco Mochi”

START END

Geewook Kim @24.04.19 UOS

ⓒ NAVER Cloud Corp.

31 of 65

31

Conventional Models

3002-Kyoto Choco Mochi 14, 000

B-name I-name I-name B-price I-price

Transformer Backbone�(BERT, LayoutLM, …)

(Off-the-shelf)�OCR Engine

Geewook Kim @24.04.19 UOS

ⓒ NAVER Cloud Corp.

32 of 65

32

Conventional Models

3002-Kyoto Choco Mochi 14, 000

B-name I-name I-name B-price I-price

Transformer Backbone�(BERT, LayoutLM, …)

(Off-the-shelf)�OCR Engine

{ "menu": [

{

"nm": "3002-Kyoto Choco Mochi",

"unitprice": "14.000",

"cnt": "x2",

"price": "28.000"

}, … }

Geewook Kim @24.04.19 UOS

ⓒ NAVER Cloud Corp.

33 of 65

33

Conventional Models

Input Image

(Off-the-shelf)�OCR Engine

Backbone

(BERT-like)

BIO-Tags / Answer Token Span / etc

Output

  • high computational costs
  • inflexibility of OCR on languages or document type
  • OCR error propagation

Geewook Kim @24.04.19 UOS

ⓒ NAVER Cloud Corp.

34 of 65

AS-IS v.s. TO-BE

34

Conventional Models v.s. Donut

Donut 🍩�(End-to-end Model)

Token Sequence

Output

Input Image

Input Image

(Off-the-shelf)�OCR Engine

Backbone

(BERT-like)

BIO-Tags / Answer Token Span / etc

Output

Geewook Kim @24.04.19 UOS

ⓒ NAVER Cloud Corp.

35 of 65

35

Donut Architecture

<vqa><question>what is the price

of choco mochi?</question><answer>

Converted JSON

transformer encoder

Input Image and Prompt

transformer decoder

Donut 🍩

<classification>

<parsing>

<class>receipt</class>

</classification>

14,000</answer></vqa>

<item><name>3002-Kyoto Choco Mochi</name>・・・ </parsing>

{ "items": [{"name": "3002-Kyoto Choco Mochi",

"count": 2,

"unitprice": 14000, …}], … }

Output Sequence

{ "class":"receipt" }

{ "question": "what is the price of choco mochi?",

"answer": "14,000" }

Geewook Kim @24.04.19 UOS

ⓒ NAVER Cloud Corp.

36 of 65

36

Donut Architecture

<vqa><question>what is the price

of choco mochi?</question><answer>

Converted JSON

transformer encoder

Input Image and Prompt

transformer decoder

Donut 🍩

<classification>

<parsing>

<class>receipt</class>

</classification>

14,000</answer></vqa>

<item><name>3002-Kyoto Choco Mochi</name>・・・ </parsing>

{ "items": [{"name": "3002-Kyoto Choco Mochi",

"count": 2,

"unitprice": 14000, …}], … }

Output Sequence

{ "class":"receipt" }

{ "question": "what is the price of choco mochi?",

"answer": "14,000" }

Geewook Kim @24.04.19 UOS

ⓒ NAVER Cloud Corp.

37 of 65

37

Donut Architecture

<vqa><question>what is the price

of choco mochi?</question><answer>

Converted JSON

transformer encoder

Input Image and Prompt

transformer decoder

Donut 🍩

<classification>

<parsing>

<class>receipt</class>

</classification>

14,000</answer></vqa>

<item><name>3002-Kyoto Choco Mochi</name>・・・ </parsing>

{ "items": [{"name": "3002-Kyoto Choco Mochi",

"count": 2,

"unitprice": 14000, …}], … }

Output Sequence

{ "class":"receipt" }

{ "question": "what is the price of choco mochi?",

"answer": "14,000" }

Geewook Kim @24.04.19 UOS

ⓒ NAVER Cloud Corp.

38 of 65

38

Donut Architecture

<vqa><question>what is the price

of choco mochi?</question><answer>

Converted JSON

transformer encoder

Input Image and Prompt

transformer decoder

Donut 🍩

<classification>

<parsing>

<class>receipt</class>

</classification>

14,000</answer></vqa>

<item><name>3002-Kyoto Choco Mochi</name>・・・ </parsing>

{ "items": [{"name": "3002-Kyoto Choco Mochi",

"count": 2,

"unitprice": 14000, …}], … }

Output Sequence

{ "class":"receipt" }

{ "question": "what is the price of choco mochi?",

"answer": "14,000" }

Geewook Kim @24.04.19 UOS

ⓒ NAVER Cloud Corp.

39 of 65

39

Donut Architecture

<vqa><question>what is the price

of choco mochi?</question><answer>

Converted JSON

transformer encoder

Input Image and Prompt

transformer decoder

Donut 🍩

<classification>

<parsing>

<class>receipt</class>

</classification>

14,000</answer></vqa>

<item><name>3002-Kyoto Choco Mochi</name>・・・ </parsing>

{ "items": [{"name": "3002-Kyoto Choco Mochi",

"count": 2,

"unitprice": 14000, …}], … }

Output Sequence

{ "class":"receipt" }

{ "question": "what is the price of choco mochi?",

"answer": "14,000" }

Geewook Kim @24.04.19 UOS

ⓒ NAVER Cloud Corp.

40 of 65

40

Model Training and Inference Overview

Geewook Kim @24.04.19 UOS

ⓒ NAVER Cloud Corp.

41 of 65

41

Proposed Pre-Training Task is Simple

Donut 🍩�(End-to-end Model)

In terms of what that can teach, in what from …

Geewook Kim @24.04.19 UOS

ⓒ NAVER Cloud Corp.

42 of 65

42

Proposed Pre-Training Task is Simple

Model

This can be interpret as

a token classification at each step.

in

terms

of

what

terms

of

what

Minimize Cross Entropy

Geewook Kim @24.04.19 UOS

ⓒ NAVER Cloud Corp.

43 of 65

43

Document Parsing

Geewook Kim @24.04.19 UOS

ⓒ NAVER Cloud Corp.

44 of 65

44

Document Parsing

https://github.com/clovaai/donut

Geewook Kim @24.04.19 UOS

ⓒ NAVER Cloud Corp.

45 of 65

45

Document Parsing

Geewook Kim @24.04.19 UOS

ⓒ NAVER Cloud Corp.

46 of 65

46

Document VQA

Q: What is the Extension Number as per the voucher?���A: (910) 741–0673

Geewook Kim @24.04.19 UOS

ⓒ NAVER Cloud Corp.

47 of 65

47

Document VQA

Geewook Kim @24.04.19 UOS

ⓒ NAVER Cloud Corp.

48 of 65

(VLM) Donut 🍩: OCR-free Document Understanding Transformer

Geewook Kim, Teakgyu Hong, Moonbin Yim, Jeongyeon Nam, Jinyoung Park, Jinyeong Yim, Wonseok Hwang, Sangdoo Yun, Dongyoon Han, and Seunghyun Park

ECCV 2022.

(LLM-Based LVLM) Cream🍦: Visually-Situated Natural Language Understanding with Contrastive Reading Model and Frozen Large Language Models

Geewook Kim, Hodong Lee, Daehee Kim, Haeji Jung, Sanghee Park, Yoonsik Kim, Sangdoo Yun, Taeho Kil, Bado Lee, and Seunghyun Park

EMNLP 2023.

48

Recent Publications

Geewook Kim @24.04.19 UOS

ⓒ NAVER Cloud Corp.

49 of 65

49

LVLM v.s. LLM-Based LVLM

Geewook Kim @24.04.19 UOS

ⓒ NAVER Cloud Corp.

50 of 65

Input OCR

Input Vision Embeddings

Input Both

50

Conventional LLM-Based LVLM

OCR

LLM

OCR

LLM

Vision Module

LLM

Vision Module

Types

Geewook Kim @24.04.19 UOS

ⓒ NAVER Cloud Corp.

51 of 65

Input OCR

Input Vision Embeddings

Input Both

51

Conventional LLM-Based LVLM

OCR

LLM

OCR

LLM

Vision Module

LLM

Vision Module

Types

Geewook Kim @24.04.19 UOS

ⓒ NAVER Cloud Corp.

52 of 65

(Ours) Input Both, but Efficient

Input OCR

Input Vision Embeddings

Input Both

52

Conventional LLM-Based LVLM

OCR

LLM

OCR

LLM

Vision Module

OCR

LLM

Vision Module

LLM

Vision Module

Types

optional

Geewook Kim @24.04.19 UOS

ⓒ NAVER Cloud Corp.

53 of 65

53

Cream Architecture

Geewook Kim @24.04.19 UOS

ⓒ NAVER Cloud Corp.

54 of 65

54

Cream Architecture

Cream utilizes two encoders; Vision encoder and Auxiliary encoder.

Geewook Kim @24.04.19 UOS

ⓒ NAVER Cloud Corp.

55 of 65

55

Cream Architecture – Two Encoders

Geewook Kim @24.04.19 UOS

ⓒ NAVER Cloud Corp.

56 of 65

56

Cream Architecture – Contrastive Learning

Geewook Kim @24.04.19 UOS

ⓒ NAVER Cloud Corp.

57 of 65

The outputs of the two encoders are used in the decoder module.

57

Cream Architecture

Geewook Kim @24.04.19 UOS

ⓒ NAVER Cloud Corp.

58 of 65

58

Cream Training

Geewook Kim @24.04.19 UOS

ⓒ NAVER Cloud Corp.

59 of 65

59

Evaluation Benchmarks

Geewook Kim @24.04.19 UOS

ⓒ NAVER Cloud Corp.

60 of 65

60

Key Results

Geewook Kim @24.04.19 UOS

ⓒ NAVER Cloud Corp.

61 of 65

61

Key Results

Geewook Kim @24.04.19 UOS

ⓒ NAVER Cloud Corp.

62 of 65

62

Analysis – OCR Robustness

Geewook Kim @24.04.19 UOS

ⓒ NAVER Cloud Corp.

63 of 65

63

Analysis – VLM v.s. LLM-Based LVLM

Geewook Kim @24.04.19 UOS

ⓒ NAVER Cloud Corp.

64 of 65

64

Analysis – VLM v.s. LLM-Based LVLM

Geewook Kim @24.04.19 UOS

ⓒ NAVER Cloud Corp.

65 of 65

65

EOD

Q&A�

gw.kim@navercorp.com

Geewook Kim @24.04.19 UOS

ⓒ NAVER Cloud Corp.