1
2024.04.19 @ UOS
Geewook Kim
Vision-Language Models for Context-Rich Image Understanding Tasks
Geewook Kim @24.04.19 UOS
ⓒ NAVER Cloud Corp.
2
Language Model (LM)
안 녕 하 세 요
용
욥
역
…
P(요| 안녕하세) > P(역| 안녕하세)
P(?| 안녕하세)
Geewook Kim @24.04.19 UOS
ⓒ NAVER Cloud Corp.
3
Large Language Model (LLM)
https://medium.com/@harishdatalab/unveiling-the-power-of-large-language-models-llms-e235c4eba8a9
https://jalammar.github.io/illustrated-gpt2/
Geewook Kim @24.04.19 UOS
ⓒ NAVER Cloud Corp.
4
Large Language Model (LLM)
Geewook Kim @24.04.19 UOS
ⓒ NAVER Cloud Corp.
5
Application of LMs - Text-to-Text
X
Y
https://dodnet.tistory.com/133
Geewook Kim @24.04.19 UOS
ⓒ NAVER Cloud Corp.
6
Application of LMs - Text-to-Text
P(y1 | x1 x2 x3 x4)
x1 x2 x3 x4 y1
x2 x3 x4 y1 y2
x1 x2 x3 x4
y1 y2
Decoder Model (ex. GPT)
Encoder-Decoder Model (ex. T5, BART)
https://dodnet.tistory.com/133
Geewook Kim @24.04.19 UOS
ⓒ NAVER Cloud Corp.
7
Vision-Language Model (VLM) - Image-to-Text
x1 x2 x3 x4
Text -> Image
y1 y2
https://www.slideshare.net/deeplearningitalia/transformers-in-vision-from-zero-to-hero-dlipptx
Geewook Kim @24.04.19 UOS
ⓒ NAVER Cloud Corp.
8
Vision-Language Model (VLM) - Image-to-Text
x1 x2 x3 x4
Use ViT as Encoder
y1 y2
Geewook Kim @24.04.19 UOS
ⓒ NAVER Cloud Corp.
9
Vision-Language Model (VLM) - Image-to-Text
Geewook Kim @24.04.19 UOS
ⓒ NAVER Cloud Corp.
10
Vision-Language Model (VLM)
Geewook Kim @24.04.19 UOS
ⓒ NAVER Cloud Corp.
11
Vision-Language Model (VLM)
Geewook Kim @24.04.19 UOS
ⓒ NAVER Cloud Corp.
12
Large Vision-Language Model (LVLM)
https://arxiv.org/pdf/2210.03347.pdf
https://openaccess.thecvf.com/content/CVPR2022/papers/Hu_Scaling_Up_Vision-Language_Pre-Training_for_Image_Captioning_CVPR_2022_paper.pdf
Geewook Kim @24.04.19 UOS
ⓒ NAVER Cloud Corp.
13
Better CLIP for Better VLM
https://arxiv.org/pdf/2103.00020.pdf
Geewook Kim @24.04.19 UOS
ⓒ NAVER Cloud Corp.
14
LLM-Based LVLM
https://llava-vl.github.io/
Geewook Kim @24.04.19 UOS
ⓒ NAVER Cloud Corp.
15
LLM-Based LVLM
https://llava-vl.github.io/
https://vaclavkosar.com/ml/Encoder-only-Decoder-only-vs-Encoder-Decoder-Transfomer
Geewook Kim @24.04.19 UOS
ⓒ NAVER Cloud Corp.
16
LLM-Based LVLM
https://llava-vl.github.io/
Geewook Kim @24.04.19 UOS
ⓒ NAVER Cloud Corp.
17
Context-Rich Image Understanding – ex. Document/Infographic VQA
https://openaccess.thecvf.com/content/WACV2022/supplemental/Mathew_InfographicVQA_WACV_2022_supplemental.pdf
Geewook Kim @24.04.19 UOS
ⓒ NAVER Cloud Corp.
18
Context-Rich Image Understanding – ex. Chart/Diagram VQA
https://arxiv.org/pdf/1603.07396.pdf
https://aclanthology.org/2022.findings-acl.177.pdf
Geewook Kim @24.04.19 UOS
ⓒ NAVER Cloud Corp.
19
Context-Rich Image Understanding – Service Products in NAVER
Geewook Kim @24.04.19 UOS
ⓒ NAVER Cloud Corp.
20
Context-Rich Image Understanding – Service Products in NAVER
Geewook Kim @24.04.19 UOS
ⓒ NAVER Cloud Corp.
21
Context-Rich Image Understanding – Service Products in NAVER
이곳에 방문하셨군요!: 사진 속 영수증의 가게 찾기. DEVIEW 2021: https://deview.kr/2021/sessions/524
Geewook Kim @24.04.19 UOS
ⓒ NAVER Cloud Corp.
(VLM) Donut 🍩: OCR-free Document Understanding Transformer
Geewook Kim, Teakgyu Hong, Moonbin Yim, Jeongyeon Nam, Jinyoung Park, Jinyeong Yim, Wonseok Hwang, Sangdoo Yun, Dongyoon Han, and Seunghyun Park
ECCV 2022.
(LLM-Based LVLM) Cream🍦: Visually-Situated Natural Language Understanding with Contrastive Reading Model and Frozen Large Language Models
Geewook Kim, Hodong Lee, Daehee Kim, Haeji Jung, Sanghee Park, Yoonsik Kim, Sangdoo Yun, Taeho Kil, Bado Lee, and Seunghyun Park
EMNLP 2023.
22
Recent Publications
Geewook Kim @24.04.19 UOS
ⓒ NAVER Cloud Corp.
(VLM) Donut 🍩: OCR-free Document Understanding Transformer
Geewook Kim, Teakgyu Hong, Moonbin Yim, Jeongyeon Nam, Jinyoung Park, Jinyeong Yim, Wonseok Hwang, Sangdoo Yun, Dongyoon Han, and Seunghyun Park
ECCV 2022.
(LLM-Based LVLM) Cream🍦: Visually-Situated Natural Language Understanding with Contrastive Reading Model and Frozen Large Language Models
Geewook Kim, Hodong Lee, Daehee Kim, Haeji Jung, Sanghee Park, Yoonsik Kim, Sangdoo Yun, Taeho Kil, Bado Lee, and Seunghyun Park
EMNLP 2023.
23
Recent Publications
Geewook Kim @24.04.19 UOS
ⓒ NAVER Cloud Corp.
24
Context-Rich Image Understanding;
Visual Document Understanding
VDU Model
Useful Information
Geewook Kim @24.04.19 UOS
ⓒ NAVER Cloud Corp.
25
Context-Rich Image Understanding;
Visual Document Understanding
VDU Model
{ "class": "receipt" }
Geewook Kim @24.04.19 UOS
ⓒ NAVER Cloud Corp.
26
Context-Rich Image Understanding;
Visual Document Understanding
VDU Model
{ "menu": [
{
"nm": "3002-Kyoto Choco Mochi",
"unitprice": "14.000",
"cnt": "x2",
"price": "28.000"
}, … }
Geewook Kim @24.04.19 UOS
ⓒ NAVER Cloud Corp.
27
Conventional Pipeline
Input
Output
{ "items": [
{
"name": "3002-Kyoto Choco Mochi",
"count": 2,
"priceInfo": {
"unitPrice": 14000,
"price": 28000
}
}, {
"name": "1001 - Choco Bun",
"count": 1,
"priceInfo": {
"unitPrice": 22000
"price": 22000
}
}, ...
],
"total": [ {
"menuqty_cnt": 4,
"total_price": 50000
}
]
}
≈
Geewook Kim @24.04.19 UOS
ⓒ NAVER Cloud Corp.
28
Conventional Pipeline
Input
Output
{ "items": [
{
"name": "3002-Kyoto Choco Mochi",
"count": 2,
"priceInfo": {
"unitPrice": 14000,
"price": 28000
}
}, {
"name": "1001 - Choco Bun",
"count": 1,
"priceInfo": {
"unitPrice": 22000
"price": 22000
}
}, ...
],
"total": [ {
"menuqty_cnt": 4,
"total_price": 50000
}
]
}
≈
…
…
Geewook Kim @24.04.19 UOS
ⓒ NAVER Cloud Corp.
29
Conventional Pipeline
{ "items": [
{
"name": "3002-Kyoto Choco Mochi",
"count": 2,
"priceInfo": {
"unitPrice": 14000,
"price": 28000
}
}, {
"name": "1001 - Choco Bun",
"count": 1,
"priceInfo": {
"unitPrice": 22000
"price": 22000
}
}, ...
],
"total": [ {
"menuqty_cnt": 4,
"total_price": 50000
}
]
}
≈
{ "words": [ {
"id": 1,
"bbox":[[360,2048],...,[355,2127]],
"text": "3002-Kyoto"
}, {
"id": 2,
"bbox":[[801,2074],...,[801,2139]],
"text": "Choco"
}, {
"id": 3,
"bbox":[[1035,2074],...,[1035,2147]],
"text": "Mochi"
}, {
"id": 4,
"bbox":[[761,2172],...,[761,2253]],
"text": "14.000"
}, …, {
"id": 22,
"bbox":[[1573,3030],...,[1571,3126]],
"text": "50.000"
}
]
}
Detection! Recognition! Parsing!
OCR
Geewook Kim @24.04.19 UOS
ⓒ NAVER Cloud Corp.
30
Conventional Models
… 3002-Kyoto Choco Mochi 14, 000 …
Transformer Backbone�(BERT, LayoutLM, …)
(Off-the-shelf)�OCR Engine
“3002-Kyoto Choco Mochi”
START END
Geewook Kim @24.04.19 UOS
ⓒ NAVER Cloud Corp.
31
Conventional Models
… 3002-Kyoto Choco Mochi 14, 000 …
B-name I-name I-name B-price I-price
Transformer Backbone�(BERT, LayoutLM, …)
(Off-the-shelf)�OCR Engine
Geewook Kim @24.04.19 UOS
ⓒ NAVER Cloud Corp.
32
Conventional Models
… 3002-Kyoto Choco Mochi 14, 000 …
B-name I-name I-name B-price I-price
Transformer Backbone�(BERT, LayoutLM, …)
(Off-the-shelf)�OCR Engine
{ "menu": [
{
"nm": "3002-Kyoto Choco Mochi",
"unitprice": "14.000",
"cnt": "x2",
"price": "28.000"
}, … }
Geewook Kim @24.04.19 UOS
ⓒ NAVER Cloud Corp.
33
Conventional Models
Input Image
(Off-the-shelf)�OCR Engine
Backbone
(BERT-like)
BIO-Tags / Answer Token Span / etc
Output
Geewook Kim @24.04.19 UOS
ⓒ NAVER Cloud Corp.
AS-IS v.s. TO-BE
34
Conventional Models v.s. Donut
Donut 🍩�(End-to-end Model)
Token Sequence
Output
Input Image
Input Image
(Off-the-shelf)�OCR Engine
Backbone
(BERT-like)
BIO-Tags / Answer Token Span / etc
Output
Geewook Kim @24.04.19 UOS
ⓒ NAVER Cloud Corp.
35
Donut Architecture
<vqa><question>what is the price
of choco mochi?</question><answer>
Converted JSON
transformer encoder
Input Image and Prompt
transformer decoder
Donut 🍩
<classification>
<parsing>
<class>receipt</class>
</classification>
14,000</answer></vqa>
<item><name>3002-Kyoto Choco Mochi</name>・・・ </parsing>
{ "items": [{"name": "3002-Kyoto Choco Mochi",
"count": 2,
"unitprice": 14000, …}], … }
Output Sequence
{ "class":"receipt" }
{ "question": "what is the price of choco mochi?",
"answer": "14,000" }
Geewook Kim @24.04.19 UOS
ⓒ NAVER Cloud Corp.
36
Donut Architecture
<vqa><question>what is the price
of choco mochi?</question><answer>
Converted JSON
transformer encoder
Input Image and Prompt
transformer decoder
Donut 🍩
<classification>
<parsing>
<class>receipt</class>
</classification>
14,000</answer></vqa>
<item><name>3002-Kyoto Choco Mochi</name>・・・ </parsing>
{ "items": [{"name": "3002-Kyoto Choco Mochi",
"count": 2,
"unitprice": 14000, …}], … }
Output Sequence
{ "class":"receipt" }
{ "question": "what is the price of choco mochi?",
"answer": "14,000" }
Geewook Kim @24.04.19 UOS
ⓒ NAVER Cloud Corp.
37
Donut Architecture
<vqa><question>what is the price
of choco mochi?</question><answer>
Converted JSON
transformer encoder
Input Image and Prompt
transformer decoder
Donut 🍩
<classification>
<parsing>
<class>receipt</class>
</classification>
14,000</answer></vqa>
<item><name>3002-Kyoto Choco Mochi</name>・・・ </parsing>
{ "items": [{"name": "3002-Kyoto Choco Mochi",
"count": 2,
"unitprice": 14000, …}], … }
Output Sequence
{ "class":"receipt" }
{ "question": "what is the price of choco mochi?",
"answer": "14,000" }
Geewook Kim @24.04.19 UOS
ⓒ NAVER Cloud Corp.
38
Donut Architecture
<vqa><question>what is the price
of choco mochi?</question><answer>
Converted JSON
transformer encoder
Input Image and Prompt
transformer decoder
Donut 🍩
<classification>
<parsing>
<class>receipt</class>
</classification>
14,000</answer></vqa>
<item><name>3002-Kyoto Choco Mochi</name>・・・ </parsing>
{ "items": [{"name": "3002-Kyoto Choco Mochi",
"count": 2,
"unitprice": 14000, …}], … }
Output Sequence
{ "class":"receipt" }
{ "question": "what is the price of choco mochi?",
"answer": "14,000" }
Geewook Kim @24.04.19 UOS
ⓒ NAVER Cloud Corp.
39
Donut Architecture
<vqa><question>what is the price
of choco mochi?</question><answer>
Converted JSON
transformer encoder
Input Image and Prompt
transformer decoder
Donut 🍩
<classification>
<parsing>
<class>receipt</class>
</classification>
14,000</answer></vqa>
<item><name>3002-Kyoto Choco Mochi</name>・・・ </parsing>
{ "items": [{"name": "3002-Kyoto Choco Mochi",
"count": 2,
"unitprice": 14000, …}], … }
Output Sequence
{ "class":"receipt" }
{ "question": "what is the price of choco mochi?",
"answer": "14,000" }
Geewook Kim @24.04.19 UOS
ⓒ NAVER Cloud Corp.
40
Model Training and Inference Overview
Geewook Kim @24.04.19 UOS
ⓒ NAVER Cloud Corp.
41
Proposed Pre-Training Task is Simple
Donut 🍩�(End-to-end Model)
In terms of what that can teach, in what from …
Geewook Kim @24.04.19 UOS
ⓒ NAVER Cloud Corp.
42
Proposed Pre-Training Task is Simple
Model
…
…
…
This can be interpret as
a token classification at each step.
in
terms
of
what
terms
of
what
Minimize Cross Entropy
Geewook Kim @24.04.19 UOS
ⓒ NAVER Cloud Corp.
43
Document Parsing
Geewook Kim @24.04.19 UOS
ⓒ NAVER Cloud Corp.
44
Document Parsing
https://github.com/clovaai/donut
Geewook Kim @24.04.19 UOS
ⓒ NAVER Cloud Corp.
45
Document Parsing
Geewook Kim @24.04.19 UOS
ⓒ NAVER Cloud Corp.
46
Document VQA
Q: What is the Extension Number as per the voucher?���A: (910) 741–0673
Geewook Kim @24.04.19 UOS
ⓒ NAVER Cloud Corp.
47
Document VQA
Geewook Kim @24.04.19 UOS
ⓒ NAVER Cloud Corp.
(VLM) Donut 🍩: OCR-free Document Understanding Transformer
Geewook Kim, Teakgyu Hong, Moonbin Yim, Jeongyeon Nam, Jinyoung Park, Jinyeong Yim, Wonseok Hwang, Sangdoo Yun, Dongyoon Han, and Seunghyun Park
ECCV 2022.
(LLM-Based LVLM) Cream🍦: Visually-Situated Natural Language Understanding with Contrastive Reading Model and Frozen Large Language Models
Geewook Kim, Hodong Lee, Daehee Kim, Haeji Jung, Sanghee Park, Yoonsik Kim, Sangdoo Yun, Taeho Kil, Bado Lee, and Seunghyun Park
EMNLP 2023.
48
Recent Publications
Geewook Kim @24.04.19 UOS
ⓒ NAVER Cloud Corp.
49
LVLM v.s. LLM-Based LVLM
Geewook Kim @24.04.19 UOS
ⓒ NAVER Cloud Corp.
Input OCR
Input Vision Embeddings
Input Both
50
Conventional LLM-Based LVLM
OCR
LLM
OCR
LLM
Vision Module
LLM
Vision Module
Types
Geewook Kim @24.04.19 UOS
ⓒ NAVER Cloud Corp.
Input OCR
Input Vision Embeddings
Input Both
51
Conventional LLM-Based LVLM
OCR
LLM
OCR
LLM
Vision Module
LLM
Vision Module
Types
Geewook Kim @24.04.19 UOS
ⓒ NAVER Cloud Corp.
(Ours) Input Both, but Efficient
Input OCR
Input Vision Embeddings
Input Both
52
Conventional LLM-Based LVLM
OCR
LLM
OCR
LLM
Vision Module
OCR
LLM
Vision Module
LLM
Vision Module
Types
optional
Geewook Kim @24.04.19 UOS
ⓒ NAVER Cloud Corp.
53
Cream Architecture
Geewook Kim @24.04.19 UOS
ⓒ NAVER Cloud Corp.
54
Cream Architecture
Cream utilizes two encoders; Vision encoder and Auxiliary encoder.
Geewook Kim @24.04.19 UOS
ⓒ NAVER Cloud Corp.
55
Cream Architecture – Two Encoders
Geewook Kim @24.04.19 UOS
ⓒ NAVER Cloud Corp.
56
Cream Architecture – Contrastive Learning
Geewook Kim @24.04.19 UOS
ⓒ NAVER Cloud Corp.
The outputs of the two encoders are used in the decoder module.
57
Cream Architecture
Geewook Kim @24.04.19 UOS
ⓒ NAVER Cloud Corp.
58
Cream Training
Geewook Kim @24.04.19 UOS
ⓒ NAVER Cloud Corp.
59
Evaluation Benchmarks
Geewook Kim @24.04.19 UOS
ⓒ NAVER Cloud Corp.
60
Key Results
Geewook Kim @24.04.19 UOS
ⓒ NAVER Cloud Corp.
61
Key Results
Geewook Kim @24.04.19 UOS
ⓒ NAVER Cloud Corp.
62
Analysis – OCR Robustness
Geewook Kim @24.04.19 UOS
ⓒ NAVER Cloud Corp.
63
Analysis – VLM v.s. LLM-Based LVLM
Geewook Kim @24.04.19 UOS
ⓒ NAVER Cloud Corp.
64
Analysis – VLM v.s. LLM-Based LVLM
Geewook Kim @24.04.19 UOS
ⓒ NAVER Cloud Corp.
65
EOD
Q&A�
gw.kim@navercorp.com
Geewook Kim @24.04.19 UOS
ⓒ NAVER Cloud Corp.