1 of 49

Multimodal Structured Generation & CVPR’s 2nd MMFM Challenge

By Franz Louis Cesista

franzlouiscesista@gmail.com

leloykun.github.io

2 of 49

Outline

A brief overview of vision-language models (VLMs)
A brief description of CVPR's Multimodal Foundation Models (MMFM) Challenge
An overview of my approach, Multimodal Structured Generation
Results
Four possible reasons why current VLMs suck at doc- understanding tasks and what to do about them
Bonus demo: Interleaved Multimodal Structured Generation

4 of 49

Types of

Vision- Language Models (VLMs)

Where does interaction between modalities happen?

Before Encoder

Clip/Llava

After Encoder

Within (layers of) Encoder

Chameleon

Llama 3.1

5 of 49

Early-interaction VLM: Chameleon

6 of 49

Cross-interaction VLM: Llama 3.1

7 of 49

Late-interaction VLM: Llava

10 of 49

Do Multimodal Foundation Models still suck at document understanding tasks?

Spoiler: kinda

11 of 49

Phase 1: 10 public document-understanding datasets

12 of 49

Phase 2: 3 private test datasets

13 of 49

Phase 2: 3 private test datasets -- (1) MyDoc

<image> What is the address 1 in the image?

14 of 49

Phase 2: 3 private test datasets -- (2) MyChart

<image> Can you explain why the income from discontinued operations is (0.01)?

15 of 49

Phase 2: 3 private test datasets -- (3) MyInfographic

<image> Are there any icons or graphics that suggest a particular focus for the data?

17 of 49

My Approach:

Multimodal Structured Generation

Context

I joined < 48 hours before the deadline
I wasted 24+ hours working with commercial models (which weren’t allowed)
Laptop is 5 years old
On student budget

https://github.com/leloykun/MMFM-Challenge

18 of 49

GPUs

19 of 49

What I couldn’t do with the constraints

No Finetuning, because I didn’t have GPUs
No Retrieval Augmented Generation (RAG), because I didn’t have the time to implement it

20 of 49

Yet, I managed to place 2nd in the hidden test set

21 of 49

So, how did I do it?

22 of 49

JSON object or whatever

23 of 49

To what end?

24 of 49

To force the models to reason before answering!

25 of 49

Structured Generation with e.g. Outlines also gives us more control over how the models “think”!

Controlled reasoning!

Hallucination-free

outputs!

26 of 49

Folks at .TXT (Outlines) actually beat me to it:

Apr 24, 2024 -- https://blog.dottxt.co/prompt-efficiency.html

27 of 49

Llava-1.6 + Structured Generation performed the best for MyChart & MyInfographic…

MyDoc

MyChart

MyInfographic

✅

(not so much)

28 of 49

For MyDoc, I had to revert to using an LLM…

MyDoc

Vision-Language Model + Structured Generation

Large Language Model + Structured Generation

✅

https://huggingface.co/leloy/Nous-Hermes-2-Pro-Docile-RASG-1ShotRetrieval-StructuredPrompt

29 of 49

Why? A brief review of literature…

30 of 49

There are three modalities of information you can extract from a document:

Textual Information
Visual Information (“what stuff are in the doc?”)
Layout Information (“where are the stuff in the doc?”)

31 of 49

DocLLM has shown that removing the vision encoder and treating bounding boxes (i.e. layout information) as its own modality does not harm performance on doc understanding tasks…

32 of 49

There are three modalities of information you can extract from a document:

Textual Information
Visual Information (“what stuff are in the doc?”)
Layout Information (“where are the stuff in the doc?”)

33 of 49

Our previous work has shown that removing layout information does not harm performance on the Key-Information Extraction task either…

34 of 49

There are three modalities of information you can extract from a document:

Textual Information
Visual Information (“what stuff are in the doc?”)
Layout Information (“where are the stuff in the doc?”)

(at least for Key-Information Extraction)

37 of 49

Final Results

38 of 49

Results for hidden test set

mine

also mine

39 of 49

Results for hidden test set by task

Task	Best Approach	Score
MyDoc	Nous Hermes 2 Pro (LLM) + Structured Generation	62.25%
MyChart	LLava-1.6 (VLM) + Structured Generation	4.50%
MyInfographic	LLava-1.6 (VLM) + Structured Generation	60.98%

vs. 21% with LLava-1.6

41 of 49

Why did an LLM outperform a VLM on the MyDoc dataset?

42 of 49

Hypothesis 1: Visual and layout information are simply not important for Key-Information Extraction

Retrieval Augmented Structured Generation: Business Document Information Extraction as Tool Use

arXiv: 2405.20245

43 of 49

Hypothesis 2: LLMs can already infer the location of the words in the image from their index in the prompt

What if this also applies in the 2D case?

44 of 49

Hypothesis 3: The vision-language models are simply at overcapacity.

We already train LLMs to their full capacity according to Neural Scaling laws
Grafting the Vision Encoders pushes them over the edge

45 of 49

Hypothesis 4: We are not using enough image tokens

Matryoshka Multimodal Models, arXiv: 2405.17430

1 of 49

2 of 49

3 of 49

4 of 49

5 of 49

6 of 49

7 of 49

8 of 49

9 of 49

10 of 49

11 of 49

12 of 49

13 of 49

14 of 49

15 of 49

16 of 49

17 of 49

18 of 49

19 of 49

20 of 49

21 of 49

22 of 49

23 of 49

24 of 49

25 of 49

26 of 49

27 of 49

28 of 49

29 of 49

30 of 49

31 of 49

32 of 49

33 of 49

34 of 49

35 of 49

36 of 49

37 of 49

38 of 49

39 of 49

40 of 49

41 of 49

42 of 49

43 of 49

44 of 49

45 of 49

46 of 49

47 of 49

48 of 49

49 of 49