1 of 49

Multimodal Structured Generation & CVPR’s 2nd MMFM Challenge

By Franz Louis Cesista

franzlouiscesista@gmail.com

leloykun.github.io

2 of 49

Outline

  1. A brief overview of vision-language models (VLMs)
  2. A brief description of CVPR's Multimodal Foundation Models (MMFM) Challenge
  3. An overview of my approach, Multimodal Structured Generation
  4. Results
  5. Four possible reasons why current VLMs suck at doc- understanding tasks and what to do about them
  6. Bonus demo: Interleaved Multimodal Structured Generation

3 of 49

4 of 49

Types of

Vision- Language Models (VLMs)

Where does interaction between modalities happen?

Before Encoder

Clip/Llava

After Encoder

Within (layers of) Encoder

Chameleon

Llama 3.1

5 of 49

Early-interaction VLM: Chameleon

6 of 49

Cross-interaction VLM: Llama 3.1

7 of 49

Late-interaction VLM: Llava

8 of 49

9 of 49

10 of 49

Do Multimodal Foundation Models still suck at document understanding tasks?

Spoiler: kinda

11 of 49

Phase 1: 10 public document-understanding datasets

12 of 49

Phase 2: 3 private test datasets

13 of 49

Phase 2: 3 private test datasets -- (1) MyDoc

<image> What is the address 1 in the image?

14 of 49

Phase 2: 3 private test datasets -- (2) MyChart

<image> Can you explain why the income from discontinued operations is (0.01)?

15 of 49

Phase 2: 3 private test datasets -- (3) MyInfographic

<image> Are there any icons or graphics that suggest a particular focus for the data?

16 of 49

17 of 49

My Approach:

Multimodal Structured Generation

Context

  • I joined < 48 hours before the deadline
  • I wasted 24+ hours working with commercial models (which weren’t allowed)
  • Laptop is 5 years old
  • On student budget

https://github.com/leloykun/MMFM-Challenge

18 of 49

GPUs

GPUs

19 of 49

What I couldn’t do with the constraints

  • No Finetuning, because I didn’t have GPUs
  • No Retrieval Augmented Generation (RAG), because I didn’t have the time to implement it

20 of 49

Yet, I managed to place 2nd in the hidden test set

21 of 49

So, how did I do it?

22 of 49

JSON object or whatever

23 of 49

To what end?

24 of 49

To force the models to reason before answering!

25 of 49

Structured Generation with e.g. Outlines also gives us more control over how the models “think”!

Controlled reasoning!

Hallucination-free

outputs!

26 of 49

Folks at .TXT (Outlines) actually beat me to it:

Apr 24, 2024 -- https://blog.dottxt.co/prompt-efficiency.html

27 of 49

Llava-1.6 + Structured Generation performed the best for MyChart & MyInfographic…

MyDoc

MyChart

MyInfographic

(not so much)

28 of 49

For MyDoc, I had to revert to using an LLM…

MyDoc

Vision-Language Model + Structured Generation

Large Language Model + Structured Generation

https://huggingface.co/leloy/Nous-Hermes-2-Pro-Docile-RASG-1ShotRetrieval-StructuredPrompt

29 of 49

Why? A brief review of literature…

30 of 49

There are three modalities of information you can extract from a document:

  • Textual Information
  • Visual Information (“what stuff are in the doc?”)
  • Layout Information (“where are the stuff in the doc?”)

31 of 49

DocLLM has shown that removing the vision encoder and treating bounding boxes (i.e. layout information) as its own modality does not harm performance on doc understanding tasks…

32 of 49

There are three modalities of information you can extract from a document:

  • Textual Information
  • Visual Information (“what stuff are in the doc?”)
  • Layout Information (“where are the stuff in the doc?”)

33 of 49

Our previous work has shown that removing layout information does not harm performance on the Key-Information Extraction task either…

34 of 49

There are three modalities of information you can extract from a document:

  • Textual Information
  • Visual Information (“what stuff are in the doc?”)
  • Layout Information (“where are the stuff in the doc?”)

(at least for Key-Information Extraction)

35 of 49

36 of 49

37 of 49

Final Results

38 of 49

Results for hidden test set

mine

also mine

39 of 49

Results for hidden test set by task

Task

Best Approach

Score

MyDoc

Nous Hermes 2 Pro (LLM) + Structured Generation

62.25%

MyChart

LLava-1.6 (VLM) + Structured Generation

4.50%

MyInfographic

LLava-1.6 (VLM) + Structured Generation

60.98%

vs. 21% with LLava-1.6

40 of 49

41 of 49

Why did an LLM outperform a VLM on the MyDoc dataset?

42 of 49

Hypothesis 1: Visual and layout information are simply not important for Key-Information Extraction

Retrieval Augmented Structured Generation: Business Document Information Extraction as Tool Use

arXiv: 2405.20245

43 of 49

Hypothesis 2: LLMs can already infer the location of the words in the image from their index in the prompt

What if this also applies in the 2D case?

44 of 49

Hypothesis 3: The vision-language models are simply at overcapacity.

  • We already train LLMs to their full capacity according to Neural Scaling laws
  • Grafting the Vision Encoders pushes them over the edge

45 of 49

Hypothesis 4: We are not using enough image tokens

Matryoshka Multimodal Models, arXiv: 2405.17430

46 of 49

Document Understanding requires MORE tokens

47 of 49

48 of 49

Demo: Interleaved Multimodal Structured Generation

github.com/leloykun/mmsg

49 of 49