Multimodal Structured Generation & CVPR’s 2nd MMFM Challenge
By Franz Louis Cesista
franzlouiscesista@gmail.com
leloykun.github.io
Outline
Types of
Vision- Language Models (VLMs)
Where does interaction between modalities happen?
Before Encoder
Clip/Llava
After Encoder
Within (layers of) Encoder
Chameleon
Llama 3.1
Early-interaction VLM: Chameleon
Cross-interaction VLM: Llama 3.1
Late-interaction VLM: Llava
Do Multimodal Foundation Models still suck at document understanding tasks?
Spoiler: kinda
Phase 1: 10 public document-understanding datasets
Phase 2: 3 private test datasets
Phase 2: 3 private test datasets -- (1) MyDoc
<image> What is the address 1 in the image?
Phase 2: 3 private test datasets -- (2) MyChart
<image> Can you explain why the income from discontinued operations is (0.01)?
Phase 2: 3 private test datasets -- (3) MyInfographic
<image> Are there any icons or graphics that suggest a particular focus for the data?
My Approach:
Multimodal Structured Generation
Context
https://github.com/leloykun/MMFM-Challenge
GPUs
GPUs
What I couldn’t do with the constraints
Yet, I managed to place 2nd in the hidden test set
So, how did I do it?
JSON object or whatever
To what end?
To force the models to reason before answering!
Structured Generation with e.g. Outlines also gives us more control over how the models “think”!
Controlled reasoning!
Hallucination-free
outputs!
Folks at .TXT (Outlines) actually beat me to it:
Apr 24, 2024 -- https://blog.dottxt.co/prompt-efficiency.html
Llava-1.6 + Structured Generation performed the best for MyChart & MyInfographic…
MyDoc
MyChart
MyInfographic
✅
✅
(not so much)
For MyDoc, I had to revert to using an LLM…
MyDoc
Vision-Language Model + Structured Generation
Large Language Model + Structured Generation
✅
https://huggingface.co/leloy/Nous-Hermes-2-Pro-Docile-RASG-1ShotRetrieval-StructuredPrompt
Why? A brief review of literature…
There are three modalities of information you can extract from a document:
DocLLM has shown that removing the vision encoder and treating bounding boxes (i.e. layout information) as its own modality does not harm performance on doc understanding tasks…
There are three modalities of information you can extract from a document:
Our previous work has shown that removing layout information does not harm performance on the Key-Information Extraction task either…
There are three modalities of information you can extract from a document:
(at least for Key-Information Extraction)
Final Results
Results for hidden test set
mine
also mine
Results for hidden test set by task
Task | Best Approach | Score |
MyDoc | Nous Hermes 2 Pro (LLM) + Structured Generation | 62.25% |
MyChart | LLava-1.6 (VLM) + Structured Generation | 4.50% |
MyInfographic | LLava-1.6 (VLM) + Structured Generation | 60.98% |
vs. 21% with LLava-1.6
Why did an LLM outperform a VLM on the MyDoc dataset?
Hypothesis 1: Visual and layout information are simply not important for Key-Information Extraction
Retrieval Augmented Structured Generation: Business Document Information Extraction as Tool Use
arXiv: 2405.20245
Hypothesis 2: LLMs can already infer the location of the words in the image from their index in the prompt
What if this also applies in the 2D case?
Hypothesis 3: The vision-language models are simply at overcapacity.
Hypothesis 4: We are not using enough image tokens
Matryoshka Multimodal Models, arXiv: 2405.17430
Document Understanding requires MORE tokens
Demo: Interleaved Multimodal Structured Generation
github.com/leloykun/mmsg