Multimodal Needle in a Haystack:�Benchmarking Long-Context Capability of�Multimodal Large Language Models
Hengyi Wang, Haizhou Shi, Shiwei Tan,
Weiyi Qin, Wenyuan Wang, Tunyu Zhang,�Akshay Nambi, Tanuja Ganu, Hao Wang�
06/27/24
4.
3.
2.
1.
Introduction
MMNeedle Benchmark
Experiments
Conclusion
CONTENTS
Introduction
1.
Needle-in-a-Haystack Test
Input: July 2010What hard liquor, cigarettes, heroin, and crack have in common is that they're all more concentrated forms of less addictive predecessors.
Most if not all the things we describe as addictive are. And the
scary thing is, the process that created them is accelerating.We wouldn't want to stop it. It's the same process that cures
diseases: technological progress. The best thing to do in San Francisco is eat a sandwich and sit in Dolores Park on a sunny day. Technological progress means making things do more of what we want. When the thing we want is something we want to want, we consider technological progress good.
If some new technique makes solar cells x% more efficient, that
seems strictly better. When progress concentrates something we
don't want to want—when it transforms opium into heroin—it seems
bad. But it‘s the same process at work.
Question: What is the best thing to do in San Francisco?
Needle-in-a-Haystack Test
Challenges in Multimodal LLMs
MMNeedle Benchmark
2.
Key Components
Experiments
3.
Long-Context Capability
Hallucination on Negative Examples
Multi-Needle Evaluation
Statistitical Significance
Effect of Context Length
Conclusion
4.
Links and Resources
References
[1] G. Kamradt. Needle in a haystack - pressure testing llms.
https://github.com/gkamradt/LLMTest_NeedleInAHaystack, 2023.
[2] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, pages 740–755. Springer, 2014.