1 of 2

Analyzing Hallucination in Multimodal Document Question Answering through Parser and Vision-Language Model Ablation

Hallucination remains a persistent challenge in multimodal document question answering (MDQA), where models must integrate textual, visual, and structural cues across complex layouts. This study examines how removing specific document components affects factual reliability and reasoning accuracy. Using the MMLongBench-Doc dataset, 166 question–document pairs were evaluated across two settings: (1) vision-language models (VLMs) that answered directly, including GPT-4o (vision) and Gemini 2.5 Pro, and (2) retrieval-augmented generation (RAG) pipelines that combined text parsers: LlamaParse, Unstructured.io, or Docling, with GPT-4o as the answering model. Each system was tested under four ablation conditions: image, table, abstract, and control, where the corresponding element was omitted during parsing or inference. Model performance was evaluated using accuracy, precision, recall, and F1 score. Gemini 2.5 Pro, with images ablated achieved the highest accuracy (0.33), followed by GPT-4o (vision) (0.33). Removing images often improved factual accuracy, suggesting that visual features can introduce noise or conflicting cues. In contrast, removing tables reduced performance, highlighting their importance for quantitative reasoning. Among parser-based pipelines, Docling + GPT-4o performed best (0.21), followed by LlamaParse + GPT-4o (0.20) and Unstructured.io + GPT-4o (0.18). These results suggest that while VLMs outperform parser-RAG systems, their visual grounding remains imperfect and can still lead to hallucinated or inconsistent answers. Future work will extend this ablation framework to analyze how visual density, reference placement, and layout complexity shape multimodal comprehension and factual consistency across architectures.

Abstract

To test each of the models accurately, we used MMLongBench-Doc, a preexisting dataset containing 135 lengthy document PDFs and 1,090 questions that tested the contents in the document. Out of those, we filtered only the academic ones for their typical abundance of figures and tables, and also for their increased complexity, which would test the limits of the models. PDFs with >= 40 pages were also removed due to limitations with API credits. In total, 166 questions were used.

Methodology

Results

  • VLMs performed best overall
    • Gemini 2.5 Pro and GPT-4o (vision) scored far above parser pipelines
    • VLMs combine text and visuals better than parser-to-LLM workflows
  • Removing images sometimes improved accuracy
    • Image ablation boosted Gemini and GPT-4o (vision)
    • Certain visuals can introduce noise or mislead models (axes, figure layouts)
  • Removing tables consistently hurt accuracy
    • Tables provide strong grounding for quantitative reasoning
    • Model accuracy dropped across most systems without tables
  • Parser pipelines weaker overall but more stable for tricky visuals
    • Parsers avoid vision-based errors but lose context
    • Less powerful, but not misled by cluttered figures
  • Hybrid approach fits results
    • VLMs excel with helpful visuals but struggle with misleading ones
    • Parsers do the opposite
    • A hybrid could use VLMs for meaningful visuals, fall back to parsers for noisy ones, and cross-check answers for hallucination

Discussion

Large Language Models (LLMs) have recently expanded beyond text processing to handle documents that include text, tables, and figures [1], [2]. Models such as GPT-4o and Gemini 2.5 Pro can now take visual inputs like charts and diagrams, allowing them to perform document summarization and question answering across multiple formats [3], [4]. Many current systems still rely on document parsers such as LlamaParse, Docling, or Unstructured.io [5]-[7], which convert PDF layouts into plain text or markdown. While these parsers work well for extracting structure, they often lose important semantic relationships, which can reduce factual accuracy and increase hallucination [8]. Vision-Language Models (VLMs) take a different approach by interpreting visual content directly. This lets them reason across both textual and visual elements, but they still struggle with errors such as misreading axes or inventing nonexistent details [9]-[11]. Recent benchmarks like MMLongBench-Doc, ROPE, and HALLUCINOGEN have shown that hallucination remains a serious issue in multimodal document understanding, especially in scientific and biomedical contexts [12]-[14]. Several mitigation methods, including Summary-Guided Decoding and the REVERSE framework, have been proposed to improve factual grounding [15], [16]. However, hallucinations continue to appear even in the strongest models. Scientific papers are especially challenging because they are long, technical, and contain dense visual information. This project directly compares Vision-Language Models with parser-based pipelines to analyze how each handles different document components. By testing models under “ablation” settings—where elements such as tables, figures, or abstracts are selectively removed—we aim to better understand how visual and textual cues contribute to factual reliability in multimodal reasoning.

Introduction

[1] OpenAI, GPT-4 Technical Report, 2023.�[2] Google DeepMind, Gemini 2.5 Pro Model Card, 2025.�[3] Anthropic, Claude 3 Model Overview, 2024.�[4] K Lu et al., “Vision-Language Models for Document Understanding,” Proc. ACL, 2024.�[5] LlamaIndex, LlamaParse Documentation, 2024.�[6] Docling, Open-Source Document Parsing Framework, 2024.�[7] Unstructured.io, Toolkit for Unstructured Data Extraction, 2024.�[8] Y. Ji et al., “A Survey on Hallucination in Large Language Models,” Trans. Assoc. Comput. Linguistics, 2024.�[9] Z. Sun et al., “Grounded Decoding for Reducing Hallucinations in VLMs,” Proc. NeurIPS, 2024.�[10] J. Kim et al., “ROPE: Robust Optical Parsing Evaluation,” Proc. CVPR, 2024.�[11] A. Zhang et al., “HALLUCINOGEN: Benchmarking Grounding in Vision-Language Models,” Proc. ICLR, 2024.�[12] Y. Dong et al., “MMLongBench-Doc: Evaluating Long-Context Multimodal Models,” Proc. ACL, 2025.�[13] S. Tang et al., “Summary-Guided Decoding for Factuality in Scientific Summaries,” Findings EMNLP, 2024.�[14] S. Guo et al., “REVERSE: Counterfactual Decoding to Mitigate LLM Hallucination,” Proc. NeurIPS, 2024.�[15] Xie, T., Lin, M., Liu, M., Ye, Y., Chen, C., & Liu, S. (2023). InfoChartQA: A benchmark for multimodal question answering on infographic charts. arXiv:2310.12430.�[16] Zhang, Q., Wang, B., Huang, V. S., Zhang, J., Wang, Z., Liang, H., He, C., & Zhang, W. (2025). Document parsing unveiled: Techniques, challenges, and prospects for structured information extraction. arXiv.

References

Future Work

Bryan Ung, Atharv Santosh, Rohan Babbellapati, Rishaan Kotian, Phil Mui*

Some elements of the documents were removed using code. Libraries like PyMuPDF were used to ablate the images, while ablations of tables and abstracts were done manually with Adobe Acrobat software.

Tables were ablated for their structured, numeric content, which often is important to understanding quantitative relationships that text often fails to capture.

Images were ablated due to their visual and spatial cues, including diagrams, graphs, and other visuals that communicated information. Removing these were to test how much models depended on non-textual sources for information and MDQA. The removal of them also lessened the work and visual “noise” in the documents.

Abstracts were ablated for their high-level contextual summaries, which typically provide condensed versions of the paper’s goals, methods, and conclusions. Excluding abstracts forced the models to synthesize meaning from distributed information within the main body, more closely mimicking real-world scenarios.

A temperature of 0.0 was used across models to ensure replicability. A top_p of 1.0 as well as a max_tokens of 1000 were also both used.

Across the RAG systems, optical character recognition (OCR) was enabled to keep visual elements intact. We build a dense retrieval index over the allowed chunks for that ablation, and utilized top-k selection to find the most relevant information chunks. OpenAI’s text-embedding-3-large model was used to handle all embedding tasks. GPT-4o was used as a “base” model to handle all of the question-answering tasks. On the vision language model side, each academic paper is rendered page-by-page into high-resolution PNG images, which are then base64-encoded and passed alongside the corresponding question.

Department of Computer Science and Engineering

Mui Group

Future work will further investigate how different aspects of document design influence model understanding and accuracy. Factors such as the amount and arrangement of visual elements, the placement of references, and the overall complexity of a document’s layout may affect how well models handle textual and visual information. Examining these more systematically could reveal which layout features most often lead to hallucinations or inconsistent answers. This analysis should also go beyond PDFs to other formats, including HTML articles, slides, spreadsheets, and structured forms, which bring new challenges such as dynamic content, interactive elements, and varied table structures. Insights from these studies could support the development of adaptive pipelines that selectively prioritize or filter visual and textual content, adjust preprocessing based on layout, and route complex elements like tables or charts through specialized modules. They could also guide improvements in model architectures and training strategies, especially in how cross-modal information is combined and verified. Together with more targeted perturbation experiments and larger, more diverse datasets, this work will help develop multimodal question-answering systems that are more accurate, reliable, and robust.

Accuracy: Fraction of answers that match the ground truth exactly

(The X-axis of the following heatmaps denotes which element of each PDF was removed)

Recall: Fraction of the ground-truth tokens the model’s response contains

F1 Score: Harmonic mean of precision and recall

VLM

Parser-RAG

Precision: Fraction of the response’s tokens that are correct

Template ID: intuitivecerulean Size: 48x36

2 of 2

Figures Here

https://medium.com/@shrutisaxena0617/precision-vs-recall-386cf9f89488

https://www.researchgate.net/figure/Evaluation-metrics-accuracy-precision-recall-F-score-and-Intersection-over-Union_fig2_358029719

(a) Infographic and plain chart pairs Enable the diagnosis of failure cases 7,475 basic questions 462 metaphor-related questions (b) Visual-element-based questions Enable more flexible questions What is the export volume of the country corresponding to this container? Answer: Question: $ 487B 5,948 Infographic-plain chart pirs Infographic chart Plain chart $ 543B What is the export volume of UK? $ 487B Remove visual elements $ 487B What is the export volume of UK? Hypothesis: the visual elements distract MLLMs Verified! $ 543B Analysis Answer: B Question: Which option best interprets the metaphor conveyed by the image? A. The shadow lengths of the containers quantifies nations cargo export volumes. B. The shadow length of the containers quantifies each nation s export volume.

Template ID: intuitivecerulean Size: 48x36