1 of 25

VISUAL DESCRIPTION GROUNDING REDUCES HALLUCINATIONS AND BOOSTS REASONING IN LVLMS

1University of Maryland, College Park, USA

2Adobe, USA

2 of 25

Introduction

What are hallucinations?

Hallucination refers to the mismatch between factual content and the model’s generated responses.

Hallucination mitigation aims to reduce these discrepancies for more accurate outputs and improve task performance.

While hallucinations are well-studied, the exact causes behind them remain underexplored. We first investigate the root causes of hallucinations in LVLMs (Large Vision Language Models).

3 of 25

Motivation

  • Prior studies have proposed methods to reduce hallucinations and improve response accuracy (Sicong Leng et al., 2023; Huang et al., 2023b; Yin et al., 2023b; Zhou et al., 2023b) and evaluated their effectiveness (Wang et al., 2023a; Guan et al., 2023), most focus on mitigating object hallucinations in visual recognition tasks.�
  • Extensive experiments in the paper show that these techniques fall short when applied to cognitive prompts requiring reasoning or knowledge extraction, indicating a significant gap in their ability to address hallucinations in more complex, reasoning-intensive scenarios.�

4 of 25

Preliminaries

  • Datasets: VaLLu (Ghosh et al., 2025), AMBER (Wang et al., 2023a), LLaVA-Bench(Liu et al., 2023c), MM-Vet (Yu et al., 2023), MMBench (Liu et al., 2023d), MME (Fu et al., 2023), MathVista (test-mini subset), MathVision, MMC (Liu et al., 2023a), HallusionBench (Guan et al., 2023), MM-Vet (Yu et al., 2023), OVEN (Hu et al., 2023), SynthDoG (Kim et al., 2022) and MMMU (validation set).�
  • LLMs: LLaVa-v1 (Liu et al., 2023c), LLaVa-1.5 (Liu et al., 2023b), LLaVa-1.6 (Liu et al., 2024c), mPLUG-Owl2 (Ye et al., 2023), InternLM-X (Zhang et al., 2023), CogVLM (Wang et al., 2023b), Llama-2-7b (Touvron et al., 2023), Vicuna-7b-v1.5 (Zheng et al., 2023), InternLM-7B (Cai et al., 2024), CogVLM2 (Hong et al., 2024) and Qwen2-VL (Wang et al., 2024b).�
  • Baselines: VCD (Sicong Leng et al., 2023), OPERA (Huang et al., 2023b), Woodpecker (Yin et al., 2023b), LRV (Liu et al., 2024a), LURE (Zhou et al., 2023b), PAI (Liu et al., 2024d) and HALC (Chen et al., 2024a).

5 of 25

Understanding LVLM information processing

To better understand visual perception and reasoning, we break down information processing by an LVLM into its constituent steps:

6 of 25

DO HALLUCINATION MITIGATION TECHNIQUES IMPROVE COGNITIVE ABILITIES?

  1. Newer models have better performance on the AMBER benchmark (for recognition) while performance on other reasoning and information-seeking benchmarks remained stagnant.�

  • Hallucination mitigation techniques boost performance on AMBER, performance on other reasoning and information-seeking benchmarks remains stagnant.

7 of 25

DO MITIGATION TECHNIQUES GENERALIZE BEYOND REAL-WORLD SCENES?

  1. Compared to AMBER, LVLMs hallucinate more often when describing visuals beyond real-world scenes.

  • We further show that hallucination mitigation techniques improve performance on AMBER but rarely other benchmarks.

8 of 25

Types of Visual Understanding Hallucination

9 of 25

Categorizing Visual Hallucinations for LVLMs

  1. Language Hallucinations: These are caused when LVLMs over-rely on the language priors (or prior tokens in the response) rather than the input image. We attribute all hallucinated tokens that are unshifted as language hallucinations.�
  2. Vision Hallucinations: The tokens for this type are generated by attending to the input image and are shifted or marginal tokens. These are hallucinations caused when the model fails to accurately recognize visual elements in an input image etc.�
  3. Style Hallucinations: These are hallucinations caused by style imitation arising during pre-training/training.�
  4. Instruction Tuning (IT) Hallucinations: These are hallucinations caused by biases learned during the instruction tuning stage.

10 of 25

Frequency comparison of hallucination categories

Key Findings from frequency comparison:

  1. Model-based improvements and mitigation strategies reduce language and vision hallucinations, but progress in style and IT hallucinations remains limited.�
  2. Specific mitigation techniques are more effective for certain types of hallucinations. For e.g., decoding-based methods like VCD and OPERA excel at reducing language hallucinations, whereas grounding approaches like Woodpecker are more effective for vision hallucinations.�
  3. Existing methods fail to reduce hallucinations on reasoning benchmarks like MMMU. We attribute this to the algorithmic biases of these methods that are crafted to just reduce object hallucinations.

11 of 25

VISUAL PERCEPTION GAP: LVLMS CAN SEE BUT NOT PERCEIVE

  1. Prior-art hallucination mitigation techniques primarily enhance visual recognition but fail to improve other cognitive abilities, such as reasoning or knowledge extraction.�
  2. Our analysis reveals that LVLMs often rely on language priors rather than attending to the input image when generating responses to reasoning prompts.�
  3. Visual perception gap: While LVLMs can accurately recognize visual elements and possess the necessary knowledge and reasoning skills to respond factually, they struggle to perceive and interpret these elements in relation to the input prompt. �
  4. This gap in perception causes hallucinations, resulting in incorrect responses.

12 of 25

VISUAL BLIND SPOTS: LVLMS IGNORE THE INPUT IMAGE FOR REASONING

  1. Base Rank of tokens across datasets: A lower value would signify that the model solely responds using language priors, while a higher value would signify that the model pays attention to the image.�
  2. From the image, the Base Rank for AMBER, which prompts for an image description, is higher than Math-Vision, which requires it to reason or extract knowledge based on the image.�
  3. This hints towards an alignment issue for Math-Vision LVLMs are unable to perceive the image before responding, under-relying on visual cues and over-relying on language priors for generating responses.

13 of 25

VISUAL BLIND SPOTS: LVLMS IGNORE THE INPUT IMAGE FOR REASONING

  1. LVLMs excel at recognizing visual recognition and perform better when all correct visual details are provided in the text prompt.�
  2. This suggests that LVLMs have the necessary knowledge and reasoning skills but struggle to integrate these with visual perception in prompts requiring both.

14 of 25

VISUAL DESCRIPTION GROUNDING DECODING (VDGD)

15 of 25

VISUAL DESCRIPTION GROUNDING DECODING (VDGD)

16 of 25

VISUAL DESCRIPTION GROUNDING DECODING (VDGD)

17 of 25

VISUAL DESCRIPTION GROUNDING DECODING (VDGD)

We denote the VLM’s token generation as:

To ensure plausibility, we truncate the vocabulary by retaining only tokens that satisfy:

where is a threshold hyperparameter.

Next, we compute the deviation of each plausible token using KL divergence with prompt tokens:

We reduce the logit value of each token in the logit with –KL and apply softmax. Tokens with larger deviation are down-weighted in softmax, reducing the chance of incoherent generations.

18 of 25

Limitations with Existing Benchmarks

19 of 25

VaLLu Benchmark

  1. 1500 carefully curated samples from existing benchmarks, removing noisy ones.�
  2. Focus on open-ended answers and not MCQs for more robust evaluation.

  • LLM-as-a-judge evaluation that scores from 1 to 5 across 5 aspects.

  • Diverse question types, including reasoning and information-seeking questions.

20 of 25

Results

Performance comparison of VDGD with various baselines.

21 of 25

Results

Performance comparison of VDGD with various baselines.

22 of 25

Qualitative examples

The above figure compares responses from vanilla greedy decoding with VDGD. Grounding response generation results in more factual response.

23 of 25

Conclusion and Key Takeaways

  • Although there has been considerable progress in reducing hallucinations related to visual recognition, reducing hallucinations for cognitive prompts has been limited.�
  • Current hallucination mitigation techniques mainly improve LVLMs’ visual recognition skills but are less effective in improving other cognitive abilities, such as reasoning. Additionally, these improvements are largely limited to real-world scenes and specific types of hallucinations, leaving areas like non-real- world scenes largely unexplored.

  • LVLMs generally demonstrate accurate recognition and cognition skills to respond and reason accurately. However, they struggle to effectively link they recognize to their internal knowledge during the reasoning process, which results in inaccurate responses. This issue underlines what we identify as the visual perception gap.�
  • VDGD is a novel and training-free method to reduce hallucinations. Our results demonstrate that VDGD is the first method to effectively reduce hallucinations in responses that necessitate additional cognitive skills beyond just visual recognition.�
  • VDGD outperforms comparable baselines by 2-33%.

24 of 25

Limitations

  • Error accumulation in VDGD that comes from inaccurate image descriptions in the prefix.�
  • VDGD is not compute efficient as it requires 2 passes.

25 of 25

Thank You