VISUAL DESCRIPTION GROUNDING REDUCES HALLUCINATIONS AND BOOSTS REASONING IN LVLMS
Sreyan Ghosh1*, Chandra Kiran Reddy Evuru1*, Sonal Kumar1*,
Utkarsh Tyagi1, Oriol Nieto2, Zeyu Jin2, Dinesh Manocha1
1University of Maryland, College Park, USA
2Adobe, USA
Introduction
What are hallucinations?
Hallucination refers to the mismatch between factual content and the model’s generated responses.
Hallucination mitigation aims to reduce these discrepancies for more accurate outputs and improve task performance.
While hallucinations are well-studied, the exact causes behind them remain underexplored. We first investigate the root causes of hallucinations in LVLMs (Large Vision Language Models).
Motivation
Preliminaries
Understanding LVLM information processing
To better understand visual perception and reasoning, we break down information processing by an LVLM into its constituent steps:
DO HALLUCINATION MITIGATION TECHNIQUES IMPROVE COGNITIVE ABILITIES?
DO MITIGATION TECHNIQUES GENERALIZE BEYOND REAL-WORLD SCENES?
Types of Visual Understanding Hallucination
Categorizing Visual Hallucinations for LVLMs
Frequency comparison of hallucination categories
Key Findings from frequency comparison:
VISUAL PERCEPTION GAP: LVLMS CAN SEE BUT NOT PERCEIVE
VISUAL BLIND SPOTS: LVLMS IGNORE THE INPUT IMAGE FOR REASONING
VISUAL BLIND SPOTS: LVLMS IGNORE THE INPUT IMAGE FOR REASONING
�
VISUAL DESCRIPTION GROUNDING DECODING (VDGD)
VISUAL DESCRIPTION GROUNDING DECODING (VDGD)
VISUAL DESCRIPTION GROUNDING DECODING (VDGD)
VISUAL DESCRIPTION GROUNDING DECODING (VDGD)
We denote the VLM’s token generation as:
To ensure plausibility, we truncate the vocabulary by retaining only tokens that satisfy:
where is a threshold hyperparameter.
Next, we compute the deviation of each plausible token using KL divergence with prompt tokens:
We reduce the logit value of each token in the logit with –KL and apply softmax. Tokens with larger deviation are down-weighted in softmax, reducing the chance of incoherent generations.
Limitations with Existing Benchmarks
VaLLu Benchmark
�
Results
Performance comparison of VDGD with various baselines.
Results
Performance comparison of VDGD with various baselines.
Qualitative examples
The above figure compares responses from vanilla greedy decoding with VDGD. Grounding response generation results in more factual response.
Conclusion and Key Takeaways
Limitations
Thank You