Shortcomings of Vision-Language Models�Bias, Hallucination, and Grounding Robustness
Networks and Cyber Physical Systems Lab
0
3/19/25
Student | : Ba Luan Dang |
Team | : 2 |
Outline
Networks and Cyber Physical Systems Lab
1
3/19/25
Gender Bias
Networks and Cyber Physical Systems Lab
2
3/19/25
A ___ is cooking.
Man ?
Woman ?
Gender Bias
Networks and Cyber Physical Systems Lab
3
3/19/25
A man is cooking.
Gender Bias
Networks and Cyber Physical Systems Lab
4
3/19/25
Gender Bias
Networks and Cyber Physical Systems Lab
5
3/19/25
Gender Bias
Networks and Cyber Physical Systems Lab
6
3/19/25
Gender Bias
Networks and Cyber Physical Systems Lab
7
3/19/25
Women also Snowboard: Overcoming Bias in Captioning Models (2019)
Gender Bias
Networks and Cyber Physical Systems Lab
8
3/19/25
Women also Snowboard: Overcoming Bias in Captioning Models (2019)
Encourages model to predict gender words correctly
Gender Bias
Networks and Cyber Physical Systems Lab
9
3/19/25
Women also Snowboard: Overcoming Bias in Captioning Models (2019)
When gender information is confusing, probability of predicting man or woman should be equal
Gender evidence is masked
Gender Bias
Networks and Cyber Physical Systems Lab
10
3/19/25
Women also Snowboard: Overcoming Bias in Captioning Models (2019)
Encourages model to predict gender words correctly
When gender information is confusing, probability of predicting man or woman should be equal
Gender evidence is masked
Gender Bias
Networks and Cyber Physical Systems Lab
11
3/19/25
Women also Snowboard: Overcoming Bias in Captioning Models (2019)
Gender Bias
Networks and Cyber Physical Systems Lab
12
3/19/25
Women also Snowboard: Overcoming Bias in Captioning Models (2019)
Gender Bias
Networks and Cyber Physical Systems Lab
13
3/19/25
Women also Snowboard: Overcoming Bias in Captioning Models (2019)
Gender Bias
Networks and Cyber Physical Systems Lab
14
3/19/25
Women also Snowboard: Overcoming Bias in Captioning Models (2019)
Gender Bias
Networks and Cyber Physical Systems Lab
15
3/19/25
Worst of Both Worlds: Biases Compound in Pre-trained Vision-and-Language Models (2022)
Gender Bias
Networks and Cyber Physical Systems Lab
16
3/19/25
Worst of Both Worlds: Biases Compound in Pre-trained Vision-and-Language Models (2022)
Hallucination
Networks and Cyber Physical Systems Lab
17
3/19/25
Credit. A Survey on Hallucination in Large Vision-Language Model (2024)
Hallucination
Networks and Cyber Physical Systems Lab
18
3/19/25
Credit. A Survey on Hallucination in Large Vision-Language Model (2024)
Ask LVLMs about presence of certain objects
Output: binary
Measure the hallucination score
Output: number
Hallucination
Networks and Cyber Physical Systems Lab
19
3/19/25
Credit. A Survey on Hallucination in Large Vision-Language Model (2024)
Ask LVLMs about presence of certain objects
Output: binary
Measure the hallucination score
Output: number
Hallucination
Networks and Cyber Physical Systems Lab
20
3/19/25
Credit. A Survey on Hallucination in Large Vision-Language Model (2024)
Ask LVLMs about presence of certain objects
Output: binary
Measure the hallucination score
Output: number
Constructs labelled hallucination datasets and train another LVLM to detect hallucination
Hallucination
Networks and Cyber Physical Systems Lab
21
3/19/25
Credit. A Survey on Hallucination in Large Vision-Language Model (2024)
Hallucination
Networks and Cyber Physical Systems Lab
22
3/19/25
Credit. A Survey on Hallucination in Large Vision-Language Model (2024)
Hallucination
Networks and Cyber Physical Systems Lab
23
3/19/25
Credit. A Survey on Hallucination in Large Vision-Language Model (2024)
Data Distribution Imbalance
Large amount of intruction-tuning data is synthesized by LLMs
Hallucination
Networks and Cyber Physical Systems Lab
24
3/19/25
Credit. A Survey on Hallucination in Large Vision-Language Model (2024)
Propose high quality and balance datasets
Propose datasets with high quality and fine-grained descriptions
Hallucination
Networks and Cyber Physical Systems Lab
25
3/19/25
Credit. A Survey on Hallucination in Large Vision-Language Model (2024)
Limited visual resolutions capture limited visual information
Visual Encoders often focus on salient objects, fail to capture fine-grained features
Hallucination
Networks and Cyber Physical Systems Lab
26
3/19/25
Credit. A Survey on Hallucination in Large Vision-Language Model (2024)
Scale up Vision Resolution Efficiently
Recall: “scale-then-compress” in NVILA
Use extra perception modalities:
Segmentation map, depth map, spartial position, scene graph
Hallucination
Networks and Cyber Physical Systems Lab
27
3/19/25
Credit. A Survey on Hallucination in Large Vision-Language Model (2024)
Several linear layers are not sufficient
Restricted number of query tokens prevents encoding all the information present in images
Hallucination
Networks and Cyber Physical Systems Lab
28
3/19/25
Credit. A Survey on Hallucination in Large Vision-Language Model (2024)
Using larger connection modules
Adding new objectives to hance the modality alignment, Employing RLHF
Hallucination
Networks and Cyber Physical Systems Lab
29
3/19/25
Credit. A Survey on Hallucination in Large Vision-Language Model (2024)
LLMs prioritize language patterns
Disparity between pre-trained knowlegde and intruction-tuning requirements
Randomness in Stochastic Sampling Decoding
Hallucination
Networks and Cyber Physical Systems Lab
30
3/19/25
Credit. A Survey on Hallucination in Large Vision-Language Model (2024)
Weighted Scoring Beam Search
Contrastive Decoding
Employ Reinforcement Learning:
RLHF, DPO
Hallucination
Networks and Cyber Physical Systems Lab
31
3/19/25
Credit. A Survey on Hallucination in Large Vision-Language Model (2024)
Train a revisor model to post-process generated content
Thank You!�Any Questions?
Networks and Cyber Physical Systems Lab
32
3/19/25