1 of 33

Shortcomings of Vision-Language Models�Bias, Hallucination, and Grounding Robustness

Networks and Cyber Physical Systems Lab

0

3/19/25

Student

: Ba Luan Dang

Team

: 2

2 of 33

Outline

Networks and Cyber Physical Systems Lab

1

3/19/25

  • Gender Bias in
    • Early Captioning Model
    • Pre-trained Vision Language Model

  • Hallucination in Large Vision Language Models

3 of 33

Gender Bias

Networks and Cyber Physical Systems Lab

2

3/19/25

A ___ is cooking.

Man ?

Woman ?

4 of 33

Gender Bias

Networks and Cyber Physical Systems Lab

3

3/19/25

A man is cooking.

  • Surrounding context influences gender predictions

5 of 33

Gender Bias

Networks and Cyber Physical Systems Lab

4

3/19/25

6 of 33

Gender Bias

Networks and Cyber Physical Systems Lab

5

3/19/25

  • Gender influences other object predictions

7 of 33

Gender Bias

Networks and Cyber Physical Systems Lab

6

3/19/25

  • Why is that?
  • Bias in the training data
  • Visual Grounding is not good enough
    • Vision distortion
    • Model fails to ground visual evidence
    • Surrounding context inforces a stereotype over the faithful description

  • Gender Bias: model learns stereotypes or imbalanced representations of gender

8 of 33

Gender Bias

Networks and Cyber Physical Systems Lab

7

3/19/25

Women also Snowboard: Overcoming Bias in Captioning Models (2019)

9 of 33

Gender Bias

Networks and Cyber Physical Systems Lab

8

3/19/25

Women also Snowboard: Overcoming Bias in Captioning Models (2019)

Encourages model to predict gender words correctly

 

10 of 33

Gender Bias

Networks and Cyber Physical Systems Lab

9

3/19/25

Women also Snowboard: Overcoming Bias in Captioning Models (2019)

When gender information is confusing, probability of predicting man or woman should be equal

Gender evidence is masked

 

11 of 33

Gender Bias

Networks and Cyber Physical Systems Lab

10

3/19/25

Women also Snowboard: Overcoming Bias in Captioning Models (2019)

Encourages model to predict gender words correctly

When gender information is confusing, probability of predicting man or woman should be equal

Gender evidence is masked

 

 

12 of 33

Gender Bias

Networks and Cyber Physical Systems Lab

11

3/19/25

Women also Snowboard: Overcoming Bias in Captioning Models (2019)

  • Model: Show and Tell (NIC, 2015) pre-trained on MSCOCO dataset

  • Fine-tuning Datasets:
    • MSCOCO-Bias , a subset of MSCOCO whose captions contain ”man” or “woman”
    • Person segmentation masks from MSCOCO

13 of 33

Gender Bias

Networks and Cyber Physical Systems Lab

12

3/19/25

Women also Snowboard: Overcoming Bias in Captioning Models (2019)

14 of 33

Gender Bias

Networks and Cyber Physical Systems Lab

13

3/19/25

Women also Snowboard: Overcoming Bias in Captioning Models (2019)

15 of 33

Gender Bias

Networks and Cyber Physical Systems Lab

14

3/19/25

Women also Snowboard: Overcoming Bias in Captioning Models (2019)

  • Strengths:
    • Enhance the visual grounding for gender prediction and overcoming the bias
    • Addresses an important and socially significant issue in AI

  • Weaknesses:
    • Limited baselines
    • Model is trained to refuse to predict gender (result in more “person” prediction)
    • Model is not evaluated on the captioning task after being fine-tuned

16 of 33

Gender Bias

Networks and Cyber Physical Systems Lab

15

3/19/25

Worst of Both Worlds: Biases Compound in Pre-trained Vision-and-Language Models (2022)

17 of 33

Gender Bias

Networks and Cyber Physical Systems Lab

16

3/19/25

Worst of Both Worlds: Biases Compound in Pre-trained Vision-and-Language Models (2022)

18 of 33

Hallucination

Networks and Cyber Physical Systems Lab

17

3/19/25

Credit. A Survey on Hallucination in Large Vision-Language Model (2024)

  • Disagreement between visual input and textual output

19 of 33

Hallucination

Networks and Cyber Physical Systems Lab

18

3/19/25

Credit. A Survey on Hallucination in Large Vision-Language Model (2024)

Ask LVLMs about presence of certain objects

Output: binary

Measure the hallucination score

Output: number

20 of 33

Hallucination

Networks and Cyber Physical Systems Lab

19

3/19/25

Credit. A Survey on Hallucination in Large Vision-Language Model (2024)

Ask LVLMs about presence of certain objects

Output: binary

Measure the hallucination score

Output: number

21 of 33

Hallucination

Networks and Cyber Physical Systems Lab

20

3/19/25

Credit. A Survey on Hallucination in Large Vision-Language Model (2024)

Ask LVLMs about presence of certain objects

Output: binary

Measure the hallucination score

Output: number

Constructs labelled hallucination datasets and train another LVLM to detect hallucination

22 of 33

Hallucination

Networks and Cyber Physical Systems Lab

21

3/19/25

Credit. A Survey on Hallucination in Large Vision-Language Model (2024)

23 of 33

Hallucination

Networks and Cyber Physical Systems Lab

22

3/19/25

Credit. A Survey on Hallucination in Large Vision-Language Model (2024)

24 of 33

Hallucination

Networks and Cyber Physical Systems Lab

23

3/19/25

Credit. A Survey on Hallucination in Large Vision-Language Model (2024)

Data Distribution Imbalance

Large amount of intruction-tuning data is synthesized by LLMs

25 of 33

Hallucination

Networks and Cyber Physical Systems Lab

24

3/19/25

Credit. A Survey on Hallucination in Large Vision-Language Model (2024)

Propose high quality and balance datasets

Propose datasets with high quality and fine-grained descriptions

26 of 33

Hallucination

Networks and Cyber Physical Systems Lab

25

3/19/25

Credit. A Survey on Hallucination in Large Vision-Language Model (2024)

Limited visual resolutions capture limited visual information

Visual Encoders often focus on salient objects, fail to capture fine-grained features

27 of 33

Hallucination

Networks and Cyber Physical Systems Lab

26

3/19/25

Credit. A Survey on Hallucination in Large Vision-Language Model (2024)

Scale up Vision Resolution Efficiently

Recall: “scale-then-compress” in NVILA

Use extra perception modalities:

Segmentation map, depth map, spartial position, scene graph

28 of 33

Hallucination

Networks and Cyber Physical Systems Lab

27

3/19/25

Credit. A Survey on Hallucination in Large Vision-Language Model (2024)

Several linear layers are not sufficient

Restricted number of query tokens prevents encoding all the information present in images

29 of 33

Hallucination

Networks and Cyber Physical Systems Lab

28

3/19/25

Credit. A Survey on Hallucination in Large Vision-Language Model (2024)

Using larger connection modules

Adding new objectives to hance the modality alignment, Employing RLHF

30 of 33

Hallucination

Networks and Cyber Physical Systems Lab

29

3/19/25

Credit. A Survey on Hallucination in Large Vision-Language Model (2024)

LLMs prioritize language patterns

Disparity between pre-trained knowlegde and intruction-tuning requirements

Randomness in Stochastic Sampling Decoding

31 of 33

Hallucination

Networks and Cyber Physical Systems Lab

30

3/19/25

Credit. A Survey on Hallucination in Large Vision-Language Model (2024)

Weighted Scoring Beam Search

Contrastive Decoding

Employ Reinforcement Learning:

RLHF, DPO

32 of 33

Hallucination

Networks and Cyber Physical Systems Lab

31

3/19/25

Credit. A Survey on Hallucination in Large Vision-Language Model (2024)

Train a revisor model to post-process generated content

33 of 33

Thank You!�Any Questions?

Networks and Cyber Physical Systems Lab

32

3/19/25