1 of 28

Visual Dialog: Datasets and Models

Miquel Florensa

Group 6

February 19th 2025

2 of 28

Overview

  1. Limitations of Supervised Learning approaches
  2. Paper 1 - Learning Cooperative Visual Dialog Agents with Deep Reinforcement Learning
  3. GuessWhich cooperative game
  4. Paper 2 - Improving Generative Visual Dialog by Answering Diverse Questions
  5. Conclusion

3 of 28

Limitations of Supervised Learning: A Static Approach

  • Major drawbacks:
    • Reliance on human-generated data (expensive, biased).
    • Inability for agents to "steer" the conversation.
    • Lack of exploration of the vast question/answer space.
    • Difficulty in training robust, context-aware agents.
  • Turn 1 (Human): "What color is the cat?"
  • Turn 2 (AI - Supervised): "The cat is orange."
  • Turn 3 (Human): "Is it comfortable?"
  • Turn 4 (AI - Supervised): "Yes, it is comfortable."
  • Turn 5 (Human): “What’s under the couch?"
  • Turn 6 (AI - Supervised): "The couch is on a rug.”

4 of 28

Breaking Free: Reinforcement Learning for Dynamic Dialog

Paper 1: Learning Cooperative Visual Dialog Agents with Deep Reinforcement Learning (Das et al., 2017b)

Learning Cooperative Visual Dialog Agents with Deep Reinforcement Learning, Das et al. Mar. 2017

5 of 28

The Q-BOT-A-BOT Architecture: Learning Through Self-Play

Learning Cooperative Visual Dialog Agents with Deep Reinforcement Learning, Das et al. Mar. 2017

6 of 28

RL Details: Shaping Behaviour Through Rewards

  • Action: Both agents share a common action space consisting of all possible output sequences under a token vocabulary . Q-BOT predicts at each round of the dialogue .
  • State: (A-BOT can see image and caption , Q-BOT can see just ) , .
  • Policy: and . With feature regression network
  • Reward:

Learning Cooperative Visual Dialog Agents with Deep Reinforcement Learning, Das et al. Mar. 2017

7 of 28

Q-BOT & A-BOT Architecture

Learning Cooperative Visual Dialog Agents with Deep Reinforcement Learning, Das et al. Mar. 2017

8 of 28

Q-BOT Architecture

Learning Cooperative Visual Dialog Agents with Deep Reinforcement Learning, Das et al. Mar. 2017

LSTM

Are there any animals?

9 of 28

A-BOT Architecture

Learning Cooperative Visual Dialog Agents with Deep Reinforcement Learning, Das et al. Mar. 2017

LSTM

10 of 28

A-BOT Architecture

Learning Cooperative Visual Dialog Agents with Deep Reinforcement Learning, Das et al. Mar. 2017

LSTM

11 of 28

Q-BOT Architecture

Learning Cooperative Visual Dialog Agents with Deep Reinforcement Learning, Das et al. Mar. 2017

LSTM

12 of 28

Q-BOT Architecture

Learning Cooperative Visual Dialog Agents with Deep Reinforcement Learning, Das et al. Mar. 2017

MLP

13 of 28

Joint Training with REINFORCE

  • The Q-BOT and A-BOT are jointly trained to maximize their collective reward through policy gradients.
  • Update the model on pre-round rewards:

Learning Cooperative Visual Dialog Agents with Deep Reinforcement Learning, Das et al. Mar. 2017

14 of 28

Evaluation

~9.5k images test set of VisDial v0.5

Learning Cooperative Visual Dialog Agents with Deep Reinforcement Learning, Das et al. Mar. 2017

15 of 28

Evaluation

Learning Cooperative Visual Dialog Agents with Deep Reinforcement Learning, Das et al. Mar. 2017

Qualitative retrieval results on VisDial

16 of 28

The Problem: Are RL Agents Really Talking?

  • While the Q-BOT-A-BOT framework was a significant breakthrough, it suffers from a critical limitation: repetitive dialogs.
  • Agents converge to simple strategies that maximize reward but don't lead to meaningful communication.
  • This repetition hinders the learning process and limits the agent's ability to engage in complex conversations.

Learning Cooperative Visual Dialog Agents with Deep Reinforcement Learning, Das et al. Mar. 2017

Q-BOT-A-BOT interactions for SL-pretrained and RL-full-QAf.

17 of 28

GuessWhich: A Cooperative Game for Evaluating Visual Dialog

  • GuessWhich is a cooperative game designed to evaluate visual conversational AI agents in a human-AI collaborative setting.
  • It aims to measure not just AI performance in isolation, but its impact on human team performance.

Evaluating Visual Conversational Agents via Cooperative Human-AI Games, Chattopadhyay & Yadav. Aug. 2017

18 of 28

How GuessWhich Works: A Cooperative Image Hunt

Evaluating Visual Conversational Agents via Cooperative Human-AI Games, Chattopadhyay & Yadav. Aug. 2017

GuessWhich Interface

19 of 28

Why GuessWhich Matters: A Human-Centered Evaluation

GuessWhich addresses a critical gap in AI evaluation: it focuses on how AI agents impact human performance.

  • How well the AI helps humans understand visual information.
  • How naturally the AI communicates.
  • How effectively the AI collaborates with humans to solve a task.

Evaluating Visual Conversational Agents via Cooperative Human-AI Games, Chattopadhyay & Yadav. Aug. 2017

20 of 28

Diversifying the Conversation: Encouraging More Engaging Dialog

Improving Generative Visual Dialog by Answering Diverse Questions (Murahari et al., 2019)

Improving Generative Visual Dialog by Answering Diverse Questions (Murahari et al., 2019)

21 of 28

Formalizing Diversity: The Smooth L1 Penalty

  • Goal: encourage Q-BOT to ask a diverse set of questions.
  • Maximize Smooth L1 penalty on

Improving Generative Visual Dialog by Answering Diverse Questions (Murahari et al., 2019)

22 of 28

Improving Agent Conversations

Improving Generative Visual Dialog by Answering Diverse Questions (Murahari et al., 2019)

23 of 28

Quantifying Diversity: Measuring the Impact

Improving Generative Visual Dialog by Answering Diverse Questions (Murahari et al., 2019)

24 of 28

Improved Results

Improving Generative Visual Dialog by Answering Diverse Questions (Murahari et al., 2019)

25 of 28

Conclusions: Advancing the Frontier of Conversational AI

    • RL is a powerful approach for Visual Dialog, but diversity is crucial.
    • A simple diversity penalty can significantly improve the quality of conversations.
    • The resulting agents are more engaging and informative for human users.

26 of 28

Conclusions: Thoughts on Visual Dialog Models

    • Human biased: human questions still differ from models questions.
    • Models are biased by datasets: lack in generalization, and robustness to out-of-distribution examples.

27 of 28

Questions?

28 of 28

Thank you!