1 of 36

Representation Learning for Dialog Models

Manish Gupta

gmanish@microsoft.com

7th Dec 2023

2 of 36

Recent work

  • Bing Query Auto Completion team at Hyderabad.
  • Adjunct Faculty (IIIT-H) since 2013; Visiting Faculty (ISB) since 2016; 100+ publications and 2 books; 7500+ citations; 15 tutorials; served on 50+ program committees.
  • Microsoft Academic Partnership Grant since 2021.
  • Areas of interest*
    • Dialog Systems: [ACL 23, NAACL 22, EMNLP 23 (Findings)]
    • Cross-Lingual Text Generation: [WebConf 23, ECAI 23]
    • Multimodal NLP: [IJCAI 23, EMNLP 22].
    • Other NLP: [ECIR 23, PKDD 23, EMNLP 23 (Findings)]
    • Cognitive Neuroscience: [InterSpeech 23, COLING 22, ACL 23, NeurIPS 23, NAACL 22], Tutorials [IJCAI 23, IJCNN 23, CogSci 22].
    • Query Auto-Completion: [PKDD 23], Tutorials [ECIR 23, IJCAI 22]
  • Collaborators: IIIT-H, IIT-KGP, IIT-D, IIT-H, IIT-J, Inria, MPI-SWS.
  • YouTube (Data Science Gems)

gmanish@microsoft.com

2

3 of 36

Agenda

  • Introduction to Dialog Modeling
  • DMI-based Representation Learning [NAACL 22] (with IITKGP)
  • Representation Learning for Multimodal Persona Based Setting [ACL 23] (with IITD)
  • Representation Learning for In-Context Learning Models [EMNLP 23 (Findings)] (with IITKGP)
  • Outlook

gmanish@microsoft.com

3

4 of 36

What is dialog modeling?

  • Given
    • (Text or multimodal) Conversation history
    • Persona of users
    • Topic of conversation
    • Any other context
  • Generate
    • Response
  • Challenge
    • How do I produce the best input representation?

gmanish@microsoft.com

4

S1: hi , how are you doing ? i am getting ready to do some cheetah chasing to stay in shape .

S2: you must be very fast . hunting is one of my favorite hobbies .

S1: i am ! for my hobby i like to do canning or some whittling .

S2: i also remodel homes when i am not out bow hunting.

Context

Response

Persona for S2: i like to remodel homes. i like to go hunting. i like to shoot a bow. my favorite holiday is halloween.

Context

Response

Reddit Dataset Example

PersonaChat Dataset Example

  • Matthew Henderson, Paweł Budzianowski, Iñigo Casanueva, Sam Coope, Daniela Gerz, Girish Kumar, Nikola Mrkšic, Georgios Spithourakis, Pei-Hao Su, ´ Ivan Vulic, and Tsung-Hsien Wen. 2019. ´ A repository of conversational datasets. First Workshop on NLP for Conversational AI, pages 1–10, Florence, Italy.
  • Saizheng Zhang, Emily Dinan, Jack Urbanek, Arthur Szlam, Douwe Kiela, and Jason Weston. 2018. Personalizing dialogue agents: I have a dog, do you have pets too? In ACL, pages 2204–2213.

5 of 36

What is dialog modeling?

  • Given
    • (Text or multimodal) Conversation history
    • Persona of users
    • Topic of conversation
    • Any other context
  • Generate
    • Response
  • Challenge
    • How do I produce the best input representation?

Karthik Gopalakrishnan, Behnam Hedayatnia, Qinlang Chen, Anna Gottardi, Sanjeev Kwatra, Anu Venkatesh, Raefer Gabriel, and Dilek Hakkani-Tür. 2019. Topical-chat: Towards knowledge-grounded open-domain conversations. Proc. Interspeech 2019, pages 1891–1895.

gmanish@microsoft.com

5

Context

Response

TopicalChat Dataset Example

Knowledge section for topic "Fish"

A fish is an aquatic, craniate, gill-bearing animal that lacks limbs with digits. Included in this definition are the living hagfish, lampreys, and cartilaginous and bony fish as well as various extinct related groups. Approximately 95% of living fish species are ray-finned fish, belonging to the class Actinopterygii, with around 99% of those being teleosts.

S1: I think fish are so cool there is actually a breed of jellyfish that is immortal.

S2: i had rememered hearing about that before. Immortatlity is wasted on a jellyfish haha. did you know a seahorse is the only fish that has an actual neck?

S1: That is so funny I guess I never considered a seahorse a fish. The black swallower fish sounds a lot like a snake because it can eat pray that is so large.

S2: i guess they live up to their name then!

S1: It seems they do. I also didn't know that there was a difference with how freshwater and saltwater fish drink.

6 of 36

What is dialog modeling?

  • Given
    • (Text or multimodal) Conversation history
    • Persona of users
    • Topic of conversation
    • Any other context
  • Generate
    • Response
  • Challenge
    • How do I produce the best input representation?

Agrawal, Harsh, Mishra, Aditya; Gupta, Manish; Mausam.: Multimodal persona based generation of comic dialogs. In ACL. pp. 14150–14164 (2023)

gmanish@microsoft.com

6

Context

Response

ComSet Dataset Example

7 of 36

Why care about dialog modeling?

  • Remember Eliza @ Emacs? ☺
  • Domain-specific (task-oriented) customer service: banking, airports, tech-help, car rental
  • Personal assistants: Siri, Alexa
  • Tutoring: KhanAcademy
  • Healthcare support: OneRemission, Youper, Babylon Health
  • Entertainment/chitchat systems: Virtual friend; Ruuh
  • ….
  • LLMs 🡪 Bing Chat

gmanish@microsoft.com

7

8 of 36

What are popular approaches for dialog modeling?

  • Seq2Seq neural NLG models
    • Standard pretrained models: BERT, ELMo, GPT-2.
    • Pretrained on dialog data: DialoGPT, BlenderBot, Meena, EDGE.
    • Pretrained with dialog specific losses: DialogRPT, ContextPretrain, ConveRT.
      • Response Ranking, next utterance retrieval, next utterance generation, masked utterance retrieval, inconsistency identification
    • Finetuned on persona-based data: Bert-over-Bert, PersonaGPT.
  • What is missing?
    • Can there be a loss that is conscious of (context, response) structure of dialogs?
    • How to model multimodal dialogs?
    • How to use GPTx for dialog modeling?

gmanish@microsoft.com

8

9 of 36

What are popular metrics for evaluating dialog systems?

  • Multi-class Classification: Accuracy
  • Response Selection (Retrieval): R@1, R@2, MRR
  • Generation
    • Perplexity
    • Syntactic: Unigram F1, ROUGE, BLEU
    • Syntax + surface forms + stemmed forms + meanings: METEOR
    • Semantic: BLEURT
    • Dialog specific semantic: MaUde, DEB

gmanish@microsoft.com

9

10 of 36

Agenda

  • Introduction to Dialog Modeling
  • DMI-based Representation Learning [NAACL 22] (with IITKGP)
  • Representation Learning for Multimodal Persona Based Setting [ACL 23] (with IITD)
  • Representation Learning for In-Context Learning Models [EMNLP 23 (Findings)] (with IITKGP)
  • Outlook

gmanish@microsoft.com

10

11 of 36

Why learn a new representation for dialog systems?

  • Factors impacting effectiveness of pretrained models
    • Pretraining corpus
    • Loss function
    • Downstream tasks
  • Standard pretraining objectives are unaware of dialog structure

gmanish@microsoft.com

11

Word-level reasoning

Discourse-level reasoning

12 of 36

How do we optimize with dialog structure awareness?

  •  

gmanish@microsoft.com

12

13 of 36

How is InfoNCE-S computed?

  •  

13

 

gmanish@microsoft.com

14 of 36

Experimental Setup

  • Pretraining: Subset of Reddit-727M
  • Dialog Unrolling for Pretraining: 2.7B CR pairs; used ~10%

14

Downstream task details. Adv.: Adversarial, Neg.: Negative

gmanish@microsoft.com

15 of 36

Probing (Frozen LM) Results

  • Analyze pretrained model’s effectiveness.
  • DMI outperforms baselines significantly.
  • Performance is consistent across all tasks.

gmanish@microsoft.com

15

Std LMs

Dialog specific LMs

16 of 36

Finetuning Results: LM weights are also updated

  • DMI outperforms all baselines under both settings.
  • Discourse level features are highly suitable for pretraining effective representations.

gmanish@microsoft.com

16

Std LMs

Dialog specific LMs

17 of 36

Qualitative Analysis

  • Response selection from a pool of ~7K utterances

17

Context:

User 1: Are you busy tomorrow morning ?

Response:

By User 2 (Human): I'm free . What's up ?

By DMI Model: No , not this weekend . I have too much work to do .

Context:

User 1: Hi , Dan . What's new ?

User 2: Where have you been ? I've been trying to get in touch with you for two days .

User 1: I went to Salt Lake City on business . What's up ?

Response

By User 2 (Human): I got fired .

By DMI Model: Not much . I had to pay an unexpected bill , so I needed the money back .

Example Predictions on Empathetic-Intent Task

gmanish@microsoft.com

18 of 36

Conclusions

  • DMI: A novel pretraining objective for better modeling of dialogs that is structure aware.
  • DMI representations are much more effective and consistent across different downstream tasks.
  • Released pretrained DMI model checkpoints of various sizes.

gmanish@microsoft.com

18

19 of 36

Agenda

  • Introduction to Dialog Modeling
  • DMI-based Representation Learning [NAACL 22] (with IITKGP)
  • Representation Learning for Multimodal Persona Based Setting [ACL 23] (with IITD)
  • Representation Learning for In-Context Learning Models [EMNLP 23 (Findings)] (with IITKGP)
  • Outlook

gmanish@microsoft.com

19

20 of 36

What is the comic dialog generation problem?

  • Challenges: Visual narrative, multi-party dialog, personas.
  • ComSet dataset
    • 13 comics from GoComics
    • Each comic strip contains transcription and an image.
    • Parsing Transcripts: POS, NER, dependency parsing.
    • Panel Segmentation: Faster-RCNN; 159K panels from 54K strips
    • Dialogue Text Detection and Masking (EasyOCR).
    • Multimodal Alignment (edit distance): 238K utterances
    • Persona fact generation: 202 chars.

20

gmanish@microsoft.com

21 of 36

MPDialog Model Architecture

  • MultiModal Embedding (MME)
    • Text encodings (TE)
      • 12L PersonaGPT-base
    • Visual embeddings (VE)
      • 12L CLIP-VIT Vision encoder (VE)
      • Linearly projected & reshaped.
    • Interleave text and visual embeddings
    • Prepend persona info.
  • 12L PersonaGPT-base decoder.
  • Finetuned end to end.

21

gmanish@microsoft.com

22 of 36

How does MPDialog perform?

  • MPDialog > LM only and persona-based baselines
  • LM only models (DialoGPT and EDGE) cannot generate coherent responses (high perplexity and low MaUde) for comics.
  • Adding persona info reduces perplexity.
  • LM + persona + images > LM + persona > LM

22

gmanish@microsoft.com

23 of 36

Comic-wise Quantitative Analysis

23

BLEURT

  • MPDialog is best in most cases.
  • “Cleats” comic focuses on the relationships between the characters, their sportsmanship and the challenges of being part of a team.
    • Images do not contain much additional information

MaUde

Perplexity

gmanish@microsoft.com

24 of 36

Qualitative Analysis

24

Human Evaluation Results

  • EDGE and BoB: Too banal responses
  • DialoGPT: completely nonsensical responses.

gmanish@microsoft.com

25 of 36

Conclusions

  • COMSET: comics dataset with ~54K strips and 200+ personas
  • MPDialog: persona-based multimodal dialog baseline
  • Experiments: evidence that leveraging multimodality and persona orientation improves the quality of dialogues.

25

gmanish@microsoft.com

26 of 36

Agenda

  • Introduction to Dialog Modeling
  • DMI-based Representation Learning [NAACL 22] (with IITKGP)
  • Representation Learning for Multimodal Persona Based Setting [ACL 23] (with IITD)
  • Representation Learning for In-Context Learning Models [EMNLP 23 (Findings)] (with IITKGP)
  • Outlook

gmanish@microsoft.com

26

27 of 36

Using LLMs as dialog models

  •  

gmanish@microsoft.com

27

28 of 36

Optimizing the prompts

  • Manual versus Perplexity Optimized Prompts
  • Redundancies in conversations
    • Back-channeling, clarification, mistake correction.
    • Verbose model responses.
  • Shortening Dialog Histories
    • Selection
      • Recent-k; Semantic-k
    • Summarization
      • BART-D (DialogSum): 12L+12L
      • Pegasus-DS (DialogSum and SAMSum): 16L+16L
      • Pegasus-CD (CNN/DailyMail): 16L+16L
  • Shortening Background Information (Persona/topic)
    • BART, Pegasus.

  • Absolute perf analysis
    • GPT-3 is best; Tk-Instruct is worst.
    • Perplexity optimized < Manually engg
    • Best results with full dialog history for TC.
    • For MSC, even prompts with summarized history seem to do very well.

28

gmanish@microsoft.com

29 of 36

Analysis of Prompt Lengths

  •  

29

gmanish@microsoft.com

30 of 36

Usable information-density (UID)

  •  

30

gmanish@microsoft.com

31 of 36

 

  •  

31

gmanish@microsoft.com

32 of 36

Conclusions

  • Explored the tradeoff between model performance and cost for dialog systems.
  • UID: Representation of dialog history that provides the highest amount of usable information per token.
  • Insights
    • Summaries > full history.
    • Recent-k or Semantic-k > summaries.
    • Semantic-1 is best from both accuracy as well as UID perspective.
    • Zero-shot > Few-shot.

32

33 of 36

Agenda

  • Introduction to Dialog Modeling
  • DMI-based Representation Learning [NAACL 22] (with IITKGP)
  • Representation Learning for Multimodal Persona Based Setting [ACL 23] (with IITD)
  • Representation Learning for In-Context Learning Models [EMNLP 23 (Findings)] (with IITKGP)
  • Outlook

gmanish@microsoft.com

33

34 of 36

Summary

  • Introduction to Dialog Modeling: Settings, motivation, metrics.
  • DMI-based Representation Learning
    • DMI: A novel pretraining objective for better modeling of dialogs that is structure aware.
    • DMI representations are much more effective and consistent across different downstream tasks.
  • Representation Learning for Multimodal Persona Based Setting
    • COMSET: comics dataset with ~54K strips and 200+ personas
    • MPDialog: persona-based multimodal dialog baseline
    • Leveraging multimodality and persona orientation improves the quality of dialogues.
  • Representation Learning for In-Context Learning Models
    • Explored the tradeoff between model performance and cost for dialog systems.
    • UID: Representation of dialog history that provides the highest amount of usable information per token.
    • Cost is imp 🡺 Recent-1 and Semantic-1 (zero-shot)
    • Cost is less imp 🡺 Longer dialog summaries such as Pegasus-CD and Semantic-4 (zero-shot)

gmanish@microsoft.com

34

35 of 36

Research Opportunities

  • Current Directions
    • DSMH: Dialog System for Mental Health at Workplace
    • VideoDialogs: Dialogs for Educational Videos
    • CORAL: Contextual Response Retrievability Loss
  • Future Opportunities
    • Query auto-completion for (multimodal) dialogs
    • Generation of humorous utterances
    • Generation of text utterances jointly with panel images
    • Hate speech/toxicity detection in (multimodal) conversations
    • RAG (Retrieval Augmented Generation) with dialog models

gmanish@microsoft.com

35

36 of 36

Thanks!

gmanish@microsoft.com

36