1 of 36

Representation Learning for Dialog Models

Manish Gupta

gmanish@microsoft.com

7^th Dec 2023

2 of 36

Recent work

Bing Query Auto Completion team at Hyderabad.
Adjunct Faculty (IIIT-H) since 2013; Visiting Faculty (ISB) since 2016; 100+ publications and 2 books; 7500+ citations; 15 tutorials; served on 50+ program committees.
Microsoft Academic Partnership Grant since 2021.
Areas of interest*

Dialog Systems: [ACL 23, NAACL 22, EMNLP 23 (Findings)]
Cross-Lingual Text Generation: [WebConf 23, ECAI 23]
Multimodal NLP: [IJCAI 23, EMNLP 22].
Other NLP: [ECIR 23, PKDD 23, EMNLP 23 (Findings)]
Cognitive Neuroscience: [InterSpeech 23, COLING 22, ACL 23, NeurIPS 23, NAACL 22], Tutorials [IJCAI 23, IJCNN 23, CogSci 22].
Query Auto-Completion: [PKDD 23], Tutorials [ECIR 23, IJCAI 22]

Collaborators: IIIT-H, IIT-KGP, IIT-D, IIT-H, IIT-J, Inria, MPI-SWS.
YouTube (Data Science Gems)

gmanish@microsoft.com

2

3 of 36

Agenda

Introduction to Dialog Modeling
DMI-based Representation Learning [NAACL 22] (with IITKGP)
Representation Learning for Multimodal Persona Based Setting [ACL 23] (with IITD)
Representation Learning for In-Context Learning Models [EMNLP 23 (Findings)] (with IITKGP)
Outlook

gmanish@microsoft.com

3

4 of 36

What is dialog modeling?

Given

(Text or multimodal) Conversation history
Persona of users
Topic of conversation
Any other context

Generate

Response

Challenge

How do I produce the best input representation?

gmanish@microsoft.com

4

S1: hi , how are you doing ? i am getting ready to do some cheetah chasing to stay in shape .
S2: you must be very fast . hunting is one of my favorite hobbies .
S1: i am ! for my hobby i like to do canning or some whittling .
S2: i also remodel homes when i am not out bow hunting.

Context

Response

Persona for S2: i like to remodel homes. i like to go hunting. i like to shoot a bow. my favorite holiday is halloween.

Context

Response

Reddit Dataset Example

PersonaChat Dataset Example

Matthew Henderson, Paweł Budzianowski, Iñigo Casanueva, Sam Coope, Daniela Gerz, Girish Kumar, Nikola Mrkšic, Georgios Spithourakis, Pei-Hao Su, ´ Ivan Vulic, and Tsung-Hsien Wen. 2019. ´ A repository of conversational datasets. First Workshop on NLP for Conversational AI, pages 1–10, Florence, Italy.
Saizheng Zhang, Emily Dinan, Jack Urbanek, Arthur Szlam, Douwe Kiela, and Jason Weston. 2018. Personalizing dialogue agents: I have a dog, do you have pets too? In ACL, pages 2204–2213.

i like to remodel homes. i like to go hunting. i like to shoot a bow. my favorite holiday is halloween.

S1: hi , how are you doing ? i am getting ready to do some cheetah chasing to stay in shape .

S2: you must be very fast . hunting is one of my favorite hobbies .

S1: i am ! for my hobby i like to do canning or some whittling .

S2: i also remodel homes when i am not out bow hunting .

S1: that is neat . when i was in high school i placed 6th in 100m dash !

S2: that is awesome . do you have a favorite season or time of year ?

S1: i do not . but i do have a favorite meat since that is all i eat exclusively .

S2: what is your favorite meat to eat ?

S1: i would have to say its prime rib . do you have any favorite foods ?

S2: i like chicken or macaroni and cheese .

S1: do you have anything planned for today ? i think i am going to do some canning .

S2: i am going to watch football . what are you canning ?

S1: i think i will can some jam . do you also play footfall for fun ?

S2: if i have time outside of hunting and remodeling homes . which is not much !

5 of 36

What is dialog modeling?

Given

(Text or multimodal) Conversation history
Persona of users
Topic of conversation
Any other context

Generate

Response

Challenge

How do I produce the best input representation?

Karthik Gopalakrishnan, Behnam Hedayatnia, Qinlang Chen, Anna Gottardi, Sanjeev Kwatra, Anu Venkatesh, Raefer Gabriel, and Dilek Hakkani-Tür. 2019. Topical-chat: Towards knowledge-grounded open-domain conversations. Proc. Interspeech 2019, pages 1891–1895.

gmanish@microsoft.com

5

Context

Response

TopicalChat Dataset Example

Knowledge section for topic "Fish"

A fish is an aquatic, craniate, gill-bearing animal that lacks limbs with digits. Included in this definition are the living hagfish, lampreys, and cartilaginous and bony fish as well as various extinct related groups. Approximately 95% of living fish species are ray-finned fish, belonging to the class Actinopterygii, with around 99% of those being teleosts.

S1: I think fish are so cool there is actually a breed of jellyfish that is immortal.
S2: i had rememered hearing about that before. Immortatlity is wasted on a jellyfish haha. did you know a seahorse is the only fish that has an actual neck?
S1: That is so funny I guess I never considered a seahorse a fish. The black swallower fish sounds a lot like a snake because it can eat pray that is so large.
S2: i guess they live up to their name then!
S1: It seems they do. I also didn't know that there was a difference with how freshwater and saltwater fish drink.

6 of 36

What is dialog modeling?

Given

(Text or multimodal) Conversation history
Persona of users
Topic of conversation
Any other context

Generate

Response

Challenge

How do I produce the best input representation?

Agrawal, Harsh, Mishra, Aditya; Gupta, Manish; Mausam.: Multimodal persona based generation of comic dialogs. In ACL. pp. 14150–14164 (2023)

gmanish@microsoft.com

6

Context

Response

ComSet Dataset Example

7 of 36

Why care about dialog modeling?

Remember Eliza @ Emacs? ☺
Domain-specific (task-oriented) customer service: banking, airports, tech-help, car rental
Personal assistants: Siri, Alexa
Tutoring: KhanAcademy
Healthcare support: OneRemission, Youper, Babylon Health
Entertainment/chitchat systems: Virtual friend; Ruuh
….
LLMs 🡪 Bing Chat

https://www.ometrics.com/blog/list-of-fun-chatbots/ https://www.linkedin.com/pulse/global-chatbot-market-size-projected-surpass-around-usd-sundstrom/

gmanish@microsoft.com

7

8 of 36

What are popular approaches for dialog modeling?

Seq2Seq neural NLG models

Standard pretrained models: BERT, ELMo, GPT-2.
Pretrained on dialog data: DialoGPT, BlenderBot, Meena, EDGE.
Pretrained with dialog specific losses: DialogRPT, ContextPretrain, ConveRT.

Response Ranking, next utterance retrieval, next utterance generation, masked utterance retrieval, inconsistency identification

Finetuned on persona-based data: Bert-over-Bert, PersonaGPT.

What is missing?

Can there be a loss that is conscious of (context, response) structure of dialogs?
How to model multimodal dialogs?
How to use GPTx for dialog modeling?

gmanish@microsoft.com

8

9 of 36

What are popular metrics for evaluating dialog systems?

Multi-class Classification: Accuracy
Response Selection (Retrieval): R@1, R@2, MRR
Generation

Perplexity
Syntactic: Unigram F1, ROUGE, BLEU
Syntax + surface forms + stemmed forms + meanings: METEOR
Semantic: BLEURT
Dialog specific semantic: MaUde, DEB

Agrawal, Harsh, Aditya Mishra, and Manish Gupta. "Multimodal Persona Based Generation of Comic Dialogs." In ACL , pp. 14150-14164. 2023.

gmanish@microsoft.com

9

10 of 36

Agenda

Introduction to Dialog Modeling
DMI-based Representation Learning [NAACL 22] (with IITKGP)
Representation Learning for Multimodal Persona Based Setting [ACL 23] (with IITD)
Representation Learning for In-Context Learning Models [EMNLP 23 (Findings)] (with IITKGP)
Outlook

gmanish@microsoft.com

10

11 of 36

Why learn a new representation for dialog systems?

Factors impacting effectiveness of pretrained models

Pretraining corpus
Loss function
Downstream tasks

Standard pretraining objectives are unaware of dialog structure

Bishal Santra, Sumegh Roychowdhury, Aishik Mandal, Vasu Gurram, Atharva Naik, Manish Gupta, Pawan Goyal. Representation Learning for Conversational Data using Discourse Mutual Information Maximization. NAACL 2022.

gmanish@microsoft.com

11

Word-level reasoning

Discourse-level reasoning

12 of 36

How do we optimize with dialog structure awareness?

Bishal Santra, Sumegh Roychowdhury, Aishik Mandal, Vasu Gurram, Atharva Naik, Manish Gupta, Pawan Goyal. Representation Learning for Conversational Data using Discourse Mutual Information Maximization. NAACL 2022.

gmanish@microsoft.com

12

13 of 36

How is InfoNCE-S computed?

Bishal Santra, Sumegh Roychowdhury, Aishik Mandal, Vasu Gurram, Atharva Naik, Manish Gupta, Pawan Goyal. Representation Learning for Conversational Data using Discourse Mutual Information Maximization. NAACL 2022.

13

gmanish@microsoft.com

14 of 36

Experimental Setup

Pretraining: Subset of Reddit-727M
Dialog Unrolling for Pretraining: 2.7B CR pairs; used ~10%

Bishal Santra, Sumegh Roychowdhury, Aishik Mandal, Vasu Gurram, Atharva Naik, Manish Gupta, Pawan Goyal. Representation Learning for Conversational Data using Discourse Mutual Information Maximization. NAACL 2022.

14

Downstream task details. Adv.: Adversarial, Neg.: Negative

gmanish@microsoft.com

15 of 36

Probing (Frozen LM) Results

Analyze pretrained model’s effectiveness.
DMI outperforms baselines significantly.
Performance is consistent across all tasks.

Bishal Santra, Sumegh Roychowdhury, Aishik Mandal, Vasu Gurram, Atharva Naik, Manish Gupta, Pawan Goyal. Representation Learning for Conversational Data using Discourse Mutual Information Maximization. NAACL 2022.

gmanish@microsoft.com

15

Std LMs

Dialog specific LMs

16 of 36

Finetuning Results: LM weights are also updated

DMI outperforms all baselines under both settings.
Discourse level features are highly suitable for pretraining effective representations.

Bishal Santra, Sumegh Roychowdhury, Aishik Mandal, Vasu Gurram, Atharva Naik, Manish Gupta, Pawan Goyal. Representation Learning for Conversational Data using Discourse Mutual Information Maximization. NAACL 2022.

gmanish@microsoft.com

16

Std LMs

Dialog specific LMs

17 of 36

Qualitative Analysis

Response selection from a pool of ~7K utterances

Bishal Santra, Sumegh Roychowdhury, Aishik Mandal, Vasu Gurram, Atharva Naik, Manish Gupta, Pawan Goyal. Representation Learning for Conversational Data using Discourse Mutual Information Maximization. NAACL 2022.

17

Context:

User 1: Are you busy tomorrow morning ?

Response:

By User 2 (Human): I'm free . What's up ?

By DMI Model: No , not this weekend . I have too much work to do .

Context:

User 1: Hi , Dan . What's new ?

User 2: Where have you been ? I've been trying to get in touch with you for two days .

User 1: I went to Salt Lake City on business . What's up ?

Response

By User 2 (Human): I got fired .

By DMI Model: Not much . I had to pay an unexpected bill , so I needed the money back .

Example Predictions on Empathetic-Intent Task

gmanish@microsoft.com

18 of 36

Conclusions

DMI: A novel pretraining objective for better modeling of dialogs that is structure aware.
DMI representations are much more effective and consistent across different downstream tasks.
Released pretrained DMI model checkpoints of various sizes.

https://bsantraigi.github.io/DMI

Bishal Santra, Sumegh Roychowdhury, Aishik Mandal, Vasu Gurram, Atharva Naik, Manish Gupta, Pawan Goyal. Representation Learning for Conversational Data using Discourse Mutual Information Maximization. NAACL 2022.

gmanish@microsoft.com

18

19 of 36

Agenda

Introduction to Dialog Modeling
DMI-based Representation Learning [NAACL 22] (with IITKGP)
Representation Learning for Multimodal Persona Based Setting [ACL 23] (with IITD)
Representation Learning for In-Context Learning Models [EMNLP 23 (Findings)] (with IITKGP)
Outlook

gmanish@microsoft.com

19

20 of 36

What is the comic dialog generation problem?

Challenges: Visual narrative, multi-party dialog, personas.
ComSet dataset

13 comics from GoComics
Each comic strip contains transcription and an image.
Parsing Transcripts: POS, NER, dependency parsing.
Panel Segmentation: Faster-RCNN; 159K panels from 54K strips
Dialogue Text Detection and Masking (EasyOCR).
Multimodal Alignment (edit distance): 238K utterances
Persona fact generation: 202 chars.

Agrawal, Harsh, Aditya Mishra, and Manish Gupta. "Multimodal Persona Based Generation of Comic Dialogs." In ACL , pp. 14150-14164. 2023.

20

gmanish@microsoft.com

21 of 36

MPDialog Model Architecture

MultiModal Embedding (MME)

Text encodings (TE)

12L PersonaGPT-base

Visual embeddings (VE)

12L CLIP-VIT Vision encoder (VE)
Linearly projected & reshaped.

Interleave text and visual embeddings
Prepend persona info.

12L PersonaGPT-base decoder.
Finetuned end to end.

Agrawal, Harsh, Aditya Mishra, and Manish Gupta. "Multimodal Persona Based Generation of Comic Dialogs." In ACL , pp. 14150-14164. 2023.

21

gmanish@microsoft.com

22 of 36

How does MPDialog perform?

MPDialog > LM only and persona-based baselines
LM only models (DialoGPT and EDGE) cannot generate coherent responses (high perplexity and low MaUde) for comics.
Adding persona info reduces perplexity.
LM + persona + images > LM + persona > LM

Agrawal, Harsh, Aditya Mishra, and Manish Gupta. "Multimodal Persona Based Generation of Comic Dialogs." In ACL , pp. 14150-14164. 2023.

22

gmanish@microsoft.com

23 of 36

Comic-wise Quantitative Analysis

Agrawal, Harsh, Aditya Mishra, and Manish Gupta. "Multimodal Persona Based Generation of Comic Dialogs." In ACL , pp. 14150-14164. 2023.

23

BLEURT

MPDialog is best in most cases.
“Cleats” comic focuses on the relationships between the characters, their sportsmanship and the challenges of being part of a team.

Images do not contain much additional information

MaUde

Perplexity

gmanish@microsoft.com

24 of 36

Qualitative Analysis

Agrawal, Harsh, Aditya Mishra, and Manish Gupta. "Multimodal Persona Based Generation of Comic Dialogs." In ACL , pp. 14150-14164. 2023.

24

Human Evaluation Results

EDGE and BoB: Too banal responses
DialoGPT: completely nonsensical responses.

gmanish@microsoft.com

25 of 36

Conclusions

COMSET: comics dataset with ~54K strips and 200+ personas
MPDialog: persona-based multimodal dialog baseline
Experiments: evidence that leveraging multimodality and persona orientation improves the quality of dialogues.

25

Agrawal, Harsh, Aditya Mishra, and Manish Gupta. "Multimodal Persona Based Generation of Comic Dialogs." In ACL , pp. 14150-14164. 2023.

gmanish@microsoft.com

26 of 36

Agenda

Introduction to Dialog Modeling
DMI-based Representation Learning [NAACL 22] (with IITKGP)
Representation Learning for Multimodal Persona Based Setting [ACL 23] (with IITD)
Representation Learning for In-Context Learning Models [EMNLP 23 (Findings)] (with IITKGP)
Outlook

gmanish@microsoft.com

26

27 of 36

Using LLMs as dialog models

Bishal Santra, Sakya Basak, Abhinandan De, Manish Gupta, Pawan Goyal. Frugal Prompting for Dialog Models. EMNLP Findings, 2023.

gmanish@microsoft.com

27

28 of 36

Optimizing the prompts

Manual versus Perplexity Optimized Prompts
Redundancies in conversations

Back-channeling, clarification, mistake correction.
Verbose model responses.

Shortening Dialog Histories

Selection

Recent-k; Semantic-k

Summarization

BART-D (DialogSum): 12L+12L
Pegasus-DS (DialogSum and SAMSum): 16L+16L
Pegasus-CD (CNN/DailyMail): 16L+16L

Shortening Background Information (Persona/topic)

BART, Pegasus.

Bishal Santra, Sakya Basak, Abhinandan De, Manish Gupta, Pawan Goyal. Frugal Prompting for Dialog Models. EMNLP Findings, 2023.

Absolute perf analysis

GPT-3 is best; Tk-Instruct is worst.
Perplexity optimized < Manually engg
Best results with full dialog history for TC.
For MSC, even prompts with summarized history seem to do very well.

28

gmanish@microsoft.com

29 of 36

Analysis of Prompt Lengths

Bishal Santra, Sakya Basak, Abhinandan De, Manish Gupta, Pawan Goyal. Frugal Prompting for Dialog Models. EMNLP Findings, 2023.

29

gmanish@microsoft.com

30 of 36

Usable information-density (UID)

Bishal Santra, Sakya Basak, Abhinandan De, Manish Gupta, Pawan Goyal. Frugal Prompting for Dialog Models. EMNLP Findings, 2023.

30

gmanish@microsoft.com

31 of 36

Bishal Santra, Sakya Basak, Abhinandan De, Manish Gupta, Pawan Goyal. Frugal Prompting for Dialog Models. EMNLP Findings, 2023.

31

gmanish@microsoft.com

32 of 36

Conclusions

Explored the tradeoff between model performance and cost for dialog systems.
UID: Representation of dialog history that provides the highest amount of usable information per token.
Insights

Summaries > full history.
Recent-k or Semantic-k > summaries.
Semantic-1 is best from both accuracy as well as UID perspective.
Zero-shot > Few-shot.

32

33 of 36

Agenda

Introduction to Dialog Modeling
DMI-based Representation Learning [NAACL 22] (with IITKGP)
Representation Learning for Multimodal Persona Based Setting [ACL 23] (with IITD)
Representation Learning for In-Context Learning Models [EMNLP 23 (Findings)] (with IITKGP)
Outlook

gmanish@microsoft.com

33

34 of 36

Summary

Introduction to Dialog Modeling: Settings, motivation, metrics.
DMI-based Representation Learning

DMI: A novel pretraining objective for better modeling of dialogs that is structure aware.
DMI representations are much more effective and consistent across different downstream tasks.

Representation Learning for Multimodal Persona Based Setting

COMSET: comics dataset with ~54K strips and 200+ personas
MPDialog: persona-based multimodal dialog baseline
Leveraging multimodality and persona orientation improves the quality of dialogues.

Representation Learning for In-Context Learning Models

Explored the tradeoff between model performance and cost for dialog systems.
UID: Representation of dialog history that provides the highest amount of usable information per token.
Cost is imp 🡺 Recent-1 and Semantic-1 (zero-shot)
Cost is less imp 🡺 Longer dialog summaries such as Pegasus-CD and Semantic-4 (zero-shot)

gmanish@microsoft.com

34

35 of 36

Research Opportunities

Current Directions

DSMH: Dialog System for Mental Health at Workplace
VideoDialogs: Dialogs for Educational Videos
CORAL: Contextual Response Retrievability Loss

Future Opportunities

Query auto-completion for (multimodal) dialogs
Generation of humorous utterances
Generation of text utterances jointly with panel images
Hate speech/toxicity detection in (multimodal) conversations
RAG (Retrieval Augmented Generation) with dialog models

gmanish@microsoft.com

35

36 of 36

Thanks!

HomePage: https://sites.google.com/view/manishg/
Google Scholar: https://scholar.google.co.in/citations?user=eX9PSu0AAAAJ
LinkedIn: http://aka.ms/manishgupta
YouTube (Data Science Gems): https://www.youtube.com/channel/UC_g2RbNMcraoOPc-vYJjgIA

gmanish@microsoft.com

36