1 of 11

Multimodal Persona Based Generation of Comic Dialogs

Harsh Agrawal (IIT Delhi), Aditya M. Mishra (IIT Delhi), Manish Gupta (Microsoft), Mausam (IIT Delhi)

manishg.iitb@gmail.com

2 of 11

What is the comic dialog generation problem?

  • Task: Next utterance generation for comics
  • Inputs:
    • Conversation history with utterances
    • Aligned sequence of visual scenes
    • Persona facts for the comic characters
  • Expected Output: Next utterance by character
  • Goal: Multimodal dialog systems with a focus on comic strips
  • Challenges: Visual narrative, multi-party dialog, personas, humor

Agrawal, Harsh, Aditya Mishra, and Manish Gupta. "Multimodal Persona Based Generation of Comic Dialogs." In ACL, pp. 14150-14164. 2023.

manishg.iitb@gmail.com

2

manishg.iitb@gmail.com

3 of 11

Related Work

  • Dialogue Generation
    • DialoGPT: GPT2 decoder pretrained on Reddit conversations
    • EDGE: allows for controlled response generation by conditioning on semantic frames of exemplar responses.
    • Bert-over-Bert: Shared BERT encoder but has two task-specific decoders for dialogue generation and consistency understanding.
    • PersonaGPT: GPT-2 finetuned on PersonaChat dataset
  • Multimodal Datasets for Dialogue Generation
    • COMICS: scanned images of comic strips but not manually extracted transcript information or information about comic characters; OCR detection inaccuracies.
    • PersonaChat: conversations between two agents and their corresponding persona facts but no images.
    • ImageChat, PhotoChat and VisualDialog: conversation between speakers about a single reference image.

Agrawal, Harsh, Aditya Mishra, and Manish Gupta. "Multimodal Persona Based Generation of Comic Dialogs." In ACL, pp. 14150-14164. 2023.

manishg.iitb@gmail.com

3

4 of 11

ComSet Dataset

  • 13 comics from GoComics
  • Each comic strip contains transcription and an image.
  • Parsing Transcripts
    • Parsing speaker (character) and utterance pairs from unstructured conversation transcripts.
    • Match list of comic characters with PERSON entities in transcripts.
    • OTHER: 17%.
    • Non-named entity speakers: Man, Woman, Stranger, Voice, Noise, Sound
    • Edit distance to handle spell mistakes.
    • Free-form text: Bucky is holding Smacky and says ....
      • 4 parts: character/speaker name (Bucky), action or attribute phrase (is holding Smacky and), speaking verb (says, replies, asks, proclaims, etc.), and utterance.
      • Heuristics with POS, NER and dependency parsing.

Agrawal, Harsh, Aditya Mishra, and Manish Gupta. "Multimodal Persona Based Generation of Comic Dialogs." In ACL, pp. 14150-14164. 2023.

  • Panel Segmentation
    • Each strip image has several panels and utterances across panels.
    • Faster-RCNN with ResNet-50 and 500 manually annotated panel bounding boxes

4

manishg.iitb@gmail.com

5 of 11

ComSet Dataset

  • Dialogue Text Detection and Masking
    • EasyOCR to extract the text and bounding boxes from each segmented panel
    • Filled bounding boxes with random noise.

Agrawal, Harsh, Aditya Mishra, and Manish Gupta. "Multimodal Persona Based Generation of Comic Dialogs." In ACL, pp. 14150-14164. 2023.

  • Multimodal Alignment
    • OCR text from panels
    • Utterances from transcripts
    • Levenshtein distance-based similarity score
    • Get best matching panel for every utterance.
    • Get monotonically increasing sequence using dynamic programming.
  • Persona fact generation
    • Find personality traits for top characters
  • Split into 8 seen and 5 unseen comics. Seen set split 70:10:20.

5

manishg.iitb@gmail.com

6 of 11

Overall Statistics of ComSet Dataset

Agrawal, Harsh, Aditya Mishra, and Manish Gupta. "Multimodal Persona Based Generation of Comic Dialogs." In ACL, pp. 14150-14164. 2023.

6

Component

Acc.

Parsing transcripts

98%

Panel segmentation

97%

Text masking

96%

Multimodal alignment

95%

Overall accuracy

92%

manishg.iitb@gmail.com

7 of 11

MPDialog Model Architectue

  • Next Utterance Prediction Task
    • Predict: utterance at time t.
    • Given: A comic strip containing conversation history with utterances up to time t-1, and an aligned sequence of images including the one for time t.
    • Limit history to h=5.
  • Baseline Methods: Finetuned over ComSet-train
    • LM only: DialoGPT and EDGE
    • LM+Persona: PersonaGPT and BoB; Generate persona consistent responses.
  • MPDialog
    • MultiModal Embedding (MME) encodes both text and images into a 768D space
      • Computes text encodings using a 12L PersonaGPT-base (TE) layer
      • Computes visual token embeddings using 12L CLIP-VIT Vision encoder (VE)
      • Linearly projects each D-sized embedding to n×D and reshapes it to n tokens of size D
      • Interleaves text and visual token embeddings, with panel dialogues preceded by respective panel embeddings
      • Prepend persona information.
    • 12L PersonaGPT-base decoder generates output tokens based on the encoded embeddings
    • Finetuned end to end.

Agrawal, Harsh, Aditya Mishra, and Manish Gupta. "Multimodal Persona Based Generation of Comic Dialogs." In ACL, pp. 14150-14164. 2023.

7

manishg.iitb@gmail.com

8 of 11

Main Results

  • MPDialog > LM only and persona-based baselines
  • LM only models (DialoGPT and EDGE) cannot generate coherent responses (high perplexity and low MaUde) for comics.
  • Adding persona info reduces perplexity.
  • LM + persona + images > LM + persona > LM

Agrawal, Harsh, Aditya Mishra, and Manish Gupta. "Multimodal Persona Based Generation of Comic Dialogs." In ACL, pp. 14150-14164. 2023.

8

manishg.iitb@gmail.com

9 of 11

Comic-wise Quantitative Analysis

Agrawal, Harsh, Aditya Mishra, and Manish Gupta. "Multimodal Persona Based Generation of Comic Dialogs." In ACL, pp. 14150-14164. 2023.

9

BLEURT

  • For most comics across all the three metrics, MPDIALOG performs better than other models.
  • Cleats comic focuses on the relationships between the characters, their sportsmanship and the challenges of being part of a team.
    • Images do not contain much additional information

MaUde

Perplexity

manishg.iitb@gmail.com

10 of 11

Qualitative Analysis

Agrawal, Harsh, Aditya Mishra, and Manish Gupta. "Multimodal Persona Based Generation of Comic Dialogs." In ACL, pp. 14150-14164. 2023.

  • Fluency encompasses how easy to understand the response is.
  • Engagingness is defined as how interesting and unique the response is.
  • Dialog/Scene Consistency: How consistent is the response to the dialogue/image history?
  • Persona Detection: Given persona facts of two characters, which persona does the response match to?

10

Human Evaluation Results

  • Character level KL div between train vocab distribution and generated output vocab distribution is minimum for MPDialog.
  • Responses for other language only models are either too banal (EDGE and BoB) or completely nonsensical (DialoGPT).

manishg.iitb@gmail.com

11 of 11

Conclusion

  • COMSET: comics dataset with ~54K strips and 200+ personas
  • MPDialog: persona-based multimodal dialog baseline
  • Experiments: evidence that leveraging multimodality and persona orientation improves the quality of dialogues.
  • Future Research Opportunities
    • Make responses contextually coherent
    • Generation of humorous utterances
    • Explore generation of next utterances jointly with panel images

11

Agrawal, Harsh, Aditya Mishra, and Manish Gupta. "Multimodal Persona Based Generation of Comic Dialogs." In ACL, pp. 14150-14164. 2023.

manishg.iitb@gmail.com