1 of 1

PARTNER: A Persuasive Mental Health and Legal Counselling

Dialogue System for Crime Victims

#AI4SG5879

Priyanshu Priya*, Kshitij Mishra*, Palak Totala, Asif Ekbal

AI-NLP-ML Lab, Department of Computer Science and Engineering, Indian Institute of Technology Patna, Patna, India

Code

  1. Vaswani et al. (2017). Attention is all you need. Advances in Neural Information Processing Systems, 30, 5998-6008.
  2. Raffel et al. (2020). Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research.
  3. Lewis et al. (2020). BART: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension
  4. Manuel Romero. 2021. T5 (base) fine-tuned on squad for qg via ap. https://huggingface.co/mrm8488/t5-base-finetuned-question-generation-ap
  5. Asahi Ushio, Fernando Alva-Manchego, and Jose Camacho-Collados. 2022. Generative language models for paragraph-level question generation. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 670–688, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics
  6. AI@Meta. 2024. Llama 3 model card.

References

Generation of entity-centric information-seeking questions from videos. Our work addresses three key challenges: identifying question-worthy information, linking it to entities, and effectively utilizing multimodal signals

Problem Statement

VideoQuestions Dataset Annotation and Sample Dialogue

Results and Analysis

  • This work introduced a new task of generating entity-centric information-seeking questions (ECIS) from videos, filling a gap in existing question generation research
  • A Transformer-based model incorporating multimodal inputs (text and video) significantly improved question generation quality
  • Combining contrastive loss with cross-entropy enabled the generation of more distinct and entity-specific questions
  • Developed a new dataset, VideoQuestions, contributing to better model training and evaluation for video question generation tasks
  • The proposed method shows great potential for use in e-learning, search engines, and video-based chatbots, improving the generation of meaningful, context-rich questions.
  • Further exploration could focus on extending the system to handle more complex questions, support multiple languages, and cover a broader range of video content types.

Conclusions and Future Direction

Proposed System (ECIS-VQG)

32nd International Joint Conference on Artificial Intelligence

Motivation

  • Proposed entity-centric information seeking question generation from videos
  • Created the VideoQuestions dataset with over 2,000 annotated questions
  • Developed a Transformer model using text and video inputs
  • We analyze the efficacy of various Transformer encoder-decoder techniques, prompt engineering methods, multimodal video information encoding approaches, and cross-entropy combined with contrastive for better question generation
  • Achieved superior performance on metrics like BLEU, ROUGE, and CIDEr

Contribution

  • We curate a dataset of videos (each video < 10 minutes in duration) from YouTube and retain those which have English transcript
  • Education: 121 videos, Entertainment: 32 videos, How to & Style: 90 videos, News & Politics: 8 videos, People & Blogs: 75 videos, Science & Technology: 65 videos, Travel & Events: 20 videos
  • VideoQuestions dataset contains 411 videos with an average length of ~6 minutes
  • 2789 chapter titles with an average duration of ~48 seconds per chapter

VideoQuestions Dataset

Models

BLEU-1

CIDEr

METEOR

BERT-Score

ROUGE-L

Llama3-8B

19.8

2.33

45.5

68

38.1

Qwen-VL

2.7

0.26

31.8

56.8

17.2

GPT-4o

7.1

0.87

41.6

64.7

25.6

Ushio et al.

6.4

0.779

28.1

59.3

21.8

Proposed Model

71.3

7.311

81.9

90.0

78.6

Observations:

  • We observe that fine-tuning Transformer-based encoder-decoder models are better than prompt engineering with Alpaca and GPT
  • We observe that additional video embedding input (CLIP-based embedding) leads to improvements for BART, indicating the importance of effective encoding of the visual information in video clips
  • We observe that a combination of contrastive loss and cross-entropy loss is better than using cross-entropy alone

Contact

VideoQuestions Dataset Statistics

Category

Number of Videos

Average Video Length

Number of Empty Video Transcripts

Education

121

6.34

9

Entertainment

32

6.07

2

How to & Style

90

5.76

2

News & Politics

8

5.76

2

People & Blogs

75

6.18

6

Science & Technology

65

5.38

1

Travel & Events

20

6.92

0

There are a very few studies on video based QG

  • Only from Transcripts, or generate questions about common object and attributes

Video understanding based questions

  • What did the person get a plastic bowl from?”, “What will I show you?”, etc.
  • Answers are too contextually dependent on the video content, do not provide generally applicable information-seeking questions

ECIS-VQG: Generation of Entity-centric Information-seeking Questions from Videos

Architecture of the proposed method indicating various components like input representations, chapter titles classifier, and Transformer encoder-decoder model. Here, inputs are shown in orange, outputs are in green, models are in blue, and loss functions are in pink. Note that loss computation happens at train time only. Prompt is used for Alpaca only. Cross-attention Transformer layer and video embedding is not used for Alpaca

Note: This work is accepted at EMNLP 2024

Two examples of ECIS QG task. For example1, although the existing QG model (Romero, 2021) generates a grammatically sound question, it lacks key context information like a place (Where is the food cheap?) or subject (Which food item?). In example-2, without the particular chair’s name, the question generated by the existing QG model is too broad

Human Evaluation

Observations:

  • Here, B=BART-large, T5=T5-base. C = Chapter Title, V = Video Title, F= Frame Caption, T = Transcript and S(F,T) = Summary of F and T, generated by GPT-3.5-Turbo

Arpan Phukan1, Manish Gupta2, Asif Ekbal1*

AI-NLP-ML Lab, Department of Computer Science and Engineering, Indian Institute of Technology Patna, Patna, India1

Microsoft2

Indian Institute of Technology Jodhpur (on lien from IIT Patna)*