1 of 21

When Words Can’t Capture It All: Towards Video-Based User Complaint Text Generation with Multimodal Video Complaint Dataset

Presenter:Sarmistha Das

PhD student,CSE

Indian Institute of

Technology Patna

Supervised By

Dr. Sriparna Saha (Indian Institute of Technology Patna)

&

Manish Gupta (Microsoft)

2 of 21

Contents

  • Background
  • Challenges
  • Motivation and Research Objectives
  • Dataset Curation
  • Methodology
  • Experimental Results
  • Conclusion

3 of 21

Background

4 of 21

Background

Worst Product

Bad Mouse

Clicking too much but no work

Ordered 4 but stuck as one

5 of 21

Background

Complaint: Assertions which intensify firm disappointments. Related to expectations v/s actuality of services, products etc.

(e.g., vague text such asworst productpaired with a 5-second video depicting a broken headphone with right earcup).

worst product

6 of 21

Background

Complaint: Assertions which intensify firm disappointments. Related to expectations v/s actuality of services, products etc.

(e.g., vague text such asworst productpaired with a 5-second video depicting a broken headphone with right earcup).

Presenting generation of Complaint Description from Videos CoD-V (e.g., to help the above user articulate her complaint about the defective right earcup)

worst product

7 of 21

Motivation

  • Amazon, JioMart, and Flipkart are rapidly entering rural markets, driving a rise in first-time online shoppers

  • Supporting users with cognitive or writing challenges, such as dysgraphia or limited language skills.

  • Existing video summarizers offer generic descriptions (e.g., ‘person holding a headphone’) but miss core issues (e.g., ‘broken wiring’), leading to misinterpretation and delayed resolution.

8 of 21

Research Objectives

  • Build a platform where inarticulate or busy users can express grievances via video, empowering them and improving engagement with e-commerce platforms.

  • Assess whether text generated from complaint videos accurately captures both factual details and the user’s dissatisfied review.

The user wants to convey a complaint about a broken headphone, specifically the right earcup. Where the right earcup is barely connected.

9 of 21

Challenges

  • Lack of Publicly Available Datasets.

  • Complex Annotation Requirements
  • Overlapping with Basic Tasks: Existing models often treat it like basic video summarization or captioning, failing to capture complaint-specific intent and nuances.

10 of 21

Corpus Curation

  • Samples 1,175
  • 655 - electronic

gadgets,

  • 273 on household

items,

  • 202 on fashion

items,

  • 45 others
  • 4 emotional

Categories (dissatisfaction, blame, frustration,

disappointment)

11 of 21

Methodology

12 of 21

Methodology

13 of 21

Methodology

  • Sentiment Analysis using Vader Score (-1 to 1) (Svader)
  • Emotion Score (Happy, Sad, Angry, Frustrated) (Es)
  • Aspect Detection Score (Quality, Battery Life, Return Policy) [0,1] (As)
  • Complaint Retention (CR) = (Svader + Es + As)/3

14 of 21

Results

15 of 21

Results

16 of 21

Results

17 of 21

18 of 21

Qualitative Analysis

The User is trying to scroll

The User keeps trying

Nothing works and the mouse looks old

With Emotion

Without Emotion

BLIP-VQA-BASE

VideoLLaVA2-7b

VideoLLaMA2-7b+MR

Video

the user wants to convey about the complaint of mouse. the user is saying that the mouse scroller is not working and the scroll button is not working.

the user wants to convey about the complaint of mouse is old.The user has received an old product. Which is not having good quality. The scrolling button is not working as well.

The user wants to convey about the complaint of mouse. The user shows the quality of the mouse which is not good by placing it in his hand. The sensor is not working.

the user wants to convey about the complaint of mouse. the user shows the quality of the mouse which is not good.the user has shown this by taking the mouse in his hand

The user wants to convey about the complaint of mouse. The user has dissatisfaction regarding the product durability.

The user wants to convey about the complaint of mouse.The user claims that the mouse is defective and he has received a damaged product. The scrolling button is not working properly.

Ground Truth : The user wants to convey about the complaint of mouse. The user claims that the mouse is defective. He has received a complete damaged product where the scrolling button is not working properly.

19 of 21

Error Analysis

  1. The User Typed for once

2. Automatically Typing

3. User Checked The Connection

Output from proposed model: The user wants to complain that the keyboard is not functioning properly. Despite multiple attempts to press the keys, they remain unresponsive and instead automatically type other. The user is disappointed with Amazon for selling defective items and is unable to contact customer care for a refund.

Ground Truth: The user wants to convey the complaint that the keyboard is not working. The USB jack is also not working. Wrong typing occurs on the screen. words and trigger the escape button.

20 of 21

Conclusion

  • Presenting Complaint Generation Task from multimodal videos.

  • ComVID dataset with 1175 videos.

  • Presenting Complaint Retention Evaluation measurement.

  • A brief difference between Summary Generation Vs Video Description Vs Complaint Generation from Video task.

21 of 21

Thank You.

Authors: Sarmistha Das, R E Zera Marveen Lyngkhoi, Kirtan Jain, Vinayak Goyal, Sriparna Saha Manish Gupta

Github: https://github.com/sarmistha-D/CoD-V

Contact Email: sarmistha1515@gmail.com