1 of 24

Abnormal detection

Video_to_Text Translation

Sifan Zhu, Anqi Li, Kristina Kuznetsova :)

2 of 24

CHOSEN MODELS

PROJECT DESIGN

WORKFLOW OVERVIEW

ANNOTATION & FINE-TUNING

Challenges

EVALUATION & BRAINSTORMING

TABLE OF CONTENTS

01

03

02

04

Annotation example in three perspectives

Example of video summarization after fine tuning

Results comparison

More to think about …

3 of 24

PROJECT DESIGN & WORKFLOW OVERVIEW

01

4 of 24

SUMMARIZATION

IMAGE CAPTIONING

PROJECT DESIGN

State-of-the-Art approach for Video Description using AI:

FRAME EXTRACTION

Extract key frames using interval-based extraction or scene detection.

Apply a pre-trained captioning model to each frame.

Use a summarization model to combine captions into a summary.

5 of 24

WORKFLOW OVERVIEW

Generation of video descriptions with the pre-trained model.

Database preparation for the project task

Model training on the chosen database.

Comparative analysis of the results of the two models

BASELINE MODEL

ANNOTATION

FINE-TUNING

EVALUATION

B

A

F

E

6 of 24

CHOSEN MODELS

02

7 of 24

IMAGE CAPTIONING

BLIP

(Bootstrapping Language-Image Pre-training)

Strengths:

    • High-Quality Captions:

Generates detailed and accurate captions that describe not only the objects in the image but also the context and actions, providing a richer understanding of the visual content.

    • Fine-Tuning Capabilities:

Can be fine-tuned for specific tasks or domains to further improve performance, making it adaptable to various applications.

8 of 24

SUMMARIZATION

BART �(Bidirectional and Auto-Regressive Transformers)

Strengths:

    • Ability to Handle Long Inputs:

Capable of processing and summarizing long sequences of text, which is essential when dealing with multiple captions generated from a video.

    • Flexibility in Output Length:

Can generate summaries of varying lengths by adjusting parameters, providing flexibility based on the required level of detail.

9 of 24

F0

F1

Fn

Video

Description of the video

generated without manually predefined frame descriptions

F0

F1

d0

Fn

d0*

d1

d2

d2

d1

d0

d2

d1

Video

Detailed description of the video

generated with *manually predefined frame descriptions

BASELINE

model

FINE-TUNED model

10 of 24

ANNOTATION & FINE-TUNING

Challenges

03

Annotation example and Summarizations of video

the fine-tuned model, discovered challenges

11 of 24

3 Annotation Perspectives

Action perspective

Location perspective

Temporal perspective

abnormal_scene_2_scenario_2_240

abnormal_scene_2_scenario_2_30

abnormal_scene_2_scenario_2_120

12 of 24

Annotation example

'abnormal_scene_2_scenario_2_240.jpg’: [

‘A bald man in casual clothes looks at a construction worker in a fluorescent waistcoat and a white helmet lying on the ground.’,

'In the middle of a pedestrian crossable motorised road, flanked by European-style buildings.',

' A dawn morning',

]

abnormal_scene_2_scenario_2

Action perspective

Location perspective

Temporal perspective

13 of 24

Dataset

UBnormal a new supervised open-set benchmark composed of multiple virtual scenes for video anomaly detection

Amount of scenes: 29 in total

Videos in total: 490(Abnormal: 247/ Normal: 243)

Scene 1: 27 scenarios ➡️ 256 frames

Scene 2: 13 scenarios ➡️ 175 frames

Scene 3: 15 scenarios ➡️ 207 frames

Scene 20: 18 scenarios ➡️ 287 frames

In Total : 5978 frames x 3

14 of 24

Summarizations of video (generated by Baseline Model)

15 of 24

Summarizations of videos (generated by Fine-tuned model)

16 of 24

Discovered challenges

Another summarization model is needed to create better video descriptions.

Bart, T5, Pegasus……

  • Simply concatenating
  • Repetition
  • Untruth

Handling domain mismatches between the pre-trained model's training data and the annotated data.

  • Potential bias: ensuring that all annotators follow the same guidelines and interpret the data uniformly.
  • Human errors: maintaining high quality and accuracy of annotations
  • Time-consuming: inefficiency, costly for large datasets

Summarization model

Models fine-tuning

Annotations

17 of 24

EVALUATION

04

Chosen metrics and expected results

18 of 24

Evaluation metrics:

  • Ensures generated text captures important segments and structural elements.
  • Suitable for evaluating summarization and text generation.
  • Provides a nuanced and semantically meaningful evaluation by consideration of: precision, recall, synonymy, stemming, and word order.

  • Measures precision of n-grams in the generated text against the reference text.
  • Ensures generated text maintains fluency and readability.

ROUGE

METEOR

BLEU

19 of 24

Results

Baseline

Gold description

Fine-tuned

‘ a street with a person walking down it a person walking down a street in a city a street with a white line on the road. a street with a person walking down it a person walking down it a person walking down it.'

‘ At dawn, a bald man in casual clothes walks up to a construction worker in a fluorescent vest and white helmet and knocks him to the ground in the middle of a pedestrianized, motorized road flanked by European-style buildings, and then the bald man runs away.’

‘ people walk and talk on the both sides of the street. it's on a street in a man in black stands on the left side of the road while a. man in a surgical gown a man. in dark blue stands. on the right side of. the street, talking.’

abnormal_scene_2_scenario_2

20 of 24

Comparison according to metrics

21 of 24

More to think about…

Generations by the fine-tuned model appear to be incorrect according to the video content

Dataset with different scenarios unbalanced?

Diverse data set required

(Native speaker annotated videos on 1-4 sce) Fine-tuning on 5-20 sce, validation on 1-4 sce

Gap between scenarios too large?

22 of 24

More to think about…

Train on 1 perspective to justify the contribution of 3 perspectives

What contributes more?

23 of 24

More to think about…

The improvement of the fine-tuned model is tiny according to the metrics

Add coordinates as a 4th perspective in the annotation files to locate the features we are describing

Possible strategy:

Select features in the frames to combine with the frame descriptions?

24 of 24

THANKS

for your

ATTENTION!