Abnormal detection
Video_to_Text Translation
Sifan Zhu, Anqi Li, Kristina Kuznetsova :)
CHOSEN MODELS
PROJECT DESIGN
WORKFLOW OVERVIEW
ANNOTATION & FINE-TUNING
Challenges
EVALUATION & BRAINSTORMING
TABLE OF CONTENTS
01
03
02
04
Annotation example in three perspectives
Example of video summarization after fine tuning
Results comparison
More to think about …
PROJECT DESIGN & WORKFLOW OVERVIEW
01
SUMMARIZATION
IMAGE CAPTIONING
PROJECT DESIGN
State-of-the-Art approach for Video Description using AI:
FRAME EXTRACTION
Extract key frames using interval-based extraction or scene detection.
Apply a pre-trained captioning model to each frame.
Use a summarization model to combine captions into a summary.
WORKFLOW OVERVIEW
Generation of video descriptions with the pre-trained model.
Database preparation for the project task
Model training on the chosen database.
Comparative analysis of the results of the two models
BASELINE MODEL
ANNOTATION
FINE-TUNING
EVALUATION
B
A
F
E
CHOSEN MODELS
02
IMAGE CAPTIONING
BLIP
(Bootstrapping Language-Image Pre-training)
Strengths:
Generates detailed and accurate captions that describe not only the objects in the image but also the context and actions, providing a richer understanding of the visual content.
Can be fine-tuned for specific tasks or domains to further improve performance, making it adaptable to various applications.
SUMMARIZATION
BART �(Bidirectional and Auto-Regressive Transformers)
Strengths:
Capable of processing and summarizing long sequences of text, which is essential when dealing with multiple captions generated from a video.
Can generate summaries of varying lengths by adjusting parameters, providing flexibility based on the required level of detail.
F0
F1
Fn
Video
Description of the video
generated without manually predefined frame descriptions
F0
F1
d0
Fn
d0*
d1
d2
d2
d1
d0
d2
d1
Video
Detailed description of the video
generated with *manually predefined frame descriptions
BASELINE
model
FINE-TUNED model
ANNOTATION & FINE-TUNING
Challenges
03
Annotation example and Summarizations of video
the fine-tuned model, discovered challenges
3 Annotation Perspectives
Action perspective
Location perspective
Temporal perspective
abnormal_scene_2_scenario_2_240
abnormal_scene_2_scenario_2_30
abnormal_scene_2_scenario_2_120
Annotation example
'abnormal_scene_2_scenario_2_240.jpg’: [
‘A bald man in casual clothes looks at a construction worker in a fluorescent waistcoat and a white helmet lying on the ground.’,
'In the middle of a pedestrian crossable motorised road, flanked by European-style buildings.',
' A dawn morning',
]
abnormal_scene_2_scenario_2
Action perspective
Location perspective
Temporal perspective
Dataset
UBnormal a new supervised open-set benchmark composed of multiple virtual scenes for video anomaly detection
Amount of scenes: 29 in total
Videos in total: 490(Abnormal: 247/ Normal: 243)
Scene 1: 27 scenarios ➡️ 256 frames
Scene 2: 13 scenarios ➡️ 175 frames
Scene 3: 15 scenarios ➡️ 207 frames
…
…
Scene 20: 18 scenarios ➡️ 287 frames
In Total : 5978 frames x 3
Summarizations of video (generated by Baseline Model)
Summarizations of videos (generated by Fine-tuned model)
Discovered challenges
Another summarization model is needed to create better video descriptions.
Bart, T5, Pegasus……
Handling domain mismatches between the pre-trained model's training data and the annotated data.
Summarization model
Models fine-tuning
Annotations
EVALUATION
04
Chosen metrics and expected results
Evaluation metrics:
ROUGE
METEOR
BLEU
Results
Baseline
Gold description
Fine-tuned
‘ a street with a person walking down it a person walking down a street in a city a street with a white line on the road. a street with a person walking down it a person walking down it a person walking down it.'
‘ At dawn, a bald man in casual clothes walks up to a construction worker in a fluorescent vest and white helmet and knocks him to the ground in the middle of a pedestrianized, motorized road flanked by European-style buildings, and then the bald man runs away.’
‘ people walk and talk on the both sides of the street. it's on a street in a man in black stands on the left side of the road while a. man in a surgical gown a man. in dark blue stands. on the right side of. the street, talking.’
abnormal_scene_2_scenario_2
Comparison according to metrics
More to think about…
Generations by the fine-tuned model appear to be incorrect according to the video content
Dataset with different scenarios unbalanced?
Diverse data set required
(Native speaker annotated videos on 1-4 sce) Fine-tuning on 5-20 sce, validation on 1-4 sce
Gap between scenarios too large?
More to think about…
Train on 1 perspective to justify the contribution of 3 perspectives
What contributes more?
More to think about…
The improvement of the fine-tuned model is tiny according to the metrics
Add coordinates as a 4th perspective in the annotation files to locate the features we are describing
Possible strategy:
Select features in the frames to combine with the frame descriptions?
THANKS
for your
ATTENTION!