1 of 20

Federated Weakly Supervised Video Anomaly Detection

Presenter: Jiahang Li

Supervisor: Prof. Yong Su

Affiliation: Tianjin Normal University

Flower A Friendly Federated Learning Framework

2 of 20

Agenda

  • Introduction�
  • Background�
  • Motivation & Real-World Problem�
  • Proposed Framework�
  • Datasets and Experiments�
  • Conclusion�

3 of 20

Introduction

About Me

  • Rising senior undergraduate student in Intelligent Science and Technology at Tianjin Normal University.
  • Research interests include video anomaly detection, weakly supervised learning, and real-world applications of federated learning.

4 of 20

Background

  • Video Anomaly Detection�
  • Weakly Supervised Learning�
  • Federated Learning

5 of 20

What’s Video Anomaly Detection (VAD)?

  • Anomaly Definition: Predefine a set of anomalous activities, such as driving a car on the sidewalk, stealing a bag, fighting, and other abnormal behaviors.
  • Frame-Level Annotation Rule: If any frame contains a person performing one of these activities, that frame is detected as anomalous.
  • Training Model inputs: The model takes video features and the corresponding video label as inputs.
  • Frame-Level Scoring: The model outputs a score for each frame.
    • Normal frame: 0.0
    • Abnormal frame: 1.0

6 of 20

What’s Weakly Supervised Video Anomaly Detection?

Weakly Supervised Video Anomaly Detection (WS-VAD)

  • Training Data:
    • Any video with at least one abnormal frame is labeled as abnormal (1.0); otherwise, labeled as normal (0.0).
    • Only video-level labels, no frame-level annotations.�
  • Pros:
    • Greatly reduces labeling cost
  • Cons:
    • Requires more advanced algorithms
    • May cause blind guessing

7 of 20

What’s Federated Weakly Supervised Video Anomaly Detection?

  • Privacy Protection: Federated learning enables collaborative model training without sharing raw video data, preserving user privacy.�
  • Data Security: Prevents sensitive surveillance data from being centralized or exposed.�
  • Real-World Adaptability: Supports distributed video anomaly detection across diverse and heterogeneous environments.

Figure 3. Comparison Between Centralized and Federated Video Anomaly Detection

8 of 20

Motivation

  • Discrete snippets optimization�
  • Privacy Leakage & Scene-Specific Anomalies�
  • Context-agnostic

9 of 20

Discrete snippets optimization

WS-VAD MIL Loss:

Issues:

  • Noise-contaminated normal snippets receive inflated anomaly scores.�
  • Anomalies with insufficient feature representation are assigned lower scores.

Figure 4. Toy example: Comparison of baseline discrete optimization

(a) and adaptive dynamic recursive mapping for anomaly score reoptimization (b). (a) Baseline results. (b) Our results.

10 of 20

Privacy Leakage & Scene-Specific Anomalies

Privacy Protection:

  • Distributed data and strict confidentiality rules prevent sharing surveillance videos between institutions.�
  • Key regulations:�
    • General Data Protection Regulation (GDPR)
    • Personal Data Protection Act (PDPA)
    • California Consumer Privacy Act (CCPA)�

Scene-Specific Anomalies in Real-World Scenarios:

  • Different scenes often have unique abnormal events.
  • Centralized models struggle to perform well across heterogeneous environments.

Figure 3. Comparison Between Centralized and Federated Video Anomaly Detection

11 of 20

Context-agnostic

Task-Agnostic Pretrained Feature Extractors:

  • Domain gaps cause inaccurate features.�
  • Scene differences (lighting, structure, resolution, frame rate) amplify noise.�
  • Data heterogeneity reduces robustness.�

Model Poisoning in Federated Aggregation:

  • Local model errors can propagate globally, degrading overall system performance.

12 of 20

Proposed Framework

  • Adaptive Dynamic Recursive Mapping (ADRM)�
  • Scene-Similarity Adaptive Local Aggregation (SSALA)

13 of 20

Adaptive Dynamic Recursive Mapping (ADRM)

where is the anomaly score in step t, and α is an adaptive decision parameter within the range [−1,1] controlling score updates.

Figure 6. Comparison of Different Mappings

Top row: (a) Logistic map, (b) Modified logistic map, (c) Sampled recursive mapping, (d) Recursive mapping (2-D).

Bottom row: (e)–(h) Corresponding 3-D output products for varying parameter values α and β.

14 of 20

Scene-Similarity Adaptive Local Aggregation (SSALA)

(1)���(2)���(3)���(4)���(5)�����(6)

Figure 7. Dual-architecture of the reoptimization framework with adaptive dynamic recursive mapping for WSVAD.

15 of 20

Experiment Setup

Datasets:

  • ShanghaiTech: 437 videos from 13 scenes (307 normal, 130 anomalous); both train and test sets cover all scenes.�
  • UBnormal: 543 videos from 29 scenes; anomaly types in train and test are disjoint for real-world evaluation.�

Federated Simulation:

  • Videos are partitioned by scene across clients to simulate cross-scene heterogeneity.�
  • Edge devices (NVIDIA Jetson AGX Xavier) serve as clients, using the Flower framework for federated learning.�

Training & Aggregation:

  • Multiple local epochs per client; SSALA protocol for aggregation.�

Metrics:

  • Frame-level AUC (ROC) as the main metric; False Alarm Rate (FAR) for robustness.�

16 of 20

Main Results

17 of 20

18 of 20

Inference & Latency on Jetson AGX Xavier

  • Feature extraction: 14.7 fps�
  • Model inference: 9,835 fps�
  • Bottleneck: Feature extraction limits real-time performance.�
  • Applied pruning (L2 norm) and int8 dynamic quantization to VideoMAEv2.�
  • Optimizations boost speed and efficiency, with accuracy maintained even at 30% pruning.

19 of 20

Conclusions

Fed-WS-VAD faces many real-world challenges. Our proposed framework addresses these key problems as follows:�

  • Discrete Snippets Optimization:
    • Our dual-detector framework with adaptive dynamic recursive mapping and parameter interaction enables stable and accurate anomaly scoring.�
  • Privacy Leakage & Scene-Specific Anomalies:
    • The SSALA algorithm enables learning private local models, supports effective parameter aggregation across clients, and mitigates scene heterogeneity.
  • Context-Agnostic Limitation:
    • The scene-similarity selection mechanism enables the FL framework to train WSVAD models that accurately capture context-dependent anomaly definitions, while also preserving privacy in multi-scene surveillance.

20 of 20

Thank you for your attention! 😀

Questions and feedback are welcome.

Paper

Code

  • Contact:
  • For more details, see our paper or code.

Slides