1 of 32

Argus++: Robust Real-time Activity Detection�for Unconstrained Video Streams �with Overlapping Cube Proposals

Lijun Yu, Yijun Qian, Wenhe Liu 

   and Alexander G. Hauptmann

2 of 32

Overview

    • In unconstrained  videos:�untrimmed and with large field-of-views
    • Three aspects
      • Temporal localization
      • Spatial localization
      • Action classification

Activity detection

    • Detect all atomic activities
    • Bipartite match between predictions and ground truths

Strict target

    • Detect either atomic activities (e.g., standing up)�or continuous repetitive activities (e.g., walking)
    • Match multiple non-overlapping predictions to each ground truth

Loosened target

3 of 32

Argus++ Framework

4 of 32

Argus++ Architecture

5 of 32

Intermediate Concept: Cube Proposal

  • Proposal
    • A candidate region where activity may occur
    • Processing element for activity recognition

  • Spatio-temporal cube proposal
    • A simple six-tuple defining the boundaries in three dimensions

    • Fixed temporal duration when sampled
    • Much simpler than activity instances or tube proposals

6 of 32

Proposal Generation

  • Proposal sampling
    • Dense overlapping sampling on untrimmed videos
    • Ensure completeness and coverage of any activity instance

  • Proposal refinement
    • Seed track ids from central from in each temporal window
    • Enlarge bounding boxes as union across the window

7 of 32

Proposal Generation: An Example

8 of 32

9 of 32

Proposal Filtering

  • Foreground segmentation
    • Frame-level binary mask for foreground pixels
    • Proposal foreground score as average value of pixel mask inside the cube
    • Learn a filtering threshold by allowing up to some sacrificed true positive
  • Label assignment
    • Convert annotation into cube format by dense sampling
    • Estimate spatial IoU between proposal and ground truth cubes
    • Follow Faster R-CNN in selecting positive and negative samples
  • Proposal evaluation
    • Assume perfect classifier by using assigned labels
    • Pass through following steps and use official metrics to estimate upper bound

10 of 32

Activity Recognition

  • Multi-label Classification
    • Binary cross entropy loss
    • Weighted by proposal scores
    • Balance activity-wise pos/neg samples 
    • Balance samples of different activities
    • Balance samples of different datasets when used
  • Action-wise late fusion

11 of 32

Activity Deduplication

  • Overlapping instances

  • Adjacent instances
    • Merge adjacent cubes above certain threshold, subject to a minimum duration

12 of 32

Experimental Results

13 of 32

Implementation Details

  • Object detection: Mask R-CNN with Resnet-101 on COCO, stride 8
  • Multi-object tracking: Towards-Realtime-MOT
  • Foreground segmentation: HoG
  • Proposal: duration 64, stride 16
  • Classifiers: R(2+1)D, X3D, TRM

14 of 32

Evaluation Protocols

  •  

15 of 32

CVPR 2021 ActivityNet Challenge�ActEV SDL Unknown Facility

16 of 32

NIST ActEV’21 SDL Known Facility

17 of 32

NIST ActEV’21 SDL Unknown Facility

18 of 32

NIST TRECVID 2021 ActEV

19 of 32

NIST TRECVID 2020 ActEV

20 of 32

ICCV 2021 ROAD Action Detection

21 of 32

Ablation Study

  • Coverage of Proposal Formats

  • Performance of Proposal Filtering

22 of 32

Conclusion and Future Work

  • Argus++: Robust Real-time Activity Detection
  • Overlapping spatio-temporal cube proposals
  • Superior performance in CVPR ActivityNet ActEV 2021, NIST ActEV SDL UF/KF, TRECVID ActEV 2020/2021, ICCV ROAD 2021

  • Extending strict target into ActEV settings: bipartite matching with spatial localization
  • Generalizing to more scenarios such as UAV videos
  • Zero-shot or few-shot activity detection

23 of 32

Argus++ in SRL

24 of 32

ActEV SRL Challenge - Metrics

  • Time-based false alarm (TFA) -> rate of false alarm (RFA)
    • Match one ground truth with only one prediction
    • Cannot cut into short cubes, but need merging
  • Temporal localization -> Spatio-temporal localization
    • Matching require spatial alignment

25 of 32

Argus++ for SRL

  • Modifications limited within activity deduplication part
  • Trained models from SDL system
  • Still runs in real-time

26 of 32

Activity Deduplication

  • Filter cubes based on classification confidence
    • Thresholds taken from scorer at 0.02 TFA
  • For remaining cubes, merge adjacent ones into one instance
  • Use bounding box from central cube
    • Since cube stride is 16, bounding boxes are unions of each 16-frame window
  • Applied activity type filter, scene type filter, activity count filter

27 of 32

Results – under different constraints

28 of 32

Results – under different constraints

29 of 32

Results – under different constraints

30 of 32

Future Work

  • Optimization for AOD
    • Use frame-level object detection bounding boxes, at additional computation costs, depending on the efficiency requirements
  • Optimization for RFA
    • Refine deduplication algorithm, with joint optimization with recognition, e.g., classification of activity start, middle, and end

31 of 32

Acknowledgements

This research is supported in part by the Intelligence Advanced Research Projects Activity (IARPA) via Department of Interior/Interior Business Center (DOI/IBC) contract number D17PC00340. This research is supported in part through the financial assistance award 60NANB17D156 from U.S. Department of Commerce, National Institute of Standards and Technology. This project is funded in part by Carnegie Mellon University’s Mobility21 National University Transportation Center, which is sponsored by the US Department of Transportation.

32 of 32