1 of 32

Argus++: Robust Real-time Activity Detection�for Unconstrained Video Streams �with Overlapping Cube Proposals

Lijun Yu, Yijun Qian, Wenhe Liu

and Alexander G. Hauptmann

2 of 32

Overview

In unconstrained videos:�untrimmed and with large field-of-views
Three aspects

Temporal localization
Spatial localization
Action classification

Activity detection

Detect all atomic activities
Bipartite match between predictions and ground truths

Strict target

Detect either atomic activities (e.g., standing up)�or continuous repetitive activities (e.g., walking)
Match multiple non-overlapping predictions to each ground truth

Loosened target

3 of 32

Argus++ Framework

4 of 32

Argus++ Architecture

5 of 32

Intermediate Concept: Cube Proposal

Proposal

A candidate region where activity may occur
Processing element for activity recognition

Spatio-temporal cube proposal

A simple six-tuple defining the boundaries in three dimensions�

Fixed temporal duration when sampled
Much simpler than activity instances or tube proposals

6 of 32

Proposal Generation

Proposal sampling

Dense overlapping sampling on untrimmed videos
Ensure completeness and coverage of any activity instance

Proposal refinement

Seed track ids from central from in each temporal window
Enlarge bounding boxes as union across the window

7 of 32

Proposal Generation: An Example

9 of 32

Proposal Filtering

Foreground segmentation

Frame-level binary mask for foreground pixels
Proposal foreground score as average value of pixel mask inside the cube
Learn a filtering threshold by allowing up to some sacrificed true positive

Label assignment

Convert annotation into cube format by dense sampling
Estimate spatial IoU between proposal and ground truth cubes
Follow Faster R-CNN in selecting positive and negative samples

Proposal evaluation

Assume perfect classifier by using assigned labels
Pass through following steps and use official metrics to estimate upper bound

10 of 32

Activity Recognition

Multi-label Classification

Binary cross entropy loss
Weighted by proposal scores
Balance activity-wise pos/neg samples
Balance samples of different activities
Balance samples of different datasets when used

Action-wise late fusion

11 of 32

Activity Deduplication

Overlapping instances

Adjacent instances

Merge adjacent cubes above certain threshold, subject to a minimum duration

12 of 32

Experimental Results

13 of 32

Implementation Details

Object detection: Mask R-CNN with Resnet-101 on COCO, stride 8
Multi-object tracking: Towards-Realtime-MOT
Foreground segmentation: HoG
Proposal: duration 64, stride 16
Classifiers: R(2+1)D, X3D, TRM

14 of 32

Evaluation Protocols

15 of 32

CVPR 2021 ActivityNet Challenge�ActEV SDL Unknown Facility

16 of 32

NIST ActEV’21 SDL Known Facility

17 of 32

NIST ActEV’21 SDL Unknown Facility

18 of 32

NIST TRECVID 2021 ActEV

19 of 32

NIST TRECVID 2020 ActEV

20 of 32

ICCV 2021 ROAD Action Detection

21 of 32

Ablation Study

Coverage of Proposal Formats

Performance of Proposal Filtering

22 of 32

Conclusion and Future Work

Argus++: Robust Real-time Activity Detection
Overlapping spatio-temporal cube proposals
Superior performance in CVPR ActivityNet ActEV 2021, NIST ActEV SDL UF/KF, TRECVID ActEV 2020/2021, ICCV ROAD 2021

Extending strict target into ActEV settings: bipartite matching with spatial localization
Generalizing to more scenarios such as UAV videos
Zero-shot or few-shot activity detection

23 of 32

Argus++ in SRL

24 of 32

ActEV SRL Challenge - Metrics

Time-based false alarm (TFA) -> rate of false alarm (RFA)

Match one ground truth with only one prediction
Cannot cut into short cubes, but need merging

Temporal localization -> Spatio-temporal localization

Matching require spatial alignment

25 of 32

Argus++ for SRL

Modifications limited within activity deduplication part
Trained models from SDL system
Still runs in real-time

26 of 32

Activity Deduplication

Filter cubes based on classification confidence

Thresholds taken from scorer at 0.02 TFA

For remaining cubes, merge adjacent ones into one instance
Use bounding box from central cube

Since cube stride is 16, bounding boxes are unions of each 16-frame window

Applied activity type filter, scene type filter, activity count filter

27 of 32

Results – under different constraints

28 of 32

Results – under different constraints

29 of 32

Results – under different constraints

30 of 32

Future Work

Optimization for AOD

Use frame-level object detection bounding boxes, at additional computation costs, depending on the efficiency requirements

Optimization for RFA

Refine deduplication algorithm, with joint optimization with recognition, e.g., classification of activity start, middle, and end

31 of 32

Acknowledgements

This research is supported in part by the Intelligence Advanced Research Projects Activity (IARPA) via Department of Interior/Interior Business Center (DOI/IBC) contract number D17PC00340. This research is supported in part through the financial assistance award 60NANB17D156 from U.S. Department of Commerce, National Institute of Standards and Technology. This project is funded in part by Carnegie Mellon University’s Mobility21 National University Transportation Center, which is sponsored by the US Department of Transportation.

1 of 32

2 of 32

3 of 32

4 of 32

5 of 32

6 of 32

7 of 32

8 of 32

9 of 32

10 of 32

11 of 32

12 of 32

13 of 32

14 of 32

15 of 32

16 of 32

17 of 32

18 of 32

19 of 32

20 of 32

21 of 32

22 of 32

23 of 32

24 of 32

25 of 32

26 of 32

27 of 32

28 of 32

29 of 32

30 of 32

31 of 32

32 of 32