1 of 21

EScALation: A Framework for Efficient and Scalable Spatio-temporal Action Localization

Bo Chen, Klara Nahrstedt

University of Illinois at Urbana-Champaign, USA

Email: boc2@illinois.edu

2 of 21

Introduction

Nowadays cameras are playing an important role in our life, e.g., smart homes and autonomous driving cars.
Mostly, Computer Vision algorithms are applied to analyze these videos.

3 of 21

Introduction

To enable autonomous analysis of videos, Spatio-Temporal Action Localization (or Action Detection) has been proposed.
Action Detection aims at finding sequences of bounding boxes associated with an action (action tube) and the action class in a video, e.g., biking, ice dancing, walking dog…

Biking

Ice dancing

Walking dog

4 of 21

Introduction

Action Detection is challenging:

Spatial information: locate bounding boxes
Temporal information: determine the start and the end time of an action tube

Biking

Ice dancing

Walking dog

5 of 21

Motivation

Existing problems:

Per-frame bounding box detection based on CNN is resource-intensive
Tube linking does not scale to a large number of action classes

Approach	Backbone Network	Hardware	Speed (fps)	mAP (%)
Peng et al.^[1]	VGG16	Offline		32.1
Saha et al.^[2]	VGG16	Offline		36.4
Singh et al.^[3]	VGG16	NVIDIA Titan X	40	40.9
YOWO^[4]	3D-ResNext-101 + Darknet-19	NVIDIA Titan XP	34	48.8

[1] X. Peng, et al., “Multi-region two-stream r-cnn for action detection”. ECCV. (2016).

[2] S. Saha, et al., “Deep learning for detecting multiple space-time action tubes in videos”. arXiv:1608.01529. (2016).

[3] G. Singh, et al., "Online real-time multiple spatiotemporal action localisation and prediction." ICCV (2017).

[4] O. Kopuklu , et al., “You only watch once: A unified cnn architecture for real-time spatiotemporal action localization”. arXiv:1911.06644. (2019).

6 of 21

Motivation

Intuitions:

The temporal correlation in the video can be utilized in action detection to remove redundant computation

7 of 21

EScALation

A novel framework for optimizing action detection in videos. It overcomes the resource-intensive problem of existing bounding box detection networks and makes tube linking scalable by taking advantage of the temporal correlation in the video.

8 of 21

EScALation

Pipeline of YOWO

Bounding box detection is applied to all frames.
Tube linking is applied to all classes.

9 of 21

EScALation

Pipeline of EScALation

Frame sampling: sample key frames to be processed by bounding box detection.
Class filtering: predict the action class to drive tube linking.

The key to success is the strategy to sample frames and predict classes

The input to the pipeline is a sequence of RGB frames (a).The frame sampling technique (1) skips a subset of the input frames (c) and produces the sampled frames (b).In this figure, we sample one frame in every three frames, sothe \textit{sampling interval} is $2$.The sampled frames is fed into the bounding box detection network (2) to generate bounding box information for all frames (d), which involves computing bounding box information for sampled frames and copying the bounding box information from the latest frame to skipped frames. The bounding box information is then processed by tube linking with class filtering (3), which predicts the action class (denoted by the blue color) of the action tubes in the video.Then, tube linking (5) will be performed on the predicted class. Finally, tube linking generates the action tube (e), which is a sequence of bounding boxes.

10 of 21

Frame Sampling

Our strategy is to slice a video into multiple sequences of frames of the same length, apply CNN to the first frame and skip the rest frames in every sequence

11 of 21

Frame Sampling

To show why this strategy works, we choose IoU thresholds of 0.05, 0.1, 0.2, 0.3, 0.5, and 0.75 and the sampling interval ranges from 0 to 29 frames. We apply this strategy to YOWO on the training set of the UCF101-24 dataset.

The mAP of YOWO with sampling does not change too much as the sampling interval increases (especially for low IoU thresholds).
The time of action detection drops fast as the sampling interval increases.

12 of 21

Frame Sampling

This experiment implies that the mAP is less affected while the action detection time is significantly reduced by frame sampling.
The only problem is how to determine the optimal sampling interval.

13 of 21

Action Detection Metrics

14 of 21

Frame Sampling

A hypothesis is that ALS on the test set should respond to the sampling interval like on the training set.
Results show that the curves (mAP, time and ALS) on the test and the training set are similar.

15 of 21

Class Filtering

Class filtering predicts the action class in a video in a lightweight manner before tube linking, filter out all other classes and only link tubes of the predicted class.
The prediction is made based on the class-specific score averaged over different classes in the first K frames of a video (reference frames).
The accuracy on both datasets reaches a plateau when 𝐾 is around 30 (after videos being played for 1s).

94.5%

84%

16 of 21

Evaluation

[1] O. Kopuklu , et al., “You only watch once: A unified cnn architecture for real-time spatiotemporal action localization”. arXiv:1911.06644. (2019).

17 of 21

Overall Performance

Action detection result on UCF101-24
Scenario 1: the action localization time is reduced by 72.2% with 6.1% loss of mAP and ALS improvement of 24.6%.
Scenario 2: the action localization time is saved by 76.9% with the mAP loss of 9.7% and the ALS improvement of 100.4%.

18 of 21

Overall Performance

Action detection result on J-HMDB-21
Scenario 1: the action localization time is saved by 73.8% with 10% loss of mAP and the ALS improvement of 6.1%.
Scenario 2: the action localization time is saved by 84% with the mAP loss of 13.6% and the ALS improvement of 85.9%.

19 of 21

Conclusion

We present EScALation, a framework for efficient and scalable spatio-temporal action localization, which mainly involves frame sampling and class filtering.
Our evaluation using the UCF101-24 and J-HMDB- 21 demonstrates that EScALation can greatly reduce the action detection time than the state-of-the-art approach, YOWO, with only a slight loss of the mAP, which improves the ALS.

20 of 21

Future Works

Adapt the frame sampling interval based on processed frames to further improve the accuracy-time tradeoffs.
Integrate frame sampling and image compression techniques to reduce the size of video to be transmitted in an end-to-end action detection system.

21 of 21

Q&A

Thank you for listening!

Bo Chen, Klara Nahrstedt

University of Illinois at Urbana-Champaign, USA

Email: boc2@illinois.edu