1 of 21

EScALation: A Framework for Efficient and Scalable Spatio-temporal Action Localization

Bo Chen, Klara Nahrstedt

University of Illinois at Urbana-Champaign, USA

Email: boc2@illinois.edu

2 of 21

Introduction

  • Nowadays cameras are playing an important role in our life, e.g., smart homes and autonomous driving cars.
  • Mostly, Computer Vision algorithms are applied to analyze these videos.

3 of 21

Introduction

  • To enable autonomous analysis of videos, Spatio-Temporal Action Localization (or Action Detection) has been proposed.
  • Action Detection aims at finding sequences of bounding boxes associated with an action (action tube) and the action class in a video, e.g., biking, ice dancing, walking dog…

Biking

Ice dancing

Walking dog

4 of 21

Introduction

  • Action Detection is challenging:
    • Spatial information: locate bounding boxes
    • Temporal information: determine the start and the end time of an action tube

Biking

Ice dancing

Walking dog

5 of 21

Motivation

  • Existing problems:
    • Per-frame bounding box detection based on CNN is resource-intensive
    • Tube linking does not scale to a large number of action classes

Approach

Backbone Network

Hardware

Speed (fps)

mAP (%)

Peng et al.[1]

VGG16

Offline

32.1

Saha et al.[2]

VGG16

Offline

36.4

Singh et al.[3]

VGG16

NVIDIA Titan X

40

40.9

YOWO[4]

3D-ResNext-101 + Darknet-19

NVIDIA Titan XP

34

48.8

[1] X. Peng, et al., “Multi-region two-stream r-cnn for action detection”. ECCV. (2016).

[2] S. Saha, et al., “Deep learning for detecting multiple space-time action tubes in videos”. arXiv:1608.01529. (2016).

[3] G. Singh, et al., "Online real-time multiple spatiotemporal action localisation and prediction." ICCV (2017).

[4] O. Kopuklu , et al., “You only watch once: A unified cnn architecture for real-time spatiotemporal action localization”. arXiv:1911.06644. (2019).

6 of 21

Motivation

  • Intuitions:
    • The temporal correlation in the video can be utilized in action detection to remove redundant computation

7 of 21

EScALation

  • A novel framework for optimizing action detection in videos. It overcomes the resource-intensive problem of existing bounding box detection networks and makes tube linking scalable by taking advantage of the temporal correlation in the video.

8 of 21

EScALation

  • Pipeline of YOWO
    • Bounding box detection is applied to all frames.
    • Tube linking is applied to all classes.

9 of 21

EScALation

  • Pipeline of EScALation
    • Frame sampling: sample key frames to be processed by bounding box detection.
    • Class filtering: predict the action class to drive tube linking.

  • The key to success is the strategy to sample frames and predict classes

10 of 21

Frame Sampling

  • Our strategy is to slice a video into multiple sequences of frames of the same length, apply CNN to the first frame and skip the rest frames in every sequence

11 of 21

Frame Sampling

  • To show why this strategy works, we choose IoU thresholds of 0.05, 0.1, 0.2, 0.3, 0.5, and 0.75 and the sampling interval ranges from 0 to 29 frames. We apply this strategy to YOWO on the training set of the UCF101-24 dataset.

  • The mAP of YOWO with sampling does not change too much as the sampling interval increases (especially for low IoU thresholds).
  • The time of action detection drops fast as the sampling interval increases.

12 of 21

Frame Sampling

  • This experiment implies that the mAP is less affected while the action detection time is significantly reduced by frame sampling.
  • The only problem is how to determine the optimal sampling interval.

13 of 21

Action Detection Metrics

  •  

14 of 21

Frame Sampling

  • A hypothesis is that ALS on the test set should respond to the sampling interval like on the training set.
  • Results show that the curves (mAP, time and ALS) on the test and the training set are similar.

15 of 21

Class Filtering

  • Class filtering predicts the action class in a video in a lightweight manner before tube linking, filter out all other classes and only link tubes of the predicted class.
  • The prediction is made based on the class-specific score averaged over different classes in the first K frames of a video (reference frames).
  • The accuracy on both datasets reaches a plateau when 𝐾 is around 30 (after videos being played for 1s).

94.5%

84%

16 of 21

Evaluation

  •  

[1] O. Kopuklu , et al., “You only watch once: A unified cnn architecture for real-time spatiotemporal action localization”. arXiv:1911.06644. (2019).

17 of 21

Overall Performance

  • Action detection result on UCF101-24
  • Scenario 1: the action localization time is reduced by 72.2% with 6.1% loss of mAP and ALS improvement of 24.6%.
  • Scenario 2: the action localization time is saved by 76.9% with the mAP loss of 9.7% and the ALS improvement of 100.4%.

18 of 21

Overall Performance

  • Action detection result on J-HMDB-21
  • Scenario 1: the action localization time is saved by 73.8% with 10% loss of mAP and the ALS improvement of 6.1%.
  • Scenario 2: the action localization time is saved by 84% with the mAP loss of 13.6% and the ALS improvement of 85.9%.

19 of 21

Conclusion

  • We present EScALation, a framework for efficient and scalable spatio-temporal action localization, which mainly involves frame sampling and class filtering.
  • Our evaluation using the UCF101-24 and J-HMDB- 21 demonstrates that EScALation can greatly reduce the action detection time than the state-of-the-art approach, YOWO, with only a slight loss of the mAP, which improves the ALS.

20 of 21

Future Works

  • Adapt the frame sampling interval based on processed frames to further improve the accuracy-time tradeoffs.
  • Integrate frame sampling and image compression techniques to reduce the size of video to be transmitted in an end-to-end action detection system.

21 of 21

Q&A

Thank you for listening!

Bo Chen, Klara Nahrstedt

University of Illinois at Urbana-Champaign, USA

Email: boc2@illinois.edu