1 of 54

Detection and Tracking of Foot Movement in Kathak Dance

- Pranit Chawla (17EC35017), Department of E&ECE

- Advisor: Prof. Dr. Partha Pratim Das - Mentor: Saptami Ghosh

2 of 54

Introduction

  • Kathak dances are composed of several “Ladis” which are rhythmic footwork patterns put together in a sequence.

  • A particular sequence of feet movements (strikes and lifts) can uniquely define a Ladi variation.

  • Thus for the classification and identification of Ladis, the detection of feet movement and tracking the path of the foot is important

3 of 54

Ladi Variations and foot patterns

4 of 54

Problem Formulation

  • Detect the presence of feet movement in a continuous dance video stream (Temporal Activity Detection)
  • Track foot movement across all frames to get path of foot and use it for further analysis

  • Combining temporal information with the sequence of foot posture for both feet for pattern detection of a Ladi variation

5 of 54

Training data available

  • Untrimmed Videos of Kathak Dance with RGB-D information (used RGB so far)
  • Labelled start and end frame numbers for feet movement for each bol.
  • Audio data for corresponding video streams

  • Labelled Audio data containing time stamp information of a bol uttered.

6 of 54

Example of movement

7 of 54

Image Preprocessing: Full - Cropped Image

8 of 54

Detection of feet movement

9 of 54

Literature Review

  • Computer Vision Approaches
    • Feature point tracking (SIFT, SURF or ORB) to track height of heel above baseline (Wang et al. 2011)
    • Optical flow approach to capture movement (Weiss et al. 2006)
    • Background Subtraction and tracking to get mask of moving feet (Mahapatra et al.2013)

  • Learning based Approaches
    • 2-D Convolutional Neural Network for activity recognition (Krizhevsky et al. 2012)
    • 3-D CNN + Optical Flow based Activity Recognition (Simoyan et al. 2014)
    • 3-D CNN + Regressor for Activity Detection (getting start and end times) (Xu et al. 2019)

10 of 54

Computer Vision Approaches (Summary) (Done last semester)

  1. Most methods do not work well due to lack of generalisation and need of manual changes in thresholds
  2. ORB feature tracking: keypoints detected across frames are inconsistent and not robust to occlusion
  3. Background subtraction: Noisy in cases of movements of left and right feet in short interval, both movements detected
  4. Optical Flow based detection: Magnitude of optical flow noisy in several cases and several pixels have missing optical flow values
  5. To counter such cases and make our model robust, we resort to deep learning approaches and the use of labelled data

11 of 54

ORB feature point tracking

N_features = 500

12 of 54

ORB feature point tracking

N_features = 5000

13 of 54

Observations from ORB feature point tracking

  1. Features not consistent (several frames show missing feature-points at heel) due to lack of texture
  2. Occlusion occurs (heel gets hidden behind other leg) leading to loss of feature points
  3. Not able to detect front and back part of foot when in movement accurately.
  4. Difference in skin colour of dancers, low contrast in background, during fast feet movement loss of contour occurs which results in poor feature point extraction from ORB
  5. Even by varying parameters and reducing thresholds for new points, reliable key points for tracking not generated

14 of 54

Background Subtraction and Median Tracking

Median tracking

  • Keep track of median of white pixels in left and right halves. Whenever abs(median_left-median_right) >threshold (how do you define threshold : data dependent), left feet is in the air and vice-versa

Background Subtraction

  • Subtract the current frame from the background (average of first 10 frames) to get a mask of changed pixels

Pre-Processing Techniques

  • Use Erosion + Dilation to thin down white points in the black background and divide the image into two halves, for the left and right feet

15 of 54

Background Subtraction and Median Tracking

No movement

Left foot moving

Right foot moving

16 of 54

Background Subtraction and Median Tracking (Observations)

  • Works well to detect movement when last frame was still
  • Errors occur when successive movements of left and right feet occur in short interval (both movements are detected)
  • Requires a lot of manual dataset dependant thresholding and tuning.
  • Several movements missed due to inaccurate thresholding and successive movements.

17 of 54

Background Subtraction with Optical Flow

Optical Flow for each image in the bottom row

18 of 54

Background Subtraction with Optical Flow

Movement detected by count of pixels

19 of 54

Background Subtraction with Optical Flow

Movement missed

20 of 54

Background Subtraction with Optical Flow (Observations)

  • Able to detect movements better than median tracking.
  • Missing several movements due to inaccuracies in optical flow calculation.
  • Requires a lot of manual dataset dependant thresholding and tuning to set threshold for number of pixels to decide movement.
  • Almost all computer vision approaches are not very robust to cases such as occlusion, fast feet movement.
  • Requires the use of labelled data and deep learning algorithms.

21 of 54

2D- CNN : Foot Movement Detection

  • Binary Classifier to determine if foot is in the air or not

CNN

FC

In the air? On the ground

22 of 54

2D- CNN : Foot Movement Detection Observations

  1. Loss exploding
  2. CNN not able to distinguish between feet in air and on ground due to very similar images
  3. Multiple frames or motion information needed
  4. 3-D CNN solves this problem

23 of 54

3D- CNN : Temporal Activity Recognition

Architecture of a 3-D CNN

Img Source: 3D Convolutional Neural Networks for Crop Classification with Multi-Temporal Remote Sensing Images (Ji et al., 2018)

24 of 54

3D- CNN : Temporal Activity Recognition

Moving

Not Moving

3-D CNN

3-D CNN

25 of 54

3D- CNN : Temporal Activity Recognition with Optical Flow

3-D CNN

3-D CNN

Fusion Module

Moving vs Not Moving

26 of 54

Quantitative Results (L2V3D4R1)

Moving (Predicted)

Not Moving (Predicted)

Moving (Actual)

767 (True Positive)

184 (False Negative)

Not Moving (Actual)

41 (False Positive)

1356 (True Negative)

Confusion Matrix with number of frames for L2V3D4R1

Accuracy

Precision

Recall

F1 score

0.90

0.95

0.80

0.87

Classification Metrics for L2V3D4R1

27 of 54

Quantitative Results (L2V1D4R1)

Moving (Predicted)

Not Moving (Predicted)

Moving (Actual)

751 (True Positive)

198 (False Negative)

Not Moving (Actual)

166 (False Positive)

1208 (True Negative)

Confusion Matrix with number of frames for L2V1D4R1

Accuracy

Precision

Recall

F1 score

0.84

0.82

0.79

0.80

Classification Metrics for L2V1D4R1

28 of 54

Quantitative Results (L2V2D4R1)

Moving (Predicted)

Not Moving (Predicted)

Moving (Actual)

756 (True Positive)

196 (False Negative)

Not Moving (Actual)

27 (False Positive)

1419 (True Negative)

Confusion Matrix with number of frames for L2V2D4R1

Accuracy

Precision

Recall

F1 score

0.91

0.97

0.79

0.87

Classification Metrics for L2V2D4R1

29 of 54

3D- CNN : Temporal Activity Recognition with Optical Flow (Results)

Moving

Moving

Moving

Moving

Moving

Not Moving

Moving

Moving

30 of 54

3D- CNN : Temporal Activity Recognition with Optical Flow (Observations)

  1. Able to classify almost all cases (occluded heel, quick movements) very well
  2. Works in real time, predicting the output for a frame based on last (k-1) frames + optical flow values
  3. Accuracy of around 90% achieved, most of the false positives and false negatives occur at beginning or ending of movement

31 of 54

Tracking path of feet

32 of 54

Tracking of path of foot across frames

  • Once we have classified each frame as moving or not moving, we need to track the path of the foot across frames
  • This tracking allows us to classify the movements into different ladi variation
  • Since we do not have the ground truths for the trajectory, we were not able to employ deep learning techniques for the same
  • Instead, we focused on using computer vision algorithms to segment the foreground, use skin based thresholding for foot segmentation and then track the movements

33 of 54

Overview of Tracking Pipeline

Histogram Equalisation

Otsu Thresholding

Colour Space Conversion to YCrCb

Refinement by mask multiplication and area thresholding

Skin Colour Based Thresholding

Contour Fitting and Tracking of centre of mass

Morphological Operations based refinement

34 of 54

Colour Space Conversion to YCrCb

  • The input image is converted to YCrCb to allow for easier histogram equalisation and skin segmentation
  • Conversion disentangles colour information from intensity information and intensity information is present in Y channel
  • Histogram equalisation is performed on the Y channel before the other steps

35 of 54

Histogram Equalisation across Y channel

  • It stretches the initial histogram across all intensity values to allow for easier segmentation
  • Equalisation is not done directly on RGB image as intensity information has to be separated from colour and it is a non linear process.

Initial

Equalised

36 of 54

Otsu Thresholding

  • It is an automatic thresholding method used to segment an image into a foreground and background
  • The algorithm checks for all threshold values for segmentation and chooses the value which maximises inter-class variance between foreground and background
  • We use Otsu thresholding to get the mask of the dancer and suppress the background which leads to better skin segmentation results in the following steps

37 of 54

Otsu Thresholding (Results)

Original

Mask

Original

Mask

38 of 54

Skin Based Segmentation

  • After multiplying the original image by the mask obtained in the last slide, we have to segment the foot for tracking and suppress the cloth of the dancer
  • As shown in figure on right, we first convert image to HSV space and then find the threshold ranges for skin.
  • We ensure that these thresholds hold different dancers (male and female) and different colours of dress

Finding segmentation threshold in HSV space

39 of 54

Skin Based Segmentation (Results)

40 of 54

Skin Based Segmentation (Observations)

  • The segmentation rule is able to segment out some parts of the foot and is able to suppress the dress of the dancer
  • However, there are spurious pixels detected along the dress and there are a lot of holes in the foot mask
  • When the foot is moving, occlusion takes place resulting in inaccurate segmentation
  • The spurious pixels can be removed by erosion and then the holes can be filled by dilation. This entire process is called as opening in computer vision.

41 of 54

Refinement using Morphological Operations

  • Erosion and Dilation are both performed using 3*3 kernels.
  • Erosion removes the spurious pixels of the dancer’s dress and dilation fills the holes in the foot

Skin Based Segmentation

Morphological operation based refinement

42 of 54

Refinement using Morphological Operations (Issues)

  • A single contour is formed sometimes instead of two due to occlusion as seen in figure 1.
  • Spurious formation of contours due to noise in video as seen in figure 2

Figure 1: Single contour formed due to movement and occlusion

Figure 2: Third contour formed on top due to noise

43 of 54

Fixing issues of Morphological Refinement

  • The issues in the previous slide are resolved using the following methods:
    1. Divide the single contour in figure 1 into two halves by multiplying by left sided and right sided masks having all ones and multiply the image by the two masks to get the two separate contours
    2. Use area based thresholding to remove all other contours other than the two biggest contours

44 of 54

Fixing issues of Morphological Refinement (Results)

Left contour

Initial mask

Left ones mask

Right ones mask

Right contour

Multiply

Multiply

45 of 54

Contour Fitting and tracking Centre of Mass

  • Once we get the left and right ones multiplied masks and we perform area thresholding, we are left with two contours which correspond to the left and right foot of the dancer
  • We use cv2.findContours() function to get the boundary of the two contours as a list of points and we use these two lists to track the feet.
  • We use cv2.moments() function to get the centre of mass of each contour and track its x and y coordinate

46 of 54

Contour Fitting and tracking Centre of Mass (Results)

47 of 54

Step wise Results overview

48 of 54

Qualitative Results on different dancers

Dancer 1

Dancer 4

Dancer 3 (Male)

49 of 54

Quantitative Result on L2V3D4R1

50 of 54

Quantitative Result on L2V2D4R1

51 of 54

Tracking pipeline (Limitations)

  • Skin based thresholding might be susceptible to certain changes in lighting and erroneous segmentations

  • Small fluctuations observed in the centre of mass of the other foot while a foot is moving - dancers are trying to balance themselves

  • Shape of contour of foot is deformed when foot deviates too much from ground resulting in slightly inaccurate coordinates of centre of mass

52 of 54

Future Experiments

  1. Make pipeline more robust by incorporating Optical Flow information as that is more robust as compared to skin based thresholding.

  • Looking for methods to annotate the foot in every frame and then resorting to supervised deep learning methods such as semantic segmentation to make pipeline more robust.

  • Classify the movement of a variation as one of the many Ladi variations

  • Use annotated Audio data to improve performance of model

53 of 54

References

  • Wang H, Kläser A, Schmid C, Liu CL. Action recognition by dense trajectories. InCVPR 2011 2011 Jun 20 (pp. 3169-3176). IEEE
  • Rosten E, Drummond T. Machine learning for high-speed corner detection. In European conference on computer vision 2006 May 7 (pp. 430-443). Springer, Berlin, Heidelberg.
  • Fleet D, Weiss Y. Optical flow estimation. In Handbook of mathematical models in computer vision 2006 (pp. 237-257). Springer, Boston, MA.
  • Calonder M, Lepetit V, Strecha C, Fua P. Brief: Binary robust independent elementary features. InEuropean conference on computer vision 2010 Sep 5 (pp. 778-792). Springer, Berlin, Heidelberg.
  • He K, Zhang X, Ren S, Sun J. Deep residual learning for image recognition. InProceedings of the IEEE conference on computer vision and pattern recognition 2016 (pp. 770-778).

54 of 54

Thank you