1 of 1

Dataset

8 professional singers , 9 ragas
Recordings by 3 synchronized cameras
24 fps - 1920 x 1080 pixels
143 recordings, 7 hours 10 mins
Average recording time ~ 3 minutes

Conclusion

Multiple cameras lend themselves to 3D reconstruction – and even a single camera can be used for 3D inference
Fusion of predictions from multiple views is preferable for both 2D and 3D

Gestures are an integral part of Hindustani vocal performances. However, gestures are idiosyncratic and are neither taught nor rehearsed
We can use Human Pose Estimation (HPE) techniques to understand gesture correspondence with audio

Problem Statement

HPE -2D and 3D Estimation

Multi-view Processing of Audio-Visual Recording

[1] S. Nadkarni, S. Roychowdhury, P. Rao, and M. Clayton, “Exploring the correspondence of melodic contour with gesture in raga alap singing,” in Proceedings of the 24th International Society for Music Information Retrieval Conference, ISMIR, Milan, Italy, 2023.

References

Human Pose Estimation for Expressive Movement

Descriptors in Vocal Musical Performances

Sujoy Roychowdhury, Preeti Rao, Sharat Chandran

Indian Institute of Technology Bombay, India

HPE vs. Sensor Based Systems

Results

Correspondence between HPE Models

Confidence Thresholds

Research Questions

Does the choice of algorithm for HPE matter ? How should we choose parameters like thresholds on confidence score for these algorithms?
What is impact of choice of 3D reconstruction algorithm?
Is using three camera views better than using a single camera view ?
What is the best way to combine multiple camera views ?

F1-scores (%) for stable note detection and Accuracy (%) for gesture-based singer identification

Fusion based results

For Mediapipe, we drop z-coordinate for 2D experiments. We call this MP2
3D reconstruction can be done with any view as reference view

HPE models are AI models trained on large datasets to identify keypoints of interest on videos without any sensors attached to the people
They are dependent on how well the training data of the AI models reflect the video setting

Left View

Front View

Right View

In each model-view combination no more than 10% of frames have a confidence less than threshold for wrists and elbows

Left View

Front View

Right View

Feature Extraction

classification

label

Classifier

Fusion Model

Predicted prob

2D/3D Kinematic features

Multi-view fusion of classifiers

Classification using HPE Models and Multi-view Fusion

Stable Note from Gesture [1]

Singer ID from Gesture

Observations:-

Fusion models perform consistently better than individual view models
2D fusion is similar or better than 3D reconstruction (83.9 vs. 83.5 for Stable Note and 93.0 vs 83.3 for singer ID)
Decision fusing 3D information is similar or better than decision fusing 2D information ( 86.6 vs 83.9 in Stable Note, 93.6 vs 93.0 in Singer ID)

Dataset Link

Euclidean distance between keypoints

Template ID: inquisitalanchor Size: 36x48