1 of 1

Dataset

  • 8 professional singers , 9 ragas
  • Recordings by 3 synchronized cameras
  • 24 fps - 1920 x 1080 pixels
  • 143 recordings, 7 hours 10 mins
  • Average recording time ~ 3 minutes

Conclusion

  • Multiple cameras lend themselves to 3D reconstruction – and even a single camera can be used for 3D inference
  • Fusion of predictions from multiple views is preferable for both 2D and 3D

  • Gestures are an integral part of Hindustani vocal performances. However, gestures are idiosyncratic and are neither taught nor rehearsed
  • We can use Human Pose Estimation (HPE) techniques to understand gesture correspondence with audio

Problem Statement

HPE -2D and 3D Estimation

Multi-view Processing of Audio-Visual Recording

[1] S. Nadkarni, S. Roychowdhury, P. Rao, and M. Clayton, “Exploring the correspondence of melodic contour with gesture in raga alap singing,” in Proceedings of the 24th International Society for Music Information Retrieval Conference, ISMIR, Milan, Italy, 2023.

References

Human Pose Estimation for Expressive Movement

Descriptors in Vocal Musical Performances

Sujoy Roychowdhury, Preeti Rao, Sharat Chandran

Indian Institute of Technology Bombay, India

HPE vs. Sensor Based Systems

Results

Correspondence between HPE Models

Confidence Thresholds

Research Questions

    • Does the choice of algorithm for HPE matter ? How should we choose parameters like thresholds on confidence score for these algorithms?
    • What is impact of choice of 3D reconstruction algorithm?
    • Is using three camera views better than using a single camera view ?
    • What is the best way to combine multiple camera views ?

F1-scores (%) for stable note detection and Accuracy (%) for gesture-based singer identification

Fusion based results

  • For Mediapipe, we drop z-coordinate for 2D experiments. We call this MP2
  • 3D reconstruction can be done with any view as reference view

  • HPE models are AI models trained on large datasets to identify keypoints of interest on videos without any sensors attached to the people
  • They are dependent on how well the training data of the AI models reflect the video setting

Left View

Front View

Right View

  • In each model-view combination no more than 10% of frames have a confidence less than threshold for wrists and elbows

Left View

Front View

Right View

 

 

 

Feature Extraction

Feature Extraction

Feature Extraction

classification

label

Classifier

Classifier

Classifier

Fusion Model

Predicted prob

Predicted prob

2D/3D Kinematic features

2D/3D Kinematic features

2D/3D Kinematic features

Multi-view fusion of classifiers

Classification using HPE Models and Multi-view Fusion

Stable Note from Gesture [1]

Singer ID from Gesture

Observations:-

  • Fusion models perform consistently better than individual view models
  • 2D fusion is similar or better than 3D reconstruction (83.9 vs. 83.5 for Stable Note and 93.0 vs 83.3 for singer ID)
  • Decision fusing 3D information is similar or better than decision fusing 2D information ( 86.6 vs 83.9 in Stable Note, 93.6 vs 93.0 in Singer ID)

Dataset Link

Euclidean distance between keypoints

Template ID: inquisitalanchor Size: 36x48