Tutorial on Multi-modal Learning
Contact: {sindhu.hegde, aditya.ag, bipasha.sen, radrabha.m}@research.iiit.ac.in
A deep-dive into Speaker Separation problem
Sindhu B Hegde
Aditya Agarwal
Bipasha Sen
Rudrabha Mukhopadhyay
IIIT Hyderabad
Seshadri�Mazumder
Motivation: �Isolating & Enhancing the Target Speaker
Contact: {sindhu.hegde, aditya.ag, bipasha.sen, radrabha.m}@research.iiit.ac.in
Speaker Separation: Potential Applications
Contact: {sindhu.hegde, aditya.ag, bipasha.sen, radrabha.m}@research.iiit.ac.in
(a)
(b)
(c)
Audio-Visual Speaker Separation: Overview
Contact: {sindhu.hegde, aditya.ag, bipasha.sen, radrabha.m}@research.iiit.ac.in
Mixed Speech Input
… I don’t feel strong ..
… keep trying to fund raise ...
Visual Features
Audio Features
Audio-Visual Network
… keep trying to fund raise ...
Isolated Target Speaker
Visual Stream Input
Why do we need Visual Stream?
�
Contact: {sindhu.hegde, aditya.ag, bipasha.sen, radrabha.m}@research.iiit.ac.in
Audio-Visual Network: Architecture Overview
Contact: {sindhu.hegde, aditya.ag, bipasha.sen, radrabha.m}@research.iiit.ac.in
Tx514
(T/4)x3x96x96
Visual Encoder
(3D Conv Block)
Speech Encoder
(1D Conv Block)
Concatenation
Encoder
Speech Decoder
(1D Conv Block)
+
Decoder
Residual mask
(Tx514)
Tx1112
Tx600
(T/4)x512x1x1
4x upsample
Tx600
Tx512
Tx514
Tx514
STFT
ISTFT
Audio-Visual Network: Detailed Architecture
Contact: {sindhu.hegde, aditya.ag, bipasha.sen, radrabha.m}@research.iiit.ac.in
STFT
Magnitude Sub-network
Mag encoder
Phase Sub-network
Visual encoder
Mag decoder
Mixed mag
Pred mag
Mixed phase
Phase network
+
+
Pred phase
ISTFT
Target speech
Mag
mask
Mixed speech
Phase
mask
Visual input
Audio-Visual Network: Representations
Contact: {sindhu.hegde, aditya.ag, bipasha.sen, radrabha.m}@research.iiit.ac.in
Audio-Visual Network: Training details
Contact: {sindhu.hegde, aditya.ag, bipasha.sen, radrabha.m}@research.iiit.ac.in
Audio-Visual Network: Training details
Contact: {sindhu.hegde, aditya.ag, bipasha.sen, radrabha.m}@research.iiit.ac.in
Dataset and Experimental setup
celebrity videos.
identities, languages, and face poses.
Contact: {sindhu.hegde, aditya.ag, bipasha.sen, radrabha.m}@research.iiit.ac.in
Fig.: Dataset samples.
| Train | Test |
# speakers | 5,994 | 118 |
# videos | 145,569 | 4,911 |
Table.: Statistics of the VoxCeleb2 dataset.
Qualitative Results
Contact: {sindhu.hegde, aditya.ag, bipasha.sen, radrabha.m}@research.iiit.ac.in
Qualitative Results
Contact: {sindhu.hegde, aditya.ag, bipasha.sen, radrabha.m}@research.iiit.ac.in
Qualitative Results
Contact: {sindhu.hegde, aditya.ag, bipasha.sen, radrabha.m}@research.iiit.ac.in
Q&A Break
Contact: {sindhu.hegde, aditya.ag, bipasha.sen, radrabha.m}@research.iiit.ac.in
Time for interaction
Time for Code Walk-through!
Contact: {sindhu.hegde, aditya.ag, bipasha.sen, radrabha.m}@research.iiit.ac.in
Related works: