1 of 16

Tutorial on Multi-modal Learning

Contact: {sindhu.hegde, aditya.ag, bipasha.sen, radrabha.m}@research.iiit.ac.in

A deep-dive into Speaker Separation problem

Sindhu B Hegde

Aditya Agarwal

Bipasha Sen

Rudrabha Mukhopadhyay

IIIT Hyderabad

Seshadri�Mazumder

2 of 16

Motivation: �Isolating & Enhancing the Target Speaker

Contact: {sindhu.hegde, aditya.ag, bipasha.sen, radrabha.m}@research.iiit.ac.in

  • Multi-modal learning: Engaging multiple streams/modalities to perform a desired task.

  • In a cocktail-party like environment, separating a single speaker from other speakers can be an extremely important task.�
    • Example: Understanding the target speaker’s speech in news debates as shown below.
  • In such challenging situations, using additional information from visual modality along with the audio stream proves to be beneficial.

3 of 16

Speaker Separation: Potential Applications

  1. Debate denoising - let one person speak at a time!
  2. Automatic transcriptions with multiple speakers (such as in meetings).
  3. Controlled hearing aids - enhances the speech of target speaker �in noisy environments.
  4. Blind speech separation.

Contact: {sindhu.hegde, aditya.ag, bipasha.sen, radrabha.m}@research.iiit.ac.in

(a)

(b)

(c)

4 of 16

Audio-Visual Speaker Separation: Overview

Contact: {sindhu.hegde, aditya.ag, bipasha.sen, radrabha.m}@research.iiit.ac.in

Mixed Speech Input

… I don’t feel strong ..

… keep trying to fund raise ...

Visual Features

Audio Features

Audio-Visual Network

… keep trying to fund raise ...

Isolated Target Speaker

Visual Stream Input

5 of 16

Why do we need Visual Stream?

  • The task of separating the speech can be done using the audio modality alone.
    • Very hard to accomplish this using solely the audio modality.
    • Audio alone falls short is bringing all the information.
    • Permutation problem: No easy way to associate each separated audio source with its corresponding speaker in the video (example - play this particular speaker)��� Play the lady’s voice - �

  • Visual stream along with the auditory input has proven to be extremely beneficial.
    • Visual stream allows us to “focus” the audio on the desired target speakers.
    • It also improves the overall speaker separation performance.

Contact: {sindhu.hegde, aditya.ag, bipasha.sen, radrabha.m}@research.iiit.ac.in

6 of 16

Audio-Visual Network: Architecture Overview

Contact: {sindhu.hegde, aditya.ag, bipasha.sen, radrabha.m}@research.iiit.ac.in

Tx514

(T/4)x3x96x96

Visual Encoder

(3D Conv Block)

Speech Encoder

(1D Conv Block)

Concatenation

Encoder

Speech Decoder

(1D Conv Block)

+

Decoder

Residual mask

(Tx514)

Tx1112

Tx600

(T/4)x512x1x1

4x upsample

Tx600

Tx512

Tx514

Tx514

STFT

ISTFT

7 of 16

Audio-Visual Network: Detailed Architecture

Contact: {sindhu.hegde, aditya.ag, bipasha.sen, radrabha.m}@research.iiit.ac.in

STFT

Magnitude Sub-network

Mag encoder

Phase Sub-network

Visual encoder

Mag decoder

Mixed mag

Pred mag

Mixed phase

Phase network

+

+

Pred phase

ISTFT

Target speech

Mag

mask

Mixed speech

Phase

mask

Visual input

8 of 16

Audio-Visual Network: Representations

  • Audio-Visual network: Takes both the visual stream and the mixed auditory stream as the input and generates the isolated speech for the target speaker.

  • Audio representation:
    • Extract linear spectrogram using short-time Fourier transform (STFT) from 1-second segment of mixed speech input.
    • Decompose the complex time-frequency representation (T x 257) into magnitude and the phase components, and normalize them between [0, 1].
    • The mag and the phase components, each of dimension (T x 257) act as input to the respective magnitude and phase encoder networks.

  • Visual representation:
    • The corresponding visual 1-second of frames are extracted (25 frames).
    • The resized frames (96x96x3) act as input to the visual encoder.

Contact: {sindhu.hegde, aditya.ag, bipasha.sen, radrabha.m}@research.iiit.ac.in

9 of 16

Audio-Visual Network: Training details

  • Magnitude Sub-network:

    • Visual Encoder:
      • Processes the input images using a stack of residual 2D-convolution blocks and generates a visual embedding for each frame (T’x512) where T’=25 frames.
      • The output of the visual encoder module is up-sampled 4× to match the spectrogram temporal dimension (Tx512) where T=100.

    • Mag Encoder:
      • Processes the input mixed mag representation (T x 257) using a stack of 1D-convolution blocks with residual connections.
      • Convolutions are performed along the temporal dimension, by considering the frequency component of the input spectrograms as channels (Tx600).

    • Mag Decoder:
      • Concatenate the learned features of each stream along the channels (Tx1112).
      • Processes the fused representation using a stack of residual 1D-convolution blocks.
      • Output: A magnitude mask (Tx257) that is added to input magnitude followed by a sigmoid activation to generate the enhanced magnitude spectrogram output (Tx257).

Contact: {sindhu.hegde, aditya.ag, bipasha.sen, radrabha.m}@research.iiit.ac.in

10 of 16

Audio-Visual Network: Training details

Contact: {sindhu.hegde, aditya.ag, bipasha.sen, radrabha.m}@research.iiit.ac.in

  • Phase Sub-network:

    • Concatenate the predicted magnitude (Tx257), visual embeddings (Tx512) and the input mixed phase (Tx257) representations along the channels (Tx1026).
    • The phase network processed the fused representation using a stack of residual 1D convolution layers.
    • Output: A residual phase mask (Tx257) that is added to the input phase followed by a sigmoid activation to generate the enhanced phase spectrogram output (Tx257).

  • The enhanced speech output is obtained by computing the inverse-STFT (ISTFT) from the magnitude and phase predictions.

  • Losses:
    • Magnitude prediction: L1 loss
    • Phase prediction: Cosine similarity
    • Total loss = Mag loss + Phase loss

11 of 16

Dataset and Experimental setup

  • VoxCeleb2 dataset:
    • A large-scale talking-face video dataset containing

celebrity videos.

    • Contains over 1 million utterances for 6,112 celebrities.
    • A challenging dataset that spans a wide variety of

identities, languages, and face poses.

Contact: {sindhu.hegde, aditya.ag, bipasha.sen, radrabha.m}@research.iiit.ac.in

Fig.: Dataset samples.

Train

Test

# speakers

5,994

118

# videos

145,569

4,911

Table.: Statistics of the VoxCeleb2 dataset.

12 of 16

Qualitative Results

Contact: {sindhu.hegde, aditya.ag, bipasha.sen, radrabha.m}@research.iiit.ac.in

13 of 16

Qualitative Results

Contact: {sindhu.hegde, aditya.ag, bipasha.sen, radrabha.m}@research.iiit.ac.in

14 of 16

Qualitative Results

Contact: {sindhu.hegde, aditya.ag, bipasha.sen, radrabha.m}@research.iiit.ac.in

15 of 16

Q&A Break

Contact: {sindhu.hegde, aditya.ag, bipasha.sen, radrabha.m}@research.iiit.ac.in

Time for interaction

16 of 16

Time for Code Walk-through!

Contact: {sindhu.hegde, aditya.ag, bipasha.sen, radrabha.m}@research.iiit.ac.in

  • Repository: https://github.com/Sindhu-Hegde/speaker-separation
    • Clone and star the repo 😄

  • The repo has the complete train and test codes along with the pre-trained model for the task of speaker separation.
    • A demo inference file (collab notebook) is also provided.

Related works:

  1. The Conversation: Deep Audio-Visual Speech Enhancement.Triantafyllos Afouras, Joon Son Chung, and Andrew Zisserman, In Interspeech 2018.
  2. Looking to Listen at the Cocktail Party: A Speaker-Independent Audio-Visual Model for Speech Separation. Ephrat, A., Inbar Mosseri, Oran Lang, Tali Dekel, K. Wilson, Avinatan Hassidim, W. Freeman and Michael Rubinstein, In ACM Transactions on Graphics (ToG) 2018.