1 of 32

Heterogeneous target speech separation

Efthymios Tzinis1,2,*, Gordon Wichern1, Aswin Subramanian1,

Paris Smaragdis2 and Jonathan Le Roux1

1Mitsubishi Electric Research Laboratories (MERL)

2University of Illinois at Urbana-Champaign

*Work done during an internship at MERL.

U N I V E R S I T Y O F I L L I N O I S A T U R B A N A - C H A M P A I G N

2 of 32

Introduction

  • Audio source separation
    • Co-occurence of multiple sounds
    • Extract independent sound sources
      • All sources: Unconditional source separation
      • Specify sources: Conditional / Target source separation

2

U N I V E R S I T Y O F I L L I N O I S A T U R B A N A - C H A M P A I G N

3 of 32

Introduction

  • Audio source separation
    • Co-occurence of multiple sounds
    • Extract independent sound sources
      • All sources: Unconditional source separation
      • Specify sources: Conditional / Target source separation

  • Target speech separation
    • Solves the disambiguation of the sources
    • Solves the alignment of the estimated sources

3

U N I V E R S I T Y O F I L L I N O I S A T U R B A N A - C H A M P A I G N

4 of 32

Introduction

  • Audio source separation
    • Co-occurence of multiple sounds
    • Extract independent sound sources
      • All sources: Unconditional source separation
      • Specify sources: Conditional / Target source separation

  • Target speech separation
    • Solves the disambiguation of the sources
    • Solves the alignment of the estimated sources

  • What kind of conditional targets can we use?

4

U N I V E R S I T Y O F I L L I N O I S A T U R B A N A - C H A M P A I G N

5 of 32

Heterogeneous target separation

  • Slicing an acoustic scene has multiple solutions
    • Based on user’s intention
    • Multiple ways to describe the same target source

5

U N I V E R S I T Y O F I L L I N O I S A T U R B A N A - C H A M P A I G N

6 of 32

Heterogeneous target separation

  • Slicing an acoustic scene has multiple solutions
    • Based on user’s intention
    • Multiple ways to describe the same target source
  • Isolate a speaker based on different semantic concepts
    • Gender
    • Distance from the microphone
      • Far/Near microphone
    • Language spoken
      • French, English, etc.
    • Energy of the speaker
      • Loudest / Less energetic

6

U N I V E R S I T Y O F I L L I N O I S A T U R B A N A - C H A M P A I G N

7 of 32

Heterogeneous training

  • Permutation invariant training (Oracle)
    • Backpropagate the minimum loss under all permutations of the estimated speakers

7

U N I V E R S I T Y O F I L L I N O I S A T U R B A N A - C H A M P A I G N

8 of 32

Heterogeneous training

  • Permutation invariant training (Oracle)
    • Backpropagate the minimum loss under all permutations of the estimated speakers
  • Heterogeneous
    • Generate a mixture from a set of sources
    • Sample a discriminative concept to create the target waveform
      • Could contain more than one sources

8

U N I V E R S I T Y O F I L L I N O I S A T U R B A N A - C H A M P A I G N

9 of 32

Heterogeneous training

  • Permutation invariant training (Oracle)
    • Backpropagate the minimum loss under all permutations of the estimated speakers
  • Heterogeneous
    • Generate a mixture from a set of sources
    • Sample a discriminative concept to create the target waveform
      • Could contain more than one sources
    • Train the model under a targeted L1 loss
    • Example conditions and their discriminative concepts:
      • Distance from the microphone: (Far or Near)
      • Language spoken: (French, English, etc.)

9

U N I V E R S I T Y O F I L L I N O I S A T U R B A N A - C H A M P A I G N

10 of 32

Introduced datasets

  • Generated three different datasets
    • Wall Street Journal (WSJ - anechoic)
      • Energy (E), gender (G)
    • Spatial LibriSpeech (SLIB - reverberant)
      • E, G, spatial location (S)
    • Spatial VoxForge (SVOX - multi-lingual and reverberant):
      • E, S, language (L)

10

U N I V E R S I T Y O F I L L I N O I S A T U R B A N A - C H A M P A I G N

11 of 32

Conditional separation network

  • Conditional sudo rm -rf
    • One-hot conditioning vector based on all semantic concepts

11

Condition

Discriminative concept values

Energy

Loudest / Most silent

Spatial Location

Far / Near field

Language

English / French / German / Spanish

Gender

Female / Male

U N I V E R S I T Y O F I L L I N O I S A T U R B A N A - C H A M P A I G N

12 of 32

Conditional separation network

  • Conditional sudo rm -rf
    • One-hot conditioning vector based on all semantic concepts
    • FiLM modulation in the input of all B=16 U-ConvBlocks
    • Always estimate the target and the non-target estimate

12

Condition

Discriminative concept values

Energy

Loudest / Most silent

Spatial Location

Far / Near field

Language

English / French / German / Spanish

Gender

Female / Male

U N I V E R S I T Y O F I L L I N O I S A T U R B A N A - C H A M P A I G N

13 of 32

Conditional separation network

  • Conditional sudo rm -rf
    • One-hot conditioning vector based on all semantic concepts
    • FiLM modulation in the input of all B=16 U-ConvBlocks
    • Always estimate the target and the non-target estimate
    • Low overhead conditioning mechanism

13

Parameters: 9.66 millions -> 9.84 millions

Condition

Discriminative concept values

Energy

Loudest / Most silent

Spatial Location

Far / Near field

Language

English / French / German / Spanish

Gender

Female / Male

U N I V E R S I T Y O F I L L I N O I S A T U R B A N A - C H A M P A I G N

14 of 32

Training and evaluation details

  • Training
    • Sample a discriminative concept given a pre-defined prior
    • L1 norm for both “target” and “other” estimated sources
      • We train for 120 epochs
        • 20,000 8kHz mixtures
        • Uniform [75-100]% overlap

14

Condition

WSJ

SVOX

SLIB

Input-SNR

Uniform

[-5,5]

Uniform

[-2.5, 2.5]

Conditions

Energy, Gender

Energy,

Gender,

Spatial Loc.

Energy,

Language,

Spatial Loc.

U N I V E R S I T Y O F I L L I N O I S A T U R B A N A - C H A M P A I G N

15 of 32

Training and evaluation details

  • Training
    • Sample a discriminative concept given a pre-defined prior
    • L1 norm for both “target” and “other” estimated sources
      • We train for 120 epochs
        • 20,000 8kHz mixtures
        • Uniform [75-100]% overlap
  • Evaluation
    • Scale-invariant signal to noise ratio on the target source
    • 3,000 validation mixtures
    • 5,000 test mixtures

15

Condition

WSJ

SVOX

SLIB

Input-SNR

Uniform

[-5,5]

Uniform

[-2.5, 2.5]

Conditions

Energy, Gender

Energy,

Gender,

Spatial Loc.

Energy,

Language,

Spatial Loc.

U N I V E R S I T Y O F I L L I N O I S A T U R B A N A - C H A M P A I G N

16 of 32

In- and cross-domain results

  • Single-conditioned models > PIT
    • Each model trained and evaluated on the corresponding condition

16

U N I V E R S I T Y O F I L L I N O I S A T U R B A N A - C H A M P A I G N

17 of 32

In- and cross-domain results

  • Single-conditioned models > PIT
    • Each model trained and evaluated on the corresponding condition
  • Heterogeneous training > PIT
    • For all conditions except language
    • For in-domain data

17

U N I V E R S I T Y O F I L L I N O I S A T U R B A N A - C H A M P A I G N

18 of 32

In- and cross-domain results

  • Single-conditioned models > PIT
    • Each model trained and evaluated on the corresponding condition
  • Heterogeneous training > PIT
    • For all conditions except language
    • For in-domain data

18

U N I V E R S I T Y O F I L L I N O I S A T U R B A N A - C H A M P A I G N

19 of 32

In- and cross-domain results

  • Single-conditioned models > PIT
    • Each model trained and evaluated on the corresponding condition
  • Heterogeneous training > PIT
    • For all conditions except language
    • For in-domain data
    • For cross-domain evaluation

19

U N I V E R S I T Y O F I L L I N O I S A T U R B A N A - C H A M P A I G N

20 of 32

Robustness under degenerate conditions

  • Trade-off between the percentage of:
    • Same gender conditioning
    • Cross-gender

20

U N I V E R S I T Y O F I L L I N O I S A T U R B A N A - C H A M P A I G N

21 of 32

Robustness under degenerate conditions

  • Trade-off between the percentage of:
    • Same gender conditioning
    • Cross-gender
  • Optimal point for both gender and energy conditions
    • Using only 0.2-0.4% of same-gender mixtures
    • Also learns the degenerate case

21

U N I V E R S I T Y O F I L L I N O I S A T U R B A N A - C H A M P A I G N

22 of 32

Bridge conditioning ablation

  • Learn a harder discriminative concept (e.g. gender on SLIB)
    • No access to SLIB gender metadata about the speakers
    • Learn using the energy concept as a “bridge” condition
      • Possible available metadata for the WSJ anechoic dataset

22

U N I V E R S I T Y O F I L L I N O I S A T U R B A N A - C H A M P A I G N

23 of 32

Bridge conditioning ablation

  • Learn a harder discriminative concept (e.g. gender on SLIB)
    • No access to SLIB gender metadata about the speakers
    • Learn using the energy concept as a “bridge” condition
      • Possible available metadata for the WSJ anechoic dataset

23

U N I V E R S I T Y O F I L L I N O I S A T U R B A N A - C H A M P A I G N

24 of 32

Bridge conditioning ablation

  • Learn a harder discriminative concept (e.g. gender on SLIB)
    • No access to SLIB gender metadata about the speakers
    • Learn using the energy concept as a “bridge” condition
      • Possible available metadata for the WSJ anechoic dataset

24

U N I V E R S I T Y O F I L L I N O I S A T U R B A N A - C H A M P A I G N

25 of 32

Bridge conditioning ablation

  • Learn a harder discriminative concept (e.g. gender on SLIB)
    • No access to SLIB gender metadata about the speakers
    • Learn using the energy concept as a “bridge” condition
      • Possible available metadata for the WSJ anechoic dataset

25

U N I V E R S I T Y O F I L L I N O I S A T U R B A N A - C H A M P A I G N

26 of 32

Bridge conditioning ablation

  • Learn a harder discriminative concept (e.g. gender on SLIB)
    • No access to SLIB gender metadata about the speakers
    • Learn using the energy concept as a “bridge” condition
      • Possible available metadata for the WSJ anechoic dataset

26

U N I V E R S I T Y O F I L L I N O I S A T U R B A N A - C H A M P A I G N

27 of 32

Using a bridge semantic condition

  • Learn a hard condition using an easier one
    • Learn how to condition on a specific language using the spatial location

27

U N I V E R S I T Y O F I L L I N O I S A T U R B A N A - C H A M P A I G N

28 of 32

Using a bridge semantic condition

  • Learn a hard condition using an easier one
    • Learn how to condition on a specific language using the spatial location
    • Best model for both conditions appears to be in between the two extremes
      • The training conditioning prior is key

28

U N I V E R S I T Y O F I L L I N O I S A T U R B A N A - C H A M P A I G N

29 of 32

Conclusions & Highlights

  • Heterogeneous target source separation
    • A new paradigm in source separation
    • Slicing acoustic scenes based on deviant:
      • Non-mutually exclusive signal characteristic conditions
        • One can also consider using AND and OR conditions

29

U N I V E R S I T Y O F I L L I N O I S A T U R B A N A - C H A M P A I G N

30 of 32

Conclusions & Highlights

  • Heterogeneous target source separation
    • A new paradigm in source separation
    • Slicing acoustic scenes based on deviant:
      • Non-mutually exclusive signal characteristic conditions
        • One can also consider using AND and OR conditions
  • Heterogeneous condition training
    • Improves upon oracle permutation invariant training
    • Improves cross-domain generalization
    • Robust under degenerate cases

30

U N I V E R S I T Y O F I L L I N O I S A T U R B A N A - C H A M P A I G N

31 of 32

Conclusions & Highlights

  • Heterogeneous target source separation
    • A new paradigm in source separation
    • Slicing acoustic scenes based on deviant:
      • Non-mutually exclusive signal characteristic conditions
        • One can also consider using AND and OR conditions
  • Heterogeneous condition training
    • Improves upon oracle permutation invariant training
    • Improves cross-domain generalization
    • Robust under degenerate cases
  • In the future
    • We want to apply our method towards a variable number of sources
    • Make our method require less supervision
    • Extend out method to work with natural language queries

31

U N I V E R S I T Y O F I L L I N O I S A T U R B A N A - C H A M P A I G N

32 of 32

Thank you!

Any questions?

U N I V E R S I T Y O F I L L I N O I S A T U R B A N A - C H A M P A I G N