1 of 35

Representation Selective Self-distillation

and wav2vec 2.0 Feature Exploration

for Spoof-aware Speaker Verification

Jin Woo Lee1, Eungbeom Kim2, Junghyun Koo1, Kyogu Lee1,2,3

jinwlee@snu.ac.kr

1 Department of Intelligence and Information, Seoul National University

2 Interdisciplinary Program in Artificial Intelligence, Seoul National University

3 AI Institute, Seoul National University

eb.kim@snu.ac.kr

dg22302@snu.ac.kr

kglee@snu.ac.kr

2 of 35

  • Motivation
  • Contributions
  • Representation Learning for CM
    • Front-end feature exploration
    • Back-end model selection
  • RSSD: Training Strategy for SASV
    • Representation-Selective Self-Distillation
    • Evaluation using the proposed front-/back-end combinations
  • Conclusions

Outline

Representation Selective Self-distillation and wav2vec 2.0 Feature Exploration for Spoof-aware Speaker Verification

-

Jin Woo Lee et al.

MARG@SNU

3 of 35

Motivation

  • Spoofing-aware Automatic Speaker Verification (SASV) Challenge
    • Automatic Speaker Verification (ASV) systems should reject both of:
      • Voice uttered by different speaker (ASV non-target)
      • Spoofed (synthesised or converted) utterances (CM non-target)
    • How to integrate two distinct systems (ASV + CM) [1]
      • Ensemble solutions based upon separate ASV and CM systems
      • Integrated single system solutions
    • ASVspoof 2019 Logical Access (LA) database [2]
      • Collection of labeled speech data (target, non-target, spoofed)

[1] Jung, J. W., Tak, H., Shim, H. J., Heo, H. S., Lee, B. J., Chung, S. W., ... & Kinnunen, T. (2022). SASV challenge 2022: A spoofing aware speaker verification challenge evaluation plan. arXiv preprint arXiv:2201.10283.

[2] Wang, X., Yamagishi, J., Todisco, M., Delgado, H., Nautsch, A., Evans, N., ... & Ling, Z. H. (2020). ASVspoof 2019: A large-scale public database of synthesized, converted and replayed speech. Computer Speech & Language, 64, 101114.

Representation Selective Self-distillation and wav2vec 2.0 Feature Exploration for Spoof-aware Speaker Verification

-

Jin Woo Lee et al.

MARG@SNU

4 of 35

Motivation

  • ASV systems should additionally reject spoofed utterances
    • Text-to-speech and voice conversion studies are constantly improving
    • Difference between the target and the non-target for the CM becomes more indistinguishable
  • Improve not only the SASV system, but also those for the CM
    • Not many anti-spoofing studies analyzed to find clues about the attributes that are important for finding artifacts
    • Answer to what is an effective representation for finding artifacts remains veiled

Representation Selective Self-distillation and wav2vec 2.0 Feature Exploration for Spoof-aware Speaker Verification

-

Jin Woo Lee et al.

MARG@SNU

5 of 35

  • Representation learning for spoofing CM
    • Effective feature extractor for the CM
      • Explore Transformer layer outputs from wav2vec 2.0 model
      • Analyze feature spaces that effectively distinguish spoofed utterances
    • Effective CM encoder for the feature extractor
      • Study encoder structure that effectively distinguish spoofed utterances
      • Compare EERs for each combination of feature extractor and encoder
  • Representation Selective Self-Distillation (RSSD)
    • Inspired by representation distillation[1] and self-distillation[2].

Contributions

[1] Tian, Y., Krishnan, D., & Isola, P. (2019, September). Contrastive Representation Distillation. In International Conference on Learning Representations.

[2] Zhang, L., Song, J., Gao, A., Chen, J., Bao, C., & Ma, K. (2019). Be your own teacher: Improve the performance of convolutional neural networks via self distillation. In Proceedings of the IEEE/CVF International Conference on Computer Vision (pp. 3713-3722).

Representation Selective Self-distillation and wav2vec 2.0 Feature Exploration for Spoof-aware Speaker Verification

-

Jin Woo Lee et al.

MARG@SNU

6 of 35

Contributions

Representation Selective Self-distillation and wav2vec 2.0 Feature Exploration for Spoof-aware Speaker Verification

-

Jin Woo Lee et al.

MARG@SNU

7 of 35

  • ASV Network: ECAPA-TDNN [1]
  • CM Network: wav2vec 2.0 [2,3] + Attentive statistics pooling [1]
  • SASV Network: Representation-Selective Self-Distillation (RSSD)

[1] Desplanques, B., Thienpondt, J., & Demuynck, K. (2020). Ecapa-tdnn: Emphasized channel attention, propagation and aggregation in tdnn based speaker verification. arXiv preprint arXiv:2005.07143.

[2] Baevski, A., Zhou, Y., Mohamed, A., & Auli, M. (2020). wav2vec 2.0: A framework for self-supervised learning of speech representations. Advances in Neural Information Processing Systems, 33, 12449-12460.

[3] Conneau, A., Baevski, A., Collobert, R., Mohamed, A., & Auli, M. (2020). Unsupervised cross-lingual representation learning for speech recognition. arXiv preprint arXiv:2006.13979.

CM Network

ASV Network

Contributions

Representation Selective Self-distillation and wav2vec 2.0 Feature Exploration for Spoof-aware Speaker Verification

-

Jin Woo Lee et al.

MARG@SNU

8 of 35

  • Experimental setup: front-end

Representation Learning for CM

CM Network

ASV Network

Representation Selective Self-distillation and wav2vec 2.0 Feature Exploration for Spoof-aware Speaker Verification

-

Jin Woo Lee et al.

MARG@SNU

9 of 35

CNN

Raw

waveform

Latent

Speech

Representations

Layer 1

Layer 2

Layer 3

Transformer

Layer 4

Layer 24

Intermediate

wav2vec 2.0

Representations

  • XLSR-53 Feature Exploration

Representation Learning for CM

CM Network

ASV Network

Representation Selective Self-distillation and wav2vec 2.0 Feature Exploration for Spoof-aware Speaker Verification

-

Jin Woo Lee et al.

MARG@SNU

10 of 35

  • Experimental setup: back-end

Representation Learning for CM

CM Network

ASV Network

Representation Selective Self-distillation and wav2vec 2.0 Feature Exploration for Spoof-aware Speaker Verification

-

Jin Woo Lee et al.

MARG@SNU

11 of 35

  • Spoofing CM using ASP

Representation Learning for CM

Intermediate

wav2vec 2.0

Representations

mean

std

Tanh

Conv 1x1

Softmax

mean

std

CM

Embed

Linear

Linear

CM

Logits

Conv1x1

ReLU, BN

Attentive Statistics Pooling (ASP)

RSSD

CM Network

ASV Network

Representation Selective Self-distillation and wav2vec 2.0 Feature Exploration for Spoof-aware Speaker Verification

-

Jin Woo Lee et al.

MARG@SNU

12 of 35

System

Front-end

Intermediate�Feature

CM�EER

Architecture

Self-supervised

RawGAT-ST [4]

SincNet

𝗫

-

1.06

AASIST-L [3]

SincNet

𝗫

-

0.99

AASIST [3]

SincNet

𝗫

-

0.83

LGF [5]

XLSR-53

𝗫

1.28

LLGF [5]

W2V-Large

𝗫

0.86

Ours (MLP)

XLSR-53

0.80

Ours (AASIST)

XLSR-53

0.40

Ours (ASP)

XLSR-53

0.31

  • CM EER Comparison

Representation Learning for CM

Representation Selective Self-distillation and wav2vec 2.0 Feature Exploration for Spoof-aware Speaker Verification

-

Jin Woo Lee et al.

MARG@SNU

13 of 35

Back-end

Encoder

CM-EER

sinc

XLSR 1st

XLSR 3rd

XLSR 5th

XLSR 7th

XLSR 9th

XLSR 13th

XLSR 17th

MLP

-

2.2

2.1

0.8

1.4

1.0

1.7

2.2

AASIST

0.8

2.9

0.8

0.4

0.5

0.5

0.7

0.9

ASP

1.0

1.0

0.7

0.3

0.5

0.4

0.5

1.0

  • Using ASP as back-end model outperforms the others�for most front-ends
  • Using 5th layer output as front-end outperforms the others for most back-ends

Representation Learning for CM

Representation Selective Self-distillation and wav2vec 2.0 Feature Exploration for Spoof-aware Speaker Verification

-

Jin Woo Lee et al.

MARG@SNU

14 of 35

Back-end

Encoder

CM-EER

sinc

XLSR 1st

XLSR 3rd

XLSR 5th

XLSR 7th

XLSR 9th

XLSR 13th

XLSR 17th

MLP

-

2.2

2.1

0.8

1.4

1.0

1.7

2.2

AASIST

0.8

2.9

0.8

0.4

0.5

0.5

0.7

0.9

ASP

1.0

1.0

0.7

0.3

0.5

0.4

0.5

1.0

  • Using ASP as back-end model outperforms the others�for most front-ends
  • Using 5th layer output as front-end outperforms the others for most back-ends

Representation Learning for CM

Representation Selective Self-distillation and wav2vec 2.0 Feature Exploration for Spoof-aware Speaker Verification

-

Jin Woo Lee et al.

MARG@SNU

15 of 35

Back-end

Encoder

CM-EER

sinc

XLSR 1st

XLSR 3rd

XLSR 5th

XLSR 7th

XLSR 9th

XLSR 13th

XLSR 17th

MLP

-

2.2

2.1

0.8

1.4

1.0

1.7

2.2

AASIST

0.8

2.9

0.8

0.4

0.5

0.5

0.7

0.9

ASP

1.0

1.0

0.7

0.3

0.5

0.4

0.5

1.0

  • Using ASP as back-end model outperforms the others�for most front-ends
  • Using 5th layer output as front-end outperforms the others for most back-ends

Representation Learning for CM

Representation Selective Self-distillation and wav2vec 2.0 Feature Exploration for Spoof-aware Speaker Verification

-

Jin Woo Lee et al.

MARG@SNU

16 of 35

Back-end

Encoder

CM-EER

sinc

XLSR 1st

XLSR 3rd

XLSR 5th

XLSR 7th

XLSR 9th

XLSR 13th

XLSR 17th

MLP

-

2.2

2.1

0.8

1.4

1.0

1.7

2.2

AASIST

0.8

2.9

0.8

0.4

0.5

0.5

0.7

0.9

ASP

1.0

1.0

0.7

0.3

0.5

0.4

0.5

1.0

  • Using ASP as back-end model outperforms the others�for most front-ends
  • Using 5th layer output as front-end outperforms the others for most back-ends

Representation Learning for CM

Representation Selective Self-distillation and wav2vec 2.0 Feature Exploration for Spoof-aware Speaker Verification

-

Jin Woo Lee et al.

MARG@SNU

17 of 35

Back-end

Encoder

CM-EER

sinc

XLSR 1st

XLSR 3rd

XLSR 5th

XLSR 7th

XLSR 9th

XLSR 13th

XLSR 17th

MLP

-

2.2

2.1

0.8

1.4

1.0

1.7

2.2

AASIST

0.8

2.9

0.8

0.4

0.5

0.5

0.7

0.9

ASP

1.0

1.0

0.7

0.3

0.5

0.4

0.5

1.0

  • Using ASP as back-end model outperforms the others�for most front-ends
  • Using 5th layer output as front-end outperforms the others for most back-ends

Representation Learning for CM

Representation Selective Self-distillation and wav2vec 2.0 Feature Exploration for Spoof-aware Speaker Verification

-

Jin Woo Lee et al.

MARG@SNU

18 of 35

  • Spoofing countermeasure feature exploration
    • XLSR-53 nth layer + ASP

Representation Learning for CM

TTS

VC

VC

TTS

Representation Selective Self-distillation and wav2vec 2.0 Feature Exploration for Spoof-aware Speaker Verification

-

Jin Woo Lee et al.

MARG@SNU

19 of 35

  • t-SNE Plot of Evaluated CM embeddings

Representation Learning for CM

Representation Selective Self-distillation and wav2vec 2.0 Feature Exploration for Spoof-aware Speaker Verification

-

Jin Woo Lee et al.

MARG@SNU

20 of 35

bona fide

TTS + VC from TTS

(A07 ~ A16)

VC from human voice

(A17, A18, A19)

  • t-SNE Plot of Evaluated CM embeddings

Representation Learning for CM

Representation Selective Self-distillation and wav2vec 2.0 Feature Exploration for Spoof-aware Speaker Verification

-

Jin Woo Lee et al.

MARG@SNU

21 of 35

Attack

Type

Attack Algorithm Input

Waveform generator

CM EER (%)

A7

TTS

Text

WORLD

0.12

A8

NSF

0.12

A9

Vocaine

0.00

A10

WaveRNN

0.89

A11

Griffin-Lim

0.10

A12

WaveNet

0.04

A13

VC

Synthetic Speech (public TTS)

Waveform filtering

0.00

A14

Synthetic Speech (commercial TTS)

STRAIGHT

0.31

A15

WaveNet

0.43

A16

TTS

Text

Waveform concat.

0.12

A17

VC

Genuine Speech (human)

Waveform filtering

0.29

A18

MFCC vocoder

0.24

A19

Spectral filtering

0.26

  • Spoofing countermeasure EER breakdown

Representation Learning for CM

Representation Selective Self-distillation and wav2vec 2.0 Feature Exploration for Spoof-aware Speaker Verification

-

Jin Woo Lee et al.

MARG@SNU

22 of 35

Attack

Type

Attack Algorithm Input

Waveform generator

CM EER (%)

A7

TTS

Text

WORLD

0.12

A8

NSF

0.12

A9

Vocaine

0.00

A10

WaveRNN

0.89

A11

Griffin-Lim

0.10

A12

WaveNet

0.04

A13

VC

Synthetic Speech (public TTS)

Waveform filtering

0.00

A14

Synthetic Speech (commercial TTS)

STRAIGHT

0.31

A15

WaveNet

0.43

A16

TTS

Text

Waveform concat.

0.12

A17

VC

Genuine Speech (human)

Waveform filtering

0.29

A18

MFCC vocoder

0.24

A19

Spectral filtering

0.26

  • Spoofing countermeasure EER breakdown

Representation Learning for CM

Representation Selective Self-distillation and wav2vec 2.0 Feature Exploration for Spoof-aware Speaker Verification

-

Jin Woo Lee et al.

MARG@SNU

23 of 35

Attack

Type

Attack Algorithm Input

Waveform generator

CM EER (%)

A7

TTS

Text

WORLD

0.12

A8

NSF

0.12

A9

Vocaine

0.00

A10

WaveRNN

0.89

A11

Griffin-Lim

0.10

A12

WaveNet

0.04

A13

VC

Synthetic Speech (public TTS)

Waveform filtering

0.00

A14

Synthetic Speech (commercial TTS)

STRAIGHT

0.31

A15

WaveNet

0.43

A16

TTS

Text

Waveform concat.

0.12

A17

VC

Genuine Speech (human)

Waveform filtering

0.29

A18

MFCC vocoder

0.24

A19

Spectral filtering

0.26

  • Spoofing countermeasure EER breakdown

Representation Learning for CM

Representation Selective Self-distillation and wav2vec 2.0 Feature Exploration for Spoof-aware Speaker Verification

-

Jin Woo Lee et al.

MARG@SNU

24 of 35

Attack

Type

Attack Algorithm Input

Waveform generator

CM EER (%)

A7

TTS

Text

WORLD

0.12

A8

NSF

0.12

A9

Vocaine

0.00

A10

WaveRNN

0.89

A11

Griffin-Lim

0.10

A12

WaveNet

0.04

A13

VC

Synthetic Speech (public TTS)

Waveform filtering

0.00

A14

Synthetic Speech (commercial TTS)

STRAIGHT

0.31

A15

WaveNet

0.43

A16

TTS

Text

Waveform concat.

0.12

A17

VC

Genuine Speech (human)

Waveform filtering

0.29

A18

MFCC vocoder

0.24

A19

Spectral filtering

0.26

  • Spoofing countermeasure EER breakdown

Representation Learning for CM

Representation Selective Self-distillation and wav2vec 2.0 Feature Exploration for Spoof-aware Speaker Verification

-

Jin Woo Lee et al.

MARG@SNU

25 of 35

Attack

Type

Attack Algorithm Input

Waveform generator

CM EER (%)

A7

TTS

Text

WORLD

0.12

A8

NSF

0.12

A9

Vocaine

0.00

A10

WaveRNN

0.89

A11

Griffin-Lim

0.10

A12

WaveNet

0.04

A13

VC

Synthetic Speech (public TTS)

Waveform filtering

0.00

A14

Synthetic Speech (commercial TTS)

STRAIGHT

0.31

A15

WaveNet

0.43

A16

TTS

Text

Waveform concat.

0.12

A17

VC

Genuine Speech (human)

Waveform filtering

0.29

A18

MFCC vocoder

0.24

A19

Spectral filtering

0.26

  • Spoofing countermeasure EER breakdown

Representation Learning for CM

Representation Selective Self-distillation and wav2vec 2.0 Feature Exploration for Spoof-aware Speaker Verification

-

Jin Woo Lee et al.

MARG@SNU

26 of 35

Attack

Type

Attack Algorithm Input

Waveform generator

CM EER (%)

A7

TTS

Text

WORLD

0.12

A8

NSF

0.12

A9

Vocaine

0.00

A10

WaveRNN

0.89

A11

Griffin-Lim

0.10

A12

WaveNet

0.04

A13

VC

Synthetic Speech (public TTS)

Waveform filtering

0.00

A14

Synthetic Speech (commercial TTS)

STRAIGHT

0.31

A15

WaveNet

0.43

A16

TTS

Text

Waveform concat.

0.12

A17

VC

Genuine Speech (human)

Waveform filtering

0.29

A18

MFCC vocoder

0.24

A19

Spectral filtering

0.26

  • Spoofing countermeasure EER breakdown

Representation Learning for CM

Representation Selective Self-distillation and wav2vec 2.0 Feature Exploration for Spoof-aware Speaker Verification

-

Jin Woo Lee et al.

MARG@SNU

27 of 35

RSSD: Training Strategy for SASV

  • Representation Selective Self-Distillation (RSSD)

Bona fide

Spoofed

Speaker matched

Target

Nontarget

Speaker mismatched

Nontarget

Nontarget

ASV

CM

Representation Selective Self-distillation and wav2vec 2.0 Feature Exploration for Spoof-aware Speaker Verification

-

Jin Woo Lee et al.

MARG@SNU

28 of 35

RSSD: Training Strategy for SASV

  • Representation Selective Self-Distillation (RSSD)

Bona fide

Spoofed

Speaker matched

Target

Nontarget

Speaker mismatched

Nontarget

Nontarget

ASV

CM

Distort the Test Spk Emb

Representation Selective Self-distillation and wav2vec 2.0 Feature Exploration for Spoof-aware Speaker Verification

-

Jin Woo Lee et al.

MARG@SNU

29 of 35

RSSD: Training Strategy for SASV

  • Representation Selective Self-Distillation (RSSD)

Bona fide

Spoofed

Speaker matched

Target

Nontarget

Speaker mismatched

Nontarget

Nontarget

ASV

CM

Keep the Test Spk Emb

Representation Selective Self-distillation and wav2vec 2.0 Feature Exploration for Spoof-aware Speaker Verification

-

Jin Woo Lee et al.

MARG@SNU

30 of 35

  • Training with RSSD

RSSD: Training Strategy for SASV

Representation Selective Self-distillation and wav2vec 2.0 Feature Exploration for Spoof-aware Speaker Verification

-

Jin Woo Lee et al.

MARG@SNU

31 of 35

  • Training with RSSD

RSSD: Training Strategy for SASV

minimize

Representation Selective Self-distillation and wav2vec 2.0 Feature Exploration for Spoof-aware Speaker Verification

-

Jin Woo Lee et al.

MARG@SNU

32 of 35

  • Training with RSSD

RSSD: Training Strategy for SASV

minimize

Representation Selective Self-distillation and wav2vec 2.0 Feature Exploration for Spoof-aware Speaker Verification

-

Jin Woo Lee et al.

MARG@SNU

33 of 35

CM Network

ASV Network

RSSD: Training Strategy for SASV

  • Representation Selective Self-Distillation (RSSD)

Representation Selective Self-distillation and wav2vec 2.0 Feature Exploration for Spoof-aware Speaker Verification

-

Jin Woo Lee et al.

MARG@SNU

34 of 35

SASV System

CM FE

CM BE

SV EER

SPF EER

SASV EER

Baseline 1

SincNet

AASIST

35.32

0.67

19.31

Baseline 2

SincNet

AASIST

11.48

0.78

6.37

RSSD

SincNet

AASIST

1.41

0.76

1.15

RSSD

XLSR-53

AASIST

1.34

0.60

1.11

RSSD

XLSR-53

ASP

1.32

0.59

1.08

  • Evaluation using the proposed front-/back-end combinations

RSSD: Training Strategy for SASV

Representation Selective Self-distillation and wav2vec 2.0 Feature Exploration for Spoof-aware Speaker Verification

-

Jin Woo Lee et al.

MARG@SNU

35 of 35

  • Investigate from which layer of XLSR-53 the features obtained are advantageous for defending spoof attacks.
    • Output from the 5th Transformer layer shows the lowest CM EER.
  • CM encoder with a simple attentive statistics pooling layer outperformed AASIST and MLP.
  • Propose RSSD for SASV
    • Selective training with representation self-distillation helps utilizing the learned representations of each ASV and CM task.
    • The proposed CM model advantages both CM and SASV task.

Conclusion

Representation Selective Self-distillation and wav2vec 2.0 Feature Exploration for Spoof-aware Speaker Verification

-

Jin Woo Lee et al.

MARG@SNU