1 of 35

Representation Selective Self-distillation

and wav2vec 2.0 Feature Exploration

for Spoof-aware Speaker Verification

Jin Woo Lee¹, Eungbeom Kim², Junghyun Koo¹, Kyogu Lee^1,2,3

jinwlee@snu.ac.kr

¹Department of Intelligence and Information, Seoul National University

²Interdisciplinary Program in Artificial Intelligence, Seoul National University

³AI Institute, Seoul National University

eb.kim@snu.ac.kr

dg22302@snu.ac.kr

kglee@snu.ac.kr

2 of 35

Motivation
Contributions
Representation Learning for CM

Front-end feature exploration
Back-end model selection

RSSD: Training Strategy for SASV

Representation-Selective Self-Distillation
Evaluation using the proposed front-/back-end combinations

Conclusions

Outline

Representation Selective Self-distillation and wav2vec 2.0 Feature Exploration for Spoof-aware Speaker Verification

-

Jin Woo Lee et al.

MARG@SNU

3 of 35

Motivation

Spoofing-aware Automatic Speaker Verification (SASV) Challenge

Automatic Speaker Verification (ASV) systems should reject both of:

Voice uttered by different speaker (ASV non-target)
Spoofed (synthesised or converted) utterances (CM non-target)

How to integrate two distinct systems (ASV + CM) ^[1]

Ensemble solutions based upon separate ASV and CM systems
Integrated single system solutions

ASVspoof 2019 Logical Access (LA) database ^[2]

Collection of labeled speech data (target, non-target, spoofed)

[1] Jung, J. W., Tak, H., Shim, H. J., Heo, H. S., Lee, B. J., Chung, S. W., ... & Kinnunen, T. (2022). SASV challenge 2022: A spoofing aware speaker verification challenge evaluation plan. arXiv preprint arXiv:2201.10283.

[2] Wang, X., Yamagishi, J., Todisco, M., Delgado, H., Nautsch, A., Evans, N., ... & Ling, Z. H. (2020). ASVspoof 2019: A large-scale public database of synthesized, converted and replayed speech. Computer Speech & Language, 64, 101114.

Representation Selective Self-distillation and wav2vec 2.0 Feature Exploration for Spoof-aware Speaker Verification

-

Jin Woo Lee et al.

MARG@SNU

4 of 35

Motivation

ASV systems should additionally reject spoofed utterances

Text-to-speech and voice conversion studies are constantly improving
Difference between the target and the non-target for the CM becomes more indistinguishable

Improve not only the SASV system, but also those for the CM

Not many anti-spoofing studies analyzed to find clues about the attributes that are important for finding artifacts
Answer to what is an effective representation for finding artifacts remains veiled

Representation Selective Self-distillation and wav2vec 2.0 Feature Exploration for Spoof-aware Speaker Verification

-

Jin Woo Lee et al.

MARG@SNU

5 of 35

Representation learning for spoofing CM

Effective feature extractor for the CM

Explore Transformer layer outputs from wav2vec 2.0 model
Analyze feature spaces that effectively distinguish spoofed utterances

Effective CM encoder for the feature extractor

Study encoder structure that effectively distinguish spoofed utterances
Compare EERs for each combination of feature extractor and encoder

Representation Selective Self-Distillation (RSSD)

Inspired by representation distillation^[1] and self-distillation^[2].

Contributions

[1] Tian, Y., Krishnan, D., & Isola, P. (2019, September). Contrastive Representation Distillation. In International Conference on Learning Representations.

[2] Zhang, L., Song, J., Gao, A., Chen, J., Bao, C., & Ma, K. (2019). Be your own teacher: Improve the performance of convolutional neural networks via self distillation. In Proceedings of the IEEE/CVF International Conference on Computer Vision (pp. 3713-3722).

Representation Selective Self-distillation and wav2vec 2.0 Feature Exploration for Spoof-aware Speaker Verification

-

Jin Woo Lee et al.

MARG@SNU

Therefore, our work consists of two main contributions: improving the CM system, and developing the SASV method.

First, we conduct representation learning for the CM. That is, we study which feature extractors are effective for the CM task, and what kind of encoders are appropriate for each feature extractor.

To do so, we explore the feature space of the W2V2 model for each Transformer layer output.

And for each combination of feature extractor and encoder model, we examine which pair best distinguishes the spoofed utterances.

Finally, we propose a new training strategy for the SASV task, called representation selective self-distillation (RSSD).

To mix information from the CM model with the ASV model during the SASV training, we perform representation distillation, in a similar manner to self-distillation.

�

6 of 35

Contributions

Representation Selective Self-distillation and wav2vec 2.0 Feature Exploration for Spoof-aware Speaker Verification

-

Jin Woo Lee et al.

MARG@SNU

7 of 35

ASV Network: ECAPA-TDNN ^[1]
CM Network: wav2vec 2.0 ^[2,3] + Attentive statistics pooling ^[1]
SASV Network: Representation-Selective Self-Distillation (RSSD)

[1] Desplanques, B., Thienpondt, J., & Demuynck, K. (2020). Ecapa-tdnn: Emphasized channel attention, propagation and aggregation in tdnn based speaker verification. arXiv preprint arXiv:2005.07143.

[2] Baevski, A., Zhou, Y., Mohamed, A., & Auli, M. (2020). wav2vec 2.0: A framework for self-supervised learning of speech representations. Advances in Neural Information Processing Systems, 33, 12449-12460.

[3] Conneau, A., Baevski, A., Collobert, R., Mohamed, A., & Auli, M. (2020). Unsupervised cross-lingual representation learning for speech recognition. arXiv preprint arXiv:2006.13979.

CM Network

ASV Network

Contributions

Representation Selective Self-distillation and wav2vec 2.0 Feature Exploration for Spoof-aware Speaker Verification

-

Jin Woo Lee et al.

MARG@SNU

8 of 35

Experimental setup: front-end

Representation Learning for CM

CM Network

ASV Network

Representation Selective Self-distillation and wav2vec 2.0 Feature Exploration for Spoof-aware Speaker Verification

-

Jin Woo Lee et al.

MARG@SNU

9 of 35

CNN

Raw

waveform

Latent

Speech

Representations

Layer 1

Layer 2

Layer 3

Transformer

Layer 4

Layer 24

Intermediate

wav2vec 2.0

Representations

XLSR-53 Feature Exploration

Representation Learning for CM

CM Network

ASV Network

Representation Selective Self-distillation and wav2vec 2.0 Feature Exploration for Spoof-aware Speaker Verification

-

Jin Woo Lee et al.

MARG@SNU

10 of 35

Experimental setup: back-end

Representation Learning for CM

CM Network

ASV Network

Representation Selective Self-distillation and wav2vec 2.0 Feature Exploration for Spoof-aware Speaker Verification

-

Jin Woo Lee et al.

MARG@SNU

11 of 35

Spoofing CM using ASP

Representation Learning for CM

Intermediate

wav2vec 2.0

Representations

mean

std

Tanh

Conv 1x1

Softmax

mean

std

CM

Embed

Linear

CM

Logits

Conv1x1

ReLU, BN

Attentive Statistics Pooling (ASP)

RSSD

CM Network

ASV Network

Representation Selective Self-distillation and wav2vec 2.0 Feature Exploration for Spoof-aware Speaker Verification

-

Jin Woo Lee et al.

MARG@SNU

12 of 35

System	Front-end		Intermediate�Feature	CM�EER
System	Architecture	Self-supervised	Intermediate�Feature	CM�EER
RawGAT-ST [4]	SincNet	𝗫	-	1.06
AASIST-L [3]	SincNet	𝗫	-	0.99
AASIST [3]	SincNet	𝗫	-	0.83
LGF [5]	XLSR-53	✔	𝗫	1.28
LLGF [5]	W2V-Large	✔	𝗫	0.86
Ours (MLP)	XLSR-53	✔	✔	0.80
Ours (AASIST)	XLSR-53	✔	✔	0.40
Ours (ASP)	XLSR-53	✔	✔	0.31

CM EER Comparison

Representation Learning for CM

Representation Selective Self-distillation and wav2vec 2.0 Feature Exploration for Spoof-aware Speaker Verification

-

Jin Woo Lee et al.

MARG@SNU

13 of 35

Back-end Encoder	CM-EER
Back-end Encoder	sinc	XLSR 1^st	XLSR 3^rd	XLSR 5^th	XLSR 7^th	XLSR 9^th	XLSR 13^th	XLSR 17^th
MLP	-	2.2	2.1	0.8	1.4	1.0	1.7	2.2
AASIST	0.8	2.9	0.8	0.4	0.5	0.5	0.7	0.9
ASP	1.0	1.0	0.7	0.3	0.5	0.4	0.5	1.0

Using ASP as back-end model outperforms the others�for most front-ends
Using 5^th layer output as front-end outperforms the others for most back-ends

Representation Learning for CM

Representation Selective Self-distillation and wav2vec 2.0 Feature Exploration for Spoof-aware Speaker Verification

-

Jin Woo Lee et al.

MARG@SNU

14 of 35

Back-end Encoder	CM-EER
Back-end Encoder	sinc	XLSR 1^st	XLSR 3^rd	XLSR 5^th	XLSR 7^th	XLSR 9^th	XLSR 13^th	XLSR 17^th
MLP	-	2.2	2.1	0.8	1.4	1.0	1.7	2.2
AASIST	0.8	2.9	0.8	0.4	0.5	0.5	0.7	0.9
ASP	1.0	1.0	0.7	0.3	0.5	0.4	0.5	1.0

Using ASP as back-end model outperforms the others�for most front-ends
Using 5^th layer output as front-end outperforms the others for most back-ends

Representation Learning for CM

Representation Selective Self-distillation and wav2vec 2.0 Feature Exploration for Spoof-aware Speaker Verification

-

Jin Woo Lee et al.

MARG@SNU

15 of 35

Back-end Encoder	CM-EER
Back-end Encoder	sinc	XLSR 1^st	XLSR 3^rd	XLSR 5^th	XLSR 7^th	XLSR 9^th	XLSR 13^th	XLSR 17^th
MLP	-	2.2	2.1	0.8	1.4	1.0	1.7	2.2
AASIST	0.8	2.9	0.8	0.4	0.5	0.5	0.7	0.9
ASP	1.0	1.0	0.7	0.3	0.5	0.4	0.5	1.0

Using ASP as back-end model outperforms the others�for most front-ends
Using 5^th layer output as front-end outperforms the others for most back-ends

Representation Learning for CM

Representation Selective Self-distillation and wav2vec 2.0 Feature Exploration for Spoof-aware Speaker Verification

-

Jin Woo Lee et al.

MARG@SNU

16 of 35

Back-end Encoder	CM-EER
Back-end Encoder	sinc	XLSR 1^st	XLSR 3^rd	XLSR 5^th	XLSR 7^th	XLSR 9^th	XLSR 13^th	XLSR 17^th
MLP	-	2.2	2.1	0.8	1.4	1.0	1.7	2.2
AASIST	0.8	2.9	0.8	0.4	0.5	0.5	0.7	0.9
ASP	1.0	1.0	0.7	0.3	0.5	0.4	0.5	1.0

Using ASP as back-end model outperforms the others�for most front-ends
Using 5^th layer output as front-end outperforms the others for most back-ends

Representation Learning for CM

Representation Selective Self-distillation and wav2vec 2.0 Feature Exploration for Spoof-aware Speaker Verification

-

Jin Woo Lee et al.

MARG@SNU

17 of 35

Back-end Encoder	CM-EER
Back-end Encoder	sinc	XLSR 1^st	XLSR 3^rd	XLSR 5^th	XLSR 7^th	XLSR 9^th	XLSR 13^th	XLSR 17^th
MLP	-	2.2	2.1	0.8	1.4	1.0	1.7	2.2
AASIST	0.8	2.9	0.8	0.4	0.5	0.5	0.7	0.9
ASP	1.0	1.0	0.7	0.3	0.5	0.4	0.5	1.0

Using ASP as back-end model outperforms the others�for most front-ends
Using 5^th layer output as front-end outperforms the others for most back-ends

Representation Learning for CM

Representation Selective Self-distillation and wav2vec 2.0 Feature Exploration for Spoof-aware Speaker Verification

-

Jin Woo Lee et al.

MARG@SNU

18 of 35

Spoofing countermeasure feature exploration

XLSR-53 n^th layer + ASP

Representation Learning for CM

TTS

VC

TTS

Representation Selective Self-distillation and wav2vec 2.0 Feature Exploration for Spoof-aware Speaker Verification

-

Jin Woo Lee et al.

MARG@SNU

19 of 35

t-SNE Plot of Evaluated CM embeddings

Representation Learning for CM

Representation Selective Self-distillation and wav2vec 2.0 Feature Exploration for Spoof-aware Speaker Verification

-

Jin Woo Lee et al.

MARG@SNU

20 of 35

bona fide

TTS + VC from TTS

(A07 ~ A16)

VC from human voice

(A17, A18, A19)

t-SNE Plot of Evaluated CM embeddings

Representation Learning for CM

Representation Selective Self-distillation and wav2vec 2.0 Feature Exploration for Spoof-aware Speaker Verification

-

Jin Woo Lee et al.

MARG@SNU

21 of 35

Attack	Type	Attack Algorithm Input	Waveform generator	CM EER (%)
A7	TTS	Text	WORLD	0.12
A8			NSF	0.12
A9			Vocaine	0.00
A10			WaveRNN	0.89
A11			Griffin-Lim	0.10
A12			WaveNet	0.04
A13	VC	Synthetic Speech (public TTS)	Waveform filtering	0.00
A14		Synthetic Speech (commercial TTS)	STRAIGHT	0.31
A15		Synthetic Speech (commercial TTS)	WaveNet	0.43
A16	TTS	Text	Waveform concat.	0.12
A17	VC	Genuine Speech (human)	Waveform filtering	0.29
A18			MFCC vocoder	0.24
A19			Spectral filtering	0.26

Spoofing countermeasure EER breakdown

Representation Learning for CM

Representation Selective Self-distillation and wav2vec 2.0 Feature Exploration for Spoof-aware Speaker Verification

-

Jin Woo Lee et al.

MARG@SNU

22 of 35

Attack	Type	Attack Algorithm Input	Waveform generator	CM EER (%)
A7	TTS	Text	WORLD	0.12
A8			NSF	0.12
A9			Vocaine	0.00
A10			WaveRNN	0.89
A11			Griffin-Lim	0.10
A12			WaveNet	0.04
A13	VC	Synthetic Speech (public TTS)	Waveform filtering	0.00
A14		Synthetic Speech (commercial TTS)	STRAIGHT	0.31
A15		Synthetic Speech (commercial TTS)	WaveNet	0.43
A16	TTS	Text	Waveform concat.	0.12
A17	VC	Genuine Speech (human)	Waveform filtering	0.29
A18			MFCC vocoder	0.24
A19			Spectral filtering	0.26

Spoofing countermeasure EER breakdown

Representation Learning for CM

Representation Selective Self-distillation and wav2vec 2.0 Feature Exploration for Spoof-aware Speaker Verification

-

Jin Woo Lee et al.

MARG@SNU

23 of 35

Attack	Type	Attack Algorithm Input	Waveform generator	CM EER (%)
A7	TTS	Text	WORLD	0.12
A8			NSF	0.12
A9			Vocaine	0.00
A10			WaveRNN	0.89
A11			Griffin-Lim	0.10
A12			WaveNet	0.04
A13	VC	Synthetic Speech (public TTS)	Waveform filtering	0.00
A14		Synthetic Speech (commercial TTS)	STRAIGHT	0.31
A15		Synthetic Speech (commercial TTS)	WaveNet	0.43
A16	TTS	Text	Waveform concat.	0.12
A17	VC	Genuine Speech (human)	Waveform filtering	0.29
A18			MFCC vocoder	0.24
A19			Spectral filtering	0.26

Spoofing countermeasure EER breakdown

Representation Learning for CM

Representation Selective Self-distillation and wav2vec 2.0 Feature Exploration for Spoof-aware Speaker Verification

-

Jin Woo Lee et al.

MARG@SNU

24 of 35

Attack	Type	Attack Algorithm Input	Waveform generator	CM EER (%)
A7	TTS	Text	WORLD	0.12
A8			NSF	0.12
A9			Vocaine	0.00
A10			WaveRNN	0.89
A11			Griffin-Lim	0.10
A12			WaveNet	0.04
A13	VC	Synthetic Speech (public TTS)	Waveform filtering	0.00
A14		Synthetic Speech (commercial TTS)	STRAIGHT	0.31
A15		Synthetic Speech (commercial TTS)	WaveNet	0.43
A16	TTS	Text	Waveform concat.	0.12
A17	VC	Genuine Speech (human)	Waveform filtering	0.29
A18			MFCC vocoder	0.24
A19			Spectral filtering	0.26

Spoofing countermeasure EER breakdown

Representation Learning for CM

Representation Selective Self-distillation and wav2vec 2.0 Feature Exploration for Spoof-aware Speaker Verification

-

Jin Woo Lee et al.

MARG@SNU

25 of 35

Attack	Type	Attack Algorithm Input	Waveform generator	CM EER (%)
A7	TTS	Text	WORLD	0.12
A8			NSF	0.12
A9			Vocaine	0.00
A10			WaveRNN	0.89
A11			Griffin-Lim	0.10
A12			WaveNet	0.04
A13	VC	Synthetic Speech (public TTS)	Waveform filtering	0.00
A14		Synthetic Speech (commercial TTS)	STRAIGHT	0.31
A15		Synthetic Speech (commercial TTS)	WaveNet	0.43
A16	TTS	Text	Waveform concat.	0.12
A17	VC	Genuine Speech (human)	Waveform filtering	0.29
A18			MFCC vocoder	0.24
A19			Spectral filtering	0.26

Spoofing countermeasure EER breakdown

Representation Learning for CM

Representation Selective Self-distillation and wav2vec 2.0 Feature Exploration for Spoof-aware Speaker Verification

-

Jin Woo Lee et al.

MARG@SNU

26 of 35

Attack	Type	Attack Algorithm Input	Waveform generator	CM EER (%)
A7	TTS	Text	WORLD	0.12
A8			NSF	0.12
A9			Vocaine	0.00
A10			WaveRNN	0.89
A11			Griffin-Lim	0.10
A12			WaveNet	0.04
A13	VC	Synthetic Speech (public TTS)	Waveform filtering	0.00
A14		Synthetic Speech (commercial TTS)	STRAIGHT	0.31
A15		Synthetic Speech (commercial TTS)	WaveNet	0.43
A16	TTS	Text	Waveform concat.	0.12
A17	VC	Genuine Speech (human)	Waveform filtering	0.29
A18			MFCC vocoder	0.24
A19			Spectral filtering	0.26

Spoofing countermeasure EER breakdown

Representation Learning for CM

Representation Selective Self-distillation and wav2vec 2.0 Feature Exploration for Spoof-aware Speaker Verification

-

Jin Woo Lee et al.

MARG@SNU

27 of 35

RSSD: Training Strategy for SASV

Representation Selective Self-Distillation (RSSD)

	Bona fide	Spoofed
Speaker matched	Target	Nontarget
Speaker mismatched	Nontarget	Nontarget

ASV

CM

Representation Selective Self-distillation and wav2vec 2.0 Feature Exploration for Spoof-aware Speaker Verification

-

Jin Woo Lee et al.

MARG@SNU

28 of 35

RSSD: Training Strategy for SASV

Representation Selective Self-Distillation (RSSD)

	Bona fide	Spoofed
Speaker matched	Target	Nontarget
Speaker mismatched	Nontarget	Nontarget

ASV

CM

Distort the Test Spk Emb

Representation Selective Self-distillation and wav2vec 2.0 Feature Exploration for Spoof-aware Speaker Verification

-

Jin Woo Lee et al.

MARG@SNU

29 of 35

RSSD: Training Strategy for SASV

Representation Selective Self-Distillation (RSSD)

	Bona fide	Spoofed
Speaker matched	Target	Nontarget
Speaker mismatched	Nontarget	Nontarget

ASV

CM

Keep the Test Spk Emb

Representation Selective Self-distillation and wav2vec 2.0 Feature Exploration for Spoof-aware Speaker Verification

-

Jin Woo Lee et al.

MARG@SNU

30 of 35

Training with RSSD

RSSD: Training Strategy for SASV

Representation Selective Self-distillation and wav2vec 2.0 Feature Exploration for Spoof-aware Speaker Verification

-

Jin Woo Lee et al.

MARG@SNU

31 of 35

Training with RSSD

RSSD: Training Strategy for SASV

minimize

Representation Selective Self-distillation and wav2vec 2.0 Feature Exploration for Spoof-aware Speaker Verification

-

Jin Woo Lee et al.

MARG@SNU

32 of 35

Training with RSSD

RSSD: Training Strategy for SASV

minimize

Representation Selective Self-distillation and wav2vec 2.0 Feature Exploration for Spoof-aware Speaker Verification

-

Jin Woo Lee et al.

MARG@SNU

33 of 35

CM Network

ASV Network

RSSD: Training Strategy for SASV

Representation Selective Self-Distillation (RSSD)

Representation Selective Self-distillation and wav2vec 2.0 Feature Exploration for Spoof-aware Speaker Verification

-

Jin Woo Lee et al.

MARG@SNU

34 of 35

SASV System	CM FE	CM BE	SV EER	SPF EER	SASV EER
Baseline 1	SincNet	AASIST	35.32	0.67	19.31
Baseline 2	SincNet	AASIST	11.48	0.78	6.37
RSSD	SincNet	AASIST	1.41	0.76	1.15
RSSD	XLSR-53	AASIST	1.34	0.60	1.11
RSSD	XLSR-53	ASP	1.32	0.59	1.08

Evaluation using the proposed front-/back-end combinations

RSSD: Training Strategy for SASV

Representation Selective Self-distillation and wav2vec 2.0 Feature Exploration for Spoof-aware Speaker Verification

-

Jin Woo Lee et al.

MARG@SNU

35 of 35

Investigate from which layer of XLSR-53 the features obtained are advantageous for defending spoof attacks.

Output from the 5^th Transformer layer shows the lowest CM EER.

CM encoder with a simple attentive statistics pooling layer outperformed AASIST and MLP.
Propose RSSD for SASV

Selective training with representation self-distillation helps utilizing the learned representations of each ASV and CM task.
The proposed CM model advantages both CM and SASV task.

Conclusion

Representation Selective Self-distillation and wav2vec 2.0 Feature Exploration for Spoof-aware Speaker Verification

-

Jin Woo Lee et al.

MARG@SNU