Representation Selective Self-distillation
and wav2vec 2.0 Feature Exploration
for Spoof-aware Speaker Verification
Jin Woo Lee1, Eungbeom Kim2, Junghyun Koo1, Kyogu Lee1,2,3
jinwlee@snu.ac.kr
1 Department of Intelligence and Information, Seoul National University
2 Interdisciplinary Program in Artificial Intelligence, Seoul National University
3 AI Institute, Seoul National University
eb.kim@snu.ac.kr
dg22302@snu.ac.kr
kglee@snu.ac.kr
Outline
Representation Selective Self-distillation and wav2vec 2.0 Feature Exploration for Spoof-aware Speaker Verification
-
Jin Woo Lee et al.
MARG@SNU
Motivation
[1] Jung, J. W., Tak, H., Shim, H. J., Heo, H. S., Lee, B. J., Chung, S. W., ... & Kinnunen, T. (2022). SASV challenge 2022: A spoofing aware speaker verification challenge evaluation plan. arXiv preprint arXiv:2201.10283.
[2] Wang, X., Yamagishi, J., Todisco, M., Delgado, H., Nautsch, A., Evans, N., ... & Ling, Z. H. (2020). ASVspoof 2019: A large-scale public database of synthesized, converted and replayed speech. Computer Speech & Language, 64, 101114.
Representation Selective Self-distillation and wav2vec 2.0 Feature Exploration for Spoof-aware Speaker Verification
-
Jin Woo Lee et al.
MARG@SNU
Motivation
Representation Selective Self-distillation and wav2vec 2.0 Feature Exploration for Spoof-aware Speaker Verification
-
Jin Woo Lee et al.
MARG@SNU
Contributions
[1] Tian, Y., Krishnan, D., & Isola, P. (2019, September). Contrastive Representation Distillation. In International Conference on Learning Representations.
[2] Zhang, L., Song, J., Gao, A., Chen, J., Bao, C., & Ma, K. (2019). Be your own teacher: Improve the performance of convolutional neural networks via self distillation. In Proceedings of the IEEE/CVF International Conference on Computer Vision (pp. 3713-3722).
Representation Selective Self-distillation and wav2vec 2.0 Feature Exploration for Spoof-aware Speaker Verification
-
Jin Woo Lee et al.
MARG@SNU
Contributions
Representation Selective Self-distillation and wav2vec 2.0 Feature Exploration for Spoof-aware Speaker Verification
-
Jin Woo Lee et al.
MARG@SNU
[1] Desplanques, B., Thienpondt, J., & Demuynck, K. (2020). Ecapa-tdnn: Emphasized channel attention, propagation and aggregation in tdnn based speaker verification. arXiv preprint arXiv:2005.07143.
[2] Baevski, A., Zhou, Y., Mohamed, A., & Auli, M. (2020). wav2vec 2.0: A framework for self-supervised learning of speech representations. Advances in Neural Information Processing Systems, 33, 12449-12460.
[3] Conneau, A., Baevski, A., Collobert, R., Mohamed, A., & Auli, M. (2020). Unsupervised cross-lingual representation learning for speech recognition. arXiv preprint arXiv:2006.13979.
CM Network
ASV Network
Contributions
Representation Selective Self-distillation and wav2vec 2.0 Feature Exploration for Spoof-aware Speaker Verification
-
Jin Woo Lee et al.
MARG@SNU
Representation Learning for CM
CM Network
ASV Network
Representation Selective Self-distillation and wav2vec 2.0 Feature Exploration for Spoof-aware Speaker Verification
-
Jin Woo Lee et al.
MARG@SNU
CNN
Raw
waveform
Latent
Speech
Representations
Layer 1
Layer 2
Layer 3
Transformer
Layer 4
Layer 24
Intermediate
wav2vec 2.0
Representations
Representation Learning for CM
CM Network
ASV Network
Representation Selective Self-distillation and wav2vec 2.0 Feature Exploration for Spoof-aware Speaker Verification
-
Jin Woo Lee et al.
MARG@SNU
Representation Learning for CM
CM Network
ASV Network
Representation Selective Self-distillation and wav2vec 2.0 Feature Exploration for Spoof-aware Speaker Verification
-
Jin Woo Lee et al.
MARG@SNU
Representation Learning for CM
Intermediate
wav2vec 2.0
Representations
mean
std
Tanh
Conv 1x1
Softmax
mean
std
CM
Embed
Linear
Linear
CM
Logits
Conv1x1
ReLU, BN
Attentive Statistics Pooling (ASP)
RSSD
CM Network
ASV Network
Representation Selective Self-distillation and wav2vec 2.0 Feature Exploration for Spoof-aware Speaker Verification
-
Jin Woo Lee et al.
MARG@SNU
System | Front-end | Intermediate�Feature | CM�EER | |
Architecture | Self-supervised | |||
RawGAT-ST [4] | SincNet | 𝗫 | - | 1.06 |
AASIST-L [3] | SincNet | 𝗫 | - | 0.99 |
AASIST [3] | SincNet | 𝗫 | - | 0.83 |
LGF [5] | XLSR-53 | ✔ | 𝗫 | 1.28 |
LLGF [5] | W2V-Large | ✔ | 𝗫 | 0.86 |
Ours (MLP) | XLSR-53 | ✔ | ✔ | 0.80 |
Ours (AASIST) | XLSR-53 | ✔ | ✔ | 0.40 |
Ours (ASP) | XLSR-53 | ✔ | ✔ | 0.31 |
Representation Learning for CM
Representation Selective Self-distillation and wav2vec 2.0 Feature Exploration for Spoof-aware Speaker Verification
-
Jin Woo Lee et al.
MARG@SNU
Back-end Encoder | CM-EER | |||||||
sinc | XLSR 1st | XLSR 3rd | XLSR 5th | XLSR 7th | XLSR 9th | XLSR 13th | XLSR 17th | |
MLP | - | 2.2 | 2.1 | 0.8 | 1.4 | 1.0 | 1.7 | 2.2 |
AASIST | 0.8 | 2.9 | 0.8 | 0.4 | 0.5 | 0.5 | 0.7 | 0.9 |
ASP | 1.0 | 1.0 | 0.7 | 0.3 | 0.5 | 0.4 | 0.5 | 1.0 |
Representation Learning for CM
Representation Selective Self-distillation and wav2vec 2.0 Feature Exploration for Spoof-aware Speaker Verification
-
Jin Woo Lee et al.
MARG@SNU
Back-end Encoder | CM-EER | |||||||
sinc | XLSR 1st | XLSR 3rd | XLSR 5th | XLSR 7th | XLSR 9th | XLSR 13th | XLSR 17th | |
MLP | - | 2.2 | 2.1 | 0.8 | 1.4 | 1.0 | 1.7 | 2.2 |
AASIST | 0.8 | 2.9 | 0.8 | 0.4 | 0.5 | 0.5 | 0.7 | 0.9 |
ASP | 1.0 | 1.0 | 0.7 | 0.3 | 0.5 | 0.4 | 0.5 | 1.0 |
Representation Learning for CM
Representation Selective Self-distillation and wav2vec 2.0 Feature Exploration for Spoof-aware Speaker Verification
-
Jin Woo Lee et al.
MARG@SNU
Back-end Encoder | CM-EER | |||||||
sinc | XLSR 1st | XLSR 3rd | XLSR 5th | XLSR 7th | XLSR 9th | XLSR 13th | XLSR 17th | |
MLP | - | 2.2 | 2.1 | 0.8 | 1.4 | 1.0 | 1.7 | 2.2 |
AASIST | 0.8 | 2.9 | 0.8 | 0.4 | 0.5 | 0.5 | 0.7 | 0.9 |
ASP | 1.0 | 1.0 | 0.7 | 0.3 | 0.5 | 0.4 | 0.5 | 1.0 |
Representation Learning for CM
Representation Selective Self-distillation and wav2vec 2.0 Feature Exploration for Spoof-aware Speaker Verification
-
Jin Woo Lee et al.
MARG@SNU
Back-end Encoder | CM-EER | |||||||
sinc | XLSR 1st | XLSR 3rd | XLSR 5th | XLSR 7th | XLSR 9th | XLSR 13th | XLSR 17th | |
MLP | - | 2.2 | 2.1 | 0.8 | 1.4 | 1.0 | 1.7 | 2.2 |
AASIST | 0.8 | 2.9 | 0.8 | 0.4 | 0.5 | 0.5 | 0.7 | 0.9 |
ASP | 1.0 | 1.0 | 0.7 | 0.3 | 0.5 | 0.4 | 0.5 | 1.0 |
Representation Learning for CM
Representation Selective Self-distillation and wav2vec 2.0 Feature Exploration for Spoof-aware Speaker Verification
-
Jin Woo Lee et al.
MARG@SNU
Back-end Encoder | CM-EER | |||||||
sinc | XLSR 1st | XLSR 3rd | XLSR 5th | XLSR 7th | XLSR 9th | XLSR 13th | XLSR 17th | |
MLP | - | 2.2 | 2.1 | 0.8 | 1.4 | 1.0 | 1.7 | 2.2 |
AASIST | 0.8 | 2.9 | 0.8 | 0.4 | 0.5 | 0.5 | 0.7 | 0.9 |
ASP | 1.0 | 1.0 | 0.7 | 0.3 | 0.5 | 0.4 | 0.5 | 1.0 |
Representation Learning for CM
Representation Selective Self-distillation and wav2vec 2.0 Feature Exploration for Spoof-aware Speaker Verification
-
Jin Woo Lee et al.
MARG@SNU
Representation Learning for CM
TTS
VC
VC
TTS
Representation Selective Self-distillation and wav2vec 2.0 Feature Exploration for Spoof-aware Speaker Verification
-
Jin Woo Lee et al.
MARG@SNU
Representation Learning for CM
Representation Selective Self-distillation and wav2vec 2.0 Feature Exploration for Spoof-aware Speaker Verification
-
Jin Woo Lee et al.
MARG@SNU
bona fide
TTS + VC from TTS
(A07 ~ A16)
VC from human voice
(A17, A18, A19)
Representation Learning for CM
Representation Selective Self-distillation and wav2vec 2.0 Feature Exploration for Spoof-aware Speaker Verification
-
Jin Woo Lee et al.
MARG@SNU
Attack | Type | Attack Algorithm Input | Waveform generator | CM EER (%) |
A7 | TTS | Text | WORLD | 0.12 |
A8 | NSF | 0.12 | ||
A9 | Vocaine | 0.00 | ||
A10 | WaveRNN | 0.89 | ||
A11 | Griffin-Lim | 0.10 | ||
A12 | WaveNet | 0.04 | ||
A13 | VC | Synthetic Speech (public TTS) | Waveform filtering | 0.00 |
A14 | Synthetic Speech (commercial TTS) | STRAIGHT | 0.31 | |
A15 | WaveNet | 0.43 | ||
A16 | TTS | Text | Waveform concat. | 0.12 |
A17 | VC | Genuine Speech (human) | Waveform filtering | 0.29 |
A18 | MFCC vocoder | 0.24 | ||
A19 | Spectral filtering | 0.26 |
Representation Learning for CM
Representation Selective Self-distillation and wav2vec 2.0 Feature Exploration for Spoof-aware Speaker Verification
-
Jin Woo Lee et al.
MARG@SNU
Attack | Type | Attack Algorithm Input | Waveform generator | CM EER (%) |
A7 | TTS | Text | WORLD | 0.12 |
A8 | NSF | 0.12 | ||
A9 | Vocaine | 0.00 | ||
A10 | WaveRNN | 0.89 | ||
A11 | Griffin-Lim | 0.10 | ||
A12 | WaveNet | 0.04 | ||
A13 | VC | Synthetic Speech (public TTS) | Waveform filtering | 0.00 |
A14 | Synthetic Speech (commercial TTS) | STRAIGHT | 0.31 | |
A15 | WaveNet | 0.43 | ||
A16 | TTS | Text | Waveform concat. | 0.12 |
A17 | VC | Genuine Speech (human) | Waveform filtering | 0.29 |
A18 | MFCC vocoder | 0.24 | ||
A19 | Spectral filtering | 0.26 |
Representation Learning for CM
Representation Selective Self-distillation and wav2vec 2.0 Feature Exploration for Spoof-aware Speaker Verification
-
Jin Woo Lee et al.
MARG@SNU
Attack | Type | Attack Algorithm Input | Waveform generator | CM EER (%) |
A7 | TTS | Text | WORLD | 0.12 |
A8 | NSF | 0.12 | ||
A9 | Vocaine | 0.00 | ||
A10 | WaveRNN | 0.89 | ||
A11 | Griffin-Lim | 0.10 | ||
A12 | WaveNet | 0.04 | ||
A13 | VC | Synthetic Speech (public TTS) | Waveform filtering | 0.00 |
A14 | Synthetic Speech (commercial TTS) | STRAIGHT | 0.31 | |
A15 | WaveNet | 0.43 | ||
A16 | TTS | Text | Waveform concat. | 0.12 |
A17 | VC | Genuine Speech (human) | Waveform filtering | 0.29 |
A18 | MFCC vocoder | 0.24 | ||
A19 | Spectral filtering | 0.26 |
Representation Learning for CM
Representation Selective Self-distillation and wav2vec 2.0 Feature Exploration for Spoof-aware Speaker Verification
-
Jin Woo Lee et al.
MARG@SNU
Attack | Type | Attack Algorithm Input | Waveform generator | CM EER (%) |
A7 | TTS | Text | WORLD | 0.12 |
A8 | NSF | 0.12 | ||
A9 | Vocaine | 0.00 | ||
A10 | WaveRNN | 0.89 | ||
A11 | Griffin-Lim | 0.10 | ||
A12 | WaveNet | 0.04 | ||
A13 | VC | Synthetic Speech (public TTS) | Waveform filtering | 0.00 |
A14 | Synthetic Speech (commercial TTS) | STRAIGHT | 0.31 | |
A15 | WaveNet | 0.43 | ||
A16 | TTS | Text | Waveform concat. | 0.12 |
A17 | VC | Genuine Speech (human) | Waveform filtering | 0.29 |
A18 | MFCC vocoder | 0.24 | ||
A19 | Spectral filtering | 0.26 |
Representation Learning for CM
Representation Selective Self-distillation and wav2vec 2.0 Feature Exploration for Spoof-aware Speaker Verification
-
Jin Woo Lee et al.
MARG@SNU
Attack | Type | Attack Algorithm Input | Waveform generator | CM EER (%) |
A7 | TTS | Text | WORLD | 0.12 |
A8 | NSF | 0.12 | ||
A9 | Vocaine | 0.00 | ||
A10 | WaveRNN | 0.89 | ||
A11 | Griffin-Lim | 0.10 | ||
A12 | WaveNet | 0.04 | ||
A13 | VC | Synthetic Speech (public TTS) | Waveform filtering | 0.00 |
A14 | Synthetic Speech (commercial TTS) | STRAIGHT | 0.31 | |
A15 | WaveNet | 0.43 | ||
A16 | TTS | Text | Waveform concat. | 0.12 |
A17 | VC | Genuine Speech (human) | Waveform filtering | 0.29 |
A18 | MFCC vocoder | 0.24 | ||
A19 | Spectral filtering | 0.26 |
Representation Learning for CM
Representation Selective Self-distillation and wav2vec 2.0 Feature Exploration for Spoof-aware Speaker Verification
-
Jin Woo Lee et al.
MARG@SNU
Attack | Type | Attack Algorithm Input | Waveform generator | CM EER (%) |
A7 | TTS | Text | WORLD | 0.12 |
A8 | NSF | 0.12 | ||
A9 | Vocaine | 0.00 | ||
A10 | WaveRNN | 0.89 | ||
A11 | Griffin-Lim | 0.10 | ||
A12 | WaveNet | 0.04 | ||
A13 | VC | Synthetic Speech (public TTS) | Waveform filtering | 0.00 |
A14 | Synthetic Speech (commercial TTS) | STRAIGHT | 0.31 | |
A15 | WaveNet | 0.43 | ||
A16 | TTS | Text | Waveform concat. | 0.12 |
A17 | VC | Genuine Speech (human) | Waveform filtering | 0.29 |
A18 | MFCC vocoder | 0.24 | ||
A19 | Spectral filtering | 0.26 |
Representation Learning for CM
Representation Selective Self-distillation and wav2vec 2.0 Feature Exploration for Spoof-aware Speaker Verification
-
Jin Woo Lee et al.
MARG@SNU
RSSD: Training Strategy for SASV
| Bona fide | Spoofed |
Speaker matched | Target | Nontarget |
Speaker mismatched | Nontarget | Nontarget |
ASV
CM
Representation Selective Self-distillation and wav2vec 2.0 Feature Exploration for Spoof-aware Speaker Verification
-
Jin Woo Lee et al.
MARG@SNU
RSSD: Training Strategy for SASV
| Bona fide | Spoofed |
Speaker matched | Target | Nontarget |
Speaker mismatched | Nontarget | Nontarget |
ASV
CM
Distort the Test Spk Emb
Representation Selective Self-distillation and wav2vec 2.0 Feature Exploration for Spoof-aware Speaker Verification
-
Jin Woo Lee et al.
MARG@SNU
RSSD: Training Strategy for SASV
| Bona fide | Spoofed |
Speaker matched | Target | Nontarget |
Speaker mismatched | Nontarget | Nontarget |
ASV
CM
Keep the Test Spk Emb
Representation Selective Self-distillation and wav2vec 2.0 Feature Exploration for Spoof-aware Speaker Verification
-
Jin Woo Lee et al.
MARG@SNU
RSSD: Training Strategy for SASV
Representation Selective Self-distillation and wav2vec 2.0 Feature Exploration for Spoof-aware Speaker Verification
-
Jin Woo Lee et al.
MARG@SNU
RSSD: Training Strategy for SASV
minimize
Representation Selective Self-distillation and wav2vec 2.0 Feature Exploration for Spoof-aware Speaker Verification
-
Jin Woo Lee et al.
MARG@SNU
RSSD: Training Strategy for SASV
minimize
Representation Selective Self-distillation and wav2vec 2.0 Feature Exploration for Spoof-aware Speaker Verification
-
Jin Woo Lee et al.
MARG@SNU
CM Network
ASV Network
RSSD: Training Strategy for SASV
Representation Selective Self-distillation and wav2vec 2.0 Feature Exploration for Spoof-aware Speaker Verification
-
Jin Woo Lee et al.
MARG@SNU
SASV System | CM FE | CM BE | SV EER | SPF EER | SASV EER |
Baseline 1 | SincNet | AASIST | 35.32 | 0.67 | 19.31 |
Baseline 2 | SincNet | AASIST | 11.48 | 0.78 | 6.37 |
RSSD | SincNet | AASIST | 1.41 | 0.76 | 1.15 |
RSSD | XLSR-53 | AASIST | 1.34 | 0.60 | 1.11 |
RSSD | XLSR-53 | ASP | 1.32 | 0.59 | 1.08 |
RSSD: Training Strategy for SASV
Representation Selective Self-distillation and wav2vec 2.0 Feature Exploration for Spoof-aware Speaker Verification
-
Jin Woo Lee et al.
MARG@SNU
Conclusion
Representation Selective Self-distillation and wav2vec 2.0 Feature Exploration for Spoof-aware Speaker Verification
-
Jin Woo Lee et al.
MARG@SNU