Slides by Xin Wang
National Institute of Informatics
© 2026, Xin Wang. All rights reserved.
This work is licensed under the Creative Commons Attribution 3.0 license.
See http://creativecommons.org/ for details.
1
Progress of Speech Generative AI and Countermeasures against Speech Deepfake
WANG Xin
Project Associate Professor, JST PRESTO researcher
Muroran Institute of Technology 2026/06/10
wangxin@nii.ac.jp
ワン シン
2
2015
2018
2020
2013
2024
@UESTC & USTC
China
@NII, Japan
PI Prof. Yamagishi
Ph.D.
researcher for CREST
JST さきがけ
Speech watermark
Speech deepfake detection: ASVspoof, SSL
Voice anonymization
Speech synthesis: HMM, autoregress, F0, DNN vocoders
M.S.
https://researchmap.jp/wangxin
Muroran Institute of Technology 2026/06/10
@NII
PI Wang
M.S.
3
Speech synthesis
Speech generation, GenAI
Deepfake detection
Speech anti-spoofing
Signal processing
Computer science
signal & system, filters, …
information theory, search, …
pattern recognition
decision theory
deep neural networks
Machine learning
Linguistic
phonetics, phonology, …
https://www.magnific.com/free-photo/sym-
bols-come-out-bulb-top-book_985250.htm
4
Introduction
Text example from: Beckman, M. E. & Ayers, G. Guidelines for ToBI labelling. OSU Res. Found. 3, (1997)
Image
Video
Audio
Text
Speech / voice
Text: Marianna made the marmalade
Speech:
5
Introduction
Kewley-Port, D. & M. Nearey, T. Speech synthesizer produced voices for disabled, including Stephen Hawking. J. Acoust. Soc. Am. 148, R1--R2 (2020)
https://commons.wikimedia.org/wiki/File:Stephen_Hawking.StarChild.jpg
Input text
Speech signal
Text-to-speech (TTS)
6
Introduction
https://collection.sciencemuseumgroup.org.uk/objects/co8911441/stephen-hawkings-speech-synthesizer-board
https://thereader.mitpress.mit.edu/stephen-hawkings-eternal-voice/
Input text
Speech signal
“Perfect Paul”
by Dennis Klatt
7
Introduction
Kewley-Port, D. & M. Nearey, T. Speech synthesizer produced voices for disabled, including Stephen Hawking. J. Acoust. Soc. Am. 148, R1--R2 (2020)
https://commons.wikimedia.org/wiki/File:Stephen_Hawking.StarChild.jpg
Input text
Reference voice (10s)
Speech signal
8
Introduction
See paper in Reference
Samples at the top are from ChatterBox. Other samples from ASVspoof 2019 database
HMM+deep neural networks (DNNs)
(Zen 2013, Ling 2013, Fan 2014)
Hidden Markov model (HMM)
(Tokuda 1995, Yoshimura 1999, Tokuda 2000)
~1990s
~2000
~2013
~2017
Unit-selection
(Hunt 1996)
Transformer(Li 2018), WaveNet(oord 2016) …
Latest
Language language model (LLM)
(e.g., VALLE-E Wang cc2023)
Natural voice
9
Introduction
Samples from WildSpoof challenge https://wildspoof.github.io
Many APIs & models on Huggingface
10
Introduction
Of course, my own voice can be easily cloned.
That is what you are hearing right now.
The cloned voice can speak better than me, reading a complicated sentence like “behaving like a babbling, bumbling band of baboons”
11
Introduction
https://www.bloomberg.com/news/articles/2024-01-26/ai-startup-elevenlabs-bans-account-blamed-for-biden-audio-deepfake?embedded-checkout=true
https://www.bbc.com/news/technology-60780142
Speech signal
Input text
Reference voice
12
Introduction
https://www.bloomberg.com/news/articles/2024-01-26/ai-startup-elevenlabs-bans-account-blamed-for-biden-audio-deepfake?embedded-checkout=true
https://www.bbc.com/news/technology-60780142
Speech signal
Input text
Reference voice
Part 2
How speech generation is misused
Part 3
Deepfake speech detection
Part 4
Additional layer of countermeasure: watermark & anonymization
Part 1:
Progress of speech synthesis
13
Progress of Speech Generation
How do we speak & listen
How can machines speak – the progress
Euphonia: the talking machine
https://en.wikipedia.org/wiki/Euphonia_%28device%29
14
How do we speak & listen
Illustration from HTS Slides ver. 2.3, HTS Working Group
Illustration from https://en.wikipedia.org/wiki/Middle_ear
Speaker
Listener
Speech perception
Speech production
Speech transmission
Telephone, VoIP, air …
Channel
15
How do we speak & listen
Illustration from HTS Slides ver. 2.3, HTS Working Group
Animation from Speech production and articulation knowledge group https://sail.usc.edu/span/rtmri_ipa/je_2015.html
Contents
Mariana made the marmalade
Words:
Prosody:
H
H
L
Accent: U.S. accent
Emotion:
Sex:
Voice quality: clear, warm, …
Identity: who the speaker is
Paralinguistic
Biometric
16
How do we speak & listen
Illustration from HTS Slides ver. 2.3, HTS Working Group
Animation from Speech production and articulation knowledge group https://sail.usc.edu/span/rtmri_ipa/je_2015.html
[a]
[i]
[o]
We perceive sounds by their formant frequencices, decided by resonant frequencies of vocal tract
Contents
F1 F2 …
vowels
17
How do we speak & listen
Figure from https://www.sfu.ca/sonic-studio-webdav/cmns/Handbook%20Tutorial/SpeechAcoustics.html
https://commons.wikimedia.org/wiki/File:Pipe003.gif
We perceive sounds by their formant frequencices, decided by resonant frequencies of vocal tract
Contents
F1 F2 …
vowels
18
How do we speak & listen
Motion picture from https://en.wikipedia.org/wiki/Vocal_cords
Animation from Speech production and articulation knowledge group https://sail.usc.edu/span/rtmri_ipa/je_2015.html
Animation from http://www.ling.fju.edu.tw/hearing/ishizaka.htm
[a]
We perceive pitch based on fundamental frequency, decided by the vibration of vocal cords
Contents
[a]
F0
19
How do we speak & listen
Example from: Nakamura, I., Minematsu, N., Suzuki, M., Hirano, H., Nakagawa, C., Nakamura, N., Tagawa, Y., Hirose, K., Hashimoto, H. (2013) Development of a web framework for teaching and learning Japanese prosody: OJAD (online Japanese accent dictionary). Proc. Interspeech 2013, 2554-2558, doi: 10.21437/Interspeech.2013-575
[a]
We perceive pitch based on fundamental frequency, decided by the vibration of vocal cords
Contents
[a]
F0
Japanese pitch accent
20
How do we speak & listen
Picture from DOI:10.1142/9891 See more by searching Vocal Cords Bernoulli effect
Animation from Speech production and articulation knowledge group https://sail.usc.edu/span/rtmri_ipa/je_2015.html
Our impression of who the speaker is, decided by the unique shape, size, length, … of vocal tract & cords!
Speaker A
Biometric
Speaker B
Voiceprint
21
How do we speak & listen
https://en.wikipedia.org/wiki/Formant#/media/File:Spectrogram_-iua-.png
https://speechprocessingbook.aalto.fi/Representations/Fundamental_frequency_F0.html
Short-time Fourier transform
Wideband spectrogram
Formant frequencies (F1 & F2)
[i]
[u]
[a]
Frequency
Time
22
How do we speak & listen
https://www.biointeractive.org/classroom-resources/cochlea
This animation is a clip from a 1999 Holiday Lecture Series, Senses and Sensitivity, by neuroscientist A. James Hudspeth
Short-time Fourier transform
Wideband spectrogram
Formant frequencies (F1 & F2)
[i]
[u]
[a]
Frequency
Time
23
How can machine speak? – Research questions
https://en.wikipedia.org/wiki/Formant#/media/File:Spectrogram_-iua-.png
Input text
Recover waveform Waveform generation
Generate intermediate features Acoustic modeling
Decide F0, F1, F2 … from text Text analysis
24
How can machine speak? – Research questions
Sentence from: Beckman, M. E. & Ayers, G. Guidelines for ToBI labelling. OSU Res. Found. 3, (1997)
Marianna made the marmalade
Discrete
Continuous
Alignment
M
a
a
m
e
a
…
m
…
d
a
l
d
e
r
Ambiguity
Biometric …
25
How can machine speak? – Research questions
Sentence from: Beckman, M. E. & Ayers, G. Guidelines for ToBI labelling. OSU Res. Found. 3, (1997)
LOGIOS Lexicon tool: http://www.speech.cs.cmu.edu/tools/lextool.html
H*, L-L%: ToBI labels Beckman, M. E. & Ayers, G. Guidelines for ToBI labelling. OSU Res. Found. 3, (1997)
Marianna made the marmalade
M AA R IY AA N AH M EY D DH AH M AA R M AH L EY D
To phoneme
Waveform
generation
M
a
a
m
e
a
…
m
…
d
a
l
d
e
r
Normali-zation
+Prosody
tags
H*
H*
L-L%
M AA R IY AA N AH M EY D DH AH M AA R M AH L EY D
Acoustic
realization
…
Waveform generation
Acoustic modeling
Text analysis
26
How can machine speak? – Research questions
Kiyoshi KURIHARA, Nobumasa SEIYAMA, Tadashi KUMANO, "Prosodic Features Control by Symbols as Input of Sequence-to-Sequence Acoustic Modeling for Neural TTS" in IEICE TRANSACTIONS on Information, vol. E104-D, no. 2, pp. 302-311, February 2021, doi: 10.1587/transinf.2020EDP7104.
Marianna made the marmalade
M AA R IY AA N AH M EY D DH AH M AA R M AH L EY D
To phoneme
Waveform
generation
M
a
a
m
e
a
…
m
…
d
a
l
d
e
r
Normali-zation
+Prosody
tags
H*
H*
L-L%
M AA R IY AA N AH M EY D DH AH M AA R M AH L EY D
Acoustic
realization
…
Language dependent
27
Progress of TTS in terms of acoustic & waveform modeling
Heiga Zen, Keiichi Tokuda, and Alan W Black. 2009. Statistical parametric speech synthesis. Speech Communication 51, (2009), 1039–1064.
2019 https://www.wsj.com/articles/fraudsters-use-ai-to-mimic-ceos-voice-in-unusual-cybercrime-case-11567157402?reflink=desktopwebshare_permalink
See other papers in appendix
M AA R IY AA N AH M EY D DH AH M AA R M AH L EY D
To phone
Waveform
generation
M
a
a
m
e
a
…
m
…
d
a
l
d
e
r
Normali-zation
+Prosody
tags
H*
H*
L-L%
M AA R IY AA N AH M EY D DH AH M AA R M AH L EY D
Acoustic
realization
…
Marianna made the marmalade
Discrete
Continuous
Alignment
M
a
a
m
e
a
…
m
…
d
a
l
d
e
r
Ambiguity
Speaker identity,
Prosody …
HMM+deep neural networks
Hidden Markov model (HMM)
1996
~2000
~2013
~2017
Unit-selection
WaveNet, end-2-end (Tacotron, Transformer)
Latest
Statistical parametric speech synthesis (Zen 2009)
Codec+LLM
(Valle, CosyVoice)
Linguistics,
rules
Deep learning,
statistics
28
Progress of TTS – 1990s Unit selection
Hunt, A. J. & Black, A. W. Unit selection in a concatenative speech synthesis system using a large speech database. in Proc. ICASSP 373–376 (1996).
Black, A. W. & Taylor, P. A. Automatically clustering similar units for unit selection in speech synthesis. (1997)
Marianna made the marmalade
M AA R IY AA N AH M EY D DH AH M AA R M AH L EY D
To phoneme
Waveform
concatenation
M
a
a
m
e
a
…
m
…
d
a
l
d
e
r
Normali-zation
+Prosody
tags
H*
H*
L-L%
M AA R IY AA N AH M EY D DH AH M AA R M AH L EY D
Acoustic
realization
Concatenation
artifacts
29
Progress of TTS – 1990s Unit selection
[i]
[u]
[a]
Unit-selection concatenates sound segments
30
Progress of TTS – after 1990s
[i]
[u]
[a]
Co-articulation: human speech is continuous
Statistical parametric speech synthesis: use statistical models to generate smooth feature trajectories (parametric approach)
31
Progress of TTS – 2000s
Tokuda, K., Yoshimura, T., Masuko, T., Kobayashi, T. & Kitamura, T. Speech parameter generation algorithms for HMM-based speech synthesis. in Proc. ICASSP 936–939 (2000).
Keiichi Tokuda, Heiga Zen, and Alan W Black. An HMM-Based Speech Synthesis System Applied to English. In Proc. SSW, 227–230. 2002.
HMM-Based Speech Synthesis Toolkit (HTS), home page: http://hts.sp.nitech.ac.jp/?Welcome
Marianna made the marmalade
M AA R IY AA N AH M EY D DH AH M AA R M AH L EY D
To phone
Waveform
generation
M
a
a
m
e
a
…
m
…
d
a
l
d
e
r
Normali-zation
+Prosody
tags
H*
H*
L-L%
M AA R IY AA N AH M EY D DH AH M AA R M AH L EY D
Acoustic
realization
…
Hidden Markov model
Vocoder artifacts
& oversmoothing
Vocoder using digital signal processing (DSP)
32
Progress of TTS – early 2010s
Heiga Zen, Alan Senior, and Martin Schuster. Statistical Parametric Speech Synthesis Using Deep Neural Networks. In Proc. ICASSP, 7962–7966. 2013.
Heiga Zen, and Andrew Senior. Deep Mixture Density Networks for Acoustic Modeling in Statistical Parametric Speech Synthesis. In Proc. ICASSP, 3844–3848. 2014.
Xin Wang, Shinji Takaki, and Junichi Yamagishi. 2017. An autoregressive recurrent mixture density network for parametric speech synthesis. In Proc. ICASSP, 4895–4899.
Marianna made the marmalade
M AA R IY AA N AH M EY D DH AH M AA R M AH L EY D
To phone
Waveform
generation
M
a
a
m
e
a
…
m
…
d
a
l
d
e
r
Normali-zation
+Prosody
tags
H*
H*
L-L%
M AA R IY AA N AH M EY D DH AH M AA R M AH L EY D
Acoustic
realization
…
Deep neural network (DNN)
Vocoder artifacts
& oversmoothing
Vocoder using digital signal processing (DSP)
33
Progress of TTS – late 2010s
Xin Wang, Neural statistical parametric speech synthesis, ISCA Odyssey 2020, tutorial: https://tonywangx.github.io/slide.html#dec-2020
Xu Tan, Tao Qin, Frank Soong, and Tie-Yan Liu. A Survey on Neural Speech Synthesis. ArXiv Preprint ArXiv:2106.15561. 2021.
a
a
m
e
a
…
m
…
d
a
l
d
e
r
H*
H*
L-L%
M AA R IY AA N AH M EY D DH AH M AA R M AH L EY D
…
Vocoder using DNN
Transformer
Acoustic
features
Input text
M
Vocoder artifacts
& oversmoothing
Alignment not always stable
34
Progress of TTS – from 2000s to late 2010s
a
a
m
e
a
…
m
…
d
a
l
d
e
r
H*
H*
L-L%
M AA R IY AA N AH M EY D DH AH M AA R M AH L EY D
…
Vocoder using DNN
Transformer
Acoustic
features
Input text
M
Expert-knowledge based
35
Progress of TTS – 2020s
Aaron Van Den Oord, Oriol Vinyals, and others. 2017. Neural discrete representation learning. In Proc. NIPS, 2017. 6309–6318.
Neil Zeghidour, Alejandro Luebs, Ahmed Omran, Jan Skoglund, and Marco Tagliasacchi. 2022. SoundStream: An End-to-End Neural Audio Codec. IEEE/ACM Trans. Audio Speech Lang. Process. 30, (2022), 495–507.
1
2
3
…
DNN decoder
Acoustic
tokens
DNN encoder
Use DNNs to learn a latent feature space
36
Progress of TTS – 2020s
Zhihao Du, Qian Chen, Shiliang Zhang, Kai Hu, Heng Lu, Yexin Yang, Hangrui Hu, Siqi Zheng, Yue Gu, Ziyang Ma, Zhifu Gao, and Zhijie Yan. 2024. CosyVoice: A Scalable Multilingual Zero-shot Text-to-speech Synthesizer based on Supervised Semantic Tokens.
Chengyi Wang, Sanyuan Chen, Yu Wu, Ziqiang Zhang, Long Zhou, Shujie Liu, Zhuo Chen, Yanqing Liu, Huaming Wang, Jinyu Li, and others. 2023. Neural codec language models are zero-shot text to speech synthesizers. arXiv preprint arXiv:2301.02111 (2023).
a
a
m
e
a
…
m
…
d
a
l
d
e
r
H*
H*
L-L%
M AA R IY AA N AH M EY D DH AH M AA R M AH L EY D
1
2
3
…
DNN decoder
Large language model
Acoustic
tokens
Input tokens
M
Code index
prompting
37
Progress of TTS – 2020s
a
a
m
e
a
m
…
…
d
a
l
d
e
r
H*
H*
L-L%
M AA R IY AA N AH M EY D DH AH M AA R M AH L EY D
1
2
3
…
DNN decoder
Large language model
M
Speak in a firm voice like a politician,
…
Text-based control
prompting
+
+
38
Progress of TTS
More DNNs, less linguistic rules
Better quality, less artifacts
DNN decoder
Large language model
DNN vocoder
(small) Transformer
DSP vocoder
Classical DNN
/ HMM
Waveform concatenation
Unit selection
…
39
Progress of TTS
More DNNs, less linguistic rules
Better quality, less artifacts
DNN decoder
Large language model
DNN vocoder
(small) Transformer
DSP vocoder
Classical DNN
/ HMM
Waveform concatenation
Unit selection
…
Festival & MaryTTS
(C++, Lisp/perl, …)
HTS
(C++)
CURRENT
Merlin & RNNLIB
(C++/CUDA)
Theano
Tensorflow
Pytorch
(Python)
Pytorch
(Python)
More open-sourced & easier to use
40
Misuse of Speech Generation
https://en.wikipedia.org/wiki/Turing_test
How well does Deepfake fool human?
How well does Deepfake fool machine?
41
Misuse of speech generation
TTS generator
Attacker
User
Humans
Telephone call
Social media
Speaker verification
Services
Transmission
42
Misuse of speech generation
https://www.wsj.com/articles/fraudsters-use-ai-to-mimic-ceos-voice-in-unusual-cybercrime-case-11567157402
TTS generator
Attacker
User
Humans
Telephone call
Social media
Speaker verification
Services
Transmission
AI voice scam
I am your boss / son, please send me the money
sns
I am your boss …
43
Misuse of speech generation
https://www.mcafee.com/content/dam/consumer/en-us/resources/cybersecurity/artificial-intelligence/rp-beware-the-artificial-impostor-report.pdf
TTS generator
Attacker
User
Humans
Telephone call
Social media
Speaker verification
Services
Transmission
AI voice scam
44
Misuse of speech generation
Xin Wang, Junichi Yamagishi, et all. 2020. ASVspoof 2019: A large-scale public database of synthesized, converted and replayed speech. Computer Speech & Language 64, (November 2020), 101114.
Kimberly T. Mai, Sergi Bray, Toby Davies, and Lewis D. Griffin. 2023. Warning: Humans cannot reliably detect speech deepfakes. PLoS ONE 18, 8 (August 2023), e0285333.
Kevin Warren, Tyler Tucker, Anna Crowder, Daniel Olszewski, Allison Lu, Caroline Fedele, Magdalena Pasternak, Seth Layton, Kevin Butler, Carrie Gates, and others. 2024. “ Better be computer or I’m dumb”: a large-scale evaluation of humans as audio deepfake detectors. In Proc. ACM CCS, 2024. 2696–2710.
TTS generator
Attacker
User
Humans
Telephone call
Social media
Speaker verification
Services
Transmission
Can detect fake speech be detected by humans?
Human-ear detection is NOT reliable
45
Misuse of speech generation
Xin Wang, Junichi Yamagishi, et all. 2020. ASVspoof 2019: A large-scale public database of synthesized, converted and replayed speech. Computer Speech & Language 64, (November 2020), 101114.
Humans
Telephone call
Social media
Depending on the attacks (Wang 2020)
Random guess
Detection error rates by human ears (%)
by 1,145 listeners
Tactron 2
Attack IDs
46
Misuse of speech generation
TTS generator
Attacker
User
Humans
Telephone call
Social media
Services
Transmission
Speaker verification
47
Misuse of speech generation
John H L Hansen and Taufiq Hasan. 2015. Speaker recognition by machines and humans: A tutorial review. IEEE Signal processing magazine 32, 6 (2015), 74–99.
Personalized Hey Siri, https://machinelearning.apple.com/research/personalized-hey-siri
Services
Speaker verification
Speaker verification
Match
Not match
TTS
Match
Examples:
Telephone banking
“Hey Siri”
…
Spoofing
48
Misuse of speech generation
https://www.wsj.com/articles/i-cloned-myself-with-ai-she-fooled-my-bank-and-my-family-356bd1a3 https://youtu.be/kqYSIU70N68
https://www.iso.org/standard/83828.html : ISO/IEC 30107-1:2023
Services
Speaker verification
Speaker verification
Match
Not match
TTS
Match
AI voice spoofing
49
Misuse of speech generation
Bryan L Pellom and John H L Hansen. 1999. An experimental study of speaker verification sensitivity to computer voice-altered imposters. In Proc. ICASSP, 1999. IEEE, 837–840.
Takashi Masuko, Takafumi Hitotsumatsu, Keiichi Tokuda, and Takao Kobayashi. 1999. On the security of HMM-based speaker verification systems against imposture using synthetic speech. In Proc. Eurospeech, 1999. 1223--1226}
Speaker verification
Humans
Telephone call
Social media
Speaker verification
Services
Will a speaker verification model be fooled?
Yes
TTS
Match
50
Misuse of speech generation
Xin Wang, Junichi Yamagishi, et all. 2020. ASVspoof 2019: A large-scale public database of synthesized, converted and replayed speech. Computer Speech & Language 64, (November 2020), 101114.
Humans
Telephone call
Social media
Speaker verification
Services
Will a speaker verification model be fooled? YES
Random guess
Equal error rates (%)
X-vector-based speaker verification model
Unit-selection
HMM-DNN
51
Misuse of speech generation
Xin Wang, Junichi Yamagishi, et all. 2020. ASVspoof 2019: A large-scale public database of synthesized, converted and replayed speech. Computer Speech & Language 64, (November 2020), 101114.
Humans
Telephone call
Social media
Speaker verification
Services
Random guess
Human ears differ from speaker verification models
Unit-selection
HMM-DNN
Equal error rates (%)
X-vector-based speaker verification model
52
Misuse of speech generation
Jee-weon Jung, Xin Wang, Nicholas Evans, Shinji Watanabe, Hye-jin Shim, Hemlata Tak, Sidhhant Arora, Junichi Yamagishi, and Joon Son Chung. 2024. To what extent can ASV systems naturally defend against spoofing attacks? In Proc. Interspeech, 2024. 3240--3244.
Generator
Attacker
User
Humans
Telephone call
Social media
Speaker verification
Services
Transmission
But both may be fooled by speech deepfake
See latest study on spoofing & speaker verification(Jung 2024)
Human ears differ from speaker verification models
53
Speech Deepfake Detection
How do we do binary detection?
Poor generalization performance
https://en.wikipedia.org/wiki/Overfitting
54
Deepfake detection
Generator
Attacker
Humans
Telephone call
Social media
Speaker verification
Services
Transmission
55
Deepfake detection
See alternatives in Abdenour Hadid, Nicholas Evans, Sebastien Marcel, and Julian Fierrez. 2015. Biometrics systems under spoofing attack: an evaluation methodology and lessons learned. IEEE Signal Processing Magazine 32, 5 (2015), 20–30.
Attacker
Humans
Telephone call
Social media
Speaker verification
Services
Defender
Detector
Generator
REAL
FAKE
Transmission
Other names:
anti-spoofing, spoofing countermeasure, presentation attack detection
56
Deepfake detection
Goal: detection with no or limited prior knowledge of
Detector
Generator
REAL
FAKE
Attacker
Defender
En,
Jp,
Zh,
…
VoIP
mp3
…
Transmission
Generalization
57
Deepfake detection
Detector
Generator
REAL
FAKE
Attacker
Transmission
Detector
Feature extraction
(front end)
Classifier
(back end)
Input wav.
(ACCEPT)
(REJECT)
(ACCEPT)
(REJECT)
58
Deepfake detection
Spectrogram
Average over high-frequency band
Frame index
Freq. bin
Freq. bin
Frame index
Amplitude
S1
S2
Feature extraction
(front end)
Classifier
(back end)
Input wav.
Toy example
59
Deepfake detection
A new TTS attack
Frame index
Freq. bin
Freq. bin
Frame index
Amplitude
Spectrogram
Average over high-frequency band
Feature extraction
(front end)
Classifier
(back end)
Input wav.
Toy example
60
Deepfake detection
Detector
Generator
REAL
FAKE
Attacker
Transmission
(ACCEPT)
(REJECT)
Detector
Feature extraction
(front end)
Classifier
(back end)
Training data
Cross-entropy
Model training
61
Deepfake detection
Detector
Feature extraction
(front end)
Classifier
(back end)
Test data
False acceptance
False rejection
Decision threshold
Decision
Equal error rate (EER)
62
Deepfake detection
Menglu Li, Yasaman Ahmadiadli, and Xiao-Ping Zhang. 2025. A Survey on Speech Deepfake Detection. ACM Comput. Surv. 57, 7 (July 2025), 1–38.
Jiangyan Yi, Chenglong Wang, Jianhua Tao, Xiaohui Zhang, Chu Yuan Zhang, and Yan Zhao. 2023. Audio Deepfake Detection: A Survey.
Xin Wang and Junichi Yamagishi. 2022. A Practical Guide to Logical Access Voice Presentation Attack Detection. In Frontiers in Fake Media Generation and Detection. Springer
Detector
Feature extraction
(front end)
Classifier
(back end)
Training/test data
Features
Waveform
(DSP) LFCC, CQCC
(DNN) Self-supervised learning
…
Classifiers
(linear) GMM, …
(DNN) ResNet, LCNN,
AASIST, RawNet
Mamba,
Transformer …
Databases
ASVspoof series
ADD 22, 23
In-the-wild
MLAAD
…
Metrics
EER,
t-DCF, t-EER
Cllr
…
63
Deepfake detection
Xin Wang, Junichi Yamagishi, et all. 2020. ASVspoof 2019: A large-scale public database of synthesized, converted and replayed speech. Computer Speech & Language 64, (November 2020), 101114.
Average detection error rate is 3%, on this ASVspoof2019 LA dataset
Detector: CQCC + LCNN
Random guess
HMM-DNN, detection error rates = 0%
Detection error rates (%)
voice conversion
3.11%
average
Tacotron2
ASR+TTS(WaveNet)
64
Deepfake detection
Xin Wang, Junichi Yamagishi, et all. 2020. ASVspoof 2019: A large-scale public database of synthesized, converted and replayed speech. Computer Speech & Language 64, (November 2020), 101114.
Average detection error rate is 3%, on this ASVspoof2019 LA dataset
Detector: CQCC + LCNN
Random guess
HMM-DNN, detection error rates = 0%
Detection error rates (%)
3.11%
average
Are we safe from Deepfake? NO!
65
Issue with generalization
Detector
Generator
REAL
FAKE
Attacker
Defender
En
Transmission
Training set
Test set
En
66
Issue with generalization
Detector
Generator
REAL
FAKE
Attacker
Defender
En
Jp
Zh
Transmission
Training set
Test set
En
mp3
m4a
…
More errors when training and test sets mismatch
67
Issue with generalization
Xin Wang, and Junichi Yamagishi. Investigating Self-Supervised Front Ends for Speech Spoofing Countermeasures. Proc. Odyssey 2022.
Audio example from https://docs.pytorch.org/audio/2.6.0/tutorials/effector_tutorial.html#mp3
Detector
Generator
REAL
FAKE
Attacker
Defender
Transmission
codec
mp3,…
EER (%)
3.11
22.79
24.88
Test sets
2019LA
2021LA
2021DF
Training set
more
attacks
More errors when training and test sets mismatch
Same domain
original
mp3
68
Issue with generalization
Internal experiment
See similar study in Nicolas Müller, Franziska Dieckmann, et al. 2021. Speech is silver, silence is golden: What do ASVspoof-trained models really learn? In Proc. ASVspoof challenge workshop, 2021. 55–60.
Detector
Generator
REAL
FAKE
Attacker
Defender
Transmission
codec
mp3,…
EER (%)
3.11
22.79
24.88
Test sets
2019LA
2021LA
2021DF
Training set
more
attacks
More errors when training and test sets mismatch
~15
Same domain
69
Issue with generalization
Müller, Nicolas, Franziska Dieckmann, Pavel Czempin, Roman Canals, Konstantin Böttinger, and Jennifer Williams. 2021. “Speech Is Silver, Silence Is Golden: What Do ASVspoof-Trained Models Really Learn?” In Proc. ASVspoof Challenge Workshop, 55–60.
Shim, Hye-jin, Rosa Gonzalez Hautamäki, Md Sahidullah, and Tomi Kinnunen. 2023. “How to Construct Perfect and Worse-than-Coin-Flip Spoofing Countermeasures: A Word of Warning on Shortcut Learning.” In Proc. INTERSPEECH 2023, 785–89. https://doi.org/10.21437/Interspeech.2023-1901.
Detector
Generator
REAL
FAKE
Attacker
Defender
Trans.
Detection error rate x5
?
Human
Synthetic
70
Issue with generalization
Hemlata Tak, Massimiliano Todisco, Xin Wang, Jee-weon Jung, Junichi Yamagishi, and Nicholas Evans. 2022. Automatic speaker verification spoofing and deepfake detection using wav2vec 2.0 and data augmentation. In Proc. Odyssey, 2022. 112–119.
Xin Wang and Junichi Yamagishi. 2022. Investigating Self-Supervised Front Ends for Speech Spoofing Countermeasures. In Proc. Odyssey, 2022. 100–106.
Xin Wang and Junichi Yamagishi. 2023. Spoofed Training Data for Speech Spoofing Countermeasure Can Be Efficiently Created Using Neural Vocoders. In Proc. ICASSP, June 2023
Xin Wang and Junichi Yamagishi. 2024. Can Large-Scale Vocoded Spoofed Data Improve Speech Spoofing Countermeasure with a Self-Supervised Front End? In Proc. ICASSP, April 14, 2024. 10311–10315
Wanying Ge, Xin Wang, Xuechen Liu, and Junichi Yamagishi. 2025. Post-training for Deepfake Speech Detection. In Proc. ASRU, December 06, 2025. IEEE, Honolulu, HI, USA, 1–8
Xin Wang, Ge Wanying, and Junichi Yamagishi. 2026. Does Fine-tuning by Reinforcement Learning Improve Generalization in Binary Speech Deepfake Detection?
Detector
Generator
REAL
FAKE
Attacker
Defender
En,
Jp,
Zh,
…
VoIP
mp3
…
Trans.
Unseen
Work by us (advertisement : )
71
Issue with generalization
How could you claim generalization to unseen data if you haven’t seen the unseen data?
Xin Wang, Héctor Delgado, Nicholas Evans, Xuechen Liu, Tomi Kinnunen, Hemlata Tak, Kong Aik Lee, Ivan Kukanov, Md Sahidullah, Massimiliano Todisco, and Junichi Yamagishi. 2026. ASVspoof 5: Evaluation of spoofing, deepfake, and adversarial attack detection using crowdsourced speech. IEEE Transactions on Audio, Speech, and Language Processing (2026),
Detector
Generator
REAL
FAKE
Attacker
Defender
En,
Jp,
Zh,
…
VoIP
mp3
…
Trans.
Unseen
Ongoing research topic (see latest ref.)
72
Beyond deepfake detection
Attacker
Defender
Detector
Generator
REAL
FAKE
Trans.
Generator
API
Detector
REAL
FAKE
Trans.
Dataset
Generator
Training
73
Beyond deepfake detection
Attacker
Defender
Watermark detector
Generator
Trans.
Generator
API
Collaborator
Watermark detector
Message
Trans.
Dataset
Generator
Training
Watermark
w
Message
w
w
Generated by HuggingFace API
Generated by HuggingFace API
74
Beyond deepfake detection
Attacker
Defender
Watermark detector
Generator
Trans.
Generator
API
Collaborator
Watermark detector
Message
Trans.
Dataset
Generator
Training
Watermark
net
w
Message
w
w
Generated by HuggingFace API
Generated by HuggingFace API
Anony-mization
75
Proactive defense speech watermark
76
Watermark – post-processing approach
Watermark detector
Generator
Trans.
Generator
API
Watermark
w
Generator (wrapped via API)
TTS
Watermark embedder
Watermark detector
Message 101011…
101011…
General case
Message
w
77
Watermark – post-processing approach
Watermark detector
Generator
Trans.
Generator
API
Watermark
w
Generator (wrapped via API)
TTS
Watermark embedder
Watermark detector
TTS ID X, creator N
TTS ID X, creator N
General case
Message
Signature of TTS
w
78
Watermark – post-processing approach
Guangyu Chen, Yu Wu, Shujie Liu, Tao Liu, Xiaoyong Du, and Furu Wei. 2023. Wavmark: Watermarking for audio generation. arXiv preprint arXiv:2308.12770 (2023).
Robin San Roman, Pierre Fernandez, Alexandre Défossez, Teddy Furon, Tuan Tran, and Hady Elsahar. 2024. Proactive detection of voice cloning with localized watermarking. In Proc. ICML, 2024. .
Mayank Kumar Singh, Naoya Takahashi, Weihsiang Liao, and Yuki Mitsufuji. 2024. SilentCipher: Deep audio watermarking. In Proc. Interspeech, 2024. 2235--2239.
Yixin Liu, Lie Lu, Jihui Jin, Lichao Sun, and Andrea Fanelli. 2025. XAttnMark: Learning Robust Audio Watermarking with Cross-Attention.
API,
Checkpoint
Watermark detector
Generator
Trans.
Generator
building
Watermark
w
Generator (wrapped via API)
TTS
Watermark embedder
Watermark detector
TTS ID X, creator N
TTS ID X, creator N
General case
Message
Signature of TTS
w
Microsoft: WavMark (Chen 2023)
Meta: AudioSeal (Roman 2024)
Dolby: XAttnMark (Liu 2025)
Sony: SlientCipher (Singh 2024)
…
79
Watermark – post-processing approach
TTS
Watermark embedder
Watermark detector
Message 101011…
101011…
80
Watermark – post-processing approach
L. F. Turner, “Digital data security system.” Patent IPN WO 89/08915, 1989.
Figure from Heiga Zen. 2017. Generative model-based text-to-speech synthesis.
TTS
Watermark embedder
Watermark detector
Message 0
0
81
Watermark – post-processing approach
L. F. Turner, “Digital data security system.” Patent IPN WO 89/08915, 1989.
Figure from Heiga Zen. 2017. Generative model-based text-to-speech synthesis.
TTS
Watermark embedder
Watermark detector
Message 0
0
82
Watermark – post-processing approach
L. F. Turner, “Digital data security system.” Patent IPN WO 89/08915, 1989.
Figure from Heiga Zen. 2017. Generative model-based text-to-speech synthesis.
TTS
Watermark embedder
Watermark detector
Message 0
0
83
Watermark – post-processing approach
L. F. Turner, “Digital data security system.” Patent IPN WO 89/08915, 1989.
Figure from Heiga Zen. 2017. Generative model-based text-to-speech synthesis.
TTS
Watermark embedder
Watermark detector
Message 1
1
84
Watermark – post-processing approach
L. F. Turner, “Digital data security system.” Patent IPN WO 89/08915, 1989.
Figure from Heiga Zen. 2017. Generative model-based text-to-speech synthesis.
TTS
Watermark embedder
Watermark detector
Message 00
00
85
Watermark – post-processing approach
L. F. Turner, “Digital data security system.” Patent IPN WO 89/08915, 1989.
Figure from Heiga Zen. 2017. Generative model-based text-to-speech synthesis.
TTS
Watermark embedder
Watermark detector
Message 000
000
Trade-off between capacity & fidelity
86
Watermark – post-processing approach
Figure from Heiga Zen. 2017. Generative model-based text-to-speech synthesis.
TTS
Watermark embedder
Watermark detector
Message 0
1?
Trans.
noise, reverb, codec, …
87
Watermark – post-processing approach
F. Cayre, C. Fontaine, and T. Furon. 2005. Watermarking security: theory and practice. IEEE Trans. Signal Process. 53, 10 (October 2005), 3976–3987.
TTS
Watermark embedder
Watermark detector
Message 1010101
1010101
Trans.
Focus of most recent papers
88
Watermark – post-processing approach
Guang Hua, Jiwu Huang, Yun Q Shi, Jonathan Goh, and Vrizlynn LL Thing. 2016. Twenty years of digital audio watermarking—a comprehensive review. Signal processing 128, (2016), 222–242.
Guangyu Chen, Yu Wu, Shujie Liu, Tao Liu, Xiaoyong Du, and Furu Wei. 2023. Wavmark: Watermarking for audio generation. arXiv preprint arXiv:2308.12770 (2023).
Yizhu Wen, Ashwin Innuganti, Aaron Bien Ramos, Hanqing Guo, and Qiben Yan. 2025. SoK: How Robust is Audio Watermarking in Generative AI models?
TTS
Watermark embedder
Watermark detector
Message 1010101
1010101
Trans.
(chen 2023)
Patchwork
DNN
not sufficiently robust (Chen 2023, Wen 2025)
89
Watermark – post-processing approach
Guangyu Chen, Yu Wu, Shujie Liu, Tao Liu, Xiaoyong Du, and Furu Wei. 2023. Wavmark: Watermarking for audio generation. arXiv preprint arXiv:2308.12770 (2023).
Chang Liu, Jie Zhang, Tianwei Zhang, Xi Yang, Weiming Zhang, and Nenghai Yu. 2023. Detecting voice cloning attacks via timbre watermarking. In Proc. Network and Distributed System Security Symposium, 2023. .
Robin San Roman, Pierre Fernandez, Alexandre Défossez, Teddy Furon, Tuan Tran, and Hady Elsahar. 2024. Proactive detection of voice cloning with localized watermarking. In Proc. ICML, 2024. .
TTS
Watermark embedder
Watermark detector
Message 1010101
1010101
Trans.
DNN
DNN
90
Watermark – AudioSeal
Robin San Roman, Pierre Fernandez, Alexandre Défossez, Teddy Furon, Tuan Tran, and Hady Elsahar. 2024. Proactive detection of voice cloning with localized watermarking. In Proc. ICML, 2024. .
TTS
Watermark embedder
Watermark detector
Message 1010101
1010101
Trans.
+
Filters,
Mp3,
AAC,
91
Watermark – AudioSeal
Robin San Roman, Pierre Fernandez, Alexandre Défossez, Teddy Furon, Tuan Tran, and Hady Elsahar. 2024. Proactive detection of voice cloning with localized watermarking. In Proc. ICML, 2024. .
+
Filters,
Mp3,
AAC,
To preserve quality
To detect the watermark bits
92
Watermark – Timbre
Chang Liu, Jie Zhang, Tianwei Zhang, Xi Yang, Weiming Zhang, and Nenghai Yu. 2023. Detecting voice cloning attacks via timbre watermarking. In Proc. Network and Distributed System Security Symposium, 2023. .
TTS
Watermark embedder
Watermark detector
Message 1010101
1010101
Trans.
93
Watermark – help deepfake detection?
Chia-Hua Wu, Wanying Ge, Xin Wang, Junichi Yamagishi, Yu Tsao, and Hsin-Min Wang. 2025. A Comparative Study on Proactive and Passive Detection of Deepfake Speech. In Proc. Interspeech, June 17, 2025. 5328--5332.
TTS
Watermark embedder
Watermark detector
1 or 0
1 or 0
Trans.
Detector
TTS
REAL
FAKE
Trans.
Conventional deepfake detection
Watermark-based detection
VS
AudioSeal, Timbre
AASIST, SSL-AASIST
94
Watermark – help deepfake detection?
Chia-Hua Wu, Wanying Ge, Xin Wang, Junichi Yamagishi, Yu Tsao, and Hsin-Min Wang. 2025. A Comparative Study on Proactive and Passive Detection of Deepfake Speech. In Proc. Interspeech, June 17, 2025. 5328--5332.
TTS
Watermark embedder
Watermark detector
1 or 0
1 or 0
Trans.
Detector
TTS
REAL
FAKE
Trans.
Conventional deepfake detection
Watermark-based detection
watermark-based
conventional detectors
No transmission
Watermark EER ~0%
Trans.
95
Watermark – help deepfake detection?
Chia-Hua Wu, Wanying Ge, Xin Wang, Junichi Yamagishi, Yu Tsao, and Hsin-Min Wang. 2025. A Comparative Study on Proactive and Passive Detection of Deepfake Speech. In Proc. Interspeech, June 17, 2025. 5328--5332.
TTS
Watermark embedder
Watermark detector
1 or 0
1 or 0
Trans.
Detector
TTS
REAL
FAKE
Trans.
Conventional deepfake detection
Watermark-based detection
watermark-based
conventional detectors
No transmission
Watermark EER ~0%
Watermark EER >50%
Trans.
Audio codec
96
Watermark – challenges
Slavko Kovačević, Murilo Z. Silvestre, Kosta Pavlović, Petar Nedić, and Igor Djurović. 2025. DeepMark Benchmark: Redefining Audio Watermarking Robustness. March 06, 2025.
Patrick O’Reilly, Zeyu Jin, Jiaqi Su, and Bryan Pardo. 2025. Deep Audio Watermarks are Shallow: Limitations of Post-Hoc Watermarking Techniques for Speech.
Yigitcan Özer, Woosung Choi, Joan Serrà, Mayank Kumar Singh, Wei-Hsiang Liao, and Yuki Mitsufuji. 2025. A comprehensive real-world assessment of audio watermarking algorithms: Will they survive neural codecs? In Proc. Interspeech, 2025. 5113–5117.
Yigitcan Özer, Wanying Ge, Zhe Zhang, Xin Wang, and Junichi Yamagishi. 2026. Self Voice Conversion as an Attack against Neural Audio Watermarking.
TTS
Watermark embedder
Watermark detector
1 or 0
1 or 0
Trans.
(Kovačević 2025)
DNN codecs remove watermark
(O’Relly 2025, Kovačević 2025, Ozer 2025 , Ozer 2026)
Speech enhancement
DNN tokenizer (codec)
Neural vocoder (GAN)
Variational auto-encoder
Diffusion model …
All watermark models
failed the test
97
Watermark – challenges
Slavko Kovačević, Murilo Z. Silvestre, Kosta Pavlović, Petar Nedić, and Igor Djurović. 2025. DeepMark Benchmark: Redefining Audio Watermarking Robustness. March 06, 2025.
TTS
Watermark embedder
Watermark detector
1 or 0
1 or 0
Trans.
Another watermark
collusion attack
All watermark models are vulnerable
98
Watermark – summary
Yizhu Wen, Ashwin Innuganti, Aaron Bien Ramos, Hanqing Guo, and Qiben Yan. 2025. SoK: How Robust is Audio Watermarking in Generative AI models?
Watermark can be useful
Better robustness is required
Better design for security is required
Ongoing research topic (see latest ref.)
TTS
Watermark embedder
Watermark detector
1 or 0
1 or 0
Trans.
99
Speaker Anonymization
(skip this part)
Anony-mization
100
Towards Anonymization
API,
Checkpoint
Attacker
Defender (user)
Watermark detector
Generator
TRUE
FALSE
Trans.
Generator
building
Collaborator
Watermark detector
TRUE
FALSE
Trans.
Dataset
Generator
Training
Watermark
Anony-mization
w
w
w
Remove or hide voiceprint
101
Anonymization
Natalia Tomashenko, Brij Mohan Lal Srivastava, Xin Wang, Emmanuel Vincent, Andreas Nautsch, Junichi Yamagishi, Nicholas Evans, Jose Patino, Jean-François Bonastre, Paul-Gauthier Noé, and Massimiliano Todisco. 2020. Introducing the VoicePrivacy initiative. In Proc. Interspeech, October 2020. ISCA, ISCA, 1693–1697.
Anony-mization
擬似話者
same speaker?
102
103
Anonymization – example system
Natalia Tomashenko, Brij Mohan Lal Srivastava, Xin Wang, Emmanuel Vincent, Andreas Nautsch, Junichi Yamagishi, Nicholas Evans, Jose Patino, Jean-François Bonastre, Paul-Gauthier Noé, and Massimiliano Todisco. 2020. Introducing the VoicePrivacy Initiative. In Proc. Interspeech
https://www.nii.ac.jp/today/103/8.html
Anony-mization
Linguistic
Paralinguistic
104
Anonymization – algorithms
Fuming Fang, Xin Wang, Junichi Yamagishi, Isao Echizen, Massimiliano Todisco, Nicholas Evans, and Jean-Francois Bonastre. 2019. Speaker anonymization using X-vector and neural waveform models. In Proc. SSW, 2019. 155–160.
Natalia Tomashenko, Brij Mohan Lal Srivastava, Xin Wang, Emmanuel Vincent, Andreas Nautsch, Junichi Yamagishi, Nicholas Evans, Jose Patino, Jean-François Bonastre, Paul-Gauthier Noé, and Massimiliano Todisco. 2020. Introducing the VoicePrivacy Initiative. In Proc. Interspeech
input
Pool of speakers
105
Anonymization – algorithms
Fuming Fang, Xin Wang, Junichi Yamagishi, Isao Echizen, Massimiliano Todisco, Nicholas Evans, and Jean-Francois Bonastre. 2019. Speaker anonymization using X-vector and neural waveform models. In Proc. SSW, 2019. 155–160.
Natalia Tomashenko, Brij Mohan Lal Srivastava, Xin Wang, Emmanuel Vincent, Andreas Nautsch, Junichi Yamagishi, Nicholas Evans, Jose Patino, Jean-François Bonastre, Paul-Gauthier Noé, and Massimiliano Todisco. 2020. Introducing the VoicePrivacy Initiative. In Proc. Interspeech
input
Pool of speakers
106
Anonymization – algorithms
Pierre Champion. 2023. Anonymizing Speech: Evaluating and Designing Speaker Anonymization Techniques.
Xiaoxiao Miao, Xin Wang, Erica Cooper, Junichi Yamagishi, and Natalia Tomashenko. 2023. Speaker Anonymization Using Orthogonal Householder Neural Network. IEEE/ACM Trans. Audio Speech Lang. Process. 31, (2023), 3681–3695.
input
Pool of speakers
107
Anonymization – algorithms
Xiaoxiao Miao, Xin Wang, Erica Cooper, Junichi Yamagishi, and Natalia Tomashenko. 2023. Speaker Anonymization Using Orthogonal Householder Neural Network. IEEE/ACM Trans. Audio Speech Lang. Process. 31, (2023), 3681–3695.
Sarina Meyer, Pascal Tilli, Pavel Denisov, Florian Lux, Julia Koch, and Ngoc Thang Vu. 2023. Anonymizing speech with generative adversarial networks to preserve speaker privacy. In Proc. SLT, 2023. IEEE, 912–919
108
Anonymization – evaluation
Natalia Tomashenko, Brij Mohan Lal Srivastava, Xin Wang, Emmanuel Vincent, Andreas Nautsch, Junichi Yamagishi, Nicholas Evans, Jose Patino, Jean-François Bonastre, Paul-Gauthier Noé, and Massimiliano Todisco. 2020. Introducing the VoicePrivacy initiative. In Proc. Interspeech, October 2020. ISCA, ISCA, 1693–1697.
Anony-mization
speech recognition (ASR),
human listening test
speaker verification (ASV)
109
Anonymization – evaluation
Michele Panariello, Natalia Tomashenko, Xin Wang, Xiaoxiao Miao, Pierre Champion, Hubert Nourtel, Massimiliano Todisco, Nicholas Evans, Emmanuel Vincent, and Junichi Yamagishi. 2024. The VoicePrivacy 2022 Challenge: Progress and Perspectives in Voice Anonymisation. IEEE/ACM Trans. Audio Speech Lang. Process. (2024), 1–14.
Anony-mization
Speaker verification
SAME
NOT SAME
Attacker
Collaborator
110
Anonymization – evaluation
Michele Panariello, Natalia Tomashenko, Xin Wang, Xiaoxiao Miao, Pierre Champion, Hubert Nourtel, Massimiliano Todisco, Nicholas Evans, Emmanuel Vincent, and Junichi Yamagishi. 2024. The VoicePrivacy 2022 Challenge: Progress and Perspectives in Voice Anonymisation. IEEE/ACM Trans. Audio Speech Lang. Process. (2024), 1–14.
Anony-mization
Speaker verification
SAME
NOT SAME
Attacker
Collaborator
See more in latest VoicePrivacy report (Panariello 2024)
Anony-mization
Semi-informed
random seed N
random
seed M
111
Anonymization – best systems in 2022
Michele Panariello, Natalia Tomashenko, Xin Wang, Xiaoxiao Miao, Pierre Champion, Hubert Nourtel, Massimiliano Todisco, Nicholas Evans, Emmanuel Vincent, and Junichi Yamagishi. 2024. The VoicePrivacy 2022 Challenge: Progress and Perspectives in Voice Anonymisation. IEEE/ACM Trans. Audio Speech Lang. Process. (2024), 1–14.
Anony-mization
Speaker verification
SAME
NOT SAME
Attacker
Collaborator
Anony-mization
Semi-informed
random seed
random
seed’
音声認識
エラー率
話者認識
エラー率
GAN-based
システム
入力音声
112
Anonymization – best systems in 2022
Tomashenko, N., Vincent, E., Tommasi, M. (2025) Exploiting Context-dependent Duration Features for Voice Anonymization Attack Systems. Proc. Interspeech 2025, 5128-5132, doi: 10.21437/Interspeech.2025-2317
Anony-mization
Duration-based verification
SAME
NOT SAME
Collaborator
random seed
音声認識
エラー率
話者認識
エラー率
GAN-based
システム
音素の長さで元の話者を認識できる
113
Anonymization – limitations
Anony-mization
Feature-based verification
SAME
NOT SAME
Collaborator
random seed
Speaker-related information: accent, word choice
114
Anonymization – summary
Anony-mization
Feature-based verification
SAME
NOT SAME
Collaborator
random seed
https://www.voiceprivacychallenge.org/
115
Summary
116
Summary
117
Summary
Generator
Attacker
Humans
Telephone call
Social media
Speaker verification
Services
Transmission
118
Summary
Attacker
Humans
Telephone call
Social media
Speaker verification
Services
Defender
Detector
Generator
REAL
FAKE
Transmission
119
Summary
Attacker
Defender
Watermark detector
Generator
Watermarked
Not watermarked
Transmission
Generator
API
Watermark
w
w
120
Summary
Signal processing
Computer science
signal & system, filters, …
information theory, search, …
pattern recognition
decision theory
deep neural networks
Machine learning
Linguistic
phonetics, phonology, …
121
Thank you
all the slides
https://tonywangx.github.io/slide.html#talk
PhD, postdoc, intern
122
A system should be secure, even if everything about the system, except the key, is public knowledge.
Kerckhoffs's principle
123
Reference
124
Reference
125
Reference
126
Reference
127
Reference
128
Reference
129
Reference
130
Reference
131
Reference
132
Reference
133
Reference
134
Reference
135
Reference
136
Reference
137
Reference
138
Reference
139
Reference
140
Reference
141
Reference
142
Reference
143
Reference
144
Reference
145
Speaker verification
146
Voice verification
Voice verification
Feature extraction
Feature extraction
Comparison
Same person?
Not the same
John’s real voice
Same person?
Not the same
I am John
147
Voice verification
Animation from Speech production and articulation knowledge group https://sail.usc.edu/span/rtmri_ipa/je_2015.html
Vocal tract
Larynx
[a]
[i]
[o]
Linguistic: what we say
Word sequence, …
148
Voice verification
Animation from Speech production and articulation knowledge group https://sail.usc.edu/span/rtmri_ipa/je_2015.html
Vocal tract
Larynx
[a]
[i]
[o]
[a]
Linguistic: what we say
Word sequence, …
Paralinguistic: how we say
Emotion, …
149
Voice verification
Vocal tract
Larynx
Speaker A
Speaker B
[a]
[a]
Biometric: how voices sound like
Vocal tract & cords size …
Other
biometrics
150
Voice verification
Tomi Kinnunen and Haizhou Li. 2010. An overview of text-independent speaker recognition: From features to supervectors. Speech communication 52, 1 (2010), 12–40.
John H L Hansen and Taufiq Hasan. 2015. Speaker recognition by machines and humans: A tutorial review. IEEE Signal processing magazine 32, 6 (2015), 74–99.
Frédéric Bimbot, Jean-François Bonastre, Corinne Fredouille, Guillaume Gravier, Ivan Magrin-Chagnolleau, Sylvain Meignier, Teva Merlin, Javier Ortega-García, Dijana Petrovska-Delacrétaz, and Douglas A Reynolds. 2004. A tutorial on text-independent speaker verification. EURASIP Journal on Advances in Signal Processing 2004, 4 (2004), 101962.
Vocal tract
Larynx
Similarity score s
Feature
extraction
Classification
(decision)
Check tutorials (Bimbot 2004, Kinnuen 2010, Hansen 2015)
151
Voice verification - decision
Similarity score s
ACCEPT
SAME speaker
REJECT
DIFFERENT speakers
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
152
Voice verification - decision
score s
False acceptance
False rejection
Decision threshold
153
Voice verification - decision
score s
Decision threshold
False acceptance
False rejection
154
Voice verification - decision
score s
Decision threshold
False acceptance
False rejection
155
Voice verification - decision
score s
False acceptance
False rejection
Decision threshold
=
Find a decision threshold so that PFA = PFR
This is Equal Error Rate (EER)
156
Everything in a nutshell
SAME speaker
DIFFERENT speakers
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
Deepfake: increase false acceptance
Detection: decrease false acceptance
Anonymization: increase false reject
157
Everything in a nutshell
score s
Spoofing or deepfake increases false acceptance
Detection and watermarking reduce false acceptance
158
Everything in a nutshell
score s
Spoofing or deepfake increases false acceptance
Detection and watermarking reduce false acceptance
Anonymization increases false rejection !?
159
Misuse of speech generation
Jee-weon Jung, Xin Wang, Nicholas Evans, Shinji Watanabe, Hye-jin Shim, Hemlata Tak, Sidhhant Arora, Junichi Yamagishi, and Joon Son Chung. 2024. To what extent can ASV systems naturally defend against spoofing attacks? In Proc. Interspeech, 2024. 3240--3244.
Speaker verification
HMM-based TTS
HMM-DNN
End-to-end
Unit-selection
160
Misuse of speech generation
Jee-weon Jung, Xin Wang, Nicholas Evans, Shinji Watanabe, Hye-jin Shim, Hemlata Tak, Sidhhant Arora, Junichi Yamagishi, and Joon Son Chung. 2024. To what extent can ASV systems naturally defend against spoofing attacks? In Proc. Interspeech, 2024. 3240--3244.
Speaker verification
HMM-based TTS
HMM-DNN
End-to-end
Unit-selection
Newer speaker verification systems better detect fake speech
161
Misuse of speech generation
Jee-weon Jung, Xin Wang, Nicholas Evans, Shinji Watanabe, Hye-jin Shim, Hemlata Tak, Sidhhant Arora, Junichi Yamagishi, and Joon Son Chung. 2024. To what extent can ASV systems naturally defend against spoofing attacks? In Proc. Interspeech, 2024. 3240--3244.
Speaker verification
HMM-based TTS
HMM-DNN
End-to-end
Unit-selection
Latest speaker verification system is still vulnerable to some types of fake speech
162
Misuse of speech generation
Jee-weon Jung, Xin Wang, Nicholas Evans, Shinji Watanabe, Hye-jin Shim, Hemlata Tak, Sidhhant Arora, Junichi Yamagishi, and Joon Son Chung. 2024. To what extent can ASV systems naturally defend against spoofing attacks? In Proc. Interspeech, 2024. 3240--3244.
False acceptance(%)
Date when TTS system was developed
Latest TTS systems are more difficult to detect
163
Binary classification
164
Binary classification
165
Binary classification
166
Binary classification
Each point is called Operating point
167
Binary classification
Alvin F Martin, George R Doddington, Terri Kamm, Mark Ordowski, and Mark A Przybocki. 1997. The DET curve in assessment of detection task performance. In Eurospeech, 1997. 1895–1898.
Detection Error Tradeoff (DET) curve
Making the tradeoff curve straight!
168
Binary classification
David A Van Leeuwen and Niko Brümmer. 2007. An introduction to application-independent evaluation of speaker recognition systems. In Speaker classification I. Springer, 330–353.
169
Binary classification
See comments on EER and other metrics in Xin Wang, Héctor Delgado, Hemlata Tak, Jee-weon Jung, Hye-jin Shim, Massimiliano Todisco, Ivan Kukanov, Xuechen Liu, Md Sahidullah, Tomi Kinnunen, Nicholas Evans, Kong Aik Lee, and Junichi Yamagishi. 2024. ASVspoof 5: Crowdsourced speech data, deepfakes, and adversarial attacks at scale. In ASVspoof Workshop 2024, 2024. 1--8. https://doi.org/10.21437/ASVspoof.2024-1
Deep & interesting research topics
Why different?
Discrimination&
Calibration
Application-independent
evaluation: Cllr …
Bayes decision
Integrate ASV?
Tandem DCF
Tandem EER
…
170
Speech anti-spoofing – where we are : )
Todisco, M. et al. ASVspoof 2019: future horizons in spoofed and fake audio detection. in Proc. Interspeech 1008–1012 (2019). doi:10.21437/Interspeech.2019-2249
Wu, Z. et al. ASVspoof: the automatic speaker verification spoofing and countermeasures challenge. IEEE J. Sel. Top. Signal Process. 11, 588–604 (2017)
Yamagishi, J. et al. ASVspoof 2021: accelerating progress in spoofed and deepfake speech detection. in Proc. ASVspoof Challenge workshop 47–54 (2021). doi:10.21437/ASVSPOOF.2021-
Korshunov, P. et al. Overview of BTAS 2016 speaker anti-spoofing competition. in Proc. BTAS 1–6 (2016).
Zhang, Z., Gu, Y., Yi, X. & Zhao, X. FMFCC-A: A Challenging Mandarin Dataset for Synthetic Speech Detection. arXiv Prepr. arXiv2110.09441 (2021)
Kinnunen, T. et al. Tandem Assessment of Spoofing Countermeasures and Automatic Speaker Verification: Fundamentals. IEEE/ACM Trans. Audio, Speech, Lang. Process. 28, 2195–2210 (2020)
HMM-based TTS
GMM-based VC
VTLN-based VC
…
GAN-based TTS
Tacotron
WaveNet
WaveRNN
VAE
…
VoIP, GSM PSTN …
Out-of-domain data
2015
ASVspoof 2015
2019
ASVspoof 2019
2017
2021
ASVspoof 2021
2024
ASVspoof 5
More speakers
Diverse audio env.
171
Speech anti-spoofing – where we are : )
EER (%)
2015
ASVspoof 2015
2019
ASVspoof 2019
2017
2021
ASVspoof 2021
2024
ASVspoof 5
172
Speech anti-spoofing – where we are : )
Md Sahidullah, Tomi Kinnunen, and Cemal Hanilçi. A Comparison of Features for Synthetic Speech Detection. In Proc. Interspeech, 2087–2091. 2015.
Efforts by many research groups!
EER (%)
3% (Sahid 2015)
>20%
2015
ASVspoof 2015
2019
ASVspoof 2019
2017
2021
ASVspoof 2021
2024
ASVspoof 5
173
Speech anti-spoofing – where we are : )
Md Sahidullah, Tomi Kinnunen, and Cemal Hanilçi. A Comparison of Features for Synthetic Speech Detection. In Proc. Interspeech, 2087–2091. 2015.
Xin Wang, and Junich Yamagishi. A Comparative Study on Recent Neural Spoofing Countermeasures for Synthetic Speech Detection. In Proc. Interspeech, 4259–4263. doi:10.21437/Interspeech.2021-702. 2021.
3% (Sahid 2015)
>20%
3%(Wang 2021)
10%
23%
Efforts by many research groups!
EER (%)
2015
ASVspoof 2015
2019
ASVspoof 2019
2017
2021
ASVspoof 2021
2024
ASVspoof 5
174
Speech anti-spoofing – where we are : (
Rohan Kumar Das, Jichen Yang, and Haizhou Li. Assessing the Scope of Generalized Countermeasures for Anti-Spoofing. In Proc. ICASSP, 6589–6593. 2020.
Xin Wang, Junichi Yamaigshi, Investigating self-supervised front ends for speech spoofing countermeasures, https://arxiv.org/abs/2111.07725
2015
2019
2021
3%
>20%
10%
23%
Trained on 2019 train set, test on 2015 test set
Reported in one case study
8%
>23%
3%(Wang 2021)
(Das 2020)
EER (%)
2015
ASVspoof 2015
2019
ASVspoof 2019
2017
2021
ASVspoof 2021
2024
ASVspoof 5
175
Speech anti-spoofing – where we are : (
Rohan Kumar Das, Jichen Yang, and Haizhou Li. Assessing the Scope of Generalized Countermeasures for Anti-Spoofing. In Proc. ICASSP, 6589–6593. 2020.
Xin Wang, Junichi Yamaigshi, Investigating self-supervised front ends for speech spoofing countermeasures, https://arxiv.org/abs/2111.07725
2015
2019
2021
3%
>20%
10%
23%
8%
>23%
(Das 2020)
30%
(Wang 2021)
Another case
3%(Wang 2021)
EER (%)
2015
ASVspoof 2015
2019
ASVspoof 2019
2017
2021
ASVspoof 2021
2024
ASVspoof 5
176
Speech anti-spoofing – where we go
You Zhang, Ge Zhu, Fei Jiang, and Zhiyao Duan. An Empirical Study on Channel Effects for Synthetic Voice Spoofing Countermeasure Systems. In Proc. Interspeech 2021, 4309–4313. doi:10.21437/Interspeech.2021-1820. 2021.
Anton Tomilov, Aleksei Svishchev, Marina Volkova, Artem Chirkovskiy, Alexander Kondratev, and Galina Lavrentyeva. STC Antispoofing Systems for the ASVspoof2021 Challenge. In Proc. ASVspoof Challenge Workshop, 61–67. doi:10.21437/ASVSPOOF.2021-10. 2021.
Das, R. K., Yang, J. & Li, H. Data Augmentation with Signal Companding for Detection of Logical Access Attacks. arXiv Prepr. arXiv2102.06332 (2021)
Xin Wang, Junichi Yamaigshi, Investigating self-supervised front ends for speech spoofing countermeasures, https://arxiv.org/abs/2111.07725
EER (%)
2015
2019
2021
3%
>20%
10%
23%
8%
>23%
(Das 2020)
30%
(Wang 2021)
3%(Wang 2021)
0.2%
7%
2015
ASVspoof 2015
2019
ASVspoof 2019
2017
2021
ASVspoof 2021
2024
ASVspoof 5
177
Speech anti-spoofing – mainsteam
Sahidullah, M., Kinnunen, T. & Hanilçi, C. A comparison of features for synthetic speech detection. in Proc. Interspeech 2087–2091 (2015).
Kamble, M. R., Sailor, H. B., Patil, H. A. & Li, H. Advances in anti-spoofing: from the perspective of ASVspoof challenges. APSIPA Trans. Signal Inf. Process. 9, e2 (2020)
Sahidullah, M. et al. Introduction to voice presentation attack detection and recent advances. in Handbook of Biometric Anti-Spoofing 321–361 (Springer, 2019).
Wang, X. & Yamagishi, J. A comparative study on recent neural spoofing countermeasures for synthetic speech detection. in Proc. Interspeech 4259–4263 (2021).
Feature extraction
(front end)
Classifier
(back end)
Input trial
Wav2vec 2.0 (Baevski 2020)
A linear layer,
AASIST, or…
178
SSL-based detector training scheme
Alexei Baevski, Yuhao Zhou, Abdelrahman Mohamed, and Michael Auli. 2020. wav2vec 2.0: A framework for self-supervised learning of speech representations. In Proc. NIPS, 2020. 12449–12460.
Xin Wang and Junichi Yamagishi. 2022. Investigating Self-Supervised Front Ends for Speech Spoofing Countermeasures. In Proc. Odyssey, 2022. 100–106.
Hemlata Tak, Massimiliano Todisco, Xin Wang, Jee-weon Jung, Junichi Yamagishi, and Nicholas Evans. 2022. Automatic speaker verification spoofing and deepfake detection using wav2vec 2.0 and data augmentation. In Proc. Odyssey, 2022. 112–119.
Wav2vec 2.0 (Baevski 2020)
A linear layer,
AASIST, or…
Detector
SSL-based
front end
Classifier
Cross-entropy
supervised fine-tuning
179