1 of 179

Slides by Xin Wang

National Institute of Informatics

© 2026, Xin Wang. All rights reserved.

This work is licensed under the Creative Commons Attribution 3.0 license.

See http://creativecommons.org/ for details.

1

2 of 179

Progress of Speech Generative AI and Countermeasures against Speech Deepfake

WANG Xin

Project Associate Professor, JST PRESTO researcher

Muroran Institute of Technology 2026/06/10

wangxin@nii.ac.jp

ワン シン

2

3 of 179

2015

2018

2020

2013

2024

@UESTC & USTC

China

@NII, Japan

PI Prof. Yamagishi

Ph.D.

researcher for CREST

JST さきがけ

Speech watermark

Speech deepfake detection: ASVspoof, SSL

Voice anonymization

Speech synthesis: HMM, autoregress, F0, DNN vocoders

M.S.

https://researchmap.jp/wangxin

Muroran Institute of Technology 2026/06/10

@NII

PI Wang

M.S.

3

4 of 179

Speech synthesis

Speech generation, GenAI

Deepfake detection

Speech anti-spoofing

Signal processing

Computer science

signal & system, filters, …

information theory, search, …

pattern recognition

decision theory

deep neural networks

Machine learning

Linguistic

phonetics, phonology, …

https://www.magnific.com/free-photo/sym-

bols-come-out-bulb-top-book_985250.htm

4

5 of 179

Introduction

Text example from: Beckman, M. E. & Ayers, G. Guidelines for ToBI labelling. OSU Res. Found. 3, (1997)

Image

Video

Audio

Text

Speech / voice

Text: Marianna made the marmalade

Speech:

5

6 of 179

Introduction

Kewley-Port, D. & M. Nearey, T. Speech synthesizer produced voices for disabled, including Stephen Hawking. J. Acoust. Soc. Am. 148, R1--R2 (2020)

https://commons.wikimedia.org/wiki/File:Stephen_Hawking.StarChild.jpg

Input text

Speech signal

Text-to-speech (TTS)

6

7 of 179

Introduction

https://collection.sciencemuseumgroup.org.uk/objects/co8911441/stephen-hawkings-speech-synthesizer-board

https://thereader.mitpress.mit.edu/stephen-hawkings-eternal-voice/

Input text

Speech signal

“Perfect Paul”

by Dennis Klatt

7

8 of 179

Introduction

Kewley-Port, D. & M. Nearey, T. Speech synthesizer produced voices for disabled, including Stephen Hawking. J. Acoust. Soc. Am. 148, R1--R2 (2020)

https://commons.wikimedia.org/wiki/File:Stephen_Hawking.StarChild.jpg

Input text

Reference voice (10s)

Speech signal

8

9 of 179

Introduction

See paper in Reference

Samples at the top are from ChatterBox. Other samples from ASVspoof 2019 database

HMM+deep neural networks (DNNs)

(Zen 2013, Ling 2013, Fan 2014)

Hidden Markov model (HMM)

(Tokuda 1995, Yoshimura 1999, Tokuda 2000)

~1990s

~2000

~2013

~2017

Unit-selection

(Hunt 1996)

Transformer(Li 2018), WaveNet(oord 2016)

Latest

Language language model (LLM)

(e.g., VALLE-E Wang cc2023)

Natural voice

9

10 of 179

Introduction

Samples from WildSpoof challenge https://wildspoof.github.io

Many APIs & models on Huggingface

10

11 of 179

Introduction

Of course, my own voice can be easily cloned.

That is what you are hearing right now.

The cloned voice can speak better than me, reading a complicated sentence like “behaving like a babbling, bumbling band of baboons”

11

12 of 179

Introduction

https://www.bloomberg.com/news/articles/2024-01-26/ai-startup-elevenlabs-bans-account-blamed-for-biden-audio-deepfake?embedded-checkout=true

https://www.bbc.com/news/technology-60780142

Speech signal

Input text

Reference voice

12

13 of 179

Introduction

https://www.bloomberg.com/news/articles/2024-01-26/ai-startup-elevenlabs-bans-account-blamed-for-biden-audio-deepfake?embedded-checkout=true

https://www.bbc.com/news/technology-60780142

Speech signal

Input text

Reference voice

Part 2

How speech generation is misused

Part 3

Deepfake speech detection

Part 4

Additional layer of countermeasure: watermark & anonymization

Part 1:

Progress of speech synthesis

13

14 of 179

Progress of Speech Generation

How do we speak & listen

How can machines speak – the progress

Euphonia: the talking machine

https://en.wikipedia.org/wiki/Euphonia_%28device%29

14

15 of 179

How do we speak & listen

Illustration from HTS Slides ver. 2.3, HTS Working Group

Illustration from https://en.wikipedia.org/wiki/Middle_ear

Speaker

Listener

Speech perception

Speech production

Speech transmission

Telephone, VoIP, air …

Channel

15

16 of 179

How do we speak & listen

Illustration from HTS Slides ver. 2.3, HTS Working Group

Animation from Speech production and articulation knowledge group https://sail.usc.edu/span/rtmri_ipa/je_2015.html

Contents

Mariana made the marmalade

Words:

Prosody:

H

H

L

Accent: U.S. accent

Emotion:

Sex:

Voice quality: clear, warm, …

Identity: who the speaker is

Paralinguistic

Biometric

16

17 of 179

How do we speak & listen

Illustration from HTS Slides ver. 2.3, HTS Working Group

Animation from Speech production and articulation knowledge group https://sail.usc.edu/span/rtmri_ipa/je_2015.html

[a]

[i]

[o]

We perceive sounds by their formant frequencices, decided by resonant frequencies of vocal tract

Contents

F1 F2

vowels

17

18 of 179

How do we speak & listen

Figure from https://www.sfu.ca/sonic-studio-webdav/cmns/Handbook%20Tutorial/SpeechAcoustics.html

https://commons.wikimedia.org/wiki/File:Pipe003.gif

We perceive sounds by their formant frequencices, decided by resonant frequencies of vocal tract

Contents

F1 F2

vowels

18

19 of 179

How do we speak & listen

Motion picture from https://en.wikipedia.org/wiki/Vocal_cords

Animation from Speech production and articulation knowledge group https://sail.usc.edu/span/rtmri_ipa/je_2015.html

Animation from http://www.ling.fju.edu.tw/hearing/ishizaka.htm

[a]

We perceive pitch based on fundamental frequency, decided by the vibration of vocal cords

Contents

[a]

F0

19

20 of 179

How do we speak & listen

Example from: Nakamura, I., Minematsu, N., Suzuki, M., Hirano, H., Nakagawa, C., Nakamura, N., Tagawa, Y., Hirose, K., Hashimoto, H. (2013) Development of a web framework for teaching and learning Japanese prosody: OJAD (online Japanese accent dictionary). Proc. Interspeech 2013, 2554-2558, doi: 10.21437/Interspeech.2013-575

[a]

We perceive pitch based on fundamental frequency, decided by the vibration of vocal cords

Contents

[a]

F0

Japanese pitch accent

20

21 of 179

How do we speak & listen

Picture from DOI:10.1142/9891 See more by searching Vocal Cords Bernoulli effect

Animation from Speech production and articulation knowledge group https://sail.usc.edu/span/rtmri_ipa/je_2015.html

Our impression of who the speaker is, decided by the unique shape, size, length, … of vocal tract & cords!

Speaker A

Biometric

Speaker B

Voiceprint

21

22 of 179

How do we speak & listen

https://en.wikipedia.org/wiki/Formant#/media/File:Spectrogram_-iua-.png

https://speechprocessingbook.aalto.fi/Representations/Fundamental_frequency_F0.html

Short-time Fourier transform

Wideband spectrogram

Formant frequencies (F1 & F2)

[i]

[u]

[a]

Frequency

Time

22

23 of 179

How do we speak & listen

https://www.biointeractive.org/classroom-resources/cochlea

This animation is a clip from a 1999 Holiday Lecture Series, Senses and Sensitivity, by neuroscientist A. James Hudspeth

Short-time Fourier transform

Wideband spectrogram

Formant frequencies (F1 & F2)

[i]

[u]

[a]

Frequency

Time

23

24 of 179

How can machine speak? – Research questions

https://en.wikipedia.org/wiki/Formant#/media/File:Spectrogram_-iua-.png

Input text

Recover waveform Waveform generation

Generate intermediate features Acoustic modeling

Decide F0, F1, F2 … from text Text analysis

24

25 of 179

How can machine speak? – Research questions

Sentence from: Beckman, M. E. & Ayers, G. Guidelines for ToBI labelling. OSU Res. Found. 3, (1997)

Marianna made the marmalade

Discrete

Continuous

Alignment

M

a

a

m

e

a

m

d

a

l

d

e

r

Ambiguity

Biometric …

25

26 of 179

How can machine speak? – Research questions

Sentence from: Beckman, M. E. & Ayers, G. Guidelines for ToBI labelling. OSU Res. Found. 3, (1997)

LOGIOS Lexicon tool: http://www.speech.cs.cmu.edu/tools/lextool.html

H*, L-L%: ToBI labels Beckman, M. E. & Ayers, G. Guidelines for ToBI labelling. OSU Res. Found. 3, (1997)

Marianna made the marmalade

M AA R IY AA N AH M EY D DH AH M AA R M AH L EY D

To phoneme

Waveform

generation

M

a

a

m

e

a

m

d

a

l

d

e

r

Normali-zation

+Prosody

tags

H*

H*

L-L%

M AA R IY AA N AH M EY D DH AH M AA R M AH L EY D

Acoustic

realization

Waveform generation

Acoustic modeling

Text analysis

26

27 of 179

How can machine speak? – Research questions

Kiyoshi KURIHARA, Nobumasa SEIYAMA, Tadashi KUMANO, "Prosodic Features Control by Symbols as Input of Sequence-to-Sequence Acoustic Modeling for Neural TTS" in IEICE TRANSACTIONS on Information, vol. E104-D, no. 2, pp. 302-311, February 2021, doi: 10.1587/transinf.2020EDP7104.

Marianna made the marmalade

M AA R IY AA N AH M EY D DH AH M AA R M AH L EY D

To phoneme

Waveform

generation

M

a

a

m

e

a

m

d

a

l

d

e

r

Normali-zation

+Prosody

tags

H*

H*

L-L%

M AA R IY AA N AH M EY D DH AH M AA R M AH L EY D

Acoustic

realization

Language dependent

27

28 of 179

Progress of TTS in terms of acoustic & waveform modeling

Heiga Zen, Keiichi Tokuda, and Alan W Black. 2009. Statistical parametric speech synthesis. Speech Communication 51, (2009), 1039–1064.

2019 https://www.wsj.com/articles/fraudsters-use-ai-to-mimic-ceos-voice-in-unusual-cybercrime-case-11567157402?reflink=desktopwebshare_permalink

See other papers in appendix

M AA R IY AA N AH M EY D DH AH M AA R M AH L EY D

To phone

Waveform

generation

M

a

a

m

e

a

m

d

a

l

d

e

r

Normali-zation

+Prosody

tags

H*

H*

L-L%

M AA R IY AA N AH M EY D DH AH M AA R M AH L EY D

Acoustic

realization

Marianna made the marmalade

Discrete

Continuous

Alignment

M

a

a

m

e

a

m

d

a

l

d

e

r

Ambiguity

Speaker identity,

Prosody …

HMM+deep neural networks

Hidden Markov model (HMM)

1996

~2000

~2013

~2017

Unit-selection

WaveNet, end-2-end (Tacotron, Transformer)

Latest

Statistical parametric speech synthesis (Zen 2009)

Codec+LLM

(Valle, CosyVoice)

Linguistics,

rules

Deep learning,

statistics

28

29 of 179

Progress of TTS – 1990s Unit selection

Hunt, A. J. & Black, A. W. Unit selection in a concatenative speech synthesis system using a large speech database. in Proc. ICASSP 373–376 (1996).

Black, A. W. & Taylor, P. A. Automatically clustering similar units for unit selection in speech synthesis. (1997)

Marianna made the marmalade

M AA R IY AA N AH M EY D DH AH M AA R M AH L EY D

To phoneme

Waveform

concatenation

M

a

a

m

e

a

m

d

a

l

d

e

r

Normali-zation

+Prosody

tags

H*

H*

L-L%

M AA R IY AA N AH M EY D DH AH M AA R M AH L EY D

Acoustic

realization

Concatenation

artifacts

29

30 of 179

Progress of TTS – 1990s Unit selection

[i]

[u]

[a]

Unit-selection concatenates sound segments

30

31 of 179

Progress of TTS – after 1990s

[i]

[u]

[a]

Co-articulation: human speech is continuous

Statistical parametric speech synthesis: use statistical models to generate smooth feature trajectories (parametric approach)

31

32 of 179

Progress of TTS – 2000s

Tokuda, K., Yoshimura, T., Masuko, T., Kobayashi, T. & Kitamura, T. Speech parameter generation algorithms for HMM-based speech synthesis. in Proc. ICASSP 936–939 (2000).

Keiichi Tokuda, Heiga Zen, and Alan W Black. An HMM-Based Speech Synthesis System Applied to English. In Proc. SSW, 227–230. 2002.

HMM-Based Speech Synthesis Toolkit (HTS), home page: http://hts.sp.nitech.ac.jp/?Welcome

Marianna made the marmalade

M AA R IY AA N AH M EY D DH AH M AA R M AH L EY D

To phone

Waveform

generation

M

a

a

m

e

a

m

d

a

l

d

e

r

Normali-zation

+Prosody

tags

H*

H*

L-L%

M AA R IY AA N AH M EY D DH AH M AA R M AH L EY D

Acoustic

realization

Hidden Markov model

Vocoder artifacts

& oversmoothing

Vocoder using digital signal processing (DSP)

32

33 of 179

Progress of TTS – early 2010s

Heiga Zen, Alan Senior, and Martin Schuster. Statistical Parametric Speech Synthesis Using Deep Neural Networks. In Proc. ICASSP, 7962–7966. 2013.

Heiga Zen, and Andrew Senior. Deep Mixture Density Networks for Acoustic Modeling in Statistical Parametric Speech Synthesis. In Proc. ICASSP, 3844–3848. 2014.

Xin Wang, Shinji Takaki, and Junichi Yamagishi. 2017. An autoregressive recurrent mixture density network for parametric speech synthesis. In Proc. ICASSP, 4895–4899.

Marianna made the marmalade

M AA R IY AA N AH M EY D DH AH M AA R M AH L EY D

To phone

Waveform

generation

M

a

a

m

e

a

m

d

a

l

d

e

r

Normali-zation

+Prosody

tags

H*

H*

L-L%

M AA R IY AA N AH M EY D DH AH M AA R M AH L EY D

Acoustic

realization

Deep neural network (DNN)

Vocoder artifacts

& oversmoothing

Vocoder using digital signal processing (DSP)

33

34 of 179

Progress of TTS – late 2010s

Xin Wang, Neural statistical parametric speech synthesis, ISCA Odyssey 2020, tutorial: https://tonywangx.github.io/slide.html#dec-2020

Xu Tan, Tao Qin, Frank Soong, and Tie-Yan Liu. A Survey on Neural Speech Synthesis. ArXiv Preprint ArXiv:2106.15561. 2021.

a

a

m

e

a

m

d

a

l

d

e

r

H*

H*

L-L%

M AA R IY AA N AH M EY D DH AH M AA R M AH L EY D

Vocoder using DNN

Transformer

Acoustic

features

Input text

M

Vocoder artifacts

& oversmoothing

Alignment not always stable

34

35 of 179

Progress of TTS – from 2000s to late 2010s

a

a

m

e

a

m

d

a

l

d

e

r

H*

H*

L-L%

M AA R IY AA N AH M EY D DH AH M AA R M AH L EY D

Vocoder using DNN

Transformer

Acoustic

features

Input text

M

Expert-knowledge based

35

36 of 179

Progress of TTS – 2020s

Aaron Van Den Oord, Oriol Vinyals, and others. 2017. Neural discrete representation learning. In Proc. NIPS, 2017. 6309–6318.

Neil Zeghidour, Alejandro Luebs, Ahmed Omran, Jan Skoglund, and Marco Tagliasacchi. 2022. SoundStream: An End-to-End Neural Audio Codec. IEEE/ACM Trans. Audio Speech Lang. Process. 30, (2022), 495–507.

1

2

3

DNN decoder

Acoustic

tokens

DNN encoder

Use DNNs to learn a latent feature space

36

37 of 179

Progress of TTS – 2020s

Zhihao Du, Qian Chen, Shiliang Zhang, Kai Hu, Heng Lu, Yexin Yang, Hangrui Hu, Siqi Zheng, Yue Gu, Ziyang Ma, Zhifu Gao, and Zhijie Yan. 2024. CosyVoice: A Scalable Multilingual Zero-shot Text-to-speech Synthesizer based on Supervised Semantic Tokens.

Chengyi Wang, Sanyuan Chen, Yu Wu, Ziqiang Zhang, Long Zhou, Shujie Liu, Zhuo Chen, Yanqing Liu, Huaming Wang, Jinyu Li, and others. 2023. Neural codec language models are zero-shot text to speech synthesizers. arXiv preprint arXiv:2301.02111 (2023).

a

a

m

e

a

m

d

a

l

d

e

r

H*

H*

L-L%

M AA R IY AA N AH M EY D DH AH M AA R M AH L EY D

1

2

3

DNN decoder

Large language model

Acoustic

tokens

Input tokens

M

Code index

prompting

37

38 of 179

Progress of TTS – 2020s

a

a

m

e

a

m

d

a

l

d

e

r

H*

H*

L-L%

M AA R IY AA N AH M EY D DH AH M AA R M AH L EY D

1

2

3

DNN decoder

Large language model

M

Speak in a firm voice like a politician,

Text-based control

prompting

+

+

38

39 of 179

Progress of TTS

More DNNs, less linguistic rules

Better quality, less artifacts

DNN decoder

Large language model

DNN vocoder

(small) Transformer

DSP vocoder

Classical DNN

/ HMM

Waveform concatenation

Unit selection

39

40 of 179

Progress of TTS

More DNNs, less linguistic rules

Better quality, less artifacts

DNN decoder

Large language model

DNN vocoder

(small) Transformer

DSP vocoder

Classical DNN

/ HMM

Waveform concatenation

Unit selection

Festival & MaryTTS

(C++, Lisp/perl, …)

HTS

(C++)

CURRENT

Merlin & RNNLIB

(C++/CUDA)

Theano

Tensorflow

Pytorch

(Python)

Pytorch

(Python)

More open-sourced & easier to use

40

41 of 179

Misuse of Speech Generation

https://en.wikipedia.org/wiki/Turing_test

How well does Deepfake fool human?

How well does Deepfake fool machine?

41

42 of 179

Misuse of speech generation

TTS generator

Attacker

User

Humans

Telephone call

Social media

Speaker verification

Services

Transmission

42

43 of 179

Misuse of speech generation

https://www.wsj.com/articles/fraudsters-use-ai-to-mimic-ceos-voice-in-unusual-cybercrime-case-11567157402

TTS generator

Attacker

User

Humans

Telephone call

Social media

Speaker verification

Services

Transmission

AI voice scam

I am your boss / son, please send me the money

sns

I am your boss …

43

44 of 179

Misuse of speech generation

https://www.mcafee.com/content/dam/consumer/en-us/resources/cybersecurity/artificial-intelligence/rp-beware-the-artificial-impostor-report.pdf

TTS generator

Attacker

User

Humans

Telephone call

Social media

Speaker verification

Services

Transmission

AI voice scam

44

45 of 179

Misuse of speech generation

Xin Wang, Junichi Yamagishi, et all. 2020. ASVspoof 2019: A large-scale public database of synthesized, converted and replayed speech. Computer Speech & Language 64, (November 2020), 101114.

Kimberly T. Mai, Sergi Bray, Toby Davies, and Lewis D. Griffin. 2023. Warning: Humans cannot reliably detect speech deepfakes. PLoS ONE 18, 8 (August 2023), e0285333.

Kevin Warren, Tyler Tucker, Anna Crowder, Daniel Olszewski, Allison Lu, Caroline Fedele, Magdalena Pasternak, Seth Layton, Kevin Butler, Carrie Gates, and others. 2024. “ Better be computer or I’m dumb”: a large-scale evaluation of humans as audio deepfake detectors. In Proc. ACM CCS, 2024. 2696–2710.

TTS generator

Attacker

User

Humans

Telephone call

Social media

Speaker verification

Services

Transmission

Can detect fake speech be detected by humans?

Human-ear detection is NOT reliable

  • it depends on the listeners (Kimberly 2023)
  • it depends on the attacks (Wang 2020, Warren 2023)

45

46 of 179

Misuse of speech generation

Xin Wang, Junichi Yamagishi, et all. 2020. ASVspoof 2019: A large-scale public database of synthesized, converted and replayed speech. Computer Speech & Language 64, (November 2020), 101114.

Humans

Telephone call

Social media

Depending on the attacks (Wang 2020)

Random guess

Detection error rates by human ears (%)

by 1,145 listeners

Tactron 2

Attack IDs

46

47 of 179

Misuse of speech generation

TTS generator

Attacker

User

Humans

Telephone call

Social media

Services

Transmission

Speaker verification

47

48 of 179

Misuse of speech generation

John H L Hansen and Taufiq Hasan. 2015. Speaker recognition by machines and humans: A tutorial review. IEEE Signal processing magazine 32, 6 (2015), 74–99.

Personalized Hey Siri, https://machinelearning.apple.com/research/personalized-hey-siri

Services

Speaker verification

Speaker verification

Match

Not match

TTS

Match

Examples:

Telephone banking

“Hey Siri”

Spoofing

48

49 of 179

Misuse of speech generation

Services

Speaker verification

Speaker verification

Match

Not match

TTS

Match

AI voice spoofing

49

50 of 179

Misuse of speech generation

Bryan L Pellom and John H L Hansen. 1999. An experimental study of speaker verification sensitivity to computer voice-altered imposters. In Proc. ICASSP, 1999. IEEE, 837–840.

Takashi Masuko, Takafumi Hitotsumatsu, Keiichi Tokuda, and Takao Kobayashi. 1999. On the security of HMM-based speaker verification systems against imposture using synthetic speech. In Proc. Eurospeech, 1999. 1223--1226}

Speaker verification

Humans

Telephone call

Social media

Speaker verification

Services

Will a speaker verification model be fooled?

Yes

  • false acceptance rate … 86% (Pellom 1999)
  • … over 70% (Masuko 1999)

TTS

Match

50

51 of 179

Misuse of speech generation

Xin Wang, Junichi Yamagishi, et all. 2020. ASVspoof 2019: A large-scale public database of synthesized, converted and replayed speech. Computer Speech & Language 64, (November 2020), 101114.

Humans

Telephone call

Social media

Speaker verification

Services

Will a speaker verification model be fooled? YES

Random guess

Equal error rates (%)

X-vector-based speaker verification model

Unit-selection

HMM-DNN

51

52 of 179

Misuse of speech generation

Xin Wang, Junichi Yamagishi, et all. 2020. ASVspoof 2019: A large-scale public database of synthesized, converted and replayed speech. Computer Speech & Language 64, (November 2020), 101114.

Humans

Telephone call

Social media

Speaker verification

Services

Random guess

Human ears differ from speaker verification models

Unit-selection

HMM-DNN

Equal error rates (%)

X-vector-based speaker verification model

52

53 of 179

Misuse of speech generation

Jee-weon Jung, Xin Wang, Nicholas Evans, Shinji Watanabe, Hye-jin Shim, Hemlata Tak, Sidhhant Arora, Junichi Yamagishi, and Joon Son Chung. 2024. To what extent can ASV systems naturally defend against spoofing attacks? In Proc. Interspeech, 2024. 3240--3244.

Generator

Attacker

User

Humans

Telephone call

Social media

Speaker verification

Services

Transmission

But both may be fooled by speech deepfake

See latest study on spoofing & speaker verification(Jung 2024)

Human ears differ from speaker verification models

53

54 of 179

Speech Deepfake Detection

How do we do binary detection?

Poor generalization performance

https://en.wikipedia.org/wiki/Overfitting

54

55 of 179

Deepfake detection

Generator

Attacker

Humans

Telephone call

Social media

Speaker verification

Services

Transmission

55

56 of 179

Deepfake detection

See alternatives in Abdenour Hadid, Nicholas Evans, Sebastien Marcel, and Julian Fierrez. 2015. Biometrics systems under spoofing attack: an evaluation methodology and lessons learned. IEEE Signal Processing Magazine 32, 5 (2015), 20–30.

Attacker

Humans

Telephone call

Social media

Speaker verification

Services

Defender

Detector

Generator

REAL

FAKE

Transmission

Other names:

anti-spoofing, spoofing countermeasure, presentation attack detection

56

57 of 179

Deepfake detection

Goal: detection with no or limited prior knowledge of

    • generator
    • transmission channels
    • speaker, languages, …

Detector

Generator

REAL

FAKE

Attacker

Defender

En,

Jp,

Zh,

VoIP

mp3

Transmission

Generalization

57

58 of 179

Deepfake detection

Detector

Generator

REAL

FAKE

Attacker

Transmission

Detector

Feature extraction

(front end)

Classifier

(back end)

Input wav.

(ACCEPT)

(REJECT)

(ACCEPT)

(REJECT)

58

59 of 179

Deepfake detection

Spectrogram

Average over high-frequency band

Frame index

Freq. bin

Freq. bin

Frame index

Amplitude

S1

S2

Feature extraction

(front end)

Classifier

(back end)

Input wav.

Toy example

59

60 of 179

Deepfake detection

A new TTS attack

Frame index

Freq. bin

Freq. bin

Frame index

Amplitude

Spectrogram

Average over high-frequency band

Feature extraction

(front end)

Classifier

(back end)

Input wav.

Toy example

60

61 of 179

Deepfake detection

Detector

Generator

REAL

FAKE

Attacker

Transmission

(ACCEPT)

(REJECT)

Detector

Feature extraction

(front end)

Classifier

(back end)

Training data

Cross-entropy

Model training

61

62 of 179

Deepfake detection

Detector

Feature extraction

(front end)

Classifier

(back end)

Test data

False acceptance

False rejection

Decision threshold

Decision

Equal error rate (EER)

62

63 of 179

Deepfake detection

Menglu Li, Yasaman Ahmadiadli, and Xiao-Ping Zhang. 2025. A Survey on Speech Deepfake Detection. ACM Comput. Surv. 57, 7 (July 2025), 1–38.

Jiangyan Yi, Chenglong Wang, Jianhua Tao, Xiaohui Zhang, Chu Yuan Zhang, and Yan Zhao. 2023. Audio Deepfake Detection: A Survey.

Xin Wang and Junichi Yamagishi. 2022. A Practical Guide to Logical Access Voice Presentation Attack Detection. In Frontiers in Fake Media Generation and Detection. Springer

Detector

Feature extraction

(front end)

Classifier

(back end)

Training/test data

Features

Waveform

(DSP) LFCC, CQCC

(DNN) Self-supervised learning

Classifiers

(linear) GMM, …

(DNN) ResNet, LCNN,

AASIST, RawNet

Mamba,

Transformer …

Databases

ASVspoof series

ADD 22, 23

In-the-wild

MLAAD

Metrics

EER,

t-DCF, t-EER

Cllr

63

64 of 179

Deepfake detection

Xin Wang, Junichi Yamagishi, et all. 2020. ASVspoof 2019: A large-scale public database of synthesized, converted and replayed speech. Computer Speech & Language 64, (November 2020), 101114.

Average detection error rate is 3%, on this ASVspoof2019 LA dataset

Detector: CQCC + LCNN

Random guess

HMM-DNN, detection error rates = 0%

Detection error rates (%)

voice conversion

3.11%

average

Tacotron2

ASR+TTS(WaveNet)

64

65 of 179

Deepfake detection

Xin Wang, Junichi Yamagishi, et all. 2020. ASVspoof 2019: A large-scale public database of synthesized, converted and replayed speech. Computer Speech & Language 64, (November 2020), 101114.

Average detection error rate is 3%, on this ASVspoof2019 LA dataset

Detector: CQCC + LCNN

Random guess

HMM-DNN, detection error rates = 0%

Detection error rates (%)

3.11%

average

Are we safe from Deepfake? NO!

65

66 of 179

Issue with generalization

Detector

Generator

REAL

FAKE

Attacker

Defender

En

Transmission

Training set

Test set

En

66

67 of 179

Issue with generalization

Detector

Generator

REAL

FAKE

Attacker

Defender

En

Jp

Zh

Transmission

Training set

Test set

En

mp3

m4a

More errors when training and test sets mismatch

67

68 of 179

Issue with generalization

Xin Wang, and Junichi Yamagishi. Investigating Self-Supervised Front Ends for Speech Spoofing Countermeasures. Proc. Odyssey 2022.

Audio example from https://docs.pytorch.org/audio/2.6.0/tutorials/effector_tutorial.html#mp3

Detector

Generator

REAL

FAKE

Attacker

Defender

Transmission

codec

mp3,…

EER (%)

3.11

22.79

24.88

Test sets

2019LA

2021LA

2021DF

Training set

more

attacks

More errors when training and test sets mismatch

Same domain

original

mp3

68

69 of 179

Issue with generalization

Internal experiment

See similar study in Nicolas Müller, Franziska Dieckmann, et al. 2021. Speech is silver, silence is golden: What do ASVspoof-trained models really learn? In Proc. ASVspoof challenge workshop, 2021. 55–60.

Detector

Generator

REAL

FAKE

Attacker

Defender

Transmission

codec

mp3,…

EER (%)

3.11

22.79

24.88

Test sets

2019LA

2021LA

2021DF

Training set

more

attacks

More errors when training and test sets mismatch

~15

Same domain

69

70 of 179

Issue with generalization

Müller, Nicolas, Franziska Dieckmann, Pavel Czempin, Roman Canals, Konstantin Böttinger, and Jennifer Williams. 2021. “Speech Is Silver, Silence Is Golden: What Do ASVspoof-Trained Models Really Learn?” In Proc. ASVspoof Challenge Workshop, 55–60.

Shim, Hye-jin, Rosa Gonzalez Hautamäki, Md Sahidullah, and Tomi Kinnunen. 2023. “How to Construct Perfect and Worse-than-Coin-Flip Spoofing Countermeasures: A Word of Warning on Shortcut Learning.” In Proc. INTERSPEECH 2023, 785–89. https://doi.org/10.21437/Interspeech.2023-1901.

Detector

Generator

REAL

FAKE

Attacker

Defender

Trans.

Detection error rate x5

?

Human

Synthetic

70

71 of 179

Issue with generalization

Hemlata Tak, Massimiliano Todisco, Xin Wang, Jee-weon Jung, Junichi Yamagishi, and Nicholas Evans. 2022. Automatic speaker verification spoofing and deepfake detection using wav2vec 2.0 and data augmentation. In Proc. Odyssey, 2022. 112–119.

Xin Wang and Junichi Yamagishi. 2022. Investigating Self-Supervised Front Ends for Speech Spoofing Countermeasures. In Proc. Odyssey, 2022. 100–106.

Xin Wang and Junichi Yamagishi. 2023. Spoofed Training Data for Speech Spoofing Countermeasure Can Be Efficiently Created Using Neural Vocoders. In Proc. ICASSP, June 2023

Xin Wang and Junichi Yamagishi. 2024. Can Large-Scale Vocoded Spoofed Data Improve Speech Spoofing Countermeasure with a Self-Supervised Front End? In Proc. ICASSP, April 14, 2024. 10311–10315

Wanying Ge, Xin Wang, Xuechen Liu, and Junichi Yamagishi. 2025. Post-training for Deepfake Speech Detection. In Proc. ASRU, December 06, 2025. IEEE, Honolulu, HI, USA, 1–8

Xin Wang, Ge Wanying, and Junichi Yamagishi. 2026. Does Fine-tuning by Reinforcement Learning Improve Generalization in Binary Speech Deepfake Detection?

Detector

Generator

REAL

FAKE

Attacker

Defender

En,

Jp,

Zh,

VoIP

mp3

Trans.

Unseen

Work by us (advertisement : )

  • Better feature extractor (Wang 2022, Tak 2022)
  • More & larger scale data(Wang 2023, 2024, Ge 2025)
  • Better training criterion (Wang 2026)

71

72 of 179

Issue with generalization

How could you claim generalization to unseen data if you haven’t seen the unseen data?

Xin Wang, Héctor Delgado, Nicholas Evans, Xuechen Liu, Tomi Kinnunen, Hemlata Tak, Kong Aik Lee, Ivan Kukanov, Md Sahidullah, Massimiliano Todisco, and Junichi Yamagishi. 2026. ASVspoof 5: Evaluation of spoofing, deepfake, and adversarial attack detection using crowdsourced speech. IEEE Transactions on Audio, Speech, and Language Processing (2026),

Detector

Generator

REAL

FAKE

Attacker

Defender

En,

Jp,

Zh,

VoIP

mp3

Trans.

Unseen

Ongoing research topic (see latest ref.)

72

73 of 179

Beyond deepfake detection

Attacker

Defender

Detector

Generator

REAL

FAKE

Trans.

Generator

API

Detector

REAL

FAKE

Trans.

Dataset

Generator

Training

73

74 of 179

Beyond deepfake detection

Attacker

Defender

Watermark detector

Generator

Trans.

Generator

API

Collaborator

Watermark detector

Message

Trans.

Dataset

Generator

Training

Watermark

w

Message

w

w

Generated by HuggingFace API

Generated by HuggingFace API

74

75 of 179

Beyond deepfake detection

Attacker

Defender

Watermark detector

Generator

Trans.

Generator

API

Collaborator

Watermark detector

Message

Trans.

Dataset

Generator

Training

Watermark

net

w

Message

w

w

Generated by HuggingFace API

Generated by HuggingFace API

Anony-mization

75

76 of 179

Proactive defense speech watermark

76

77 of 179

Watermark – post-processing approach

Watermark detector

Generator

Trans.

Generator

API

Watermark

w

Generator (wrapped via API)

TTS

Watermark embedder

Watermark detector

Message 101011…

101011…

General case

Message

w

77

78 of 179

Watermark – post-processing approach

Watermark detector

Generator

Trans.

Generator

API

Watermark

w

Generator (wrapped via API)

TTS

Watermark embedder

Watermark detector

TTS ID X, creator N

TTS ID X, creator N

General case

Message

Signature of TTS

w

78

79 of 179

Watermark – post-processing approach

Guangyu Chen, Yu Wu, Shujie Liu, Tao Liu, Xiaoyong Du, and Furu Wei. 2023. Wavmark: Watermarking for audio generation. arXiv preprint arXiv:2308.12770 (2023).

Robin San Roman, Pierre Fernandez, Alexandre Défossez, Teddy Furon, Tuan Tran, and Hady Elsahar. 2024. Proactive detection of voice cloning with localized watermarking. In Proc. ICML, 2024. .

Mayank Kumar Singh, Naoya Takahashi, Weihsiang Liao, and Yuki Mitsufuji. 2024. SilentCipher: Deep audio watermarking. In Proc. Interspeech, 2024. 2235--2239.

Yixin Liu, Lie Lu, Jihui Jin, Lichao Sun, and Andrea Fanelli. 2025. XAttnMark: Learning Robust Audio Watermarking with Cross-Attention.

API,

Checkpoint

Watermark detector

Generator

Trans.

Generator

building

Watermark

w

Generator (wrapped via API)

TTS

Watermark embedder

Watermark detector

TTS ID X, creator N

TTS ID X, creator N

General case

Message

Signature of TTS

w

Microsoft: WavMark (Chen 2023)

Meta: AudioSeal (Roman 2024)

Dolby: XAttnMark (Liu 2025)

Sony: SlientCipher (Singh 2024)

79

80 of 179

Watermark – post-processing approach

  • Goals (part 1)
    • Fidelity: and are perceptually same
    • Capacity: with many bits can be accurately extracted from

TTS

Watermark embedder

Watermark detector

Message 101011…

101011…

80

81 of 179

Watermark – post-processing approach

L. F. Turner, “Digital data security system.” Patent IPN WO 89/08915, 1989.

Figure from Heiga Zen. 2017. Generative model-based text-to-speech synthesis.

  • Example
    • Least significant bit (Turner 1989)

TTS

Watermark embedder

Watermark detector

Message 0

0

81

82 of 179

Watermark – post-processing approach

L. F. Turner, “Digital data security system.” Patent IPN WO 89/08915, 1989.

Figure from Heiga Zen. 2017. Generative model-based text-to-speech synthesis.

  • Example
    • Least significant bit (Turner 1989)

TTS

Watermark embedder

Watermark detector

Message 0

0

82

83 of 179

Watermark – post-processing approach

L. F. Turner, “Digital data security system.” Patent IPN WO 89/08915, 1989.

Figure from Heiga Zen. 2017. Generative model-based text-to-speech synthesis.

  • Example
    • Least significant bit (Turner 1989)

TTS

Watermark embedder

Watermark detector

Message 0

0

83

84 of 179

Watermark – post-processing approach

L. F. Turner, “Digital data security system.” Patent IPN WO 89/08915, 1989.

Figure from Heiga Zen. 2017. Generative model-based text-to-speech synthesis.

  • Example
    • Least significant bit (Turner 1989)

TTS

Watermark embedder

Watermark detector

Message 1

1

84

85 of 179

Watermark – post-processing approach

L. F. Turner, “Digital data security system.” Patent IPN WO 89/08915, 1989.

Figure from Heiga Zen. 2017. Generative model-based text-to-speech synthesis.

  • Example
    • Least significant bit (Turner 1989)

TTS

Watermark embedder

Watermark detector

Message 00

00

85

86 of 179

Watermark – post-processing approach

L. F. Turner, “Digital data security system.” Patent IPN WO 89/08915, 1989.

Figure from Heiga Zen. 2017. Generative model-based text-to-speech synthesis.

  • Example
    • Least significant bit (Turner 1989)

TTS

Watermark embedder

Watermark detector

Message 000

000

Trade-off between capacity & fidelity

86

87 of 179

Watermark – post-processing approach

Figure from Heiga Zen. 2017. Generative model-based text-to-speech synthesis.

  • What is missing?
    • Robustness to distortion in transmission

TTS

Watermark embedder

Watermark detector

Message 0

1?

Trans.

noise, reverb, codec, …

87

88 of 179

Watermark – post-processing approach

  • Goals
    • Fidelity: and are perceptually same
    • Capacity: with many bits
    • Robustness: remains after transmission
    • Security (Cayre 2005)

F. Cayre, C. Fontaine, and T. Furon. 2005. Watermarking security: theory and practice. IEEE Trans. Signal Process. 53, 10 (October 2005), 3976–3987.

TTS

Watermark embedder

Watermark detector

Message 1010101

1010101

Trans.

Focus of most recent papers

88

89 of 179

Watermark – post-processing approach

  • Main approach
    • Non-DNN approaches (Hua 2016)
      • least significant bit, quantization index, spread spectrum, side information, patchwork …

Guang Hua, Jiwu Huang, Yun Q Shi, Jonathan Goh, and Vrizlynn LL Thing. 2016. Twenty years of digital audio watermarking—a comprehensive review. Signal processing 128, (2016), 222–242.

Guangyu Chen, Yu Wu, Shujie Liu, Tao Liu, Xiaoyong Du, and Furu Wei. 2023. Wavmark: Watermarking for audio generation. arXiv preprint arXiv:2308.12770 (2023).

Yizhu Wen, Ashwin Innuganti, Aaron Bien Ramos, Hanqing Guo, and Qiben Yan. 2025. SoK: How Robust is Audio Watermarking in Generative AI models?

TTS

Watermark embedder

Watermark detector

Message 1010101

1010101

Trans.

(chen 2023)

Patchwork

DNN

not sufficiently robust (Chen 2023, Wen 2025)

89

90 of 179

Watermark – post-processing approach

  • Main approach
    • Non-DNN approaches
    • DNN approaches
      • frequency domain: WavMark (Chen 2023) Timbre (Liu 2023)
      • time domain: AudioSeal (Roman 2024)

Guangyu Chen, Yu Wu, Shujie Liu, Tao Liu, Xiaoyong Du, and Furu Wei. 2023. Wavmark: Watermarking for audio generation. arXiv preprint arXiv:2308.12770 (2023).

Chang Liu, Jie Zhang, Tianwei Zhang, Xi Yang, Weiming Zhang, and Nenghai Yu. 2023. Detecting voice cloning attacks via timbre watermarking. In Proc. Network and Distributed System Security Symposium, 2023. .

Robin San Roman, Pierre Fernandez, Alexandre Défossez, Teddy Furon, Tuan Tran, and Hady Elsahar. 2024. Proactive detection of voice cloning with localized watermarking. In Proc. ICML, 2024. .

TTS

Watermark embedder

Watermark detector

Message 1010101

1010101

Trans.

DNN

DNN

90

91 of 179

Watermark – AudioSeal

Robin San Roman, Pierre Fernandez, Alexandre Défossez, Teddy Furon, Tuan Tran, and Hady Elsahar. 2024. Proactive detection of voice cloning with localized watermarking. In Proc. ICML, 2024. .

TTS

Watermark embedder

Watermark detector

Message 1010101

1010101

Trans.

+

Filters,

Mp3,

AAC,

91

92 of 179

Watermark – AudioSeal

Robin San Roman, Pierre Fernandez, Alexandre Défossez, Teddy Furon, Tuan Tran, and Hady Elsahar. 2024. Proactive detection of voice cloning with localized watermarking. In Proc. ICML, 2024. .

+

Filters,

Mp3,

AAC,

To preserve quality

To detect the watermark bits

92

93 of 179

Watermark – Timbre

Chang Liu, Jie Zhang, Tianwei Zhang, Xi Yang, Weiming Zhang, and Nenghai Yu. 2023. Detecting voice cloning attacks via timbre watermarking. In Proc. Network and Distributed System Security Symposium, 2023. .

TTS

Watermark embedder

Watermark detector

Message 1010101

1010101

Trans.

93

94 of 179

Watermark – help deepfake detection?

Chia-Hua Wu, Wanying Ge, Xin Wang, Junichi Yamagishi, Yu Tsao, and Hsin-Min Wang. 2025. A Comparative Study on Proactive and Passive Detection of Deepfake Speech. In Proc. Interspeech, June 17, 2025. 5328--5332.

TTS

Watermark embedder

Watermark detector

1 or 0

1 or 0

Trans.

Detector

TTS

REAL

FAKE

Trans.

Conventional deepfake detection

Watermark-based detection

VS

AudioSeal, Timbre

AASIST, SSL-AASIST

94

95 of 179

Watermark – help deepfake detection?

Chia-Hua Wu, Wanying Ge, Xin Wang, Junichi Yamagishi, Yu Tsao, and Hsin-Min Wang. 2025. A Comparative Study on Proactive and Passive Detection of Deepfake Speech. In Proc. Interspeech, June 17, 2025. 5328--5332.

TTS

Watermark embedder

Watermark detector

1 or 0

1 or 0

Trans.

Detector

TTS

REAL

FAKE

Trans.

Conventional deepfake detection

Watermark-based detection

watermark-based

conventional detectors

No transmission

Watermark EER ~0%

Trans.

95

96 of 179

Watermark – help deepfake detection?

Chia-Hua Wu, Wanying Ge, Xin Wang, Junichi Yamagishi, Yu Tsao, and Hsin-Min Wang. 2025. A Comparative Study on Proactive and Passive Detection of Deepfake Speech. In Proc. Interspeech, June 17, 2025. 5328--5332.

TTS

Watermark embedder

Watermark detector

1 or 0

1 or 0

Trans.

Detector

TTS

REAL

FAKE

Trans.

Conventional deepfake detection

Watermark-based detection

watermark-based

conventional detectors

No transmission

Watermark EER ~0%

Watermark EER >50%

Trans.

Audio codec

96

97 of 179

Watermark – challenges

Slavko Kovačević, Murilo Z. Silvestre, Kosta Pavlović, Petar Nedić, and Igor Djurović. 2025. DeepMark Benchmark: Redefining Audio Watermarking Robustness. March 06, 2025.

Patrick O’Reilly, Zeyu Jin, Jiaqi Su, and Bryan Pardo. 2025. Deep Audio Watermarks are Shallow: Limitations of Post-Hoc Watermarking Techniques for Speech.

Yigitcan Özer, Woosung Choi, Joan Serrà, Mayank Kumar Singh, Wei-Hsiang Liao, and Yuki Mitsufuji. 2025. A comprehensive real-world assessment of audio watermarking algorithms: Will they survive neural codecs? In Proc. Interspeech, 2025. 5113–5117.

Yigitcan Özer, Wanying Ge, Zhe Zhang, Xin Wang, and Junichi Yamagishi. 2026. Self Voice Conversion as an Attack against Neural Audio Watermarking.

TTS

Watermark embedder

Watermark detector

1 or 0

1 or 0

Trans.

(Kovačević 2025)

DNN codecs remove watermark

(O’Relly 2025, Kovačević 2025, Ozer 2025 , Ozer 2026)

Speech enhancement

DNN tokenizer (codec)

Neural vocoder (GAN)

Variational auto-encoder

Diffusion model …

All watermark models

failed the test

97

98 of 179

Watermark – challenges

Slavko Kovačević, Murilo Z. Silvestre, Kosta Pavlović, Petar Nedić, and Igor Djurović. 2025. DeepMark Benchmark: Redefining Audio Watermarking Robustness. March 06, 2025.

TTS

Watermark embedder

Watermark detector

1 or 0

1 or 0

Trans.

Another watermark

collusion attack

All watermark models are vulnerable

98

99 of 179

Watermark – summary

Yizhu Wen, Ashwin Innuganti, Aaron Bien Ramos, Hanqing Guo, and Qiben Yan. 2025. SoK: How Robust is Audio Watermarking in Generative AI models?

Watermark can be useful

Better robustness is required

Better design for security is required

Ongoing research topic (see latest ref.)

TTS

Watermark embedder

Watermark detector

1 or 0

1 or 0

Trans.

99

100 of 179

Speaker Anonymization

(skip this part)

Anony-mization

100

101 of 179

Towards Anonymization

API,

Checkpoint

Attacker

Defender (user)

Watermark detector

Generator

TRUE

FALSE

Trans.

Generator

building

Collaborator

Watermark detector

TRUE

FALSE

Trans.

Dataset

Generator

Training

Watermark

Anony-mization

w

w

w

Remove or hide voiceprint

101

102 of 179

Anonymization

Natalia Tomashenko, Brij Mohan Lal Srivastava, Xin Wang, Emmanuel Vincent, Andreas Nautsch, Junichi Yamagishi, Nicholas Evans, Jose Patino, Jean-François Bonastre, Paul-Gauthier Noé, and Massimiliano Todisco. 2020. Introducing the VoicePrivacy initiative. In Proc. Interspeech, October 2020. ISCA, ISCA, 1693–1697.

  • Voice conversion towards a non-existing pseudo speaker
  • Goals (Tomashenko 2020)
    • Utility
      • Keep linguistic contents
      • Keep naturalness
    • Privacy / security
      • Hide the voice biometric (or identity) – unlinkbility

Anony-mization

擬似話者

same speaker?

102

103 of 179

103

104 of 179

Anonymization – example system

Natalia Tomashenko, Brij Mohan Lal Srivastava, Xin Wang, Emmanuel Vincent, Andreas Nautsch, Junichi Yamagishi, Nicholas Evans, Jose Patino, Jean-François Bonastre, Paul-Gauthier Noé, and Massimiliano Todisco. 2020. Introducing the VoicePrivacy Initiative. In Proc. Interspeech

https://www.nii.ac.jp/today/103/8.html

Anony-mization

Linguistic

Paralinguistic

104

105 of 179

Anonymization – algorithms

Fuming Fang, Xin Wang, Junichi Yamagishi, Isao Echizen, Massimiliano Todisco, Nicholas Evans, and Jean-Francois Bonastre. 2019. Speaker anonymization using X-vector and neural waveform models. In Proc. SSW, 2019. 155–160.

Natalia Tomashenko, Brij Mohan Lal Srivastava, Xin Wang, Emmanuel Vincent, Andreas Nautsch, Junichi Yamagishi, Nicholas Evans, Jose Patino, Jean-François Bonastre, Paul-Gauthier Noé, and Massimiliano Todisco. 2020. Introducing the VoicePrivacy Initiative. In Proc. Interspeech

  • k-anonymity
    • Select k* farthest speakers
    • Randomly select k out of k*

input

Pool of speakers

105

106 of 179

Anonymization – algorithms

Fuming Fang, Xin Wang, Junichi Yamagishi, Isao Echizen, Massimiliano Todisco, Nicholas Evans, and Jean-Francois Bonastre. 2019. Speaker anonymization using X-vector and neural waveform models. In Proc. SSW, 2019. 155–160.

Natalia Tomashenko, Brij Mohan Lal Srivastava, Xin Wang, Emmanuel Vincent, Andreas Nautsch, Junichi Yamagishi, Nicholas Evans, Jose Patino, Jean-François Bonastre, Paul-Gauthier Noé, and Massimiliano Todisco. 2020. Introducing the VoicePrivacy Initiative. In Proc. Interspeech

  • k-anonymity
    • Select k* farthest speakers
    • Randomly select k out of k*
    • Compute average

input

Pool of speakers

106

107 of 179

Anonymization – algorithms

Pierre Champion. 2023. Anonymizing Speech: Evaluating and Designing Speaker Anonymization Techniques.

Xiaoxiao Miao, Xin Wang, Erica Cooper, Junichi Yamagishi, and Natalia Tomashenko. 2023. Speaker Anonymization Using Orthogonal Householder Neural Network. IEEE/ACM Trans. Audio Speech Lang. Process. 31, (2023), 3681–3695.

  • k-anonymity
    • Select k* farthest speakers
    • Randomly select k out of k*
    • Compute average

    • Issues (Champion 2023, Miao 2023)
      • need a pool of speakers
      • averaged voice (poor quality)

input

Pool of speakers

107

108 of 179

Anonymization – algorithms

Xiaoxiao Miao, Xin Wang, Erica Cooper, Junichi Yamagishi, and Natalia Tomashenko. 2023. Speaker Anonymization Using Orthogonal Householder Neural Network. IEEE/ACM Trans. Audio Speech Lang. Process. 31, (2023), 3681–3695.

Sarina Meyer, Pascal Tilli, Pavel Denisov, Florian Lux, Julia Koch, and Ngoc Thang Vu. 2023. Anonymizing speech with generative adversarial networks to preserve speaker privacy. In Proc. SLT, 2023. IEEE, 912–919

  • k-anonymity
    • Select k* farthest speakers
    • Randomly select k out of k*
    • Compute average

  • Geometry-based (Miao 2023)
    • rotation

  • GAN-based (Meyer 2023)
    • unconditional GAN

108

109 of 179

Anonymization – evaluation

Natalia Tomashenko, Brij Mohan Lal Srivastava, Xin Wang, Emmanuel Vincent, Andreas Nautsch, Junichi Yamagishi, Nicholas Evans, Jose Patino, Jean-François Bonastre, Paul-Gauthier Noé, and Massimiliano Todisco. 2020. Introducing the VoicePrivacy initiative. In Proc. Interspeech, October 2020. ISCA, ISCA, 1693–1697.

  • Voice conversion towards a non-existing pseudo speaker
  • Evaluation (Tomashenko 2020)
    • Utility 利便性
      • Keep linguistic contents
      • Keep naturalness
    • Privacy / security 安全性
      • Hide the voice biometric (or identity)

Anony-mization

speech recognition (ASR),

human listening test

speaker verification (ASV)

109

110 of 179

Anonymization – evaluation

Michele Panariello, Natalia Tomashenko, Xin Wang, Xiaoxiao Miao, Pierre Champion, Hubert Nourtel, Massimiliano Todisco, Nicholas Evans, Emmanuel Vincent, and Junichi Yamagishi. 2024. The VoicePrivacy 2022 Challenge: Progress and Perspectives in Voice Anonymisation. IEEE/ACM Trans. Audio Speech Lang. Process. (2024), 1–14.

  • Evaluation of security
    • Use speaker verification
    • Ideally 50% error rate (random guess)

Anony-mization

Speaker verification

SAME

NOT SAME

Attacker

Collaborator

110

111 of 179

Anonymization – evaluation

Michele Panariello, Natalia Tomashenko, Xin Wang, Xiaoxiao Miao, Pierre Champion, Hubert Nourtel, Massimiliano Todisco, Nicholas Evans, Emmanuel Vincent, and Junichi Yamagishi. 2024. The VoicePrivacy 2022 Challenge: Progress and Perspectives in Voice Anonymisation. IEEE/ACM Trans. Audio Speech Lang. Process. (2024), 1–14.

  • Evaluation of security
    • Use speaker verification
    • Ideally 50% error rate (random guess)

Anony-mization

Speaker verification

SAME

NOT SAME

Attacker

Collaborator

See more in latest VoicePrivacy report (Panariello 2024)

Anony-mization

Semi-informed

random seed N

random

seed M

111

112 of 179

Anonymization – best systems in 2022

Michele Panariello, Natalia Tomashenko, Xin Wang, Xiaoxiao Miao, Pierre Champion, Hubert Nourtel, Massimiliano Todisco, Nicholas Evans, Emmanuel Vincent, and Junichi Yamagishi. 2024. The VoicePrivacy 2022 Challenge: Progress and Perspectives in Voice Anonymisation. IEEE/ACM Trans. Audio Speech Lang. Process. (2024), 1–14.

  • Evaluation of security
    • Use speaker verification
    • Ideally 50% error rate (random guess)

Anony-mization

Speaker verification

SAME

NOT SAME

Attacker

Collaborator

Anony-mization

Semi-informed

random seed

random

seed’

音声認識

エラー率

話者認識

エラー率

GAN-based

システム

入力音声

112

113 of 179

Anonymization – best systems in 2022

Tomashenko, N., Vincent, E., Tommasi, M. (2025) Exploiting Context-dependent Duration Features for Voice Anonymization Attack Systems. Proc. Interspeech 2025, 5128-5132, doi: 10.21437/Interspeech.2025-2317

  • Evaluation of security
    • Use speaker verification
    • Ideally 50% error rate (random guess)

Anony-mization

Duration-based verification

SAME

NOT SAME

Collaborator

random seed

音声認識

エラー率

話者認識

エラー率

GAN-based

システム

音素の長さで元の話者を認識できる

113

114 of 179

Anonymization – limitations

Anony-mization

Feature-based verification

SAME

NOT SAME

Collaborator

random seed

Speaker-related information: accent, word choice

114

115 of 179

Anonymization – summary

Anony-mization

Feature-based verification

SAME

NOT SAME

Collaborator

random seed

  • Voice conversion towards a pseudo speaker
  • Evaluation: utility & security
  • Young research topic
    • How to anonymize all speaker information
    • How to implement stronger attackers

https://www.voiceprivacychallenge.org/

115

116 of 179

Summary

116

117 of 179

Summary

  • Progress of speech generation
    • DNN
    • Improved quality & flexibility

117

118 of 179

Summary

  • Progress of speech generation
    • DNN
    • Improved quality & flexibility

  • Misuse of speech generation
    • Human and machines can be fooled

Generator

Attacker

Humans

Telephone call

Social media

Speaker verification

Services

Transmission

118

119 of 179

Summary

  • Progress of speech generation
    • DNN
    • Improved quality & flexibility

  • Misuse of speech generation
    • Human and machines can be fooled

  • Deepfake detection
    • Binary classification
    • Generalization remains unsolved

Attacker

Humans

Telephone call

Social media

Speaker verification

Services

Defender

Detector

Generator

REAL

FAKE

Transmission

119

120 of 179

Summary

  • Progress of speech generation
    • DNN
    • Improved quality & flexibility

  • Misuse of speech generation
    • Human and machines can be fooled

  • Deepfake detection
    • Binary classification
    • Generalization remains unsolved

  • Speech Watermark
    • Robustness remains unsolved

Attacker

Defender

Watermark detector

Generator

Watermarked

Not watermarked

Transmission

Generator

API

Watermark

w

w

120

121 of 179

Summary

  • Progress of speech generation
    • DNN
    • Improved quality & flexibility

  • Misuse of speech generation
    • Human and machines can be fooled

  • Deepfake detection
    • Binary classification
    • Generalization remains unsolved

  • Speech Watermark
    • Robustness remains unsolved

Signal processing

Computer science

signal & system, filters, …

information theory, search, …

pattern recognition

decision theory

deep neural networks

Machine learning

Linguistic

phonetics, phonology, …

121

122 of 179

Thank you

all the slides

https://tonywangx.github.io/slide.html#talk

PhD, postdoc, intern

122

123 of 179

A system should be secure, even if everything about the system, except the key, is public knowledge.

Kerckhoffs's principle

123

124 of 179

Reference

  • Text-to-speech synthesis (general introduction)
    • Dutoit, T. An Introduction to Text-to-speech Synthesis. (Kluwer Academic Publishers, 1997).
    • Taylor, P. Text-to-Speech Synthesis. (Cambridge University Press, 2009).
    • Chapter 16, Huang, X., Acero, A., Hon, H.-W. & Reddy, R. Spoken Language Processing: A Guide to Theory, Algorithm, and System Development. (Prentice Hall PTR, 2001).
    • Chapter 8, Jurafsky, D. & Martin, J. H. Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition. (Prentice Hall PTR, 2000).

  • Text-to-speech synthesis (HMM)
    • Tokuda, K. et al. Speech synthesis based on hidden Markov models. Proc. IEEE 101, 1234–1252 (2013)
    • Zen, H., Tokuda, K. & Black, A. W. Statistical parametric speech synthesis. Speech Commun. 51, 1039–1064 (2009)

124

125 of 179

Reference

125

126 of 179

Reference

  • Neural speech synthesis beyond DNN-HMM
    • See survey paper: Xu Tan. 2023. Neural Text-to-Speech Synthesis. Springer Nature Singapore, Singapore. https://doi.org/10.1007/978-981-99-0827-1
    • Jonathan Shen, Ruoming Pang, Ron J Weiss, Mike Schuster, Navdeep Jaitly, Zongheng Yang, Zhifeng Chen, Yu Zhang, Yuxuan Wang, Rj Skerrv-Ryan, and others. 2018. Natural TTS synthesis by conditioning WaveNet on Mel spectrogram predictions. In Proc. ICASSP, 2018. IEEE, 4779–4783.
    • Ye Jia, Yu Zhang, Ron Weiss, Quan Wang, Jonathan Shen, Fei Ren, Patrick Nguyen, Ruoming Pang, Ignacio Lopez Moreno, Yonghui Wu, and Others. 2018. Transfer learning from speaker verification to multispeaker text-to-speech synthesis. In Proc. NIPS, 2018. 4480–4490.

  • Speech synthesis & LLMs (Recent)

126

127 of 179

Reference

  • Early works on voice spoofing & detection
    • Bryan L Pellom and John H L Hansen. 1999. An experimental study of speaker verification sensitivity to computer voice-altered imposters. In Proc. ICASSP, 837–840.
    • Takashi Masuko, Takafumi Hitotsumatsu, Keiichi Tokuda, and Takao Kobayashi. 1999. On the security of HMM-based speaker verification systems against imposture using synthetic speech. In Proc. Eurospeech. 1223--1226.
    • Takayuki Satoh, Takashi Masuko, Takao Kobayashi, and Keiichi Tokuda. 2001. A robust speaker verification system against imposture using an HMM-based speech synthesis system. In Proc. Eurospeech, 2001. 759–762.
    • Akira Shiozaki, Akio Ogihara, Hitoshi, Unno. 2005. Discrimination method of synthetic speech using pitch frequency against synthetic speech falsification. IEICE TRANSACTIONS on Fundamentals E88-A, 1 (January 2005), 280–286.
    • Driss Matrouf, J-F Bonastre, and Corinne Fredouille. 2006. Effect of speech transformation on impostor acceptance. In Proc. ICASSP, 2006.
    • Guillaume Galou and Gérard Chollet. 2011. Synthetic voice forgery in the forensic context: a short tutorial. In Forensic speech and audio analysis working group (ENFSI-FSAAWG), 2011. Rome, Italy. Retrieved from https://imt.hal.science/hal-00625918

127

128 of 179

Reference

  • Voice deepfake detection before 2015
    • Nicholas WD Evans, Tomi Kinnunen, and Junichi Yamagishi. 2013. Spoofing and countermeasures for automatic speaker verification. In Proc. Interspeech.
    • Lian-Wu Chen, Wu Guo, and Li-Rong Dai. 2010. Speaker verification against synthetic speech. In Proc. ISCSLP, November 2010. IEEE, Tainan, Taiwan, 309–312.
    • Phillip L. De Leon, Vijendra Raj Apsingekar, Michael Pucher, and Junichi Yamagishi. 2010. Revisiting the security of speaker verification systems against imposture using synthetic speech. In Proc. ICASSP, 1798–1801.
    • Phillip L. De Leon, Inma Hernaez, Ibon Saratxaga, Michael Pucher, and Junichi Yamagishi. 2011. Detection of synthetic speech for the problem of imposture. In Proc. ICASSP, 4844–4847.
    • Phillip DeLeon, Michael Pucher, and Junichi Yamagishi. 2010. Evaluation of the vulnerability of speaker verification to synthetic speech. In Proc. Odyssey 2010.
    • Federico Alegre, Ravichander Vipperla, and Nicholas Evans. 2012. Spoofing countermeasures for the protection of automatic speaker recognition from attacks with artificial signals. Proc. Interspeech, 54–58.
    • Zhizheng Wu, Eng Siong Chng, and Haizhou Li. 2012. Detecting converted speech and natural speech for anti-spoofing attack in speaker recognition. In Proc. Interspeech.

128

129 of 179

Reference

  • Feature for deepfake detection
    • LFCC & MFCC: Md Sahidullah, Tomi Kinnunen, and Cemal Hanilçi. 2015. A comparison of features for synthetic speech detection. In Proc. Interspeech, 2015. 2087–2091.
    • CQCC: Massimiliano Todisco, Héctor Delgado, and Nicholas Evans. 2017. Constant Q cepstral coefficients: A spoofing countermeasure for automatic speaker verification. Computer Speech & Language 45, (2017), 516–535. https://doi.org/10.1016/j.csl.2017.01.001
    • Phase-based features: Madhu R Kamble, Hardik B Sailor, Hemant A Patil, and Haizhou Li. 2020. Advances in anti-spoofing: from the perspective of ASVspoof challenges. APSIPA Transactions on Signal and Information Processing 9, (2020), e2. https://doi.org/10.1017/ATSIP.2019.21
    • SSL-based front end: Xin Wang and Junichi Yamagishi. 2022. Investigating Self-Supervised Front Ends for Speech Spoofing Countermeasures. In Proc. Odyssey, 2022. 100–106. https://doi.org/10.21437/Odyssey.2022-14
    • Hemlata Tak, Massimiliano Todisco, Xin Wang, Jee-weon Jung, Junichi Yamagishi, and Nicholas Evans. 2022. Automatic speaker verification spoofing and deepfake detection using wav2vec 2.0 and data augmentation. In Proc. Odyssey, 2022. 112–119.

129

130 of 179

Reference

  • Classifiers
    • LCNN: Galina Lavrentyeva, Sergey Novoselov, Andzhukaev Tseren, Marina Volkova, Artem Gorlanov, and Alexandr Kozlov. 2019. STC Antispoofing Systems for the ASVspoof2019 Challenge. In Interspeech 2019, 2019. 1033–1037.
    • RawNet2: Hemlata Tak, Jose Patino, Massimiliano Todisco, Andreas Nautsch, Nicholas Evans, and Anthony Larcher. 2020. End-to-end anti-spoofing with RawNet2. In Proc. ICASSP, 2020. 6369–6373. https://doi.org/0.1109/ICASSP39728.2021.9414234
    • AASIST: Jee-weon Jung, Hee-Soo Heo, Hemlata Tak, Hye-jin Shim, Joon Son Chung, Bong-Jin Lee, Ha-Jin Yu, and Nicholas Evans. 2022. AASIST: Audio anti-spoofing using integrated spectro-temporal graph attention networks. In Proc. ICASSP, 2022. IEEE, 6367–6371.
    • Mamba: Yang Xiao and Rohan Kumar Das. 2025. XLSR-Mamba: A Dual-Column Bidirectional State Space Model for Spoofing Attack Detection. IEEE Signal Process. Lett. 32, (2025), 1276–1280. https://doi.org/10.1109/lsp.2025.3547861
    • SSL-SLS: Qishan Zhang, Shuangbing Wen, and Tao Hu. 2024. Audio Deepfake Detection with Self-Supervised XLS-R and SLS Classifier. In Proc. ACM MM, 2024., 6765–6773. https://doi.org/10.1145/3664647.3681345

130

131 of 179

Reference

  • Metrics
    • Xin Wang, Héctor Delgado, Hemlata Tak, Jee-weon Jung, Hye-jin Shim, Massimiliano Todisco, Ivan Kukanov, Xuechen Liu, Md Sahidullah, Tomi Kinnunen, Nicholas Evans, Kong Aik Lee, and Junichi Yamagishi. 2024. ASVspoof 5: Crowdsourced speech data, deepfakes, and adversarial attacks at scale. In ASVspoof Workshop 2024, 2024. 1--8. https://doi.org/10.21437/ASVspoof.2024-1
    • t-EER: Tomi H. Kinnunen, Kong Aik Lee, Hemlata Tak, Nicholas Evans, and Andreas Nautsch. 2023. t-EER: Parameter-Free Tandem Evaluation of Countermeasures and Biometric Comparators. IEEE Trans. Pattern Anal. Mach. Intell. (2023), 1–16. https://doi.org/10.1109/TPAMI.2023.3313648
    • t-DCF: Tomi Kinnunen, Hector Delgado, Nicholas Evans, Kong Aik Lee, Ville Vestman, Andreas Nautsch, Massimiliano Todisco, Xin Wang, Md Sahidullah, Junichi Yamagishi, and Douglas A Reynolds. 2020. Tandem Assessment of Spoofing Countermeasures and Automatic Speaker Verification: Fundamentals. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, (2020), 2195–2210. https://doi.org/10.1109/TASLP.2020.3009494

131

132 of 179

Reference

  • Our recent works
    • Wanying Ge, Xin Wang, Xuechen Liu, and Junichi Yamagishi. 2025. Post-training for Deepfake Speech Detection. In Proc. ASRU, June 26, 2025. (accepted).
    • Hemlata Tak, Massimiliano Todisco, Xin Wang, Jee-weon Jung, Junichi Yamagishi, and Nicholas Evans. 2022. Automatic speaker verification spoofing and deepfake detection using wav2vec 2.0 and data augmentation. In Proc. Odyssey, 2022. 112–119.
    • Xin Wang and Junichi Yamagishi. 2022. Investigating Self-Supervised Front Ends for Speech Spoofing Countermeasures. In Proc. Odyssey, 2022. 100–106.
    • Xin Wang and Junichi Yamagishi. 2023. Investigating Active-learning-based Training Data Selection for Speech Spoofing Countermeasure. In Proc. SLT, 2023. 585–592.
    • Xin Wang and Junichi Yamagishi. 2023. Spoofed Training Data for Speech Spoofing Countermeasure Can Be Efficiently Created Using Neural Vocoders. In Proc. ICASSP,
    • Xin Wang and Junichi Yamagishi. 2024. Can Large-Scale Vocoded Spoofed Data Improve Speech Spoofing Countermeasure with a Self-Supervised Front End? In Proc. ICASSP, April 14, 2024. 10311–10315.

132

133 of 179

Reference

    • Related-work summary by Nicolas Muller: https://deepfake-total.com/related_work/

133

134 of 179

Reference

  • Passive detection – new task formulation
    • Tianxiang Chen, Avrosh Kumar, Parav Nagarsheth, Ganesh Sivaraman, and Elie Khoury. 2020. Generalization of audio deepfake detection. In Proc. Odyssey, 2020. 132–137.
    • Nicolas M. Müller, Nicholas Evans, Hemlata Tak, Philip Sperl, and Konstantin Böttinger. 2024. Harder or different? Understanding generalization of audio deepfake detection. In Interspeech, 2024. 2705–2709.
    • Nicholas Klein, Tianxiang Chen, Hemlata Tak, Ricardo Casal, and Elie Khoury. 2024. Source Tracing of Audio Deepfake Systems. In Proc. Interspeech, September 01, 2024. ISCA, 1100–1104.

134

135 of 179

Reference

  • Watermark – non-DNN methods
    • Guang Hua, Jiwu Huang, Yun Q Shi, Jonathan Goh, and Vrizlynn LL Thing. 2016. Twenty years of digital audio watermarking—a comprehensive review. Signal processing 128, (2016), 222–242. https://doi.org/10.1016/j.sigpro.2016.04.005 (summary paper)
    • B. Chen and G.W. Wornell. 2001. Quantization index modulation: a class of provably good methods for digital watermarking and information embedding. IEEE Trans. Inform. Theory 47, 4 (May 2001), 1423–1443. https://doi.org/10.1109/18.923725 (QIM)
    • Ingemar J Cox, Joe Kilian, Tom Leighton, and Talal Shamoon. 1996. Secure spread spectrum watermarking for images, audio and video. In Proceedings of 3rd IEEE international conference on image processing, 1996. IEEE, 243–246. (spread spectrum)
    • Ingemar J Cox, Matthew L Miller, and Andrew L McKellips. 1999. Watermarking as communications with side information. Proceedings of the IEEE 87, 7 (1999), 1127–1141. (side-channel)
    • Mitchell D Swanson, Bin Zhu, Ahmed H Tewfik, and Laurence Boney. 1998. Robust audio watermarking using perceptual masking. Signal Processing 66, 3 (May 1998), 337–355. https://doi.org/10.1016/S0165-1684(98)00014-0 (audio perceptual mask)

135

136 of 179

Reference

  • Watermark – DNN methods
    • Kosta Pavlović, Slavko Kovačević, Igor Djurović, and Adam Wojciechowski. 2022. Robust speech watermarking by a jointly trained embedder and detector using a DNN. Digital Signal Processing 122, 103381. https://doi.org/10.1016/j.dsp.2021.103381 (early DNN)
    • Robin San Roman, Pierre Fernandez, Alexandre Défossez, Teddy Furon, Tuan Tran, and Hady Elsahar. 2024. Proactive detection of voice cloning with localized watermarking. In Proc. ICML, 2024. (AudioSeal)
    • Chang Liu, Jie Zhang, Tianwei Zhang, Xi Yang, Weiming Zhang, and Nenghai Yu. 2023. Detecting voice cloning attacks via timbre watermarking. In Proc. Network and Distributed System Security Symposium, 2023. (Timber)
    • Guangyu Chen, Yu Wu, Shujie Liu, Tao Liu, Xiaoyong Du, and Furu Wei. 2023. WavMark: Watermarking for audio generation. arXiv preprint arXiv:2308.12770 (2023). (WavMark)
    • Yixin Liu, Lie Lu, Jihui Jin, Lichao Sun, and Andrea Fanelli. 2025. XAttnMark: Learning Robust Audio Watermarking with Cross-Attention. https://doi.org/10.48550/arXiv.2502.04230 (AudioSeal–like)

136

137 of 179

Reference

  • Watermark – DNN methods
    • Pengcheng Li, Xulong Zhang, Jing Xiao, and Jianzong Wang. 2024. IDEAW: Robust Neural Audio Watermarking with Invertible Dual-Embedding. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, November 2024. Association for Computational Linguistics, Miami, Florida, USA, 4500–4511. https://doi.org/10.18653/v1/2024.emnlp-main.258 (WavMark-like)
    • Shiqiang Wu, Jie Liu, Ying Huang, Hu Guan, and Shuwu Zhang. 2023. Adversarial Audio Watermarking: Embedding Watermark into Deep Feature. In Proc. ICME, 61–66. https://doi.org/10.1109/ICME55011.2023.00019. (Adversarial-like)
    • Jiren Zhu, Russell Kaplan, Justin Johnson, and Li Fei-Fei. 2018. HiDDeN: Hiding Data With Deep Networks. In Proc. ECCV, July 26, 2018. arXiv, 657–672. https://doi.org/10.48550/arXiv.1807.09937 (Adversarial-like)
    • Junzuo Zhou, Jiangyan Yi, Tao Wang, Jianhua Tao, Ye Bai, Chu Yuan Zhang, Yong Ren, and Zhengqi Wen. 2024. TraceableSpeech: Towards proactively traceable text-to-speech with watermarking. In Proc. Interspeech. 2250–2254. https://doi.org/10.21437/Interspeech.2024-534

137

138 of 179

Reference

  • Watermark – DNN methods
    • Mayank Kumar Singh, Naoya Takahashi, Weihsiang Liao, and Yuki Mitsufuji. 2024. SilentCipher: Deep audio watermarking. In Proc. Interspeech. https://doi.org/10.21437/Interspeech.2024-174
    • Patrick O’Reilly, Zeyu Jin, Jiaqi Su, and Bryan Pardo. 2024. Maskmark: Robust neural watermarking for real and synthetic speech. In Proc. ICASSP, 2024, 4650–4654. https://doi.org/10.1109/ICASSP48485.2024.10447253
    • Shengpeng Ji, Ziyue Jiang, Jialong Zuo, Minghui Fang, Yifu Chen, Tao Jin, and Zhou Zhao. 2024. Speech Watermarking with Discrete Intermediate Representations. Retrieved from https://arxiv.org/abs/2412.13917 (discrete-domain)
    • Robin San Roman, Pierre Fernandez, Antoine Deleforge, Yossi Adi, and Romain Serizel. 2025. Latent Watermarking of Audio Generative Models. In Proc. ICASSP, April 06, 2025. IEEE, Hyderabad, India, 1–5. https://doi.org/10.1109/ICASSP49660.2025.10889782 (Application of AudioSeal to generator)

138

139 of 179

Reference

  • Watermark – DNN collaborative
    • Lauri Juvela and Xin Wang. 2024. Collaborative watermarking for adversarial speech synthesis. In Proc. ICASSP, 2024. 11231–11235. https://doi.org/10.1109/icassp48485.2024.10448134
    • Lauri Juvela and Xin Wang. 2025. Audio codec augmentation for robust collaborative watermarking of speech synthesis. In Proc. ICASSP, 2025. . https://doi.org/10.1109/ICASSP49660.2025.10888976
    • Xiangyu Cheng, Yaofei Wang, Chang Liu, Donghui Hu, and Zhaopin Su. 2024. HiFi-GANw: Watermarked speech synthesis via fine-tuning of HiFi-GAN. IEEE Signal Processing Letters (2024). https://doi.org/10.1109/LSP.2024.3456673
    • Yuxiang Zhao, Yunchong Xiao, Yushen Chen, Zhikang Niu, Shuai Wang, Kai Yu, and Xie Chen. 2025. Traceable TTS: Toward Watermark-Free TTS with Strong Traceability. https://doi.org/10.48550/arXiv.2507.03887

139

140 of 179

Reference

  • Watermark – security
    • Cayre, C. Fontaine, and T. Furon. 2005. Watermarking security: theory and practice. IEEE Trans. Signal Process. 53, 10 (October 2005), 3976–3987.
    • Darko Kirovski and Fabien AP Petitcolas. 2003. Blind pattern matching attack on watermarking systems. IEEE transactions on Signal Processing 51, 4 (2003), 1045–1053.
    • Ingemar J Cox, Matt L Miller, Jean-Paul MG Linnartz, and Ton Kalker. 2018. A review of watermarking principles and practices 1. Digital signal processing for multimedia systems (2018), 461–485.
    • Scott Craver, Nasir Memon, B-L Yeo, and Minerva M Yeung. 1998. Resolving rightful ownerships with invisible watermarking techniques: Limitations, attacks, and implications. IEEE Journal on Selected areas in Communications 16, 4 
    • Lingfeng Yao, Chenpei Huang, Shengyao Wang, Junpei Xue, Hanqing Guo, Jiang Liu, Phone Lin, Tomoaki Ohtsuki, and Miao Pan. 2025. Yours or Mine? Overwriting Attacks against Neural Audio Watermarking. 
    • Slavko Kovačević, Murilo Z. Silvestre, Kosta Pavlović, Petar Nedić, and Igor Djurović. 2025. DeepMark Benchmark: Redefining Audio Watermarking Robustness. March 06, 2025. 

140

141 of 179

Reference

  • Watermark – benchmark
    • Patrick O’Reilly, Zeyu Jin, Jiaqi Su, and Bryan Pardo. 2025. Deep Audio Watermarks are Shallow: Limitations of Post-Hoc Watermarking Techniques for Speech. ICLR workshop Watermark, 2025
    • Slavko Kovačević, Murilo Z. Silvestre, Kosta Pavlović, Petar Nedić, and Igor Djurović. 2025. DeepMark Benchmark: Redefining Audio Watermarking Robustness. 
    • Yigitcan Özer, Woosung Choi, Joan Serrà, Mayank Kumar Singh, Wei-Hsiang Liao, and Yuki Mitsufuji. 2025. A comprehensive real-world assessment of audio watermarking algorithms: Will they survive neural codecs? In Proc. Interspeech, 2025. 5113–5117. 
    • Yizhu Wen, Ashwin Innuganti, Aaron Bien Ramos, Hanqing Guo, and Qiben Yan. 2025. SoK: How Robust is Audio Watermarking in Generative AI models? 
    • Hongbin Liu, Moyang Guo, Zhengyuan Jiang, Lun Wang, and Neil Zhenqiang Gong. 2024. AudioMarkBench: Benchmarking robustness of audio watermarking. arXiv preprint arXiv:2406.06979 (2024).

141

142 of 179

Reference

  • Anonymization – voice privacy framework
    • Natalia Tomashenko, et al. 2020. Introducing the VoicePrivacy initiative. In Proc. Interspeech, October 2020. ISCA, ISCA, 1693–1697.
    • Andreas Nautsch, et al. 2020. The privacy ZEBRA: Zero evidence biometric recognition assessment. In Proc. Interspeech, 2020.

  • Anonymization - K-anonymity & selection
    • Fuming Fang, Xin Wang, Junichi Yamagishi, et al. 2019. Speaker anonymization using X-vector and neural waveform models. In Proc. SSW, 2019. 155–160. https://doi.org/10.21437/SSW.2019-28
    • Brij Mohan Lal Srivastava, Mohamed Maouche, Md Sahidullah, et al. 2022. Privacy and Utility of X-Vector Based Speaker Anonymization. IEEE/ACM Transactions on Audio, Speech, and Language Processing 30, (2022), 2383–2395. https://doi.org/10.1109/TASLP.2022.3190741

142

143 of 179

Reference

  • Anonymization – differentia privacy
    • Yaowei Han, Sheng Li, Yang Cao, Qiang Ma, and Masatoshi Yoshikawa. 2020. Voice-indistinguishability: Protecting voiceprint in privacy-preserving speech data release. In Proc. ICME, 2020. 1–6. https://doi.org/10.1109/ICME46284.2020.9102875
    • Oubaı̈da Chouchane, Michele Panariello, Oualid Zari, et al. 2023. Differentially private adversarial auto-encoder to protect gender in voice biometrics. In Proceedings of the 2023 ACM workshop on information hiding and multimedia security, 2023. 127–132. https://doi.org/10.1145/3577163.3595102
    • Ali Shahin Shamsabadi, Brij Mohan Lal Srivastava, Aurélien Bellet, et al. 2023. Differentially Private Speaker Anonymization. In Proc. PETS, 2023. https://petsymposium.org/popets/2023/popets-2023-0007.pdf

143

144 of 179

Reference

  • Anonymization methods – generation of pseudo speaker
    • Sarina Meyer, Pascal Tilli, Pavel Denisov, Florian Lux, Julia Koch, and Ngoc Thang Vu. 2023. Anonymizing speech with generative adversarial networks to preserve speaker privacy. In Proc. SLT, 2023. IEEE, 912–919. https://doi.org/10.1109/SLT54892.2023.10022601
    • Belinda Hui Hui Soh, Xiaoxiao Miao, and Xin Wang. 2025. SecureSpeech: Prompt-based Speaker and Content Protection. In Proc. IJCB, 2025. https://arxiv.org/abs/2507.07799

  • Anonymization methods – kNN-VC
    • Carlos Franzreb, Arnab Das, Tim Polzehl, and Sebastian Möller. 2025. Private kNN-VC: Interpretable anonymization of converted speech. In Proc. Interspeech, 2025. 3224–3228. https://doi.org/10.21437/Interspeech.2025-820

144

145 of 179

Reference

  • Anonymization methods – limitation
    • Jennifer Williams, Karla Pizzi, Natalia Tomashenko, and Sneha Das. 2024. Anonymizing Speaker Voices: Easy to Imitate, Difficult to Recognize? In Proc. ICASSP, April 14, 2024. IEEE, 12491–12495.
    • Michele Panariello, et al. 2025. The Risks and Detection of Overestimated Privacy Protection in Voice Anonymisation. Proc. SPSC, 2025
    • Natalia Tomashenko, Emmanuel Vincent, and Marc Tommasi. 2025. Exploiting context-dependent duration features for voice anonymization attack systems. In Interspeech 2025, 2025. 5128–5132.

145

146 of 179

Speaker verification

146

147 of 179

Voice verification

Voice verification

Feature extraction

Feature extraction

Comparison

Same person?

Not the same

John’s real voice

Same person?

Not the same

I am John

147

148 of 179

Voice verification

Animation from Speech production and articulation knowledge group https://sail.usc.edu/span/rtmri_ipa/je_2015.html

Vocal tract

Larynx

[a]

[i]

[o]

Linguistic: what we say

Word sequence, …

148

149 of 179

Voice verification

Animation from Speech production and articulation knowledge group https://sail.usc.edu/span/rtmri_ipa/je_2015.html

Vocal tract

Larynx

[a]

[i]

[o]

[a]

Linguistic: what we say

Word sequence, …

Paralinguistic: how we say

Emotion, …

149

150 of 179

Voice verification

Vocal tract

Larynx

Speaker A

Speaker B

[a]

[a]

Biometric: how voices sound like

Vocal tract & cords size …

Other

biometrics

150

151 of 179

Voice verification

Tomi Kinnunen and Haizhou Li. 2010. An overview of text-independent speaker recognition: From features to supervectors. Speech communication 52, 1 (2010), 12–40.

John H L Hansen and Taufiq Hasan. 2015. Speaker recognition by machines and humans: A tutorial review. IEEE Signal processing magazine 32, 6 (2015), 74–99.

Frédéric Bimbot, Jean-François Bonastre, Corinne Fredouille, Guillaume Gravier, Ivan Magrin-Chagnolleau, Sylvain Meignier, Teva Merlin, Javier Ortega-García, Dijana Petrovska-Delacrétaz, and Douglas A Reynolds. 2004. A tutorial on text-independent speaker verification. EURASIP Journal on Advances in Signal Processing 2004, 4 (2004), 101962.

Vocal tract

Larynx

Similarity score s

Feature

extraction

Classification

(decision)

Check tutorials (Bimbot 2004, Kinnuen 2010, Hansen 2015)

151

152 of 179

Voice verification - decision

Similarity score s

ACCEPT

SAME speaker

REJECT

DIFFERENT speakers

s

s

s

s

s

s

s

s

s

s

s

s

s

s

s

s

152

153 of 179

Voice verification - decision

score s

False acceptance

False rejection

Decision threshold

153

154 of 179

Voice verification - decision

score s

Decision threshold

False acceptance

False rejection

154

155 of 179

Voice verification - decision

score s

Decision threshold

False acceptance

False rejection

155

156 of 179

Voice verification - decision

score s

False acceptance

False rejection

Decision threshold

=

Find a decision threshold so that PFA = PFR

This is Equal Error Rate (EER)

156

157 of 179

Everything in a nutshell

SAME speaker

DIFFERENT speakers

s

s

s

s

s

s

s

s

s

s

s

s

s

s

s

s

s

Deepfake: increase false acceptance

Detection: decrease false acceptance

Anonymization: increase false reject

157

158 of 179

Everything in a nutshell

score s

Spoofing or deepfake increases false acceptance

Detection and watermarking reduce false acceptance

158

159 of 179

Everything in a nutshell

score s

Spoofing or deepfake increases false acceptance

Detection and watermarking reduce false acceptance

Anonymization increases false rejection !?

159

160 of 179

Misuse of speech generation

Jee-weon Jung, Xin Wang, Nicholas Evans, Shinji Watanabe, Hye-jin Shim, Hemlata Tak, Sidhhant Arora, Junichi Yamagishi, and Joon Son Chung. 2024. To what extent can ASV systems naturally defend against spoofing attacks? In Proc. Interspeech, 2024. 3240--3244.

Speaker verification

HMM-based TTS

HMM-DNN

End-to-end

Unit-selection

160

161 of 179

Misuse of speech generation

Jee-weon Jung, Xin Wang, Nicholas Evans, Shinji Watanabe, Hye-jin Shim, Hemlata Tak, Sidhhant Arora, Junichi Yamagishi, and Joon Son Chung. 2024. To what extent can ASV systems naturally defend against spoofing attacks? In Proc. Interspeech, 2024. 3240--3244.

Speaker verification

HMM-based TTS

HMM-DNN

End-to-end

Unit-selection

Newer speaker verification systems better detect fake speech

161

162 of 179

Misuse of speech generation

Jee-weon Jung, Xin Wang, Nicholas Evans, Shinji Watanabe, Hye-jin Shim, Hemlata Tak, Sidhhant Arora, Junichi Yamagishi, and Joon Son Chung. 2024. To what extent can ASV systems naturally defend against spoofing attacks? In Proc. Interspeech, 2024. 3240--3244.

Speaker verification

HMM-based TTS

HMM-DNN

End-to-end

Unit-selection

Latest speaker verification system is still vulnerable to some types of fake speech

162

163 of 179

Misuse of speech generation

Jee-weon Jung, Xin Wang, Nicholas Evans, Shinji Watanabe, Hye-jin Shim, Hemlata Tak, Sidhhant Arora, Junichi Yamagishi, and Joon Son Chung. 2024. To what extent can ASV systems naturally defend against spoofing attacks? In Proc. Interspeech, 2024. 3240--3244.

False acceptance(%)

Date when TTS system was developed

Latest TTS systems are more difficult to detect

163

164 of 179

Binary classification

164

165 of 179

Binary classification

165

166 of 179

Binary classification

166

167 of 179

Binary classification

Each point is called Operating point

167

168 of 179

Binary classification

Alvin F Martin, George R Doddington, Terri Kamm, Mark Ordowski, and Mark A Przybocki. 1997. The DET curve in assessment of detection task performance. In Eurospeech, 1997. 1895–1898.

Detection Error Tradeoff (DET) curve

Making the tradeoff curve straight!

168

169 of 179

Binary classification

David A Van Leeuwen and Niko Brümmer. 2007. An introduction to application-independent evaluation of speaker recognition systems. In Speaker classification I. Springer, 330–353.

169

170 of 179

Binary classification

See comments on EER and other metrics in Xin Wang, Héctor Delgado, Hemlata Tak, Jee-weon Jung, Hye-jin Shim, Massimiliano Todisco, Ivan Kukanov, Xuechen Liu, Md Sahidullah, Tomi Kinnunen, Nicholas Evans, Kong Aik Lee, and Junichi Yamagishi. 2024. ASVspoof 5: Crowdsourced speech data, deepfakes, and adversarial attacks at scale. In ASVspoof Workshop 2024, 2024. 1--8. https://doi.org/10.21437/ASVspoof.2024-1

Deep & interesting research topics

Why different?

Discrimination&

Calibration

Application-independent

evaluation: Cllr …

Bayes decision

Integrate ASV?

Tandem DCF

Tandem EER

170

171 of 179

Speech anti-spoofing – where we are : )

Todisco, M. et al. ASVspoof 2019: future horizons in spoofed and fake audio detection. in Proc. Interspeech 1008–1012 (2019). doi:10.21437/Interspeech.2019-2249

Wu, Z. et al. ASVspoof: the automatic speaker verification spoofing and countermeasures challenge. IEEE J. Sel. Top. Signal Process. 11, 588–604 (2017)

Yamagishi, J. et al. ASVspoof 2021: accelerating progress in spoofed and deepfake speech detection. in Proc. ASVspoof Challenge workshop 47–54 (2021). doi:10.21437/ASVSPOOF.2021-

Korshunov, P. et al. Overview of BTAS 2016 speaker anti-spoofing competition. in Proc. BTAS 1–6 (2016).

Zhang, Z., Gu, Y., Yi, X. & Zhao, X. FMFCC-A: A Challenging Mandarin Dataset for Synthetic Speech Detection. arXiv Prepr. arXiv2110.09441 (2021)

Kinnunen, T. et al. Tandem Assessment of Spoofing Countermeasures and Automatic Speaker Verification: Fundamentals. IEEE/ACM Trans. Audio, Speech, Lang. Process. 28, 2195–2210 (2020)

HMM-based TTS

GMM-based VC

VTLN-based VC

GAN-based TTS

Tacotron

WaveNet

WaveRNN

VAE

VoIP, GSM PSTN …

Out-of-domain data

  • task
  • metric
  • database

2015

  • Task
  • Metrics
  • Database

ASVspoof 2015

2019

  • More unseen attacks

ASVspoof 2019

2017

2021

ASVspoof 2021

  • Channel (codec & compression)

2024

ASVspoof 5

  • Diverse data

More speakers

Diverse audio env.

171

172 of 179

Speech anti-spoofing – where we are : )

EER (%)

2015

ASVspoof 2015

2019

ASVspoof 2019

2017

2021

ASVspoof 2021

2024

ASVspoof 5

172

173 of 179

Speech anti-spoofing – where we are : )

Md Sahidullah, Tomi Kinnunen, and Cemal Hanilçi. A Comparison of Features for Synthetic Speech Detection. In Proc. Interspeech, 2087–2091. 2015.

Efforts by many research groups!

EER (%)

3% (Sahid 2015)

>20%

2015

ASVspoof 2015

2019

ASVspoof 2019

2017

2021

ASVspoof 2021

2024

ASVspoof 5

173

174 of 179

Speech anti-spoofing – where we are : )

Md Sahidullah, Tomi Kinnunen, and Cemal Hanilçi. A Comparison of Features for Synthetic Speech Detection. In Proc. Interspeech, 2087–2091. 2015.

Xin Wang, and Junich Yamagishi. A Comparative Study on Recent Neural Spoofing Countermeasures for Synthetic Speech Detection. In Proc. Interspeech, 4259–4263. doi:10.21437/Interspeech.2021-702. 2021.

3% (Sahid 2015)

>20%

3%(Wang 2021)

10%

23%

Efforts by many research groups!

EER (%)

2015

ASVspoof 2015

2019

ASVspoof 2019

2017

2021

ASVspoof 2021

2024

ASVspoof 5

174

175 of 179

Speech anti-spoofing – where we are : (

Rohan Kumar Das, Jichen Yang, and Haizhou Li. Assessing the Scope of Generalized Countermeasures for Anti-Spoofing. In Proc. ICASSP, 6589–6593. 2020.

Xin Wang, Junichi Yamaigshi, Investigating self-supervised front ends for speech spoofing countermeasures, https://arxiv.org/abs/2111.07725

2015

2019

2021

3%

>20%

10%

23%

Trained on 2019 train set, test on 2015 test set

Reported in one case study

8%

>23%

3%(Wang 2021)

(Das 2020)

EER (%)

2015

ASVspoof 2015

2019

ASVspoof 2019

2017

2021

ASVspoof 2021

2024

ASVspoof 5

175

176 of 179

Speech anti-spoofing – where we are : (

Rohan Kumar Das, Jichen Yang, and Haizhou Li. Assessing the Scope of Generalized Countermeasures for Anti-Spoofing. In Proc. ICASSP, 6589–6593. 2020.

Xin Wang, Junichi Yamaigshi, Investigating self-supervised front ends for speech spoofing countermeasures, https://arxiv.org/abs/2111.07725

2015

2019

2021

3%

>20%

10%

23%

8%

>23%

(Das 2020)

30%

(Wang 2021)

Another case

3%(Wang 2021)

EER (%)

2015

ASVspoof 2015

2019

ASVspoof 2019

2017

2021

ASVspoof 2021

2024

ASVspoof 5

176

177 of 179

Speech anti-spoofing – where we go

You Zhang, Ge Zhu, Fei Jiang, and Zhiyao Duan. An Empirical Study on Channel Effects for Synthetic Voice Spoofing Countermeasure Systems. In Proc. Interspeech 2021, 4309–4313. doi:10.21437/Interspeech.2021-1820. 2021.

Anton Tomilov, Aleksei Svishchev, Marina Volkova, Artem Chirkovskiy, Alexander Kondratev, and Galina Lavrentyeva. STC Antispoofing Systems for the ASVspoof2021 Challenge. In Proc. ASVspoof Challenge Workshop, 61–67. doi:10.21437/ASVSPOOF.2021-10. 2021.

Das, R. K., Yang, J. & Li, H. Data Augmentation with Signal Companding for Detection of Logical Access Attacks. arXiv Prepr. arXiv2102.06332 (2021)

Xin Wang, Junichi Yamaigshi, Investigating self-supervised front ends for speech spoofing countermeasures, https://arxiv.org/abs/2111.07725

EER (%)

2015

2019

2021

3%

>20%

10%

23%

8%

>23%

(Das 2020)

30%

(Wang 2021)

3%(Wang 2021)

0.2%

7%

2015

ASVspoof 2015

2019

ASVspoof 2019

2017

2021

ASVspoof 2021

2024

ASVspoof 5

177

178 of 179

Speech anti-spoofing – mainsteam

Sahidullah, M., Kinnunen, T. & Hanilçi, C. A comparison of features for synthetic speech detection. in Proc. Interspeech 2087–2091 (2015).

Kamble, M. R., Sailor, H. B., Patil, H. A. & Li, H. Advances in anti-spoofing: from the perspective of ASVspoof challenges. APSIPA Trans. Signal Inf. Process. 9, e2 (2020)

Sahidullah, M. et al. Introduction to voice presentation attack detection and recent advances. in Handbook of Biometric Anti-Spoofing 321–361 (Springer, 2019).

Wang, X. & Yamagishi, J. A comparative study on recent neural spoofing countermeasures for synthetic speech detection. in Proc. Interspeech 4259–4263 (2021).

Feature extraction

(front end)

Classifier

(back end)

Input trial

Wav2vec 2.0 (Baevski 2020)

A linear layer,

AASIST, or…

178

179 of 179

SSL-based detector training scheme

    • potentially better generalization! (Wang 2022, Hemlata 2022)

Alexei Baevski, Yuhao Zhou, Abdelrahman Mohamed, and Michael Auli. 2020. wav2vec 2.0: A framework for self-supervised learning of speech representations. In Proc. NIPS, 2020. 12449–12460.

Xin Wang and Junichi Yamagishi. 2022. Investigating Self-Supervised Front Ends for Speech Spoofing Countermeasures. In Proc. Odyssey, 2022. 100–106.

Hemlata Tak, Massimiliano Todisco, Xin Wang, Jee-weon Jung, Junichi Yamagishi, and Nicholas Evans. 2022. Automatic speaker verification spoofing and deepfake detection using wav2vec 2.0 and data augmentation. In Proc. Odyssey, 2022. 112–119.

Wav2vec 2.0 (Baevski 2020)

A linear layer,

AASIST, or…

Detector

SSL-based

front end

Classifier

Cross-entropy

supervised fine-tuning

179