1 of 37

Privacy Leakage in Speech Models: Attacks and Mitigations

Om Thakkar�Privacy Research Engineer�OpenAI*

*work done previously at Google

08/04/2025

2 of 37

Privacy Leakage in Machine Learning

Training Data

Collection/storage of certain types of data
E.g.: unauthorized access

Machine Learning (ML) Algorithms

(Distributed) systems for training the models
E.g.: gradients can leak information [ZLH’19, GBDM’20, DT.R⁺’21a, DT.R⁺’21b]

3 of 37

Information Leakage from Gradients

Image reconstruction from client update [GBDM’20]

Privacy leakage from ASR training updates

Utterance labels [DT.S⁺’21a]
Speaker Identity [DT.S⁺’21b]

original

reconstruction

Original

Reconstructed

4 of 37

Released Neural Networks (NNs)

Trained NNs can leak private information
E.g.: secret sharer attacks [CLE⁺’19, T.RMB’20, RT.M⁺’20, CTW⁺’21, CIJ⁺’22, �JT.T⁺’23, ST.N’24, WT.M’24, …], “fill-in-the-blank” style attacks [AT.N⁺’22, JT.W’24]

Privacy Leakage in Machine Learning

Training Data

Collection/storage of certain types of data
E.g.: unauthorized access

Machine Learning (ML) Algorithms

(Distributed) systems for training the models
E.g.: gradients can leak information [ZLH’19, GBDM’20, DT.R⁺’21a, DT.R⁺’21b]

Focus for today

5 of 37

Types of Speech Models

Core types:

- Text-to-Speech (TTS)

- E.g., AudioLM

“how are you”

TTS Model

Input Transcript

“how are you”

ASR Model

Output Transcript

- Speech-to-Text (STT) /� Automatic Speech� Recognition (ASR)

- E.g., Conformer

- Speech-to-Speech (S2S)

- E.g., GPT-4o� Advanced Voice Mode

“how are you”

S2S Model

Output Utterance

6 of 37

Types of Speech Models

Additional Types:

- Speaker Recognition / Verification

- E.g., ECAPA-TDNN

Focus for today: ASR Models

- Speech-to-Speech Translation (S2ST)

- E.g., AudioPaLM

- Foundation Speech models

- E.g., Whisper

7 of 37

Outline

- Types of Privacy Leaks

- Extraction Attacks (Noise Masking)

- Memorization Audits (Secret Sharer framework)

- Empirical Privacy Techniques

8 of 37

Extracting Targeted Training Data from ASR Models [AT.N⁺’22]

- Task for ASR: Input audio utterance → Output transcript

“how are you”

ASR Model

Output Transcript

- Noise Masking: First attack for extracting information from ASR models

- Strategy: replace target words in utterance with noise; apply the ASR model

9 of 37

Instantiation of Noise Masking

Entity Name Extraction from LibriSpeech

3.6k names appear after titles like ‘mister’, ‘miss’, ‘missus’, etc.

Consider such names as sensitive information

Train then-SOTA 120M param Conformer model on LibriSpeech [GQC⁺’20]
Pass utts having ‘mister’ followed by noise to extract names:

10 of 37

Instantiation of Noise Masking

Entity Name Extraction from LibriSpeech

3.6k names appear after titles like ‘mister’, ‘miss’, ‘missus’, etc.

Consider such names as sensitive information

“mister soames �was somewhat…”

Train then-SOTA 120M param Conformer model on LibriSpeech [GQC⁺’20]
Pass utts having ‘mister’ followed by noise to extract names:

11 of 37

Results

Can extract significant information via Noise Masking

Avg. accuracy (true name % / Any name %)

Leakage on train set >> test set ⇒ memorization
Model leaks information even by silence masking!

12 of 37

Additional Analysis for Leakage

Details on number of leaked names:

- Unique names: # in Librispeech = 3.6k

- Extrapolated names: never appear after ‘mister’. # in Librispeech = 1.1k

More in the paper:

Mitigation using Word Dropout + MTR [KMC⁺’17]

Details on effect of noise duration:

- Using fixed-duration noise results in similar leakage�- Using masked-word-duration noise ~doubles leakage

- For Baseline, can extract 595 unique names and 73 extrapolated names

13 of 37

Extraction from Pretrained Speech Encoders?

- [AT.N⁺’22] demonstrated attacks in a supervised ASR training setup

- Modern ASR models (e.g., [ZHQ⁺’23]) finetuned from Speech encoders, � pretrained using large audio-only datasets

- Allows larger pretraining datasets

In general, easier to curate unsupervised
LibriLight: 60k hours unsupervised
LibriSpeech: 1k hours supervised

Q: Design an attack to extract pretraining data from encoders?

- Can’t directly run noise masking

Encoder has never seen transcripts!
Large Datasets ⇒ Fewer epochs ⇒ Less memorization?

14 of 37

Noise Masking for Pretrained Speech Encoders [JT.W’24]

- Start with a pretrained encoder (e.g., USM [ZHQ⁺’23])

- Finetune encoder with ASR data to create an ASR Attack Model

- Need to avoid “forgetting” pretraining data [JT.T⁺’23]

- Produce a noise masked utterance (identically to existing attacks)

- Query the ASR attack model with the utterance

15 of 37

Experimental Setup

Model: 600M param SOTA Conformer
Training Data: LibriLight (LL), and LibriSpeech (LS)

Filtered formatted names from LS to produce LS-NoName

Training Method:

Pretrain encoder on LL for 1M steps using BEST-RQ [CQZ⁺’22]
Finetune with CTC loss on LS-NoName for 10k steps

Finetuning on LS-NoName ⇒ names only seen during pretraining!

16 of 37

Results

- Train > Test implies memorization

- Exact names extracted with 8-9% precision

- Any name produced with 26-38% precision

Precision on clean/other split of LS

- More in the paper:

Mitigations (deduplication, sanitization, etc.) during pretraining

17 of 37

Looking Beyond Extraction Attacks

- Previous attacks successful at extracting targeted training data

- Limited to extractions spanning small durations

- Require some domain knowledge of training data

Q: How susceptible is a training sample towards leakage?

- Line of work [CLE⁺’19, T.RMB’20, RT.M⁺’20, CTW⁺’21, CIJ⁺’22, JT.T⁺’23 …] � on unintentional memorization in LMs

18 of 37

Outline

- Types of Privacy Leaks

- Extraction Attacks (Noise Masking)

- Memorization Audits (Secret Sharer framework)

- Empirical Privacy Techniques

19 of 37

Background - The Secret Sharer Framework [CLE⁺’19]

Used for measuring unintended memorization of textual data in LMs
Method

Insert handcrafted samples (canaries) in the training data
Measure performance of trained model on canaries
Contrast with performance on holdout set from same distribution

20 of 37

Background - The Secret Sharer Framework [CLE⁺’19]

Used for measuring unintended memorization of textual data in LMs
Method

Insert handcrafted samples (canaries) in the training data
Measure performance of trained model on canaries
Contrast with performance on holdout set from same distribution
Rank of canaries → a measure of memorization, formalized by:

Exposure: Given a canary c, a model M, and a holdout set R, the exposure of c is:��where is c‘s rank among R for metric of interest on M, e.g., accuracy, loss, perplexity, character error rate, etc.

21 of 37

Auditing Large ASR Models is Challenging

- High compute cost of existing methods

- SOTA [CCN⁺’22, JT.T⁺’23, AZT’24] uses 1000s of shadow models for calibration

- Impractical with increasing model sizes

Goal: Model should perform distinguishably b/w seen and unseen canaries � to capture memorization well

- Limited success of adaptations of methods designed for LMs

- E.g., if training set contains audio canary “Om’s SSN is 902-548”, � model transcribes both canary and unseen audios (“Raj’s SSN is 532-864”) well

22 of 37

Experimental Setup

- Models: 600M parameter Conformer

- Training Data: LibriLight (LL), and LibriSpeech (LS)

- Canaries (Can):

- Transcript: 7 random words from top 10K LS vocab

- Audio: Random male/female voice using Wavenet TTS engine

- Frequency ∈ {1, 2, 4, 8, 16}; 20 unique transcripts for each frequency

- Training Method:

- Pre-train encoder on LL for 1M steps using BEST-RQ [CQZ⁺’22]

- Attach 2-layer LSTM decoder, fine-tune on LS+Can for 20k steps

- Exposure: Computed using Character Error Rate (CER)

- Holdout set of size 20k drawn from the canary distribution

23 of 37

Results

- Memorization vs. Generalization?

Sample outputs

Config	Ground Truth	Transcribed Text
Canary	forthwith inheritance announce pervaded worse were turned	forthwith inheritance announce pervaded worse were turned
Holdout	mademoiselle powdered iridescent sky crucifix embrun farmers	mademoiselle powdered iridescent sky crucifix embracing farmers

- Models generalize on the holdout

- Reduces auditing power

24 of 37

Towards Efficient Privacy Auditing of ASR Models [WT.M’24]

- To address the above, propose extremely fast utterances as canaries

25 of 37

Towards Efficient Privacy Auditing of ASR Models [WT.M’24]

- To address the above, propose extremely fast utterances as canaries

Hello, this is a demo

- Goal: Separate learning and memorization

- Here, utterance ↔ transcript mapping different from typical utterances

- Fast Canary setup: configure TTS to generate 4x-sped-up utterances

- Repeat experiments by fine-tuning on LS + Fast Canaries instead

26 of 37

Towards Efficient Privacy Auditing of ASR Models [WT.M’24]

- Successful at efficiently showing � high memorization�- Exposure increases sharply with canary freq. � until saturation

Sample outputs

Config	Ground Truth	Transcribed Text
Canary	adjust prudence lamplit spiral tree perception kirtland	theyjust prudence lampitir tree perception kircepted
Holdout	rightly characters fatter accompany yielding trace clubbed	exard me indeed

- Follow-up [ST.N’24] - Privacy auditing of� Speech Encoders

- More in the paper: 600M exhibits � more memorization than 300M model

27 of 37

Outline

- Types of Privacy Leaks

- Extraction Attacks (Noise Masking)

- Memorization Audits (Secret Sharer framework)

- Empirical Privacy Techniques

28 of 37

Mitigating Memorization via Sensitivity-Bounded Training

- Sensitivity-bounded (SB) training: bound the change a sample can have on training

- SB training is a necessary condition for differentially private training

- Usually achieved by per-example L2 norm clipping (PEC)

29 of 37

Per-Example Clipping (PEC) Mitigates Memorization [WT.M’24]

- PEC also shown to mitigate � LM memorization [CLE⁺’19, T.RMB’20, HCT.M’22]

Sample outputs

Config	Ground Truth	Non-private (Baseline) Model	PEC Model
Canary	adjust prudence lamplit spiral tree perception kirtland	theyjust prudence lampitir tree�perception kircepted	EMPTY
Holdout	rightly characters fatter accompany yielding trace clubbed	exard me indeed	EMPTY

30 of 37

Compute and Utility Overhead of Per-Example Clipping (PEC)

- PEC limits batch-processing in GPUs/TPUs, resulting in slowdowns � of up to two orders of magnitude [LK’20, SVK’21]

- Each GPU/TPU core needs to materialize per-example gradients

- The larger the per-core batch size, higher the compute/memory overhead

- PEC can also add excessive bias during training [CWH’21, SST.T’21]

Model	Exposure @ 16 freq.	WER (avg. over 3 reps.)	Steps/sec
Non-private baseline	13.5	4.00	~2
Per-example clip (PEC) 2.5	1.0	4.09 (+2% rel.)	~1 (50% speed)

31 of 37

Reducing Compute and Utility Overhead

- Microbatch clipping [PHK⁺’23]: Clip the average of several gradients

- Materialize only microbatch gradients, improving memory footprint� - Weaker empirical privacy with increasing microbatch size

- Special case: Per-core clipping [WT.M’24]

- Clip average of all gradients on a TPU core

- For data-sharded training, no memory overhead, ~no compute overhead

32 of 37

Empirical Privacy of Per-Core Clipping [WT.M’24]

Memorization mitigation via PCC close to PEC!

- Results with per-core batch size 4

33 of 37

Compute and Utility Advantages of Per-Core Clipping (PCC)

- PCC closes utility gap w.r.t. baseline training

- PCC matches compute and memory of baseline training

Model	Exposure @ 16 freq.	WER (avg. over 3 reps.)	Steps/sec
Non-private baseline	13.5	4.00	~2
Per-example clip (PEC) 2.5	1.0	4.09 (+2% rel.)	~1 (50% speed)
Per-core clip (PCC) 2.5	2.1	3.87 (-3.2% rel.)	~2

- Follow-up [WT.M⁺’24]: Improved utility and faster convergence for ASR via PCC

34 of 37

Future Directions

Privacy attacks leveraging multi-modal user data

- NNs getting larger by the day

scale.com/guides/large-language-models

35 of 37

Future Directions

Privacy attacks leveraging multi-modal user data

- NNs getting larger by the day

- Larger models shown to memorize more, [HCT.M’22, WT.M’24, PPSH’25, MSG⁺, …]

- Multi-modal user data increasingly being used for training

- Most attacks focus on example-level privacy leakage

- Recent work [SIHS’21, KPO⁺’23] on user-level inference in LMs

Goal: Design user-level privacy leakage attacks for multi-modal models

- Recent work [WHG⁺] on cross-modality memorization in VLMs

36 of 37

Future Directions

Moving Privacy Upstream in Model Training

- Training pipelines are getting complex

- Training methods are rapidly evolving

Goal: Design privacy methods that integrate at the earliest stages of training

- Ideally, as native to data pipelines as preprocessing and augmentation

- Private synthetic data, private rewriting, etc.

- Example: “DP-fy your data” tutorial (ICML’25)

- Data practitioners already know “data”

- Privacy upstream can benefit all downstream use-cases

37 of 37

Summary

- Types of Privacy Leaks

- Extraction Attacks (Noise Masking)

- Memorization Audits (Secret Sharer framework)

- Empirical Privacy Techniques

Thank You