Lecture 8
TTS: Acoustic models and vocoders
Text-to-speech (TTS)
2
TTS is a hard generative task
3
Speech has multiple components:
Who?�Speaker
How?�Prosody (intonation), accent
What?�Text, language
So there are many ways to generate outputs for a single text
TTS pipeline
4
Text
Preprocessor
Linguistic features
Wav
End-to-end
VITS
Parametric space
Acoustic model
Tacotron, �FastPitch, �GradTTS, �MQ-TTS
Vocoder
WaveNet,
Hi-Fi GAN, �WaveGlow, �LPCNet,
Vocos
/ p rr i vv ee t /
Mel-spectrogram
Discrete tokens
https://github.com/ZhangXInFD/SpeechTokenizer
Acoustic models
for Mel-Spectrograms
5
Tacotron 2 (a.k.a just Tacotron)
6
https://arxiv.org/pdf/1712.05884.pdf
https://github.com/NVIDIA/DeepLearningExamples/tree/master/PyTorch/SpeechSynthesis/Tacotron2
Recap: seq2seq paradigm
7
https://lena-voita.github.io/nlp_course/seq2seq_and_attention.html
Recap: seq2seq paradigm
8
https://lena-voita.github.io/nlp_course/seq2seq_and_attention.html
Tacotron: Ok, let’s apply sec2seq to speech synthesis!
Encoder
Decoder
Attention
/ p rr i vv ee t /
Generate mel-spectrogram sequentially frame by frame
Train with MSE loss
Problem: wait… When do we stop?
Encoder
Decoder
Attention
/ p rr i vv ee t /
Generate mel-spectrogram sequentially frame by frame
We don’t have an EOS token!
Duct tape #1: separate head for stop token
Encoder
Decoder
Attention
/ p rr i vv ee t /
Generate mel-spectrogram sequentially frame by frame
Stop Token
Problem: Bahdanau attention is not enough
Encoder
Decoder
Attention
/ p rr i vv ee t /
Generate mel-spectrogram sequentially frame by frame
Stop Token
No position encoding, the model forgets what symbols it already pronounced
Problem: Bahdanau attention is not enough
Vanilla attention matrix elements:
Problems:
13
Previous decoder output
Current encoder output
Duct tape #2: Location Sensitive Attention
�So, the attention weights of the next step start to depend on the attention weights of the previous step.
14
https://arxiv.org/pdf/1506.07503v1.pdf
Previous decoder output
Previous attention hidden state
Previous attention context vector
Duct tape #2: Location Sensitive Attention
�So, the attention weights of the next step start to also depend on the “locally close” attention weights of the previous step. It's like running your finger along a line of a book while reading aloud.
15
Was influenced by r positions around j
https://arxiv.org/pdf/1506.07503v1.pdf
2. “Look at” close attention weights with CNN:
Previous attention hidden state
Duct tape #2: Location Sensitive Attention
Encoder
Decoder
Location
Sensitive Attention
/ p rr i vv ee t /
Generate mel-spectrogram sequentially frame by frame
Stop Token
Problem: generating high quality frame in one step is hard
Encoder
Decoder
Location
Sensitive Attention
/ p rr i vv ee t /
Generate mel-spectrogram sequentially frame by frame
Stop Token
Low-quality sound
Duct tape #3: Pre-Net and Post-Net
Encoder
Pre-Net
Location
Sensitive Attention
/ p rr i vv ee t /
Generate mel-spectrogram sequentially frame by frame
Stop Token
Low-quality sound
Post-Net
High-quality sound
Problem: the model copies input frame as output
Encoder
Pre-Net
Location
Sensitive Attention
/ p rr i vv ee t /
Generate mel-spectrogram sequentially frame by frame
Stop Token
Low-quality sound
Post-Net
High-quality sound
Neighbouring frames have little difference
Duct tape #4: intensive dropout in Pre-Net
Encoder
Pre-Net
Location
Sensitive Attention
/ p rr i vv ee t /
Generate mel-spectrogram sequentially frame by frame
Stop Token
Low-quality sound
Post-Net
High-quality sound
Dense
ReLU
DropOut 0.5
Dense
ReLU
DropOut 0.5
PreNet
Dropout is active during training & inference
Tacotron: original image
21
https://arxiv.org/pdf/1712.05884.pdf
Tacotron: original image
22
Encoder
Decoder = Pre-Net + LSTM + FF
https://arxiv.org/pdf/1712.05884.pdf
#1 Stop Token
#3 Pre-Net & Post-Net
#2 Location Sensitive Attention
#4 Dropout in Pre-Net
Tacotron: training objectives
23
https://arxiv.org/pdf/1712.05884.pdf
Limitations
Has limited capacity, so
24
Questions?
25
F0, harmonics, pitch
F0, harmonics
27
F0 contour
28
Pitch
Pitch is defined as our perception of fundamental frequency.
Fun fact: If we remove F0 using a high-pass filter, the brain can still perceive the original pitch from the harmonics — this is called the missing fundamental effect.
https://speechprocessingbook.aalto.fi/Representations/Fundamental_frequency_F0.html�https://youtu.be/AZ8qZCGg4Bk?si=2_BEzAvYvIoJeKG8
A voice signal with F0 = 100 Hz
A voice signal with lowest remaining harmonic would be 500 Hz (5th harmonic of 100 Hz)
Despite the missing F0, humans typically still perceive the pitch as 100 Hz
A high-pass filter removes frequencies below 450 Hz
F0 contour: examples
30
When does finding F0 make sense?
F0 detection applies to "vibrating" sounds
31
Mel-spectrogam of the phrase “Я Алиса” (“I am Alice” in Russian)
Stressed "a" from "ya": harmonics (parallel lines) from vocal cord vibrations are visible
Pause
Voiceless "s" (no harmonic component, noise across all frequencies)
When does finding F0 make sense?
Why not in whispered speech or voiceless consonants?
32
The waveforms of speech�Whisper in Yandex Alice�*https://brianmcfee.net/dstbook-site/content/ch01-signals/Waves.html#aperiodic-signals
Mel-spectrogam of the phrase “Я Алиса” (“I am Alice” in Russian)
Stressed "a" from "ya": harmonics (parallel lines) from vocal cord vibrations are visible
Pause
Voiceless "s" (no harmonic component, noise across all frequencies)
When does finding F0 make sense?
33
In whispering, the vocal cords are hardly active, so there are no harmonic signals, even in vowels (parallel lines).
Speech
Whisper
How can we find F0?
When the signal has a clearly observable F0 (such as in vowels and some consonants), its spectrum has periodic patterns with a period of F0, as each harmonic is a multiple of F0.
34
https://flothesof.github.io/cepstrum-pitch-tracking.html
T
F0 = 1 / T
k * F0 – harmonics
How can we find F0?
So it makes sense to apply FFT to the (log) spectrum – one of the peaks will correspond to 1 / F0 (FFT inverts initial units: remember how T became 1 / T).��Where to look for that peak? Within typical human pitch range: 80-450 Hz.
35
https://flothesof.github.io/cepstrum-pitch-tracking.html
T
F0 = 1 / T
k * F0 – harmonics
1 / F0 = T
Cepstrum for F0 estimation
Obtain the Cepstrum*:
*Cepstrum – word play with spectrum
**Quefrency – word play with frequency
36
https://flothesof.github.io/cepstrum-pitch-tracking.html
T
F0 = 1 / T
1 / F0 = T
k * F0
Cepstrum for F0 estimation
Identify the F0 Candidate:
37
https://flothesof.github.io/cepstrum-pitch-tracking.html
T
F0 = 1 / T
1 / F0 = T
k * F0
Cepstrum for F0 estimation
Limitations:
38
https://flothesof.github.io/cepstrum-pitch-tracking.html
T
F0 = 1 / T
1 / F0 = T
k * F0
Break time!
FastPitch
40
https://arxiv.org/pdf/2006.06873.pdf
https://github.com/NVIDIA/DeepLearningExamples/tree/master/PyTorch/SpeechSynthesis/FastPitch
FastPitch
41
https://arxiv.org/pdf/2006.06873.pdf
https://github.com/NVIDIA/DeepLearningExamples/tree/master/PyTorch/SpeechSynthesis/FastPitch
FastPitch
42
https://arxiv.org/pdf/2006.06873.pdf
https://github.com/NVIDIA/DeepLearningExamples/tree/master/PyTorch/SpeechSynthesis/FastPitch�https://arxiv.org/pdf/1905.09263
/ p rr i vv ee t /
FastPitch
43
https://arxiv.org/pdf/2006.06873.pdf
https://github.com/NVIDIA/DeepLearningExamples/tree/master/PyTorch/SpeechSynthesis/FastPitch
/ p rr i vv ee t /
FastPitch
44
https://arxiv.org/pdf/2006.06873.pdf
https://github.com/NVIDIA/DeepLearningExamples/tree/master/PyTorch/SpeechSynthesis/FastPitch
/ p rr i vv ee t /
Pitch of input symbols
F0 values are averaged over every input symbol using the extracted durations d
45
Durations of input symbols
Monotonic alignment: either stay on the same row or jump diagonally.
Having such alignment, we can get phoneme durations as segments of the alignment path.
46
Best alignment: Monotonic Alignment Search
Suppose we have a soft-alignment matrix between text and mel-spectrogram. We can find the most probable monotonic alignment :
https://arxiv.org/pdf/2005.11129.pdf
https://jaketae.github.io/study/glowtts/#monotonic-alignment-search
47
Mel-frames
Monotonic Alignment Search
Let be the maximum likelihood up to the ij-th element. Then it can be recursively formulated as
We iteratively calculate all the values of up to .
All the values of are backtracked from the end of the alignment, .
48
https://arxiv.org/pdf/2005.11129.pdf
https://jaketae.github.io/study/glowtts/#monotonic-alignment-search
Calculate Q
Backtrack the best path
Alignment sources
49
FastPitch vs Tacotron
FastPitch
Tacotron
50
Vocoders
51
Recap: TTS pipeline
52
Text
Preprocessor
Linguistic features
Wav
End-to-end
VITS
Parametric space
Acoustic model
Tacotron,
FastPitch, �GradTTS, �MQ-TTS
Vocoder
WaveNet,
Hi-Fi GAN, �WaveGlow, �LPCNet,
Vocos
/ p rr i vv ee t /
Mel-spectrogram
Discrete tokens
https://github.com/ZhangXInFD/SpeechTokenizer
Recap: Dilated Convolutions
53
https://github.com/vdumoulin/conv_arithmetic/tree/master
Recap: Transposed Convolutions
54
https://github.com/vdumoulin/conv_arithmetic/tree/master
Recap: Generative Adversarial Networks
55
https://github.com/yandexdataschool/speech_course/tree/2022/week_09
Hi-Fi GAN
56
https://arxiv.org/pdf/2010.05646.pdf
Hi-Fi GAN: Generator
Fully-Convolutional Architecture
57
Multi-Period Discriminator (MPD)
58
https://anwarvic.github.io/speech-synthesis/HiFi-GAN
Multi-Scale Discriminator (MSD)
59
Losses
60
The overall loss
61
62
Additional slides
Tacotron: decoder & attention, detailed diagram
64
,
(encoder states)
Tacotron: decoder & attention - formulas
Attention is modeled by an RNN
A convolution on the previous attention weights is an input for getting the new weights
* Attention-Based Models for Speech Recognition
Location-sensitive attention: not exactly monotonic
66
(*) Some papers on the topic�https://arxiv.org/pdf/1704.00784.pdf�https://bshall.github.io/Tacotron/
Formants
The peaks of spectrum envelope are called formants:
67
https://wiki.aalto.fi/pages/viewpage.action?pageId=149890776
Formants
68
https://www.researchgate.net/figure/Spectrograms-of-the-vowels-i-o-and-u-international-phonetic-symbols_fig2_277131520
Adult male
Adult female
Augmentations for acoustic models
Trainable alignment for FastPitch
So, an alignment can be trained jointly with FastPitch:
70
https://arxiv.org/pdf/2108.10447.pdf
https://github.com/NVIDIA/DeepLearningExamples/tree/master/PyTorch/SpeechSynthesis/FastPitch#dataset-guidelines
Handling Multiple Speakers
Limitations of a single-speaker corpus:
71
Handling Multiple Speakers
Just use a trainable speaker embedding.
72
https://arxiv.org/pdf/1806.04558.pdf
Handling Multiple Speakers
Extract the speaker embedding from a pretrained model (e.g. for speaker verification).
73
https://arxiv.org/pdf/1806.04558.pdf
Prosody
Problems with training on intonation-rich data:
Possible solution: let’s try to encode prosody as a vector hint – “style embedding”
74
Style embedding
We can model style embedding as a weighted sum of randomly initialized embeddings – Global Style Tokens (GST).
The idea:
75
Global Style Tokens: Training
The reference encoder compresses audio signal into a fixed-length vector called reference embedding.
The reference embedding is used as the query vector to an attention module, GST are used as values.
The attention module outputs the weighted sum of GSTs: style embedding.
It is then passed to the decoder along with the encoder outputs.
76
Global Style Tokens: Training
The style token layer is jointly trained with the rest of the model, driven only by the reconstruction loss from the model.
GSTs thus do not require any explicit style or prosody labels.
77
Global Style Tokens: Inference
We can directly condition the text encoder on certain tokens:
Style 1:
Style 2:
Style 3:
78
https://arxiv.org/abs/1803.09017
https://google.github.io/tacotron/publications/global_style_tokens/
Global Style Tokens: Inference
We can feed a different audio signal to achieve style transfer:
Source:
Baseline Tacotron:
GST Tacotron:
79
https://arxiv.org/abs/1803.09017
https://google.github.io/tacotron/publications/global_style_tokens/
Visualizations
80
Inference Problems
Approach 1 - use a single GST:
81
Inference Problems
Approach 2 - Style transfer from a single audio:
82
TP-GST (Text-Predicted GST)
Estimate style tokens from text by predicting either
83
https://arxiv.org/pdf/1808.01410.pdf
or attention weights (TPCW)
style embedding (TPSE)
TP-GST (Text-Predicted GST)
84
Reference encoder:
GST Estimator:
Mel GT
Conv
GRU
Attention
GST
Encoder
GRU
FC
Attention&Decoder
+
TP-GST (Text-Predicted GST)
The GST Estimator can be trained either
Training jointly does not permit intonation manipulations and may lead to unintended shifts in intonation.
85
TP-GST (Text-Predicted GST)
Training independently at a later stage enables intonation manipulation through at least two methods:
86
Semantic hints
87
Semantic hints: pre-trained LMs
Simple approach: extract word embeddings from pre-trained models and combine them with phonemes before sending to the text encoder
88
Semantic hints: PnG-BERT
89
PnG-BERT: pretraining
90
PnG-BERT: during AM training
91
Limitations
92
Additional slides, vocoders
93
WaveNet
94
https://arxiv.org/pdf/1609.03499.pdf�https://deepmind.google/discover/blog/wavenet-a-generative-model-for-raw-audio/
Recap: Causal Convolutions
95
Mu-law encoding
WaveNet treats wav as a discrete signal: all amplitude values are quantized into discrete bins, and the model then predicts the number of a bin.
Audio samples are quantized using mu-law encoding:
96
MSE loss
Cross-Entropy loss
WaveNet: Оverview
97
https://github.com/yandexdataschool/speech_course/tree/2022/week_09
WaveNet Data-flow Graph
98
https://github.com/yandexdataschool/speech_course/tree/2022/week_09
https://github.com/vincentherrmann/pytorch-wavenet/blob/master/wavenet_model.py
(the first r samples are just zeros)
Causal Nature: Does not look into the future�Exponential Dilation Increase: Dilation grows exponentially: 1, 2, 4, 8, …�Kernel Size: Typically 2, though 3 is also feasible
WaveNet Output
99
https://github.com/yandexdataschool/speech_course/tree/2022/week_09
WaveNet Conditioning
100
https://github.com/yandexdataschool/speech_course/tree/2022/week_09
https://github.com/vdumoulin/conv_arithmetic/tree/master
WaveNet Pros and Cons
Pros
Cons
101
WaveNet Inference
102