1 of 102

Lecture 8

TTS: Acoustic models and vocoders

2 of 102

Text-to-speech (TTS)

2

3 of 102

TTS is a hard generative task

3

Speech has multiple components:

Who?Speaker

How?Prosody (intonation), accent

What?Text, language

So there are many ways to generate outputs for a single text

4 of 102

TTS pipeline

4

Text

Preprocessor

Linguistic features

Wav

End-to-end

VITS

Parametric space

Acoustic model

Tacotron, �FastPitch, �GradTTS, �MQ-TTS

Vocoder

WaveNet,

Hi-Fi GAN, �WaveGlow, �LPCNet,

Vocos

/ p rr i vv ee t /

Mel-spectrogram

Discrete tokens

https://github.com/ZhangXInFD/SpeechTokenizer

5 of 102

Acoustic models

for Mel-Spectrograms

  • Tacotron
  • F0, pitch
  • FastPitch

5

6 of 102

Tacotron 2 (a.k.a just Tacotron)

  • Google developed Tacotron 2 in 2017�
  • LTSM-only in the basic version (Transformers had not yet taken over the world back then)�
  • There once was Tacotron 1, but it was completely replaced with Tacotron 2�
  • Does not require massive data to start producing speech (~20 hours of studio recordings may be ehough for a competitive baseline)�
  • Still works surprisingly well

6

https://arxiv.org/pdf/1712.05884.pdf

https://github.com/NVIDIA/DeepLearningExamples/tree/master/PyTorch/SpeechSynthesis/Tacotron2

7 of 102

Recap: seq2seq paradigm

7

https://lena-voita.github.io/nlp_course/seq2seq_and_attention.html

8 of 102

Recap: seq2seq paradigm

8

https://lena-voita.github.io/nlp_course/seq2seq_and_attention.html

9 of 102

Tacotron: Ok, let’s apply sec2seq to speech synthesis!

Encoder

Decoder

Attention

/ p rr i vv ee t /

Generate mel-spectrogram sequentially frame by frame

Train with MSE loss

10 of 102

Problem: wait… When do we stop?

Encoder

Decoder

Attention

/ p rr i vv ee t /

Generate mel-spectrogram sequentially frame by frame

We don’t have an EOS token!

11 of 102

Duct tape #1: separate head for stop token

Encoder

Decoder

Attention

/ p rr i vv ee t /

Generate mel-spectrogram sequentially frame by frame

Stop Token

12 of 102

Problem: Bahdanau attention is not enough

Encoder

Decoder

Attention

/ p rr i vv ee t /

Generate mel-spectrogram sequentially frame by frame

Stop Token

No position encoding, the model forgets what symbols it already pronounced

13 of 102

Problem: Bahdanau attention is not enough

Vanilla attention matrix elements:

Problems:

  • If we encounter the same letter in different places, Tako might randomly switch between them (no positional embeddings).
  • If the sequence is long, Tako might forget where it is and go back.

13

Previous decoder output

Current encoder output

14 of 102

Duct tape #2: Location Sensitive Attention

�So, the attention weights of the next step start to depend on the attention weights of the previous step.

14

https://arxiv.org/pdf/1506.07503v1.pdf

  1. Model attention with RNN:

Previous decoder output

Previous attention hidden state

Previous attention context vector

15 of 102

Duct tape #2: Location Sensitive Attention

�So, the attention weights of the next step start to also depend on the “locally close” attention weights of the previous step. It's like running your finger along a line of a book while reading aloud.

15

Was influenced by r positions around j

https://arxiv.org/pdf/1506.07503v1.pdf

2. “Look at” close attention weights with CNN:

Previous attention hidden state

16 of 102

Duct tape #2: Location Sensitive Attention

Encoder

Decoder

Location

Sensitive Attention

/ p rr i vv ee t /

Generate mel-spectrogram sequentially frame by frame

Stop Token

17 of 102

Problem: generating high quality frame in one step is hard

Encoder

Decoder

Location

Sensitive Attention

/ p rr i vv ee t /

Generate mel-spectrogram sequentially frame by frame

Stop Token

Low-quality sound

18 of 102

Duct tape #3: Pre-Net and Post-Net

Encoder

Pre-Net

Location

Sensitive Attention

/ p rr i vv ee t /

Generate mel-spectrogram sequentially frame by frame

Stop Token

Low-quality sound

Post-Net

High-quality sound

19 of 102

Problem: the model copies input frame as output

Encoder

Pre-Net

Location

Sensitive Attention

/ p rr i vv ee t /

Generate mel-spectrogram sequentially frame by frame

Stop Token

Low-quality sound

Post-Net

High-quality sound

Neighbouring frames have little difference

20 of 102

Duct tape #4: intensive dropout in Pre-Net

Encoder

Pre-Net

Location

Sensitive Attention

/ p rr i vv ee t /

Generate mel-spectrogram sequentially frame by frame

Stop Token

Low-quality sound

Post-Net

High-quality sound

Dense

ReLU

DropOut 0.5

Dense

ReLU

DropOut 0.5

PreNet

Dropout is active during training & inference

21 of 102

Tacotron: original image

21

https://arxiv.org/pdf/1712.05884.pdf

22 of 102

Tacotron: original image

22

Encoder

Decoder = Pre-Net + LSTM + FF

https://arxiv.org/pdf/1712.05884.pdf

#1 Stop Token

#3 Pre-Net & Post-Net

#2 Location Sensitive Attention

#4 Dropout in Pre-Net

23 of 102

Tacotron: training objectives

23

https://arxiv.org/pdf/1712.05884.pdf

24 of 102

Limitations

Has limited capacity, so

  • May train badly if audio quiality is not good enough�
  • Can’t capture expressive speakers�
  • May be unstable at long sentences �
  • Pretraining on large data and then finetinung does not help raise synthesis quality�
  • Might not be able to extract semantic info such as pauses, accents, intonation patterns.

24

25 of 102

Questions?

25

26 of 102

F0, harmonics, pitch

27 of 102

F0, harmonics

  • If is a periodic function with a period , then fundamental frequency is defined as �
  • are called harmonics

27

28 of 102

F0 contour

  • F0 can encode emotions and general intonation patterns (e.g. questions)�
  • Typical F0 range is 80 to 450 Hz
    • males have lower voices than females and children

28

29 of 102

Pitch

Pitch is defined as our perception of fundamental frequency.

Fun fact: If we remove F0 using a high-pass filter, the brain can still perceive the original pitch from the harmonics — this is called the missing fundamental effect.

A voice signal with F0 = 100 Hz

A voice signal with lowest remaining harmonic would be 500 Hz (5th harmonic of 100 Hz)

Despite the missing F0, humans typically still perceive the pitch as 100 Hz

A high-pass filter removes frequencies below 450 Hz

30 of 102

F0 contour: examples

30

31 of 102

When does finding F0 make sense?

F0 detection applies to "vibrating" sounds

  • Sounds with periodic patterns in the spectrum (e.g., vowels, voiced consonants)
  • These sounds exhibit clear harmonic structures.

31

Mel-spectrogam of the phrase “Я Алиса” (“I am Alice” in Russian)

Stressed "a" from "ya": harmonics (parallel lines) from vocal cord vibrations are visible

Pause

Voiceless "s" (no harmonic component, noise across all frequencies)

32 of 102

When does finding F0 make sense?

Why not in whispered speech or voiceless consonants?

  • Aperiodic signal: uniform amplitudes across all FFT frequencies.
  • No clear periodicity → No detectable F0
  • An aperiodic signal ≈ signal with a huge period*Extremely small F0 (effectively undetectable).

32

Mel-spectrogam of the phrase “Я Алиса” (“I am Alice” in Russian)

Stressed "a" from "ya": harmonics (parallel lines) from vocal cord vibrations are visible

Pause

Voiceless "s" (no harmonic component, noise across all frequencies)

33 of 102

When does finding F0 make sense?

33

In whispering, the vocal cords are hardly active, so there are no harmonic signals, even in vowels (parallel lines).

Speech

Whisper

34 of 102

How can we find F0?

When the signal has a clearly observable F0 (such as in vowels and some consonants), its spectrum has periodic patterns with a period of F0, as each harmonic is a multiple of F0.

34

https://flothesof.github.io/cepstrum-pitch-tracking.html

T

F0 = 1 / T

k * F0 – harmonics

35 of 102

How can we find F0?

So it makes sense to apply FFT to the (log) spectrum – one of the peaks will correspond to 1 / F0 (FFT inverts initial units: remember how T became 1 / T).��Where to look for that peak? Within typical human pitch range: 80-450 Hz.

35

https://flothesof.github.io/cepstrum-pitch-tracking.html

T

F0 = 1 / T

k * F0 – harmonics

1 / F0 = T

36 of 102

Cepstrum for F0 estimation

Obtain the Cepstrum*:

  • Apply the Fourier Transform (FFT) on the log-magnitude spectrum:��
  • This reveals periodic structures as peaks in the quefrency** domain.

*Cepstrum – word play with spectrum

**Quefrency – word play with frequency

36

https://flothesof.github.io/cepstrum-pitch-tracking.html

T

F0 = 1 / T

1 / F0 = T

k * F0

37 of 102

Cepstrum for F0 estimation

Identify the F0 Candidate:

  • Find the first prominent peak in the cepstrum within typical human pitch range: 80-450 Hz.

37

https://flothesof.github.io/cepstrum-pitch-tracking.html

T

F0 = 1 / T

1 / F0 = T

k * F0

38 of 102

Cepstrum for F0 estimation

Limitations:

  • Voiced/unvoiced detection: Needs additional steps to handle unvoiced regions.
  • Ambiguity: May produce subharmonics (e.g., detecting F0/2 instead of F0)

38

https://flothesof.github.io/cepstrum-pitch-tracking.html

T

F0 = 1 / T

1 / F0 = T

k * F0

39 of 102

Break time!

40 of 102

FastPitch

  • Introduced in 2020 by Nvidia�
  • Exhibits a significantly higher real-time factor than Tacotron when synthesizing mel-spectrograms for typical utterances.�
  • Does not require massive data as well as Tacotron�
  • During training, it learns to predict not only the mel-spectrogram but also the durations of phonemes and pitch (which refers to F0).

40

https://arxiv.org/pdf/2006.06873.pdf

https://github.com/NVIDIA/DeepLearningExamples/tree/master/PyTorch/SpeechSynthesis/FastPitch

41 of 102

FastPitch

41

https://arxiv.org/pdf/2006.06873.pdf

https://github.com/NVIDIA/DeepLearningExamples/tree/master/PyTorch/SpeechSynthesis/FastPitch

42 of 102

FastPitch

42

/ p rr i vv ee t /

43 of 102

FastPitch

43

https://arxiv.org/pdf/2006.06873.pdf

https://github.com/NVIDIA/DeepLearningExamples/tree/master/PyTorch/SpeechSynthesis/FastPitch

/ p rr i vv ee t /

44 of 102

FastPitch

44

https://arxiv.org/pdf/2006.06873.pdf

https://github.com/NVIDIA/DeepLearningExamples/tree/master/PyTorch/SpeechSynthesis/FastPitch

/ p rr i vv ee t /

45 of 102

Pitch of input symbols

F0 values are averaged over every input symbol using the extracted durations d

45

46 of 102

Durations of input symbols

Monotonic alignment: either stay on the same row or jump diagonally.

Having such alignment, we can get phoneme durations as segments of the alignment path.

46

47 of 102

Best alignment: Monotonic Alignment Search

Suppose we have a soft-alignment matrix between text and mel-spectrogram. We can find the most probable monotonic alignment :

https://arxiv.org/pdf/2005.11129.pdf

https://jaketae.github.io/study/glowtts/#monotonic-alignment-search

47

Mel-frames

48 of 102

Monotonic Alignment Search

Let be the maximum likelihood up to the ij-th element. Then it can be recursively formulated as

We iteratively calculate all the values of up to .

All the values of are backtracked from the end of the alignment, .

48

https://arxiv.org/pdf/2005.11129.pdf

https://jaketae.github.io/study/glowtts/#monotonic-alignment-search

Calculate Q

Backtrack the best path

49 of 102

Alignment sources

  • Tacotron trained on the same corpus (as it’s done in the paper)�������

  • Pretrained ASR models & MFA acoustic models (based on HMM-GMM)
    • Usually don’t give access to the actual alignment and output durations in seconds

49

50 of 102

FastPitch vs Tacotron

FastPitch

  • Non-autoregressive
  • Training and inference are fast
  • In a basic implementation, depends on the quality of ground truth durations and pitch
  • In general, has poorer intonations than Tacotron

Tacotron

  • Autoregressive
  • Training and inference are slow
  • Builds an alignment between text and mel-spectrogram, that can be used elsewhere (e.g. for extracting durations for FastPitch)

50

51 of 102

Vocoders

  • Hi-Fi Gan

51

52 of 102

Recap: TTS pipeline

52

Text

Preprocessor

Linguistic features

Wav

End-to-end

VITS

Parametric space

Acoustic model

Tacotron,

FastPitch, �GradTTS, �MQ-TTS

Vocoder

WaveNet,

Hi-Fi GAN, �WaveGlow, �LPCNet,

Vocos

/ p rr i vv ee t /

Mel-spectrogram

Discrete tokens

https://github.com/ZhangXInFD/SpeechTokenizer

53 of 102

Recap: Dilated Convolutions

53

https://github.com/vdumoulin/conv_arithmetic/tree/master

54 of 102

Recap: Transposed Convolutions

54

https://github.com/vdumoulin/conv_arithmetic/tree/master

55 of 102

Recap: Generative Adversarial Networks

  • G learns to generate realistic data�
  • D learns to differentiate between fake data generated by the generator and real data.�
  • Min-max training objective

55

https://github.com/yandexdataschool/speech_course/tree/2022/week_09

56 of 102

Hi-Fi GAN

  • Introduced by Kakao Enterprise in 2020
  • One of the state-of-the-art vocoders at present.
  • Trained as a Generative Adversarial Network (GAN)
  • Used across various parametric spaces (not limited to spectrograms)

56

https://arxiv.org/pdf/2010.05646.pdf

57 of 102

Hi-Fi GAN: Generator

Fully-Convolutional Architecture

  • Upsamples mel-spectrogram to raw audio via transposed convolutions
  • Multi-Receptive Field Fusion observes patterns of various lengths in parallel and aggregates the outputs from multiple residual blocks.

57

58 of 102

Multi-Period Discriminator (MPD)

  • MPD is a mixture of sub-discriminators�
  • Each sub-discriminator is a stack of strided convolutional layers�
  • Sub-discriminator only accepts equally spaced samples of an input audio; the�space is given as period p�
  • In the paper, the authors define a certain period p and set the periods to [2, 3, 5, 7, 11] to minimize overlaps.

58

https://anwarvic.github.io/speech-synthesis/HiFi-GAN

59 of 102

Multi-Scale Discriminator (MSD)

  • Because each sub-discriminator in MPD only accepts disjoint samples, the authors add MSD to consecutively evaluate the audio sequence�
  • MSD has three sub-discriminators working on different scales: raw audio, 2-x average-pooled audio, and 4-x average-pooled audio.�
  • Each sub-discriminator in MSD uses stacked convolutional layers with leaky ReLU activation

59

60 of 102

Losses

  • – ground truth

  • – mel-spectrogram

  • is a single discriminator from the ensemble

60

61 of 102

The overall loss

61

62 of 102

62

63 of 102

Additional slides

64 of 102

Tacotron: decoder & attention, detailed diagram

64

,

(encoder states)

65 of 102

Tacotron: decoder & attention - formulas

Attention is modeled by an RNN

A convolution on the previous attention weights is an input for getting the new weights

* Attention-Based Models for Speech Recognition

https://arxiv.org/abs/1803.09017

66 of 102

Location-sensitive attention: not exactly monotonic

  • Attention becomes more stable.�
  • However, instabilities occur, especially in long phrases.�
  • Sometimes the model looks at strange places in the phrase (rather than the one being voiced at the moment).�
  • There are approaches enforcing monotonicity in attention so the model looks where it's currently reading, thus enhancing synthesis stability (*).

66

(*) Some papers on the topichttps://arxiv.org/pdf/1704.00784.pdf�https://bshall.github.io/Tacotron/

67 of 102

Formants

The peaks of spectrum envelope are called formants:

67

https://wiki.aalto.fi/pages/viewpage.action?pageId=149890776

68 of 102

Formants

68

https://www.researchgate.net/figure/Spectrograms-of-the-vowels-i-o-and-u-international-phonetic-symbols_fig2_277131520

Adult male

Adult female

69 of 102

Augmentations for acoustic models

70 of 102

Trainable alignment for FastPitch

  • Having an external alignment may not always be an option
  • External alignments may be not good enough for the model’s performance

So, an alignment can be trained jointly with FastPitch:

  • Add a trainable attention layer
  • MAS for current duration lables
  • Loss: CTC + Cross Entropy on MAS labels

70

https://arxiv.org/pdf/2108.10447.pdf

https://github.com/NVIDIA/DeepLearningExamples/tree/master/PyTorch/SpeechSynthesis/FastPitch#dataset-guidelines

71 of 102

Handling Multiple Speakers

Limitations of a single-speaker corpus:

  • Requires developing a new model for each new speaker
  • Challenging to collect, as Tacotron2 demands 15-20 hours of data to generate a stable and natural voice
  • Few- and zero-shot training are not supported

71

72 of 102

Handling Multiple Speakers

Just use a trainable speaker embedding.

  • This method works well enough
  • But you need to retrain the model when a new speaker comes in.

72

https://arxiv.org/pdf/1806.04558.pdf

73 of 102

Handling Multiple Speakers

Extract the speaker embedding from a pretrained model (e.g. for speaker verification).

73

https://arxiv.org/pdf/1806.04558.pdf

74 of 102

Prosody

  • Prosody: intonation, stress, and rhythm
  • A text can be pronounced with different prosodies

Problems with training on intonation-rich data:

  • A model may have difficulty learning and converging because it's trying to extract intonations from text
  • MSE loss pushes the model to averaging prosodies from the input data, which results into “robotic” synthesis intonation

Possible solution: let’s try to encode prosody as a vector hint – “style embedding”

74

75 of 102

Style embedding

We can model style embedding as a weighted sum of randomly initialized embeddings – Global Style Tokens (GST).

The idea:

  • compress the initial audio into a style embedding – but that would be a target leakage
  • to reduce the leakage: keep the number of the GSTs small

75

76 of 102

Global Style Tokens: Training

The reference encoder compresses audio signal into a fixed-length vector called reference embedding.

The reference embedding is used as the query vector to an attention module, GST are used as values.

The attention module outputs the weighted sum of GSTs: style embedding.

It is then passed to the decoder along with the encoder outputs.

76

77 of 102

Global Style Tokens: Training

The style token layer is jointly trained with the rest of the model, driven only by the reconstruction loss from the model.

GSTs thus do not require any explicit style or prosody labels.

77

78 of 102

Global Style Tokens: Inference

We can directly condition the text encoder on certain tokens:

Style 1:

Style 2:

Style 3:

78

https://arxiv.org/abs/1803.09017

https://google.github.io/tacotron/publications/global_style_tokens/

79 of 102

Global Style Tokens: Inference

We can feed a different audio signal to achieve style transfer:

Source:

Baseline Tacotron:

GST Tacotron:

79

https://arxiv.org/abs/1803.09017

https://google.github.io/tacotron/publications/global_style_tokens/

80 of 102

Visualizations

  • We can calculate GST for each sample in the dataset
  • Clusterize them using KMeans
  • And visualize using t-SNE

80

81 of 102

Inference Problems

Approach 1 - use a single GST:

  • The number of GST vectors is a hyperparameter
  • There is no guarantee on what is really learned by each of the tokens
  • For instance, tokens might not solely represent intonation but could also be associated with audio quality, duration, background noises, breathing

81

82 of 102

Inference Problems

Approach 2 - Style transfer from a single audio:

  • The model doesn't work properly when the source phrases are a different length than the text that needs to be synthesized.
  • There might not be any “ideal” phrase in the dataset

82

83 of 102

TP-GST (Text-Predicted GST)

Estimate style tokens from text by predicting either

83

https://arxiv.org/pdf/1808.01410.pdf

or attention weights (TPCW)

style embedding (TPSE)

84 of 102

TP-GST (Text-Predicted GST)

84

Reference encoder:

GST Estimator:

Mel GT

Conv

GRU

Attention

GST

Encoder

GRU

FC

Attention&Decoder

+

85 of 102

TP-GST (Text-Predicted GST)

The GST Estimator can be trained either

  • jointly with an acoustic model
  • or independently at a later stage.

Training jointly does not permit intonation manipulations and may lead to unintended shifts in intonation.

85

86 of 102

TP-GST (Text-Predicted GST)

Training independently at a later stage enables intonation manipulation through at least two methods:

  • Select a subset with the desired intonation and exclusively train the GST Estimator on it.
  • Train on the entire dataset but condition on an intonation cluster number.

86

87 of 102

Semantic hints

  • Training data for AM lacks semantic depth due to its small size (usually a few hundred thousand examples, rarely millions). �
  • Pre-training on larger textual datasets is used to extract richer semantics.�
  • This helps TTS model better pronounce difficult words and improve intonation in complex sentences

87

88 of 102

Semantic hints: pre-trained LMs

Simple approach: extract word embeddings from pre-trained models and combine them with phonemes before sending to the text encoder

88

89 of 102

Semantic hints: PnG-BERT

  • Google introduced this model in 2021.�
  • PnG-BERT is an extension of the original BERT model, as it uses both phonemic and graphemic representations of text as input data.�
  • The model can be initially trained on a large text dataset in a self-supervised manner and then fine-tuned for Text-to-Speech (TTS) tasks.

89

90 of 102

PnG-BERT: pretraining

  • phonemes and graphemes (BPE tokens)�
  • Dataset with phonemes and graphemes is derived from purely textual data — G2P can be run on texts.�
  • additional word-position embedding is used by PnG BERT, providing word-level alignment between phonemes and graphemes�
  • MLM (like regular BERT)�
  • no architectural differences between phoneme and grapheme inputs

90

91 of 102

PnG-BERT: during AM training

  • Replace encoder during Tako training, only utilize phonemic outputs (yet both phonemes and graphemes are inputted into BERT)�
  • Fine-tune several last layers of PnG-BERT along with AM.

91

92 of 102

Limitations

  • The number of clusters is a hyperparameter
  • TP-GST does not solve the problem of GST learning something meaningless and therefore there will be a plenty of ‘trash clusters’
    • some of them will contain only low-quality audio
    • ‘breathing’ clusters
    • clusters that memorized the day of recording
  • So the choice of a good GST-cluster is a manual and subjective process

92

93 of 102

Additional slides, vocoders

93

94 of 102

WaveNet

  • Introduced by DeepMind in 2016
  • State-of-the-art (SOTA) at the time of its release
  • A generative deep convolutional neural network for producing raw audio waveforms (not necessary a vocoder)

94

https://arxiv.org/pdf/1609.03499.pdf�https://deepmind.google/discover/blog/wavenet-a-generative-model-for-raw-audio/

95 of 102

Recap: Causal Convolutions

95

96 of 102

Mu-law encoding

WaveNet treats wav as a discrete signal: all amplitude values are quantized into discrete bins, and the model then predicts the number of a bin.

Audio samples are quantized using mu-law encoding:

  • lower amplitudes are sampled more often than the higher ones, which is reasonable for human speech
  • one audio sample is quantized into 256 bins (instead of 65536 bins for a 16-bit interger)

96

MSE loss

Cross-Entropy loss

97 of 102

WaveNet: Оverview

97

https://github.com/yandexdataschool/speech_course/tree/2022/week_09

98 of 102

WaveNet Data-flow Graph

98

(the first r samples are just zeros)

Causal Nature: Does not look into the future�Exponential Dilation Increase: Dilation grows exponentially: 1, 2, 4, 8, …�Kernel Size: Typically 2, though 3 is also feasible

99 of 102

WaveNet Output

99

https://github.com/yandexdataschool/speech_course/tree/2022/week_09

100 of 102

WaveNet Conditioning

  • Time Resolution Discrepancy: Signal and Spectrogram have varying time resolutions.
  • Upsampling Layer: Transposed Convolution.

100

https://github.com/yandexdataschool/speech_course/tree/2022/week_09

https://github.com/vdumoulin/conv_arithmetic/tree/master

101 of 102

WaveNet Pros and Cons

Pros

  • Implementation is straightforward
  • Convergence during training is quick and consistent.
  • Generated audio closely resembles the original

Cons

  • Difficult inference parallelization
  • The model may perform worse on generated spectrograms; it is necessary to train on generated spectrograms or fine-tune.

101

102 of 102

WaveNet Inference

102