1 of 102

Lecture 8

TTS: Acoustic models and vocoders

2 of 102

Text-to-speech (TTS)

2

3 of 102

TTS is a hard generative task

3

Speech has multiple components:

Who?�Speaker

How?�Prosody (intonation), accent

What?�Text, language

So there are many ways to generate outputs for a single text

4 of 102

TTS pipeline

4

Text

Preprocessor

Linguistic features

Wav

End-to-end

VITS

Parametric space

Acoustic model

Tacotron, �FastPitch, �GradTTS, �MQ-TTS

Vocoder

WaveNet,

Hi-Fi GAN, �WaveGlow, �LPCNet,

Vocos

/ p rr i vv ee t /

Mel-spectrogram

Discrete tokens

https://github.com/ZhangXInFD/SpeechTokenizer

5 of 102

Acoustic models

for Mel-Spectrograms

Tacotron
F0, pitch
FastPitch

5

6 of 102

Tacotron 2 (a.k.a just Tacotron)

Google developed Tacotron 2 in 2017�
LTSM-only in the basic version (Transformers had not yet taken over the world back then)�
There once was Tacotron 1, but it was completely replaced with Tacotron 2�
Does not require massive data to start producing speech (~20 hours of studio recordings may be ehough for a competitive baseline)�
Still works surprisingly well

6

https://arxiv.org/pdf/1712.05884.pdf

https://github.com/NVIDIA/DeepLearningExamples/tree/master/PyTorch/SpeechSynthesis/Tacotron2

7 of 102

Recap: seq2seq paradigm

7

https://lena-voita.github.io/nlp_course/seq2seq_and_attention.html

8 of 102

Recap: seq2seq paradigm

8

https://lena-voita.github.io/nlp_course/seq2seq_and_attention.html

9 of 102

Tacotron: Ok, let’s apply sec2seq to speech synthesis!

Encoder

Decoder

Attention

/ p rr i vv ee t /

Generate mel-spectrogram sequentially frame by frame

Train with MSE loss

10 of 102

Problem: wait… When do we stop?

Encoder

Decoder

Attention

/ p rr i vv ee t /

Generate mel-spectrogram sequentially frame by frame

We don’t have an EOS token!

11 of 102

Duct tape #1: separate head for stop token

Encoder

Decoder

Attention

/ p rr i vv ee t /

Generate mel-spectrogram sequentially frame by frame

Stop Token

12 of 102

Problem: Bahdanau attention is not enough

Encoder

Decoder

Attention

/ p rr i vv ee t /

Generate mel-spectrogram sequentially frame by frame

Stop Token

No position encoding, the model forgets what symbols it already pronounced

13 of 102

Problem: Bahdanau attention is not enough

Vanilla attention matrix elements:

Problems:

If we encounter the same letter in different places, Tako might randomly switch between them (no positional embeddings).
If the sequence is long, Tako might forget where it is and go back.

13

Previous decoder output

Current encoder output

14 of 102

Duct tape #2: Location Sensitive Attention

�So, the attention weights of the next step start to depend on the attention weights of the previous step.

14

https://arxiv.org/pdf/1506.07503v1.pdf

Model attention with RNN:

Previous decoder output

Previous attention hidden state

Previous attention context vector

15 of 102

Duct tape #2: Location Sensitive Attention

�So, the attention weights of the next step start to also depend on the “locally close” attention weights of the previous step. It's like running your finger along a line of a book while reading aloud.

15

Was influenced by r positions around j

https://arxiv.org/pdf/1506.07503v1.pdf

2. “Look at” close attention weights with CNN:

Previous attention hidden state

16 of 102

Duct tape #2: Location Sensitive Attention

Encoder

Decoder

Location

Sensitive Attention

/ p rr i vv ee t /

Generate mel-spectrogram sequentially frame by frame

Stop Token

17 of 102

Problem: generating high quality frame in one step is hard

Encoder

Decoder

Location

Sensitive Attention

/ p rr i vv ee t /

Generate mel-spectrogram sequentially frame by frame

Stop Token

Low-quality sound

18 of 102

Duct tape #3: Pre-Net and Post-Net

Encoder

Pre-Net

Location

Sensitive Attention

/ p rr i vv ee t /

Generate mel-spectrogram sequentially frame by frame

Stop Token

Low-quality sound

Post-Net

High-quality sound

19 of 102

Problem: the model copies input frame as output

Encoder

Pre-Net

Location

Sensitive Attention

/ p rr i vv ee t /

Generate mel-spectrogram sequentially frame by frame

Stop Token

Low-quality sound

Post-Net

High-quality sound

Neighbouring frames have little difference

20 of 102

Duct tape #4: intensive dropout in Pre-Net

Encoder

Pre-Net

Location

Sensitive Attention

/ p rr i vv ee t /

Generate mel-spectrogram sequentially frame by frame

Stop Token

Low-quality sound

Post-Net

High-quality sound

Dense

ReLU

DropOut 0.5

Dense

ReLU

DropOut 0.5

PreNet

Dropout is active during training & inference

21 of 102

Tacotron: original image

21

https://arxiv.org/pdf/1712.05884.pdf

22 of 102

Tacotron: original image

22

Encoder

Decoder = Pre-Net + LSTM + FF

https://arxiv.org/pdf/1712.05884.pdf

#1 Stop Token

#3 Pre-Net & Post-Net

#2 Location Sensitive Attention

#4 Dropout in Pre-Net

23 of 102

Tacotron: training objectives

23

https://arxiv.org/pdf/1712.05884.pdf

24 of 102

Limitations

Has limited capacity, so

May train badly if audio quiality is not good enough�
Can’t capture expressive speakers�
May be unstable at long sentences �
Pretraining on large data and then finetinung does not help raise synthesis quality�
Might not be able to extract semantic info such as pauses, accents, intonation patterns.

24

25 of 102

Questions?

25

26 of 102

F0, harmonics, pitch

27 of 102

F0, harmonics

If is a periodic function with a period , then fundamental frequency is defined as �
are called harmonics

27

https://speechprocessingbook.aalto.fi/Representations/Fundamental_frequency_F0.html

28 of 102

F0 contour

F0 can encode emotions and general intonation patterns (e.g. questions)�
Typical F0 range is 80 to 450 Hz

males have lower voices than females and children

28

https://librosa.org/doc/0.10.0/auto_examples/plot_spectral_harmonics.html

29 of 102

Pitch

Pitch is defined as our perception of fundamental frequency.

Fun fact: If we remove F0 using a high-pass filter, the brain can still perceive the original pitch from the harmonics — this is called the missing fundamental effect.

https://speechprocessingbook.aalto.fi/Representations/Fundamental_frequency_F0.html�https://youtu.be/AZ8qZCGg4Bk?si=2_BEzAvYvIoJeKG8

A voice signal with F0 = 100 Hz

A voice signal with lowest remaining harmonic would be 500 Hz (5th harmonic of 100 Hz)

Despite the missing F0, humans typically still perceive the pitch as 100 Hz

A high-pass filter removes frequencies below 450 Hz

30 of 102

F0 contour: examples

30

31 of 102

When does finding F0 make sense?

F0 detection applies to "vibrating" sounds

Sounds with periodic patterns in the spectrum (e.g., vowels, voiced consonants)
These sounds exhibit clear harmonic structures.

31

The waveforms of speech�Whisper in Yandex Alice

Mel-spectrogam of the phrase “Я Алиса” (“I am Alice” in Russian)

Stressed "a" from "ya": harmonics (parallel lines) from vocal cord vibrations are visible

Pause

Voiceless "s" (no harmonic component, noise across all frequencies)

32 of 102

When does finding F0 make sense?

Why not in whispered speech or voiceless consonants?

Aperiodic signal: uniform amplitudes across all FFT frequencies.
No clear periodicity → No detectable F0
An aperiodic signal ≈ signal with a huge period* → Extremely small F0 (effectively undetectable).

32

The waveforms of speech�Whisper in Yandex Alice�*https://brianmcfee.net/dstbook-site/content/ch01-signals/Waves.html#aperiodic-signals

Mel-spectrogam of the phrase “Я Алиса” (“I am Alice” in Russian)

Stressed "a" from "ya": harmonics (parallel lines) from vocal cord vibrations are visible

Pause

Voiceless "s" (no harmonic component, noise across all frequencies)

33 of 102

When does finding F0 make sense?

33

The waveforms of speech�Whisper in Yandex Alice

In whispering, the vocal cords are hardly active, so there are no harmonic signals, even in vowels (parallel lines).

Speech

Whisper

34 of 102

How can we find F0?

When the signal has a clearly observable F0 (such as in vowels and some consonants), its spectrum has periodic patterns with a period of F0, as each harmonic is a multiple of F0.

34

https://flothesof.github.io/cepstrum-pitch-tracking.html

T

F0 = 1 / T

k * F0 – harmonics

The general problem of fundamental frequency estimation is to take a portion of signal and to find the dominant frequency of repetition.

Difficulties arise from

(i) that not all signals are periodic,

(ii) those that are periodic may be changing in fundamental frequency over the time of interest,

(iii) signals may be contaminated with noise, even with periodic signals of other fundamental frequencies,

(iv) signals that are periodic with interval T are also periodic with interval 2T, 3T etc, so we need to find the smallest periodic interval or the highest fundamental frequency;

and (v) even signals of constant fundamental frequency may be changing in other ways over the interval of interest.

A reliable way of obtaining an estimate of the dominant fundamental frequency for long, clean, stationary speech signals is to use the cepstrum. The cepstrum is a Fourier analysis of the logarithmic amplitude spectrum of the signal. If the log amplitude spectrum contains many regularly spaced harmonics, then the Fourier analysis of the spectrum will show a peak corresponding to the spacing between the harmonics: i.e. the fundamental frequency.

Effectively we are treating the signal spectrum as another signal, then looking for periodicity in the spectrum itself.

For signals with small sampling rates and relatively low fundamental frequencies (e.g. speech signals) F0 can be estimated with a high accuracy using the cepstral approach. It is, however, less clear how the cepstrum method would perform on music signals which requires the consideration of a much wider range of fundamental frequencies. Since the cepstrum method maps harmonic patterns onto single bins it enables a fast search across a wide range of fundamental frequencies.

Source:

https://www.phon.ucl.ac.uk/courses/spsci/matlab/lect10.html#:~:text=Difficulties%20arise%20from%20(i)%20that,periodic%20with%20interval%20T%20are

A good tutiorial on pitch tracking:

https://flothesof.github.io/cepstrum-pitch-tracking.html

35 of 102

How can we find F0?

So it makes sense to apply FFT to the (log) spectrum – one of the peaks will correspond to 1 / F0 (FFT inverts initial units: remember how T became 1 / T).��Where to look for that peak? Within typical human pitch range: 80-450 Hz.

35

https://flothesof.github.io/cepstrum-pitch-tracking.html

T

F0 = 1 / T

k * F0 – harmonics

1 / F0 = T

The general problem of fundamental frequency estimation is to take a portion of signal and to find the dominant frequency of repetition.

Difficulties arise from

(i) that not all signals are periodic,

(ii) those that are periodic may be changing in fundamental frequency over the time of interest,

(iii) signals may be contaminated with noise, even with periodic signals of other fundamental frequencies,

(iv) signals that are periodic with interval T are also periodic with interval 2T, 3T etc, so we need to find the smallest periodic interval or the highest fundamental frequency;

and (v) even signals of constant fundamental frequency may be changing in other ways over the interval of interest.

A reliable way of obtaining an estimate of the dominant fundamental frequency for long, clean, stationary speech signals is to use the cepstrum. The cepstrum is a Fourier analysis of the logarithmic amplitude spectrum of the signal. If the log amplitude spectrum contains many regularly spaced harmonics, then the Fourier analysis of the spectrum will show a peak corresponding to the spacing between the harmonics: i.e. the fundamental frequency.

Effectively we are treating the signal spectrum as another signal, then looking for periodicity in the spectrum itself.

For signals with small sampling rates and relatively low fundamental frequencies (e.g. speech signals) F0 can be estimated with a high accuracy using the cepstral approach. It is, however, less clear how the cepstrum method would perform on music signals which requires the consideration of a much wider range of fundamental frequencies. Since the cepstrum method maps harmonic patterns onto single bins it enables a fast search across a wide range of fundamental frequencies.

Source:

https://www.phon.ucl.ac.uk/courses/spsci/matlab/lect10.html#:~:text=Difficulties%20arise%20from%20(i)%20that,periodic%20with%20interval%20T%20are

A good tutiorial on pitch tracking:

https://flothesof.github.io/cepstrum-pitch-tracking.html

36 of 102

Cepstrum for F0 estimation

Obtain the Cepstrum*:

Apply the Fourier Transform (FFT) on the log-magnitude spectrum:��
This reveals periodic structures as peaks in the quefrency** domain.

*Cepstrum – word play with spectrum

**Quefrency – word play with frequency

36

https://flothesof.github.io/cepstrum-pitch-tracking.html

T

F0 = 1 / T

1 / F0 = T

k * F0

The general problem of fundamental frequency estimation is to take a portion of signal and to find the dominant frequency of repetition.

Difficulties arise from

(i) that not all signals are periodic,

(ii) those that are periodic may be changing in fundamental frequency over the time of interest,

(iii) signals may be contaminated with noise, even with periodic signals of other fundamental frequencies,

(iv) signals that are periodic with interval T are also periodic with interval 2T, 3T etc, so we need to find the smallest periodic interval or the highest fundamental frequency;

and (v) even signals of constant fundamental frequency may be changing in other ways over the interval of interest.

A reliable way of obtaining an estimate of the dominant fundamental frequency for long, clean, stationary speech signals is to use the cepstrum. The cepstrum is a Fourier analysis of the logarithmic amplitude spectrum of the signal. If the log amplitude spectrum contains many regularly spaced harmonics, then the Fourier analysis of the spectrum will show a peak corresponding to the spacing between the harmonics: i.e. the fundamental frequency.

Effectively we are treating the signal spectrum as another signal, then looking for periodicity in the spectrum itself.

For signals with small sampling rates and relatively low fundamental frequencies (e.g. speech signals) F0 can be estimated with a high accuracy using the cepstral approach. It is, however, less clear how the cepstrum method would perform on music signals which requires the consideration of a much wider range of fundamental frequencies. Since the cepstrum method maps harmonic patterns onto single bins it enables a fast search across a wide range of fundamental frequencies.

Source:

https://www.phon.ucl.ac.uk/courses/spsci/matlab/lect10.html#:~:text=Difficulties%20arise%20from%20(i)%20that,periodic%20with%20interval%20T%20are

A good tutiorial on pitch tracking:

https://flothesof.github.io/cepstrum-pitch-tracking.html

37 of 102

Cepstrum for F0 estimation

Identify the F0 Candidate:

Find the first prominent peak in the cepstrum within typical human pitch range: 80-450 Hz.

37

https://flothesof.github.io/cepstrum-pitch-tracking.html

T

F0 = 1 / T

1 / F0 = T

k * F0

The general problem of fundamental frequency estimation is to take a portion of signal and to find the dominant frequency of repetition.

Difficulties arise from

(i) that not all signals are periodic,

(ii) those that are periodic may be changing in fundamental frequency over the time of interest,

(iii) signals may be contaminated with noise, even with periodic signals of other fundamental frequencies,

(iv) signals that are periodic with interval T are also periodic with interval 2T, 3T etc, so we need to find the smallest periodic interval or the highest fundamental frequency;

and (v) even signals of constant fundamental frequency may be changing in other ways over the interval of interest.

A reliable way of obtaining an estimate of the dominant fundamental frequency for long, clean, stationary speech signals is to use the cepstrum. The cepstrum is a Fourier analysis of the logarithmic amplitude spectrum of the signal. If the log amplitude spectrum contains many regularly spaced harmonics, then the Fourier analysis of the spectrum will show a peak corresponding to the spacing between the harmonics: i.e. the fundamental frequency.

Effectively we are treating the signal spectrum as another signal, then looking for periodicity in the spectrum itself.

For signals with small sampling rates and relatively low fundamental frequencies (e.g. speech signals) F0 can be estimated with a high accuracy using the cepstral approach. It is, however, less clear how the cepstrum method would perform on music signals which requires the consideration of a much wider range of fundamental frequencies. Since the cepstrum method maps harmonic patterns onto single bins it enables a fast search across a wide range of fundamental frequencies.

Source:

https://www.phon.ucl.ac.uk/courses/spsci/matlab/lect10.html#:~:text=Difficulties%20arise%20from%20(i)%20that,periodic%20with%20interval%20T%20are

A good tutiorial on pitch tracking:

https://flothesof.github.io/cepstrum-pitch-tracking.html

38 of 102

Cepstrum for F0 estimation

Limitations:

Voiced/unvoiced detection: Needs additional steps to handle unvoiced regions.
Ambiguity: May produce subharmonics (e.g., detecting F0/2 instead of F0)

38

https://flothesof.github.io/cepstrum-pitch-tracking.html

T

F0 = 1 / T

1 / F0 = T

k * F0

The general problem of fundamental frequency estimation is to take a portion of signal and to find the dominant frequency of repetition.

Difficulties arise from

(i) that not all signals are periodic,

(ii) those that are periodic may be changing in fundamental frequency over the time of interest,

(iii) signals may be contaminated with noise, even with periodic signals of other fundamental frequencies,

(iv) signals that are periodic with interval T are also periodic with interval 2T, 3T etc, so we need to find the smallest periodic interval or the highest fundamental frequency;

and (v) even signals of constant fundamental frequency may be changing in other ways over the interval of interest.

A reliable way of obtaining an estimate of the dominant fundamental frequency for long, clean, stationary speech signals is to use the cepstrum. The cepstrum is a Fourier analysis of the logarithmic amplitude spectrum of the signal. If the log amplitude spectrum contains many regularly spaced harmonics, then the Fourier analysis of the spectrum will show a peak corresponding to the spacing between the harmonics: i.e. the fundamental frequency.

Effectively we are treating the signal spectrum as another signal, then looking for periodicity in the spectrum itself.

For signals with small sampling rates and relatively low fundamental frequencies (e.g. speech signals) F0 can be estimated with a high accuracy using the cepstral approach. It is, however, less clear how the cepstrum method would perform on music signals which requires the consideration of a much wider range of fundamental frequencies. Since the cepstrum method maps harmonic patterns onto single bins it enables a fast search across a wide range of fundamental frequencies.

Source:

https://www.phon.ucl.ac.uk/courses/spsci/matlab/lect10.html#:~:text=Difficulties%20arise%20from%20(i)%20that,periodic%20with%20interval%20T%20are

A good tutiorial on pitch tracking:

https://flothesof.github.io/cepstrum-pitch-tracking.html

39 of 102

Break time!

40 of 102

FastPitch

Introduced in 2020 by Nvidia�
Exhibits a significantly higher real-time factor than Tacotron when synthesizing mel-spectrograms for typical utterances.�
Does not require massive data as well as Tacotron�
During training, it learns to predict not only the mel-spectrogram but also the durations of phonemes and pitch (which refers to F0).

40

https://arxiv.org/pdf/2006.06873.pdf

https://github.com/NVIDIA/DeepLearningExamples/tree/master/PyTorch/SpeechSynthesis/FastPitch

41 of 102

FastPitch

41

https://arxiv.org/pdf/2006.06873.pdf

https://github.com/NVIDIA/DeepLearningExamples/tree/master/PyTorch/SpeechSynthesis/FastPitch

42 of 102

FastPitch

42

https://arxiv.org/pdf/2006.06873.pdf

https://github.com/NVIDIA/DeepLearningExamples/tree/master/PyTorch/SpeechSynthesis/FastPitch�https://arxiv.org/pdf/1905.09263

/ p rr i vv ee t /

43 of 102

FastPitch

43

https://arxiv.org/pdf/2006.06873.pdf

https://github.com/NVIDIA/DeepLearningExamples/tree/master/PyTorch/SpeechSynthesis/FastPitch

/ p rr i vv ee t /

44 of 102

FastPitch

44

https://arxiv.org/pdf/2006.06873.pdf

https://github.com/NVIDIA/DeepLearningExamples/tree/master/PyTorch/SpeechSynthesis/FastPitch

/ p rr i vv ee t /

45 of 102

Pitch of input symbols

F0 values are averaged over every input symbol using the extracted durations d

45

46 of 102

Durations of input symbols

Monotonic alignment: either stay on the same row or jump diagonally.

Having such alignment, we can get phoneme durations as segments of the alignment path.

46

47 of 102

Best alignment: Monotonic Alignment Search

Suppose we have a soft-alignment matrix between text and mel-spectrogram. We can find the most probable monotonic alignment :

https://arxiv.org/pdf/2005.11129.pdf

https://jaketae.github.io/study/glowtts/#monotonic-alignment-search

47

Mel-frames

48 of 102

Monotonic Alignment Search

Let be the maximum likelihood up to the ij-th element. Then it can be recursively formulated as

We iteratively calculate all the values of up to .

All the values of are backtracked from the end of the alignment, .

48

https://arxiv.org/pdf/2005.11129.pdf

https://jaketae.github.io/study/glowtts/#monotonic-alignment-search

Calculate Q

Backtrack the best path

49 of 102

Alignment sources

Tacotron trained on the same corpus (as it’s done in the paper)��

Pretrained ASR models & MFA acoustic models (based on HMM-GMM)

Usually don’t give access to the actual alignment and output durations in seconds

49

50 of 102

FastPitch vs Tacotron

FastPitch

Non-autoregressive
Training and inference are fast
In a basic implementation, depends on the quality of ground truth durations and pitch
In general, has poorer intonations than Tacotron

Tacotron

Autoregressive
Training and inference are slow
Builds an alignment between text and mel-spectrogram, that can be used elsewhere (e.g. for extracting durations for FastPitch)

50

51 of 102

Vocoders

Hi-Fi Gan

51

52 of 102

Recap: TTS pipeline

52

Text

Preprocessor

Linguistic features

Wav

End-to-end

VITS

Parametric space

Acoustic model

Tacotron,

FastPitch, �GradTTS, �MQ-TTS

Vocoder

WaveNet,

Hi-Fi GAN, �WaveGlow, �LPCNet,

Vocos

/ p rr i vv ee t /

Mel-spectrogram

Discrete tokens

https://github.com/ZhangXInFD/SpeechTokenizer

53 of 102

Recap: Dilated Convolutions

53

https://github.com/vdumoulin/conv_arithmetic/tree/master

54 of 102

Recap: Transposed Convolutions

54

https://github.com/vdumoulin/conv_arithmetic/tree/master

55 of 102

Recap: Generative Adversarial Networks

G learns to generate realistic data�
D learns to differentiate between fake data generated by the generator and real data.�
Min-max training objective

55

https://github.com/yandexdataschool/speech_course/tree/2022/week_09

56 of 102

Hi-Fi GAN

Introduced by Kakao Enterprise in 2020
One of the state-of-the-art vocoders at present.
Trained as a Generative Adversarial Network (GAN)
Used across various parametric spaces (not limited to spectrograms)

56

https://arxiv.org/pdf/2010.05646.pdf

57 of 102

Hi-Fi GAN: Generator

Fully-Convolutional Architecture

Upsamples mel-spectrogram to raw audio via transposed convolutions
Multi-Receptive Field Fusion observes patterns of various lengths in parallel and aggregates the outputs from multiple residual blocks.

57

Есть спектрограмма�

Подаём её на вход последовательности из транспонированных свёрток
Эти транспонированные свёртки имеют различные kernel-size: мы моделируем наш сигнал на разных его уровнях

kernel size, 1, 3, …

Ещё есть MRF-блок (Multi-Receptive Field)

Представляет собой набор из блоков, каждый блок имеет свой kernel size и представляет из себя residual network

То есть вкратце: сделали апсемплинг с помощью транспонированной свёртки, сверху накинули ансамбль из свёрток с разными kernel size и dilation size

We feed a spectrogram into a sequence of transposed convolutions.
These transposed convolutions have different kernel sizes: we model our signal at different levels.

Kernel sizes, 1, 3, …

There is also an MRF block (Multi-Receptive Field).

It consists of a set of blocks, each block is a residual network and has its own kernel size

In short: we performed upsampling using transposed convolution, then applied an ensemble of convolutions with different kernel sizes and dilation sizes.

58 of 102

Multi-Period Discriminator (MPD)

MPD is a mixture of sub-discriminators�
Each sub-discriminator is a stack of strided convolutional layers�
Sub-discriminator only accepts equally spaced samples of an input audio; the�space is given as period p�
In the paper, the authors define a certain period p and set the periods to [2, 3, 5, 7, 11] to minimize overlaps.

58

https://anwarvic.github.io/speech-synthesis/HiFi-GAN

59 of 102

Multi-Scale Discriminator (MSD)

Because each sub-discriminator in MPD only accepts disjoint samples, the authors add MSD to consecutively evaluate the audio sequence�
MSD has three sub-discriminators working on different scales: raw audio, 2-x average-pooled audio, and 4-x average-pooled audio.�
Each sub-discriminator in MSD uses stacked convolutional layers with leaky ReLU activation

59

60 of 102

Losses

– ground truth

– mel-spectrogram

is a single discriminator from the ensemble

60

61 of 102

The overall loss

61

62 of 102

62

63 of 102

Additional slides

64 of 102

Tacotron: decoder & attention, detailed diagram

64

,

(encoder states)

65 of 102

Tacotron: decoder & attention - formulas

Attention is modeled by an RNN

A convolution on the previous attention weights is an input for getting the new weights

^* Attention-Based Models for Speech Recognition

https://arxiv.org/abs/1803.09017

66 of 102

Location-sensitive attention: not exactly monotonic

Attention becomes more stable.�
However, instabilities occur, especially in long phrases.�
Sometimes the model looks at strange places in the phrase (rather than the one being voiced at the moment).�
There are approaches enforcing monotonicity in attention so the model looks where it's currently reading, thus enhancing synthesis stability (*).

66

(*) Some papers on the topic�https://arxiv.org/pdf/1704.00784.pdf�https://bshall.github.io/Tacotron/

67 of 102

Formants

The peaks of spectrum envelope are called formants:

67

https://wiki.aalto.fi/pages/viewpage.action?pageId=149890776

68 of 102

Formants

68

https://www.researchgate.net/figure/Spectrograms-of-the-vowels-i-o-and-u-international-phonetic-symbols_fig2_277131520

Adult male

Adult female

69 of 102

Augmentations for acoustic models

70 of 102

Trainable alignment for FastPitch

Having an external alignment may not always be an option
External alignments may be not good enough for the model’s performance

So, an alignment can be trained jointly with FastPitch:

Add a trainable attention layer
MAS for current duration lables
Loss: CTC + Cross Entropy on MAS labels

70

https://arxiv.org/pdf/2108.10447.pdf

https://github.com/NVIDIA/DeepLearningExamples/tree/master/PyTorch/SpeechSynthesis/FastPitch#dataset-guidelines

71 of 102

Handling Multiple Speakers

Limitations of a single-speaker corpus:

Requires developing a new model for each new speaker
Challenging to collect, as Tacotron2 demands 15-20 hours of data to generate a stable and natural voice
Few- and zero-shot training are not supported

71

72 of 102

Handling Multiple Speakers

Just use a trainable speaker embedding.

This method works well enough
But you need to retrain the model when a new speaker comes in.

72

https://arxiv.org/pdf/1806.04558.pdf

73 of 102

Handling Multiple Speakers

Extract the speaker embedding from a pretrained model (e.g. for speaker verification).

73

https://arxiv.org/pdf/1806.04558.pdf

74 of 102

Prosody

Prosody: intonation, stress, and rhythm
A text can be pronounced with different prosodies

Problems with training on intonation-rich data:

A model may have difficulty learning and converging because it's trying to extract intonations from text
MSE loss pushes the model to averaging prosodies from the input data, which results into “robotic” synthesis intonation

Possible solution: let’s try to encode prosody as a vector hint – “style embedding”

74

75 of 102

Style embedding

We can model style embedding as a weighted sum of randomly initialized embeddings – Global Style Tokens (GST).

The idea:

compress the initial audio into a style embedding – but that would be a target leakage
to reduce the leakage: keep the number of the GSTs small

75

76 of 102

Global Style Tokens: Training

The reference encoder compresses audio signal into a fixed-length vector called reference embedding.

The reference embedding is used as the query vector to an attention module, GST are used as values.

The attention module outputs the weighted sum of GSTs: style embedding.

It is then passed to the decoder along with the encoder outputs.

76

https://arxiv.org/abs/1803.09017

77 of 102

Global Style Tokens: Training

The style token layer is jointly trained with the rest of the model, driven only by the reconstruction loss from the model.

GSTs thus do not require any explicit style or prosody labels.

77

https://arxiv.org/abs/1803.09017

78 of 102

Global Style Tokens: Inference

We can directly condition the text encoder on certain tokens:

Style 1:

Style 2:

Style 3:

78

https://arxiv.org/abs/1803.09017

https://google.github.io/tacotron/publications/global_style_tokens/

79 of 102

Global Style Tokens: Inference

We can feed a different audio signal to achieve style transfer:

Source:

Baseline Tacotron:

GST Tacotron:

79

https://arxiv.org/abs/1803.09017

https://google.github.io/tacotron/publications/global_style_tokens/

80 of 102

Visualizations

We can calculate GST for each sample in the dataset
Clusterize them using KMeans
And visualize using t-SNE

80

https://arxiv.org/abs/1803.09017

81 of 102

Inference Problems

Approach 1 - use a single GST:

The number of GST vectors is a hyperparameter
There is no guarantee on what is really learned by each of the tokens
For instance, tokens might not solely represent intonation but could also be associated with audio quality, duration, background noises, breathing

81

https://arxiv.org/abs/1803.09017

82 of 102

Inference Problems

Approach 2 - Style transfer from a single audio:

The model doesn't work properly when the source phrases are a different length than the text that needs to be synthesized.
There might not be any “ideal” phrase in the dataset

82

https://arxiv.org/abs/1803.09017

83 of 102

TP-GST (Text-Predicted GST)

Estimate style tokens from text by predicting either

83

https://arxiv.org/pdf/1808.01410.pdf

or attention weights (TPCW)

style embedding (TPSE)

84 of 102

TP-GST (Text-Predicted GST)

84

Reference encoder:

GST Estimator:

Mel GT

Conv

GRU

Attention

GST

Encoder

GRU

FC

Attention&Decoder

+

85 of 102

TP-GST (Text-Predicted GST)

The GST Estimator can be trained either

jointly with an acoustic model
or independently at a later stage.

Training jointly does not permit intonation manipulations and may lead to unintended shifts in intonation.

85

86 of 102

TP-GST (Text-Predicted GST)

Training independently at a later stage enables intonation manipulation through at least two methods:

Select a subset with the desired intonation and exclusively train the GST Estimator on it.
Train on the entire dataset but condition on an intonation cluster number.

86

87 of 102

Semantic hints

Training data for AM lacks semantic depth due to its small size (usually a few hundred thousand examples, rarely millions). �
Pre-training on larger textual datasets is used to extract richer semantics.�
This helps TTS model better pronounce difficult words and improve intonation in complex sentences

87

88 of 102

Semantic hints: pre-trained LMs

Simple approach: extract word embeddings from pre-trained models and combine them with phonemes before sending to the text encoder

88

89 of 102

Semantic hints: PnG-BERT

Google introduced this model in 2021.�
PnG-BERT is an extension of the original BERT model, as it uses both phonemic and graphemic representations of text as input data.�
The model can be initially trained on a large text dataset in a self-supervised manner and then fine-tuned for Text-to-Speech (TTS) tasks.

89

https://arxiv.org/abs/1803.09017

90 of 102

PnG-BERT: pretraining

phonemes and graphemes (BPE tokens)�
Dataset with phonemes and graphemes is derived from purely textual data — G2P can be run on texts.�
additional word-position embedding is used by PnG BERT, providing word-level alignment between phonemes and graphemes�
MLM (like regular BERT)�
no architectural differences between phoneme and grapheme inputs

90

91 of 102

PnG-BERT: during AM training

Replace encoder during Tako training, only utilize phonemic outputs (yet both phonemes and graphemes are inputted into BERT)�
Fine-tune several last layers of PnG-BERT along with AM.

91

92 of 102

Limitations

The number of clusters is a hyperparameter
TP-GST does not solve the problem of GST learning something meaningless and therefore there will be a plenty of ‘trash clusters’

some of them will contain only low-quality audio
‘breathing’ clusters
clusters that memorized the day of recording

So the choice of a good GST-cluster is a manual and subjective process

92

93 of 102

Additional slides, vocoders

93

94 of 102

WaveNet

Introduced by DeepMind in 2016
State-of-the-art (SOTA) at the time of its release
A generative deep convolutional neural network for producing raw audio waveforms (not necessary a vocoder)

94

https://arxiv.org/pdf/1609.03499.pdf�https://deepmind.google/discover/blog/wavenet-a-generative-model-for-raw-audio/

95 of 102

Recap: Causal Convolutions

95

https://www.oreilly.com/library/view/machine-learning-for/9781789136364/ch04s12.html

96 of 102

Mu-law encoding

WaveNet treats wav as a discrete signal: all amplitude values are quantized into discrete bins, and the model then predicts the number of a bin.

Audio samples are quantized using mu-law encoding:

lower amplitudes are sampled more often than the higher ones, which is reasonable for human speech
one audio sample is quantized into 256 bins (instead of 65536 bins for a 16-bit interger)

96

MSE loss

Cross-Entropy loss

97 of 102

WaveNet: Оverview

97

https://github.com/yandexdataschool/speech_course/tree/2022/week_09

(*)

Работает прямо на аудио

аудио предсказывает не как непрерывную величину, а как один из бинов дискретизации (вспоминаем про биты и битрейт), то есть решает задачу классификации
аудио предсказывается авторегрессионно, то есть по предыдущим значениям отсчётов нужно предсказать текущий

при этом обуславливается при генерации на спектрограмму. Спектрограмма апсемплится до размера вавки, на каждом шаге генерации происходит обуславливание на очередной фрейм апсемпленной спектрограммы
Сам WaveNet – чисто свёрточная архитектура, основу его составляют dilated convolutions, про них поговорим подробнее дальше

WaveNet works directly on audio

predicts audio not as a continuous value, but as one of the discretization bins (remember about bits and bit rate), meaning it solves a classification task.
audio is predicted autoregressively, meaning that based on previous sample values, the current one needs to be predicted.

Moreover, during generation, it is conditioned on the spectrogram. The spectrogram is upsampled to the waveform size, and at each generation step, conditioning occurs on the current frame of the upsampled spectrogram.

WaveNet itself is a purely convolutional architecture, with dilated convolutions forming its backbone, which will be discussed in more detail later.

98 of 102

WaveNet Data-flow Graph

98

https://github.com/yandexdataschool/speech_course/tree/2022/week_09

https://github.com/vincentherrmann/pytorch-wavenet/blob/master/wavenet_model.py

(the first r samples are just zeros)

Causal Nature: Does not look into the future�Exponential Dilation Increase: Dilation grows exponentially: 1, 2, 4, 8, …�Kernel Size: Typically 2, though 3 is also feasible

синие кружочки – входы, зелёные – выходы
серые квадратики – блоки нейронки.
архитектура чисто свёрточная, причём свёртки каузальные, то есть смотрящие строго в прошлое

Архитектура на самом деле состоит ровно из одного столбца (три квадратика в данном случае),
но на картинке показано, какие расчёты с предыдущих шагов по времени повлияли на входные значения на шаге k

на k-й выход (зелёный кружок) влияют r последних временных отсчётов k - 1, k - 2, k - r
r – достаточно большое (порядка 1-2к) => требуется большое рецептивное поле в прошлое, для этого тут и dilated convolutions
при этом рецептивное поле растёт экспоненциально от числа уровней (тоже преимущество dilated свёрток)

Blue circles represent inputs, green circles represent outputs, and gray squares represent neural network blocks.

The architecture is purely convolutional, with causal convolutions, meaning they strictly look into the past.

The architecture actually consists of exactly one column (three squares in this case),
but the image shows which calculations from previous time steps influenced the input values at step k.

At the k-th output (green circle), the last r time steps, k - 1, k - 2, k - r, influence it.

Here, r is sufficiently large (on the order of 1-2k), requiring a large receptive field in the past, which is why dilated convolutions are used.

Moreover, the receptive field grows exponentially with the number of levels (also an advantage of dilated convolutions).

99 of 102

WaveNet Output

99

https://github.com/yandexdataschool/speech_course/tree/2022/week_09

100 of 102

WaveNet Conditioning

Time Resolution Discrepancy: Signal and Spectrogram have varying time resolutions.
Upsampling Layer: Transposed Convolution.

100

https://github.com/yandexdataschool/speech_course/tree/2022/week_09

https://github.com/vdumoulin/conv_arithmetic/tree/master

101 of 102

WaveNet Pros and Cons

Pros

Implementation is straightforward
Convergence during training is quick and consistent.
Generated audio closely resembles the original

Cons

Difficult inference parallelization
The model may perform worse on generated spectrograms; it is necessary to train on generated spectrograms or fine-tune.

101

102 of 102

WaveNet Inference

102