1 of 20

Style Tokens: Unsupervised Style Modeling, Control and Transfer in End-to-End Speech Synthesis

Wang et al. (ICML 2018)

2 of 20

Agenda

Motivation
Related work / SOTA
Model Architecture
Inference
Interpretation of GSTs
How are GSTs capturing prosody
Experiments - Style Control & Transfer
Experiments - Unlabeled Noisy Found Data
Relation to our project

3 of 20

Motivation

Expressive long form speech synthesis is in its infancy�
Prosody is the confluence of a number of phenomena in speech, such as paralinguistic information, intonation, stress, and style�
Need to model prosody for realistic speech�
Contains information about emotion and intent; difficult to define, hence annotate.��Solution: Add Global Style Tokens (GSTs) to Tacotron (existing SOTA TTS system), that can learn prosodic styles in the absences of labels. �
They provide "soft" style-based interpretable labels that can help in style control

4 of 20

Related work / SOTA

Most TTS systems require extensive labelling, for which data is not available�
Cluster-based methods cluster the training set based on acoustic features, but require carefully hand-crafted features.�
Skerry-Ryan et. al. used the idea of a reference signal to transfer prosodic features to the generated signal. This idea has been extended further by this work.�
Wang et. al. used conditioning of the decoder embeddings. It also captures information from every frame of the reference signal, instead of a complete summary, leading to transfer of only local features like F0. This is mitigated in this work.

5 of 20

Model Architecture - I

6 of 20

Model Architecture - II

Reference encoder takes in reference audios (in form of log mel-spectrograms) with different prosodic features and compresses them into a fixed length embedding.�
This happens in every training iteration, and the input audios need not have prosodic labels�

During training, the reference audio is simply the ground truth audio

7 of 20

Model Architecture - III

Reference encoder's output is passed to a �style token layer�
Used to get attention weighs (similarity �scores) of style tokens which are randomly�initialised embeddings �
Style embedding is weighted sum of style tokens, where weights are the attention weights. Attention weights can be obtained using any similarity metric or attention mechanism.�
These style tokens are shared across all training examples and are jointly trained with rest of the model, driven by reconstruction loss of tacotron decoder

8 of 20

Model Architecture - IV

The style embedding is passed to the text encoder to condition the embeddings of the encoder at every step.�
The authors found that best way to condition is simply adding the style embedding to the text encoder's output at every step.

9 of 20

Model Architecture - IV

"The decoder outputs 80-channel log- mel spectrogram energies, two frames at a time, which are run through a dilated convolution network that outputs linear spectrograms."�
"Griffin-Lim for fast waveform reconstruction. It is straightforward to replace Griffin-Lim by a WaveNet vocoder to improve the audio fidelity"

10 of 20

Inference

At inference time, we have two options:

Condition on a reference audio signal�
Provide your own weights for merging GSTs to get a style embedding

11 of 20

Interpretation of GST

Quantization

Can be thought of as breaking down the reference embedding into a linear combination of basis vectors (style tokens)
Contribution of each token can be calculated by any similarity metric�

Memory-augmented NN

GST embeddings can be thought of as external memory that stores information about style info. extracted from training data.
Guides memory writes during train, and memory reads at inference�

12 of 20

How are GSTs capturing prosody?

How are we training GSTs?

Training driven by reconstruction loss of the generated audio.
At train time, reference audio is also ground-truth audio.
Closer we are to the reference audio, closer we get to the ground-truth audio!�

At inference time, you can send in any audio in as reference audio.

Now, weights have been trained to minimise loss between both ground-truth AND reference audio
Result: Prosodic features get transferred from reference audio!�

I found this intuitively similar to style transfer in images, although there is a lot of hand-waving in both places.

14 of 20

Experiments - Style Control and Transfer

Style selection

Three different tokens used to generate two different sentences.
Shows some style tokens capture pitch, others rate of speaking
Style tokens might be capturing a combination of prosodic features. Eg. Pitch and rate of speaking.

15 of 20

Experiments - Style Control and Transfer

Style Scaling

Increasing the weight of a particular style token at inference showed a pronouncement in the style feature in the generated audio.�

16 of 20

Experiments - Unlabelled Noisy Found Data

Created noisy audio samples and used them as reference audio.
It was observed that different style tokens learn different aspects of the noise.
Some of the tokens did learn features of clean noise too.
Performing generation by conditioning directly on these clean tokens lead to generation of cleaner audio.�
Also experimented on multi-speaker data from TED videos.
Style tokens learned from TED videos, when plotted using t-SNE, clearly separated into clusters, each denoting a different speaker

17 of 20

Experiments - Unlabelled Noisy Found Data

Conditioning on specific tokens shows that different sources of noise get interpreted as different "styles"�
Each token manages to absorb a different source of noise in itself

18 of 20

Experiments - Unlabelled Noisy Found Data

19 of 20

Doubts / Things to discuss

The training of the style tokens is driven only by the reconstruction loss of the encoder-decoder part.�
The paper claims that the textual part does not need labelled prosodic features for training, so we can assume that the target audio would be any audio, not a prosody-rich sample.�
So there should be some reconstruction loss when we train the model, it is expected to be there as we are adding a style condition to the encoder.�
Now in this scenario, what reconstruction loss is a good reconstruction loss, i.e., where do we stop the training procedure?

20 of 20

Relation to our project

This idea of prosodic style transfer can be instrumental for emotional speech synthesis�
Which style tokens to use for generation? Maybe reference audio will suffice?�
If we can pass a handpicked reference audio and allow the model to perform non-parallel style transfer to get an output audio from text with prosodic features of the reference audio.�
Caveat: We do not know in advance what the prosodic feature each style token is learning. We know that they learn a combination of prosodic features. What if prosodic features characteristic to a particular emotion are not learnt at all?

1 of 20

2 of 20

3 of 20

4 of 20

5 of 20

6 of 20

7 of 20

8 of 20

9 of 20

10 of 20

11 of 20

12 of 20

13 of 20

14 of 20

15 of 20

16 of 20

17 of 20

18 of 20

19 of 20

20 of 20