1 of 20

Style Tokens: Unsupervised Style Modeling, Control and Transfer in End-to-End Speech Synthesis

Wang et al. (ICML 2018)

2 of 20

Agenda

  • Motivation
  • Related work / SOTA
  • Model Architecture
  • Inference
  • Interpretation of GSTs
  • How are GSTs capturing prosody
  • Experiments - Style Control & Transfer
  • Experiments - Unlabeled Noisy Found Data
  • Relation to our project

2

3 of 20

Motivation

  • Expressive long form speech synthesis is in its infancy�
  • Prosody is the confluence of a number of phenomena in speech, such as paralinguistic information, intonation, stress, and style�
  • Need to model prosody for realistic speech�
  • Contains information about emotion and intent; difficult to define, hence annotate.��Solution: Add Global Style Tokens (GSTs) to Tacotron (existing SOTA TTS system), that can learn prosodic styles in the absences of labels. �
  • They provide "soft" style-based interpretable labels that can help in style control

3

4 of 20

Related work / SOTA

  • Most TTS systems require extensive labelling, for which data is not available�
  • Cluster-based methods cluster the training set based on acoustic features, but require carefully hand-crafted features.�
  • Skerry-Ryan et. al. used the idea of a reference signal to transfer prosodic features to the generated signal. This idea has been extended further by this work.�
  • Wang et. al. used conditioning of the decoder embeddings. It also captures information from every frame of the reference signal, instead of a complete summary, leading to transfer of only local features like F0. This is mitigated in this work.

4

5 of 20

Model Architecture - I

5

6 of 20

Model Architecture - II

6

  • Reference encoder takes in reference audios (in form of log mel-spectrograms) with different prosodic features and compresses them into a fixed length embedding.�
  • This happens in every training iteration, and the input audios need not have prosodic labels�
        • During training, the reference audio is simply the ground truth audio

7 of 20

Model Architecture - III

  • Reference encoder's output is passed to a �style token layer�
  • Used to get attention weighs (similarity �scores) of style tokens which are randomly�initialised embeddings �
  • Style embedding is weighted sum of style tokens, where weights are the attention weights. Attention weights can be obtained using any similarity metric or attention mechanism.�
  • These style tokens are shared across all training examples and are jointly trained with rest of the model, driven by reconstruction loss of tacotron decoder

7

8 of 20

Model Architecture - IV

  • The style embedding is passed to the text encoder to condition the embeddings of the encoder at every step.�
  • The authors found that best way to condition is simply adding the style embedding to the text encoder's output at every step.

8

9 of 20

Model Architecture - IV

  • "The decoder outputs 80-channel log- mel spectrogram energies, two frames at a time, which are run through a dilated convolution network that outputs linear spectrograms."�
  • "Griffin-Lim for fast waveform reconstruction. It is straightforward to replace Griffin-Lim by a WaveNet vocoder to improve the audio fidelity"

9

10 of 20

Inference

  • At inference time, we have two options:
    • Condition on a reference audio signal�
    • Provide your own weights for merging GSTs to get a style embedding

10

11 of 20

Interpretation of GST

  • Quantization
    • Can be thought of as breaking down the reference embedding into a linear combination of basis vectors (style tokens)
    • Contribution of each token can be calculated by any similarity metric�
  • Memory-augmented NN
    • GST embeddings can be thought of as external memory that stores information about style info. extracted from training data.
    • Guides memory writes during train, and memory reads at inference�

11

12 of 20

How are GSTs capturing prosody?

  • How are we training GSTs?
    • Training driven by reconstruction loss of the generated audio.
    • At train time, reference audio is also ground-truth audio.
    • Closer we are to the reference audio, closer we get to the ground-truth audio!�
  • At inference time, you can send in any audio in as reference audio.
    • Now, weights have been trained to minimise loss between both ground-truth AND reference audio
    • Result: Prosodic features get transferred from reference audio!�
  • I found this intuitively similar to style transfer in images, although there is a lot of hand-waving in both places.

12

13 of 20

13

14 of 20

Experiments - Style Control and Transfer

  • Style selection
    • Three different tokens used to generate two different sentences.
    • Shows some style tokens capture pitch, others rate of speaking
    • Style tokens might be capturing a combination of prosodic features. Eg. Pitch and rate of speaking.

14

15 of 20

Experiments - Style Control and Transfer

  • Style Scaling
    • Increasing the weight of a particular style token at inference showed a pronouncement in the style feature in the generated audio.�

15

16 of 20

Experiments - Unlabelled Noisy Found Data

  • Created noisy audio samples and used them as reference audio.
  • It was observed that different style tokens learn different aspects of the noise.
  • Some of the tokens did learn features of clean noise too.
  • Performing generation by conditioning directly on these clean tokens lead to generation of cleaner audio.�
  • Also experimented on multi-speaker data from TED videos.
  • Style tokens learned from TED videos, when plotted using t-SNE, clearly separated into clusters, each denoting a different speaker

16

17 of 20

Experiments - Unlabelled Noisy Found Data

17

  • Conditioning on specific tokens shows that different sources of noise get interpreted as different "styles"�
  • Each token manages to absorb a different source of noise in itself

18 of 20

Experiments - Unlabelled Noisy Found Data

18

19 of 20

Doubts / Things to discuss

  • The training of the style tokens is driven only by the reconstruction loss of the encoder-decoder part.�
  • The paper claims that the textual part does not need labelled prosodic features for training, so we can assume that the target audio would be any audio, not a prosody-rich sample.�
  • So there should be some reconstruction loss when we train the model, it is expected to be there as we are adding a style condition to the encoder.�
  • Now in this scenario, what reconstruction loss is a good reconstruction loss, i.e., where do we stop the training procedure?

19

20 of 20

Relation to our project

  • This idea of prosodic style transfer can be instrumental for emotional speech synthesis�
  • Which style tokens to use for generation? Maybe reference audio will suffice?�
  • If we can pass a handpicked reference audio and allow the model to perform non-parallel style transfer to get an output audio from text with prosodic features of the reference audio.�
  • Caveat: We do not know in advance what the prosodic feature each style token is learning. We know that they learn a combination of prosodic features. What if prosodic features characteristic to a particular emotion are not learnt at all?

20