1 of 15

Progress Update - 16th April

Group 1 - Emotional Speech Synthesis using HMMs

2 of 15

Summary of updates as per previous timeline

  • For the HMM based approach
    • Get a thorough understanding of the nitty gritties of the adapted HMM (By 9th March) - Done
    • Partial Implementation (By 23rd March) - Done
    • Complete Implementation (By 7th April) - In Progress
    • 1 week Buffer (7-15th April)
  • For the DL based approach
    • Understanding and implementing a vanilla neural TTS system (eg. Tacotron) (By 12th March) - Implementation done, trying out variations in learning rate for fine tuning
    • Implementing fine-tuned DCTTS - In Progress, pre-trained on LJSpeech
    • Implementing Lee et. al. (By 7th April)
    • 1 week Buffer (7-15th April)

3 of 15

DL Approach for TTS - I

What we have done/are doing:

  • Pre-trained the DCTTS model on LJSpeech dataset�
  • Tried running vanilla Tacotron with revised hyperparameters (lr: 2e-05, with annealing; SGD optimiser) for 100k iterations

4 of 15

DL Approach for TTS - II

What we have done/are doing:

  • Fine-tuning experiments for Vanilla Tacotron
    • Fixed a bug in the code which was causing early stopping.�
    • Until last time, we had not studied how it behaves if this setup is allowed to run for a long duration (does it improve? Continue to worsen?), so we let it run for 100k iterations this time.

Why are we doing this?

  • Till last time we had already explored fine-tuning over shorter durations. Just wanted to confirm that the performance would not get better over time.�

5 of 15

DL Approach for TTS - III : Experiments

  • For the vanilla Tacotron model that we finetuned, we studied the results generated at 5 different number of iterations during fine-tuning: 1k, 25k, 50k, 100k, 150k
  • For each saved checkpoint, we analysed the alignment plot during training. (The generated audio while training is almost always good, due to teacher training in decoder)
  • We also generated 3 different audio samples for each checkpoint, with 1 seen and 2 unseen sentences. Generation does not use teacher training, hence the audio was analysed here.
    • Sentence 1: "Kids are talking by the door" denoted by
    • Sentence 2: "Training neural networks is very hard!" denoted by
    • Sentence 3: "Generative adversarial network or variational auto-encoder." denoted by

6 of 15

Vanilla Tacotron FT Results

After 1k iters.

After 25k iters.

After 50k iters.

After 100k iters.

After 150k iters.

No Finetuning

7 of 15

Vanilla Tacotron FT Results

Observations

  • No clear inference could be drawn from progression of alignment plots�
  • We also generated alignment plots and spectrograms for the audio samples generated at test time. They can be viewed here.

8 of 15

DL Approach for TTS - IV : DCTTS Model

  • A preprint talks about how to use fine tune models trained on larger datasets for smaller sized data → In this case, they also focus on fine tuning on smaller-sized emotional data
    • The preprint details the hyperparameter settings and training strategy for the fine tuning on the DC-TTS Model
  • DC-TTS Model → Similar to Tacotron, but consisting of only Conv layers (no recurrent layers), making it faster to train. Consists of :
    • Text2Mel does the mapping between character embeddings and the output of Mel Filter Banks (MFBs) applied on the audio signal, that is, a mel-spectrogram. We aim to finetune only this component on the emotional data.
    • SSRN does the mapping between the mel-spectrogram and full resolution spectrogram.
    • Griffin-Lim is used as a vocoder.

9 of 15

DL Approach for TTS - V : EMOV-DB Dataset

  • EMOV-DB is a large scale emotional corpus of 4 speakers who speak the CMU arctic transcripts in different emotions and expressiveness levels
  • Preprint suggested fine-tuning pre-trained (LJSpeech)�DCTTS on EMOV-DB.

10 of 15

DL Approach for TTS - VI : DCTTS Model

Implementation details -

  • Started of with a Tensorflow implementation of DC-TTS which is adapted for fine-tuning, along with releasing the pre-trained base model on LJSpeech
    • However, the pre-trained model contains difference between the checkpoints and the version of the code that is made public, making it unusable.
  • Moved on to a PyTorch implementation of DC-TTS, for which we are currently finetuning on the LJSpeech dataset
    • Text2Mel pre-training currently at 100k steps → the authors suggest minimum 200-250k steps for legible audio
    • SSRN pre-training currently at 50k steps

11 of 15

DL Approach for TTS - VII

Next steps:

  • For Vanilla Tacotron
    • Trying to freeze decoder of the tacotron while fine-tuning, which seems to be related to this paper by Google AI: https://google.github.io/tacotron/publications/semisupervised/
      • This is not exactly the same idea, so maybe we can try working with the implementations of the above paper too
    • Trying semi-teacher training with tacotron decoder (Will probably have to re-train on LJSpeech for this)
  • For DCTTS
    • Complete pre-training on LJSpeech
    • Fine-tune on EMOV-DB dataset

12 of 15

HMM based Speech Synthesis

  • Last time we observed that due to less number of unique utterances in our corpus, HMM was not generalising well. (Could not utter unseen phrases clearly)�
  • Proposed solution was to train HMM on larger emotional datasets with diverse utterances�
  • But, training of local systems for larger datasets like EMOV-DB was not feasible.

13 of 15

HMM based Speech Synthesis

  • Tried to setup our HMM system on both Colab and server�
  • For server we have requested docker installation for the same. Awaiting their reply.

14 of 15

HMM based Speech Synthesis

Next Steps

  • Shift HMM training to server as training on laptop is proving infeasible�
  • Try HMM on a larger emotional corpus (like EMOV-DB) and see if generalizability of model increases�
  • Create a bash script for allowing students after us to work with HMMs in an end-to-end fashion�
  • Perception surveys on sound clips generated by the models from the class (?)

15 of 15

Thank you!