1 of 15

Progress Update - 16th April

Group 1 - Emotional Speech Synthesis using HMMs

2 of 15

Summary of updates as per previous timeline

For the HMM based approach

Get a thorough understanding of the nitty gritties of the adapted HMM (By 9th March) - Done
Partial Implementation (By 23rd March) - Done
Complete Implementation (By 7th April) - In Progress
1 week Buffer (7-15th April)

For the DL based approach

Understanding and implementing a vanilla neural TTS system (eg. Tacotron) (By 12th March) - Implementation done, trying out variations in learning rate for fine tuning
Implementing fine-tuned DCTTS - In Progress, pre-trained on LJSpeech
Implementing Lee et. al. (By 7th April)
1 week Buffer (7-15th April)

3 of 15

DL Approach for TTS - I

What we have done/are doing:

Pre-trained the DCTTS model on LJSpeech dataset�
Tried running vanilla Tacotron with revised hyperparameters (lr: 2e-05, with annealing; SGD optimiser) for 100k iterations

4 of 15

DL Approach for TTS - II

What we have done/are doing:

Fine-tuning experiments for Vanilla Tacotron

Fixed a bug in the code which was causing early stopping.�
Until last time, we had not studied how it behaves if this setup is allowed to run for a long duration (does it improve? Continue to worsen?), so we let it run for 100k iterations this time.

Why are we doing this?

Till last time we had already explored fine-tuning over shorter durations. Just wanted to confirm that the performance would not get better over time.�

5 of 15

DL Approach for TTS - III : Experiments

For the vanilla Tacotron model that we finetuned, we studied the results generated at 5 different number of iterations during fine-tuning: 1k, 25k, 50k, 100k, 150k
For each saved checkpoint, we analysed the alignment plot during training. (The generated audio while training is almost always good, due to teacher training in decoder)
We also generated 3 different audio samples for each checkpoint, with 1 seen and 2 unseen sentences. Generation does not use teacher training, hence the audio was analysed here.

Sentence 1: "Kids are talking by the door" denoted by
Sentence 2: "Training neural networks is very hard!" denoted by
Sentence 3: "Generative adversarial network or variational auto-encoder." denoted by

6 of 15

Vanilla Tacotron FT Results

After 1k iters.

After 25k iters.

After 50k iters.

After 100k iters.

After 150k iters.

No Finetuning

7 of 15

Vanilla Tacotron FT Results

Observations

No clear inference could be drawn from progression of alignment plots�
We also generated alignment plots and spectrograms for the audio samples generated at test time. They can be viewed here.

8 of 15

DL Approach for TTS - IV : DCTTS Model

A preprint talks about how to use fine tune models trained on larger datasets for smaller sized data → In this case, they also focus on fine tuning on smaller-sized emotional data

The preprint details the hyperparameter settings and training strategy for the fine tuning on the DC-TTS Model

DC-TTS Model → Similar to Tacotron, but consisting of only Conv layers (no recurrent layers), making it faster to train. Consists of :

Text2Mel does the mapping between character embeddings and the output of Mel Filter Banks (MFBs) applied on the audio signal, that is, a mel-spectrogram. We aim to finetune only this component on the emotional data.
SSRN does the mapping between the mel-spectrogram and full resolution spectrogram.
Griffin-Lim is used as a vocoder.

9 of 15

DL Approach for TTS - V : EMOV-DB Dataset

EMOV-DB is a large scale emotional corpus of 4 speakers who speak the CMU arctic transcripts in different emotions and expressiveness levels
Preprint suggested fine-tuning pre-trained (LJSpeech)�DCTTS on EMOV-DB.

10 of 15

DL Approach for TTS - VI : DCTTS Model

Implementation details -

Started of with a Tensorflow implementation of DC-TTS which is adapted for fine-tuning, along with releasing the pre-trained base model on LJSpeech

However, the pre-trained model contains difference between the checkpoints and the version of the code that is made public, making it unusable.

Moved on to a PyTorch implementation of DC-TTS, for which we are currently finetuning on the LJSpeech dataset

Text2Mel pre-training currently at 100k steps → the authors suggest minimum 200-250k steps for legible audio
SSRN pre-training currently at 50k steps

11 of 15

DL Approach for TTS - VII

Next steps:

For Vanilla Tacotron

Trying to freeze decoder of the tacotron while fine-tuning, which seems to be related to this paper by Google AI: https://google.github.io/tacotron/publications/semisupervised/

This is not exactly the same idea, so maybe we can try working with the implementations of the above paper too

Trying semi-teacher training with tacotron decoder (Will probably have to re-train on LJSpeech for this)

For DCTTS

Complete pre-training on LJSpeech
Fine-tune on EMOV-DB dataset

12 of 15

HMM based Speech Synthesis

Last time we observed that due to less number of unique utterances in our corpus, HMM was not generalising well. (Could not utter unseen phrases clearly)�
Proposed solution was to train HMM on larger emotional datasets with diverse utterances�
But, training of local systems for larger datasets like EMOV-DB was not feasible.

13 of 15

HMM based Speech Synthesis

Tried to setup our HMM system on both Colab and server�
For server we have requested docker installation for the same. Awaiting their reply.

14 of 15

HMM based Speech Synthesis

Next Steps

Shift HMM training to server as training on laptop is proving infeasible�
Try HMM on a larger emotional corpus (like EMOV-DB) and see if generalizability of model increases�
Create a bash script for allowing students after us to work with HMMs in an end-to-end fashion�
Perception surveys on sound clips generated by the models from the class (?)

15 of 15

Thank you!