1 of 21

Progress Update - 20th April

Group 1 - Emotional Speech Synthesis using HMMs

2 of 21

Summary of updates as per previous timeline

For the HMM based approach

Get a thorough understanding of the nitty gritties of the adapted HMM - Done
Partial Implementation - Done
Complete Implementation - In Progress, have prepared a system demo

For the DL based approach

Understanding and implementing a vanilla neural TTS system (eg. Tacotron) (By 12th March) - Implementation done, trying out freezing different components of the network while fine-tuning
Implementing fine-tuned DCTTS - In Progress, pre-trained on LJSpeech
1 week Buffer

3 of 21

DL Approach I - Vanilla Tacotron Fine-tuning

4 of 21

DL Approach I - Vanilla Tacotron Fine-tuning

What we have done/are doing:�

Tried running vanilla Tacotron with revised hyperparameters (lr: 2e-05, with annealing; SGD optimiser) while freezing the encoder

5 of 21

DL Approach I - Vanilla Tacotron Fine-tuning

What we have done/are doing:

Fine-tuning experiments for Vanilla Tacotron

Since simple learning rate variations did not show any major improvements, we decided freezing different components of the Tacotron while we are fine-tuning.
We have tried the following experiments

Freezing the Encoder and Postnet (mel-to-linear and vocoder)
Freezing the Encoder only (currently running)

Why are we doing this?

The encoder merely learns a mapping from natural language to a latent space.
Since LJSpeech has a richer vocabulary, we felt that fine-tuning it would not make sense�

6 of 21

DL Approach I - Vanilla Tacotron Fine-tuning

For the vanilla Tacotron model that we finetuned, we studied the results generated at 5 different number of iterations during fine-tuning: 1k, 12k, 25k, 37k, 50k
For each saved checkpoint, we analysed the alignment plot during training. (The generated audio while training is almost always good, due to teacher training in decoder, hence we only analysed the alignment plot)
We also generated 3 different audio samples for each checkpoint, with 1 seen and 2 unseen sentences. Generation does not use teacher training, hence the audio was analysed here.

Sentence 1: "Kids are talking by the door" denoted by
Sentence 2: "Training neural networks is very hard!" denoted by
Sentence 3: "Generative adversarial network or variational auto-encoder." denoted by

7 of 21

Vanilla Tacotron FT Results

After 1k iters.

After 12k iters.

After 25k iters.

After 37k iters.

After 50k iters.

No Finetuning

8 of 21

Vanilla Tacotron FT Results

Observations

Alignment plots at train time seem to deteriorate first and then at 50k there seems to be an improvement. Maybe training for a longer duration would be beneficial.�
However, generated audio is completely nonsensical for unseen sentences, throughout the fine-tuning process.�
We also generated alignment plots and spectrograms for the audio samples generated at test time. They can be viewed here. As such, we could not derive any insight from them.

9 of 21

DL Approach II - DCTTS

10 of 21

DL Approach II - DCTTS : DCTTS Model

A preprint talks about how to use fine tune models trained on larger datasets for smaller sized data → In this case, they also focus on fine tuning on smaller-sized emotional data

The preprint details the hyperparameter settings and training strategy for the fine tuning on the DC-TTS Model

DC-TTS Model → Similar to Tacotron, but consisting of only Conv layers (no recurrent layers), making it faster to train. Consists of :

Text2Mel does the mapping between character embeddings and the output of Mel Filter Banks (MFBs) applied on the audio signal, that is, a mel-spectrogram. We aim to finetune only this component on the emotional data.
SSRN does the mapping between the mel-spectrogram and full resolution spectrogram.
Griffin-Lim is used as a vocoder.

11 of 21

DL Approach II - DCTTS : EMOV-DB Dataset

EMOV-DB is a large scale emotional corpus of 4 speakers who speak the CMU arctic transcripts in different emotions and expressiveness levels
Preprint suggested fine-tuning pre-trained (LJSpeech)�DCTTS on EMOV-DB.

12 of 21

DL Approach II - DCTTS : DCTTS Model

Implementation details -

Started of with a Tensorflow implementation of DC-TTS which is adapted for fine-tuning, along with releasing the pre-trained base model on LJSpeech

However, the pre-trained model contains difference between the checkpoints and the version of the code that is made public, making it unusable.

Moved on to a PyTorch implementation of DC-TTS, for which we are currently fine-tuning on the LJSpeech dataset

Pre-training of both modules was completed
Fine-tuning was done for Text2Mel, using the strategy described in the preprint, keeping the SSRN frozen

13 of 21

DL Approach II - DCTTS : DCTTS Model

Pre-training results -

�Fine-tuning results - No audio is being generated (no sound for audio file of 5 second), however the attention alignment looks near perfect.

Attention matrix for the sentence: "I wrote this sentence myself to test whether it works."

“Hello and welcome to the winter offering of speech recognition and understanding.”

“Hi my name is Brihi and I am working on this repository as a part of my course project.”

More samples available here

14 of 21

DL Approaches for TTS

Challenges

Faults in local systems, and repairs not possible in lockdown�
Vanilla Tacotron

Solved catastrophic forgetting, but still not generalisable to unseen utterances
Semi-Teacher Training → training the whole model again on LJSpeech�

DCTTS

We followed the preprint to the tee, but audio was not generated even though we got attention alignment images
Server resources frequently get exhausted by others, resulting in early termination of script.

15 of 21

DL Approaches for TTS

Next steps:

For Vanilla Tacotron

Compiling results for remaining component freezing experiments
Trying semi-teacher training with tacotron decoder (Will probably have to re-train on LJSpeech for this)�

For DCTTS

Since the preprint strategy did not work out as is, we are improvising on our training strategy, by fine-tuning the SSRN module as well
If that would also not work, we would apply an iterative approach for fine tuning

First fine-tune only on the neutral samples and then fine tune on specific emotions

16 of 21

HMM based Speech Synthesis

Last time we observed that due to less number of unique utterances in our corpus, HMM was not generalising well. (Could not utter unseen phrases clearly)�
Proposed solution was to train HMM on larger emotional datasets with diverse utterances�
But, training of local systems for larger datasets like EMOV-DB was not feasible.

17 of 21

HMM based Speech Synthesis

We still have not been able to get around the problem of training HMM on a larger dataset with more resource.�
However, we have created a system with our previous models, that can allow a user to generate a voice sample, for a given text and emotion.�
The demo is in the form of a web application based on Node JS.�
It has very few required dependencies, as pre-trained models are already present, and can be easily hosted on one's local system.

18 of 21

Demo!

19 of 21

HMM based Speech Synthesis

Next Steps

Shift HMM training to server as training on laptop is proving infeasible�
Try HMM on a larger emotional corpus (like EMOV-DB) and see if generalizability of model increases�
Perception surveys on sound clips generated by the models from the class (?)

20 of 21

HMM based Speech Synthesis

Challenges

Biggest challenge right now is to figure out training of HMMs on server, by taking care of the dependency issues related to its setup

21 of 21

Thank you!

All samples today are available here: https://drive.google.com/open?id=1TVfzep6fqF7d9FkQR23dcNINp5TCy6Zh