1 of 21

Progress Update - 20th April

Group 1 - Emotional Speech Synthesis using HMMs

2 of 21

Summary of updates as per previous timeline

  • For the HMM based approach
    • Get a thorough understanding of the nitty gritties of the adapted HMM - Done
    • Partial Implementation - Done
    • Complete Implementation - In Progress, have prepared a system demo

  • For the DL based approach
    • Understanding and implementing a vanilla neural TTS system (eg. Tacotron) (By 12th March) - Implementation done, trying out freezing different components of the network while fine-tuning
    • Implementing fine-tuned DCTTS - In Progress, pre-trained on LJSpeech
    • 1 week Buffer

3 of 21

DL Approach I - Vanilla Tacotron Fine-tuning

4 of 21

DL Approach I - Vanilla Tacotron Fine-tuning

What we have done/are doing:

  • Tried running vanilla Tacotron with revised hyperparameters (lr: 2e-05, with annealing; SGD optimiser) while freezing the encoder

5 of 21

DL Approach I - Vanilla Tacotron Fine-tuning

What we have done/are doing:

  • Fine-tuning experiments for Vanilla Tacotron
    • Since simple learning rate variations did not show any major improvements, we decided freezing different components of the Tacotron while we are fine-tuning.
    • We have tried the following experiments
      • Freezing the Encoder and Postnet (mel-to-linear and vocoder)
      • Freezing the Encoder only (currently running)

Why are we doing this?

  • The encoder merely learns a mapping from natural language to a latent space.
  • Since LJSpeech has a richer vocabulary, we felt that fine-tuning it would not make sense�

6 of 21

DL Approach I - Vanilla Tacotron Fine-tuning

  • For the vanilla Tacotron model that we finetuned, we studied the results generated at 5 different number of iterations during fine-tuning: 1k, 12k, 25k, 37k, 50k
  • For each saved checkpoint, we analysed the alignment plot during training. (The generated audio while training is almost always good, due to teacher training in decoder, hence we only analysed the alignment plot)
  • We also generated 3 different audio samples for each checkpoint, with 1 seen and 2 unseen sentences. Generation does not use teacher training, hence the audio was analysed here.
    • Sentence 1: "Kids are talking by the door" denoted by
    • Sentence 2: "Training neural networks is very hard!" denoted by
    • Sentence 3: "Generative adversarial network or variational auto-encoder." denoted by

7 of 21

Vanilla Tacotron FT Results

After 1k iters.

After 12k iters.

After 25k iters.

After 37k iters.

After 50k iters.

No Finetuning

8 of 21

Vanilla Tacotron FT Results

Observations

  • Alignment plots at train time seem to deteriorate first and then at 50k there seems to be an improvement. Maybe training for a longer duration would be beneficial.�
  • However, generated audio is completely nonsensical for unseen sentences, throughout the fine-tuning process.�
  • We also generated alignment plots and spectrograms for the audio samples generated at test time. They can be viewed here. As such, we could not derive any insight from them.

9 of 21

DL Approach II - DCTTS

10 of 21

DL Approach II - DCTTS : DCTTS Model

  • A preprint talks about how to use fine tune models trained on larger datasets for smaller sized data → In this case, they also focus on fine tuning on smaller-sized emotional data
    • The preprint details the hyperparameter settings and training strategy for the fine tuning on the DC-TTS Model
  • DC-TTS Model → Similar to Tacotron, but consisting of only Conv layers (no recurrent layers), making it faster to train. Consists of :
    • Text2Mel does the mapping between character embeddings and the output of Mel Filter Banks (MFBs) applied on the audio signal, that is, a mel-spectrogram. We aim to finetune only this component on the emotional data.
    • SSRN does the mapping between the mel-spectrogram and full resolution spectrogram.
    • Griffin-Lim is used as a vocoder.

11 of 21

DL Approach II - DCTTS : EMOV-DB Dataset

  • EMOV-DB is a large scale emotional corpus of 4 speakers who speak the CMU arctic transcripts in different emotions and expressiveness levels
  • Preprint suggested fine-tuning pre-trained (LJSpeech)�DCTTS on EMOV-DB.

12 of 21

DL Approach II - DCTTS : DCTTS Model

Implementation details -

  • Started of with a Tensorflow implementation of DC-TTS which is adapted for fine-tuning, along with releasing the pre-trained base model on LJSpeech
    • However, the pre-trained model contains difference between the checkpoints and the version of the code that is made public, making it unusable.
  • Moved on to a PyTorch implementation of DC-TTS, for which we are currently fine-tuning on the LJSpeech dataset
    • Pre-training of both modules was completed
    • Fine-tuning was done for Text2Mel, using the strategy described in the preprint, keeping the SSRN frozen

13 of 21

DL Approach II - DCTTS : DCTTS Model

Pre-training results -

�Fine-tuning results - No audio is being generated (no sound for audio file of 5 second), however the attention alignment looks near perfect.

Attention matrix for the sentence: "I wrote this sentence myself to test whether it works."

“Hello and welcome to the winter offering of speech recognition and understanding.”

“Hi my name is Brihi and I am working on this repository as a part of my course project.”

More samples available here

14 of 21

DL Approaches for TTS

Challenges

  • Faults in local systems, and repairs not possible in lockdown�
  • Vanilla Tacotron
    • Solved catastrophic forgetting, but still not generalisable to unseen utterances
    • Semi-Teacher Training → training the whole model again on LJSpeech�
  • DCTTS
    • We followed the preprint to the tee, but audio was not generated even though we got attention alignment images
    • Server resources frequently get exhausted by others, resulting in early termination of script.

15 of 21

DL Approaches for TTS

Next steps:

  • For Vanilla Tacotron
    • Compiling results for remaining component freezing experiments
    • Trying semi-teacher training with tacotron decoder (Will probably have to re-train on LJSpeech for this)�
  • For DCTTS
    • Since the preprint strategy did not work out as is, we are improvising on our training strategy, by fine-tuning the SSRN module as well
    • If that would also not work, we would apply an iterative approach for fine tuning
      • First fine-tune only on the neutral samples and then fine tune on specific emotions

16 of 21

HMM based Speech Synthesis

  • Last time we observed that due to less number of unique utterances in our corpus, HMM was not generalising well. (Could not utter unseen phrases clearly)�
  • Proposed solution was to train HMM on larger emotional datasets with diverse utterances�
  • But, training of local systems for larger datasets like EMOV-DB was not feasible.

17 of 21

HMM based Speech Synthesis

  • We still have not been able to get around the problem of training HMM on a larger dataset with more resource.�
  • However, we have created a system with our previous models, that can allow a user to generate a voice sample, for a given text and emotion.�
  • The demo is in the form of a web application based on Node JS.�
  • It has very few required dependencies, as pre-trained models are already present, and can be easily hosted on one's local system.

18 of 21

Demo!

19 of 21

HMM based Speech Synthesis

Next Steps

  • Shift HMM training to server as training on laptop is proving infeasible�
  • Try HMM on a larger emotional corpus (like EMOV-DB) and see if generalizability of model increases�
  • Perception surveys on sound clips generated by the models from the class (?)

20 of 21

HMM based Speech Synthesis

Challenges

  • Biggest challenge right now is to figure out training of HMMs on server, by taking care of the dependency issues related to its setup

21 of 21

Thank you!

All samples today are available here: https://drive.google.com/open?id=1TVfzep6fqF7d9FkQR23dcNINp5TCy6Zh