2 of 13

Problem statement

The goal of this project is to be able to convert an accent of any speaker to another given accent. Specifically we hoped to do it with the movement of American to British and American to Spanish.

“to-ma-to”

“to-mah-to”

3 of 13

Background Research

Not as much work done with style transfer in the realm of audio as done with images
While a lot of papers consider accent classification, very few consider the idea of accent transfer

4 of 13

Data sources

Kaggle:

Several thousand mp3 files
About 2000 different speakers
Same paragraph (69 words)
Gender not specified in label
Words not given separately

Forvo

Several thousand mp3 files
Online website with API
Made about 500 calls daily
Larger variety of words, less speakers per word
Crowdsourced so audio quality varies

5 of 13

Pre-processing: isolating words

Parsing individual words

Used IBM Watson API
Identified keywords in audio and their respective time frames
Cut the audio at those time frames

“Please call Stella ...

6 of 13

Pre-processing: aligning speech

FastDTW (Dynamic Time Warping)

Avoid biasing model with different rates of speech
FastDTW helps align samples with highest probabilities of matching speech to speech

7 of 13

Pre-processing: extracting MFCC coefficients

[1, 2, 3, 4 … 25]

...

5 ms

Information is lost when turning sound into MFCC coefficients!

8 of 13

Data Pipeline and Preparation

Speech Samples

Speech separated into words using Watson API

Target Accent Audio Files

Sample Accent Audio Files

Find MFCCs

Align using FastDTW

DATA

LABELS

9 of 13

Post-processing

[1, 2, 3, 4 … 25]

[0, 0, 0, 0 … 0]

[25, 24, 23, 22 … 1]

...

[1, 1, 1, 1 … 1]

...

[1, 2, 3, 4 … 25]

[0.5, 1, 1.5, … 12.5]

[9, 9, 9, 9 … 9]

...

Averaging rows to “smooth” outcome

10 of 13

Model

Details:

Used for both pre-training and training models
Input layer 1 X 1575 (63*25)
Two hidden layers

Size 100 nodes each
Both use tanh activation functions

Output layer 1 X 75 (3*25)
All weights and biases are initialized randomly from normal dist.

Input layer

output layer

11 of 13

Training

Minimized mean-squared error
Pre-training was done with 500 epochs
Training was done with 1000+ epochs

MSE

prediction

label

12 of 13

Results

Stanford Paper Implementation

Achieved a mean squared error of ~55
Although accuracies resemble those described in paper, actual reproduced audio does not sound human

Pre-training Model

Achieved a mean squared error of ~61
Output sounds human and the word is actually maintained (albeit lot of noise)

Training with new accent

Achieved mean squared error of ~210
Output maintains syllables, but the content is lost due to noise

13 of 13

Extensions and applications

Waveform reconstruction from MFCCs needs some work

There is some research on using GANS to do this

Test on accent classifier

To check if our output contains an “inherent” accent

Test which accents convert between one another the best
Use cases include:

Helpful to understand thick accents
Can be used to present information in a more familiar way given location
Can possibly be used for speech training (by understanding personal difference in voice to target accent)