1 of 13

Accent Transfer

Sameer Pusapaty, Patrick Wang

2 of 13

Problem statement

The goal of this project is to be able to convert an accent of any speaker to another given accent. Specifically we hoped to do it with the movement of American to British and American to Spanish.

“to-ma-to”

“to-mah-to”

3 of 13

Background Research

  • Not as much work done with style transfer in the realm of audio as done with images
  • While a lot of papers consider accent classification, very few consider the idea of accent transfer

4 of 13

Data sources

Kaggle:

  • Several thousand mp3 files
  • About 2000 different speakers
  • Same paragraph (69 words)
  • Gender not specified in label
  • Words not given separately

Forvo

  • Several thousand mp3 files
  • Online website with API
  • Made about 500 calls daily
  • Larger variety of words, less speakers per word
  • Crowdsourced so audio quality varies

5 of 13

Pre-processing: isolating words

Parsing individual words

  • Used IBM Watson API
  • Identified keywords in audio and their respective time frames
  • Cut the audio at those time frames

“Please call Stella ...

6 of 13

Pre-processing: aligning speech

FastDTW (Dynamic Time Warping)

  • Avoid biasing model with different rates of speech
  • FastDTW helps align samples with highest probabilities of matching speech to speech

7 of 13

Pre-processing: extracting MFCC coefficients

[1, 2, 3, 4 … 25]

[1, 2, 3, 4 … 25]

...

5 ms

5 ms

Information is lost when turning sound into MFCC coefficients!

8 of 13

Data Pipeline and Preparation

Speech Samples

Speech separated into words using Watson API

Target Accent Audio Files

Sample Accent Audio Files

Find MFCCs

Find MFCCs

Align using FastDTW

DATA

LABELS

9 of 13

Post-processing

[1, 2, 3, 4 … 25]

[1, 2, 3, 4 … 25]

[1, 2, 3, 4 … 25]

[0, 0, 0, 0 … 0]

[25, 24, 23, 22 … 1]

...

[1, 1, 1, 1 … 1]

...

...

[1, 2, 3, 4 … 25]

[0.5, 1, 1.5, … 12.5]

[9, 9, 9, 9 … 9]

...

...

Averaging rows to “smooth” outcome

10 of 13

Model

Details:

  • Used for both pre-training and training models
  • Input layer 1 X 1575 (63*25)
  • Two hidden layers
    • Size 100 nodes each
    • Both use tanh activation functions
  • Output layer 1 X 75 (3*25)
  • All weights and biases are initialized randomly from normal dist.

Input layer

output layer

11 of 13

Training

  • Minimized mean-squared error
  • Pre-training was done with 500 epochs
  • Training was done with 1000+ epochs

MSE

prediction

label

12 of 13

Results

  • Stanford Paper Implementation
    • Achieved a mean squared error of ~55
    • Although accuracies resemble those described in paper, actual reproduced audio does not sound human
  • Pre-training Model
    • Achieved a mean squared error of ~61
    • Output sounds human and the word is actually maintained (albeit lot of noise)
  • Training with new accent
    • Achieved mean squared error of ~210
    • Output maintains syllables, but the content is lost due to noise

13 of 13

Extensions and applications

  • Waveform reconstruction from MFCCs needs some work
    • There is some research on using GANS to do this
  • Test on accent classifier
    • To check if our output contains an “inherent” accent
  • Test which accents convert between one another the best
  • Use cases include:
    • Helpful to understand thick accents
    • Can be used to present information in a more familiar way given location
    • Can possibly be used for speech training (by understanding personal difference in voice to target accent)