1 of 98

1

Fighting bias with a more ethical and reliable description of speech data

André Monforte

October 2021

Impact of Vocal Traits Distribution on Speech Applications’ Performance and Bias

Supervisor: Professor Doutor João Gama

Advisor: Doutor Rui Correia

2 of 98

2

AGENDA

Motivation

01

Problem Discussion

02

Research Hypothesis

03

RQ1: Most Informative Vocal Traits

04

RQ2: Impact of Vocal Traits on ASR Performance and Bias

05

Conclusions and Future Work

06

3 of 98

3

Motivation

Speech Applications

How are

you?

>

4 of 98

4

Motivation

+

=

Training data

Machine Learning

Algorithms

AI

>

Essential ingredientes for speech applications

5 of 98

5

Motivation

+

=

Training data

Machine Learning

Algorithms

AI

>

BIAS

Essential ingredientes for speech applications

6 of 98

6

Motivation

Bias in Speech

Bias is about systematic errors “against” specific sub-groups.

|

|

Gender

Source: (Garnerin, 2019)

Ethnicity

Source: (Koenecke, 2020)

Age

Source: (Schlögl, 2013)

7 of 98

7

Motivation

Bias in Speech | Mitigation techniques

Bias is about systematic errors “against” specific sub-groups.

Gender

Source: (Garnerin, 2019)

INTERVENTION

EQUAL REPRESENTATION OF GENDER GROUPS

ON THE TRAIN SET

8 of 98

8

Motivation

Bias in Speech | Mitigation techniques

Bias is about systematic errors “against” specific sub-groups.

INTERVENTION

EQUAL REPRESENTATION OF SOCIAL GROUPS

ON THE TRAIN SET

|

GENDER

ETHNICITY

AGE

|

9 of 98

9

Balancing speech data over metadata has

2 major limitations :

  1. Based on self-collected stats which are hard to validate
  2. Proxies for actual vocal traits of the speaker.

10 of 98

10

Balancing speech data over metadata has

2 major limitations :

  1. Based on self-collected stats which are hard to validate
  2. Proxies for actual vocal traits of the speaker.

11 of 98

11

Problem Discussion

Speech Contributors

>

Collection Job

>

Speech data

Speech collection pipeline

12 of 98

12

Problem Discussion

Speech Contributors

>

Collection Job

>

Speech data

Speech collection pipeline

13 of 98

13

Problem Discussion

Speech collection - fraud types

Profile vs. detected from speech

1 Account, Many Speakers

1 Speaker, Many Accounts

Gender Mismatch

Multiple Speaker

Multiple Account

14 of 98

14

Problem Discussion

Speech Contributors

>

Collection Job

>

Speech data

>

Validation Step

Speech

content

Background

conditions

Nativeness

Speech collection pipeline

15 of 98

15

Problem Discussion

Speech Contributors

>

Collection Job

>

Speech data

>

Validation Step

Speech

content

Background

conditions

Nativeness

Speech collection pipeline

Self-reported

Speaker metadata

?

  • Age
  • Gender
  • Ethnicity

16 of 98

16

Balancing speech data over metadata has

2 major limitations :

  1. Based on self-collected stats which are hard to validate
  2. Proxies for actual vocal traits of the speaker.

Impact of Vocal Traits Distribution on Speech Applications’ Performance and Bias | Research Plan

17 of 98

17

Problem Discussion

Proxies for actual vocal traits of the speaker

  • Pitch
  • Amplitude
  • Jitter
  • Speaking rate

>

TYPICAL

Male speakers

50

Female speakers

50

GENDER APPROACH:

Amplitude

Pitch

0

18 of 98

18

Problem Discussion

Proxies for actual vocal traits of the speaker

  • Pitch
  • Amplitude
  • Jitter
  • Speaking rate

>

TYPICAL

Male speakers

50

Female speakers

50

GENDER APPROACH:

Amplitude

Pitch

0

IDEALLY

19 of 98

19

Problem Discussion

Proxies for actual vocal traits of the speaker

  • Pitch
  • Amplitude
  • Jitter
  • Speaking rate

>

TYPICAL

Male speakers

50

Female speakers

50

Amplitude

0

GENDER APPROACH:

Pitch

ACTUALLY

20 of 98

20

Problem Discussion

Proxies for actual vocal traits of the speaker

  • Pitch
  • Amplitude
  • Jitter
  • Speaking rate

>

TYPICAL

Balancing data over GENDER is only efficient in capturing the typical vocal traits of the group.

5

10

25

25

5

20

10

Male speakers

50

Female speakers

50

Amplitude

Pitch

0

GENDER APPROACH:

ACTUALLY

21 of 98

21

Research Hypothesis

Actual vocal representation

0

Ignore social groups

01

WHAT WE PROPOSE

Vocal trait 2

Vocal trait 1

22 of 98

22

Vocal trait 2

0

Ignore social groups

01

WHAT WE PROPOSE

Use vocal traits as criterion

02

5

10

25

25

5

20

10

Research Hypothesis

Actual vocal representation

Vocal trait 1

23 of 98

23

Ignore social groups

01

WHAT WE PROPOSE

Use vocal traits as criterion

02

Research Hypothesis

Actual vocal representation

Actually balance the dataset

03

Vocal trait 2

Vocal trait 1

0

14

14

14

14

14

14

14

24 of 98

24

Ignore social groups

01

WHAT WE PROPOSE

Use vocal traits as criterion

02

Actually balance the dataset

03

Research Hypothesis

Actual vocal representation

+ EFFECTIVE

+ VERIFIABLE

+ ETHICAL

24

Vocal trait 2

Vocal trait 1

0

14

14

14

14

14

14

14

25 of 98

25

PROXY

for individual vocal traits

ACTUAL

individual vocal traits

  • Pitch
  • Amplitude
  • Jitter
  • Speaking rate

WHAT PEOPLE TIPICALLY SOUND LIKE

WHAT PEOPLE ACTUALLY SOUND LIKE

Research Hypothesis

Actual vocal representation

26 of 98

26

Research Questions 

  1. What voice traits better differentiate and characterize speakers?
  2. What is the impact of balancing such features in the training dataset of a speech application?

Impact of Vocal Traits Distribution on Speech Applications’ Performance and Bias

27 of 98

27

RQ1: Voice traits that better differentiate speakers

RQ 1

What voice traits better differentiate and characterize speakers?

3 steps approach

B. Vocal traits portraying the same information as gender labels

A. Pool of acoustic features

C. Vocal traits that differentiate speakers

28 of 98

28

RQ1.A | Pool of acoustic features

#

FEATURE

DESCRIPTION

INSIGHT

1

Spectral Centroid

Centre of mass of the signal

Brightness of a sound signal.

2

Spectral Spread

Spread of energy around the centre of mass.

Average deviation around the centroid.

3

HNR

Hoarseness, �Distance to pure tones

Distance between pure tones and the sound frequencies.

4

Pitch

Fundamental frequency (F0)

Degree of highness or lowness of a tone

5

Jitter

Pitch fluctuations

Variations of the pitch within the utterance.

6

Shimmer

Variations of the amplitude within the utterance.

Variations of the amplitude within the utterance.

7

Energy

Root mean energy of the signal

Intensity of the signal, and its fluctuation within the utterance.

8

Loudness

Volume

Volume

9

Speaking Rate

Number of words per second.

Rhythm of speaking

RQ1: Voice traits that better differentiate speakers

29 of 98

29

RQ1: Voice traits that better differentiate speakers

RQ 1

What voice traits better differentiate and characterize speakers?

3 steps approach

B. Vocal traits portraying the same information as gender labels

A. Pool of acoustic features

C. Vocal traits that differentiate speakers

30 of 98

30

02

GROUP DIFFERENCES

GENDER CORRELATION

MOST PROMISING FEATURES:

Box plots per measure and gender.

01

Correlation with gender

RQ1.B | Vocal traits portraying the same information as gender labels

RQ1: Voice traits that better differentiate speakers

31 of 98

31

02

GROUP DIFFERENCES

GENDER CORRELATION

MOST PROMISING FEATURES:

Box plots per measure and gender.

01

  • Highest correlation with Gender labels.
  • Most significant mean differences between gender groups.

#1

PITCH

Correlation with gender

RQ1.B | Vocal traits portraying the same information as gender labels

RQ1: Voice traits that better differentiate speakers

32 of 98

32

Correlation with gender

02

GROUP DIFFERENCES

GENDER CORRELATION

MOST PROMISING FEATURES:

Box plots per measure and gender.

01

  • HNR and Jitter with the 2nd highest correlation with Gender labels.
  • They both have significant differences between gender groups.

#2

HNR AND JITTER

#1

PITCH

RQ1.B | Vocal traits portraying the same information as gender labels

RQ1: Voice traits that better differentiate speakers

33 of 98

33

#

ITERATION

% TEST

1

Complete dataset

96.3%

2

Excluding Pitch

81.7%

3

Excluding Pitch + Jitter + HNR

70,1%

4

Pitch

87.3%

5

Pitch + Jitter

89.1%

6

Pitch + HNR

81.8%

Test Accuracy results | Gender predictor

03

  • Pitch alone has 87.3% accuracy in predicting gender.
  • Pitch has the greatest information gain for all iterations he is included.
  • Removing pitch, reduces accuracy in ~15 p.p.

#1

PITCH

MOST PROMISING FEATURES

PREDICTING GENDER FROM VOCAL TRAITS

RQ1.B | Vocal traits portraying the same information as gender labels

RQ1: Voice traits that better differentiate speakers

34 of 98

34

Test Accuracy results | Gender predictor

03

  • Pitch alone has 87.3% accuracy in predicting gender.
  • Pitch has the greatest information gain for all iterations he is included.
  • Removing pitch, reduces accuracy in ~15 p.p.

#1

PITCH

MOST PROMISING FEATURES

PREDICTING GENDER FROM VOCAL TRAITS

#

ITERATION

% TEST

1

Complete dataset

96.3%

2

Excluding Pitch

81.7%

3

Excluding Pitch + Jitter + HNR

70,1%

4

Pitch

87.3%

5

Pitch + Jitter

89.1%

6

Pitch + HNR

81.8%

#2

JITTER

  • Adding HNR to pitch leads to a worse performance than Pitch alone (81.8% vs. 87.3%)

RQ1.B | Vocal traits portraying the same information as gender labels

RQ1: Voice traits that better differentiate speakers

35 of 98

35

RQ1: Voice traits that better differentiate speakers

RQ 1

What voice traits better differentiate and characterize speakers?

3 steps approach

B. Vocal traits portraying the same information as gender labels

A. Pool of acoustic features

C. Vocal traits that differentiate speakers

Pitch and Jitter

36 of 98

36

RQ1.C | Vocal traits that differentiate speakers

RQ1: Voice traits that better differentiate speakers

>

Hierarchical clustering

Standardize acoustic features

Compare mean differences

Partitional

clustering

>

>

37 of 98

RQ1.C | Vocal traits that differentiate speakers

RQ1: Voice traits that better differentiate speakers

Clustering experiment – mean difference vocal traits

05

38 of 98

RQ1.C | Vocal traits that differentiate speakers

RQ1: Voice traits that better differentiate speakers

Clustering experiment – mean difference vocal traits

05

Correlation Matrix – vocal traits and gender

06

39 of 98

RQ1.C | Vocal traits that differentiate speakers

RQ1: Voice traits that better differentiate speakers

Clustering experiment – mean difference vocal traits

05

Correlation Matrix – vocal traits and gender

06

40 of 98

40

MOST EFFECTIVE VOCAL TRAITS TO SEPARATE SPEAKERS:

RQ1.C | Vocal traits that differentiate speakers

RQ1: Voice traits that better differentiate speakers

#1

#2

SPECTRAL

CENTROID

LOUDNESS

Beyond Gender labels | Final hypothesis

07

41 of 98

41

RQ 1

What voice traits better differentiate and characterize speakers?

A. Vocal traits portraying the same information as gender labels

0. Pool of acoustic features

B. Vocal traits that differentiate speakers

Pitch and Jitter

Spectral Centroid and Loudness

RQ1: Voice traits that better differentiate speakers

3 steps approach

42 of 98

42

Research Questions 

  1. What voice traits better differentiate and characterize speakers?
  2. What is the impact of balancing such features in the training dataset of a speech application?

Impact of Vocal Traits Distribution on Speech Applications’ Performance and Bias | Research Plan

43 of 98

43

>

>

>

2. Balanced

Train Set

3. Train ASR model

4. Evaluate performance

5. Compare performance and bias

FOR EACH FEATURE IN OUR SHORTLIST:

>

1. Most informative

criterion for balancing

High-level experimental pipeline

RQ2: Impact of balancing vocal traits in the training dataset

TRAIN AN ASR USING THE SAME TRAINING PIPELINE

44 of 98

44

>

1. Most informative

criterion for balancing

Most informative criterion for balancing

RQ2: Impact of balancing vocal traits in the training dataset

  • PITCH
  • PITCH + JITTER
  • SPECTRAL CENTROID
  • SPECTRAL CENTROID + LOUDNESS

4 VOCAL MODELS:

2 NON-VOCAL MODELS:

  • GENDER
  • RANDOM/UNBALANCED

45 of 98

45

RQ2: Impact of balancing vocal traits in the training dataset

Train set pipeline

Amplitude

Pitch

0

14

14

14

14

14

14

14

>

CONSTRAINTS:

FIXED-SIZED INTERVALS

100/200 HOURS

2. Balanced

Train Set

46 of 98

46

Train set pipeline

Amplitude

Pitch

0

14

14

14

14

14

14

14

>

REFERENCE GROUPS

#

CONSTRAINTS:

BALANCING CRITERION

?

FIXED-SIZED INTERVALS

100/200 HOURS

2. Balanced

Train Set

>

RQ2: Impact of balancing vocal traits in the training dataset

47 of 98

47

Shortlist of train sets

200 HOURS (7 train sets)

  • PITCH: 2,3
  • SPECTRAL CENTROID: 2,3
  • PITCH AND JITTER: 2x2
  • GENDER
  • RANDOM

100 HOURS (10 train sets)

  • All combinations in the 200 hours group.

+

  • PITCH: 4
  • SPECTRAL CENTROID: 4
  • SPECTRAL CENTROID AND LOUDNESS: 2x2

RQ2: Impact of balancing vocal traits in the training dataset

48 of 98

48

Performance

RQ2: Impact of balancing vocal traits in the training dataset

Average WER per model and number of hours in the train set.

08

49 of 98

49

Performance | High-Performers

100 Hours

  • Pitch + Jitter (2-2)
  • Spectral Centroid + Loudness (2-2)
  • Spectral Centroid 3

200 Hours

  • Spectral Centroid 3
  • Pitch 3

HIGH-PERFORMERS

RQ2: Impact of balancing vocal traits in the training dataset

>

VOCAL MODELS WITH AT LEAST 3 BINS

Average WER per model and number of hours in the train set.

08

50 of 98

50

Performance | Low-Performers

100 Hours

  • Random
  • Pitch 2 / Spectral centroid 2 / Gender

200 Hours

  • Spectral centroid 2
  • Gender

LOW-PERFORMERS

RQ2: Impact of balancing vocal traits in the training dataset

Average WER per model and number of hours in the train set.

08

>

NON-VOCAL MODELS +

VOCAL MODELS WITH 2 BINS

51 of 98

51

Performance | Mean WER per criterion and differences to Gender

RQ2: Impact of balancing vocal traits in the training dataset

#

BALANCEMENT

CRITERION

WER difference to Gender Model

100h

200h

mean

1

SPECTRAL CENTROID + LOUDNESS (2-2)*

-2.09

-

-2.09

2

SPECTRAL CENTROID 3

-1.58

-2.10

-1.84

3

PITCHJITTER 22

-2.27

-1.19

-1.73

4

PITCH3

-0.48

-1.97

-1.22

5

PITCH4*

-1.18

-

-1.18

6

SPECTRAL CENTROID 4*

-0.98

-

-0.98

7

PITCH2

0.15

-1.36

-0.61

8

SPECTRAL CENTROID 2

0.08

1.50

0.79

9

RANDOM

3.09

-1.12

0.99

* only validated for 100h.

  • 1-2 pp. improvement on performance for vocal models with 3+ reference groups.

  • Vocal models with 2 reference groups have the most similar performance with gender

DIFFERENCES TO GENDER

Mean WER difference to the Gender Model (percentual points).

08

52 of 98

52

Bias | Gender

RQ2: Impact of balancing vocal traits in the training dataset

* only validated for 100h.

  • Vocal models with a single criteria produced unbiased models:
    • Pitch 2-4
    • Spectral Centroid 2-4

PERFORMANCE IN GENDER GROUPS

Mean WER between gender groups.

09

BALANCEMENT CRITERION

MALE

FEMALE

meanDIFF%

Gender

44.37%

48.09%

-3.72

Pitch 2

46.37%

46.96%

-0.59

Pitch 4*

49.53%

49.67%

-0.14

Spectral Centroid 2

48.32%

48.43%

-0.11

Pitch 3

46.55%

46.46%

0.09

Spectral Centroid 3

46.32%

46.21%

0.11

Spectral Centroid 4*

50.46%

49.86%

0.60

Random

49.03%

48.08%

0.94

Spectral Centroid + Loudness (2-2)*

49.71%

48.05%

1.65

Pitch + Jitter (2-2)

47.16%

45.38%

1.78

53 of 98

53

Balancement Criterion

Differences to the mean model’s performance

Teens

Thirties

Fourties

Fifties

Over60

Sum

Squardiff

Pitch 4

6.12

-2.10

-4.74

-1.62

-4.88

96.32

Spectral Centroid 3

8.51

-0.84

-2.35

-0.64

-3.25

105.89

Spectral Centroid + Loudness (22)

7.09

-2.88

-4.48

-1.33

-5.72

117.76

Pitch3

6.87

-2.38

-3.91

-2.02

-6.24

120.00

Spectral Centroid 4

8.47

-2.10

-3.51

-1.90

-5.45

125.54

Pitch 2

7.79

-2.53

-4.33

-2.55

-5.59

129.51

Spectral Centroid 2

7.37

-2.67

-4.26

-2.07

-6.37

130.19

Pitch+jitter (2-2)

7.26

-2.52

-4.66

-1.85

-6.35

131.18

Random

7.13

-3.12

-4.51

-2.77

-6.14

132.45

Gender

5.66

-3.05

-5.12

-3.45

-7.08

134.10

Bias | Age

RQ2: Impact of balancing vocal traits in the training dataset

* only validated for 100h.

  • Vocal models with 3+ bins reduced bias against age groups.
    • Pitch 4
    • Spectral Centroid 3

  • Non-vocal models obtained the most bias between age groups.

PERFORMANCE IN AGE GROUPS

Mean WER between age groups.

09

54 of 98

54

Final conclusions

RQ2: Impact of balancing vocal traits in the training dataset

RQ2

PERFORMANCE

A

Vocal models with 3+ bins improved performance in 1-2 pp.

BIAS

B

What is the impact of balancing such features in the training dataset of a speech application?

Vocal models with 3+ bins and a single balancing criterion were effective in mitigating bias between age and gender groups.

55 of 98

55

Ignore social groups

01

WHAT WE PROPOSE

Use vocal traits as criterion

02

Actually balance the dataset

03

CONCLUSIONS AND FUTURE WORK

Research Hypothesis vs. Obtained Results

+ EFFECTIVE

+ VERIFIABLE

+ ETHICAL

55

Vocal trait 2

Vocal trait 1

0

14

14

14

14

14

14

14

56 of 98

56

Ignore social groups

01

OBTAINED RESULTS

Use vocal traits as criterion

02

Actually balance the dataset

03

CONCLUSIONS AND FUTURE WORK

Research Hypothesis vs. Obtained Results

+ EFFECTIVE

+ VERIFIABLE

+ ETHICAL

56

Vocal trait 2

Vocal trait 1

0

14

14

14

14

14

14

14

57 of 98

57

Ignore social groups

01

OBTAINED RESULTS

Use vocal traits as criterion

02

Actually balance the dataset

03

CONCLUSIONS AND FUTURE WORK

Research Hypothesis vs. Obtained Results

+ EFFECTIVE

+ VERIFIABLE

+ ETHICAL

57

Vocal trait 2

Vocal trait 1

0

58 of 98

58

Ignore social groups

01

OBTAINED RESULTS

Use vocal traits as criterion

02

Actually balance the dataset

03

CONCLUSIONS AND FUTURE WORK

Research Hypothesis vs. Obtained Results

+ EFFECTIVE

+ VERIFIABLE

+ ETHICAL

58

Vocal trait 2

PITCH OR SPECTRAL CENTROID

0

59 of 98

59

Ignore social groups

01

OBTAINED RESULTS

Use vocal traits as criterion

02

Actually balance the dataset

03

CONCLUSIONS AND FUTURE WORK

Obtained results

+ EFFECTIVE

+ VERIFIABLE

+ ETHICAL

59

Vocal trait 2

PITCH OR SPECTRAL CENTROID

0

#1

+

#2

#3

3+ REFERENCE GROUPS

60 of 98

60

Ignore social groups

01

OBTAINED RESULTS

Use vocal traits as criterion

02

Actually balance the dataset

03

CONCLUSIONS AND FUTURE WORK

Obtained results

60

Vocal trait 2

PITCH OR SPECTRAL CENTROID

0

#1

+

#2

#3

3+ REFERENCE GROUPS

60

60

+ EFFECTIVE

+ VERIFIABLE

+ ETHICAL

61 of 98

61

Ignore social groups

01

OBTAINED RESULTS

Use vocal traits as criterion

02

Actually balance the dataset

03

CONCLUSIONS AND FUTURE WORK

Obtained results

61

Vocal trait 2

PITCH OR SPECTRAL CENTROID

0

#1

+

#2

#3

3+ REFERENCE GROUPS

↑ PERFORMANCE

↓ BIAS

61

61

+ EFFECTIVE

+ VERIFIABLE

+ ETHICAL

62 of 98

62

FUTURE WORK

Languages

01

Nationalities

02

Recording conditions

03

Training set setups (distribution, # groups, # hours)

04

TEST THE CONCLUSIONS IN ALTERNATIVE SCENARIOS:

63 of 98

63

THANK YOU.

64 of 98

64

Q&A

65 of 98

65

RQ2

RQ1

DATASET

POOL OF RECORDINGS

ASR TRAINING PIPELINE

ASR ARCHITECTURE

TRAIN SET SELECTION

SHORTLIST

METHODOLOGY

SELECTION CRITERIA

EXTRACTION

CV RESULTS

XGBOOST RESULTS

GREY AREA

CLUSTERING PIPELINE

CONSIDERED ITERATIONS

GENDER VS. CLUSTERS

GENERAL PERFORMANCE NOTES

EXPECTED PERFORMANCE

GENDER BIAS

AGE BIAS

Q&A

66 of 98

66

A. ACOUSTIC FEATURES

67 of 98

Vocal Traits Conveying Speaker Characteristics

Methodology

>

Utterance-level feature extraction

  • Vocal features for each recording.
  • Speaker Identification
  • Voice Profiling techniques

Base Pool of Features

Selecting

Vocal features

  • Define a criterion: coefficient of variation
  • Rank and select variables.

Speaker-level

feature extraction

Aggregate the feature results on a speaker level, using:

  • Median
  • Trimmed Mean

>

>

DefinedCrowd confidential

68 of 98

Feature Extraction Process

DefinedCrowd Speech Dataset

  • 63,363 WAV files
  • 275 speakers
  • 155.9 hours
  • Mean duration 8.9 seconds
  • 64% vs. 36% (F/M)

>

For each

recording

Extract the features for each file and aggregate them on an utterance level, using the Surfboard python toolkit:

  • Mean
  • Standard Deviation*

>

Aggregate on a

speaker level

Aggregate feature on a speaker level, using:

  • Median
  • Trimmed Mean

Methodology

DefinedCrowd confidential

69 of 98

Selecting and ranking features

Methodology

  • Evaluates how well-distributed a variable is on the instance space
  • Unitless, hence comparable
  • Very low: x<0.15 | Very high: x> 0.85

2.

COEFFICIENT OF VARIATION (CV)

1.

EXCLUSION PREMISES

INTRA-UTTERANCE

Variables with a very high variability within the utterance do not have a significant mean (aggregated) value and should be excluded.

INTRA-SPEAKER

Variables with a very high variability across all recordings of the same speaker do not have a significant speaker-mean value and should be excluded.

INTER-UTTERANCE

Variables with a very low variability across all recordings have low discriminatory power and should be excluded.

INTER-SPEAKER

Variables with a very low variability between speakers have a low discriminatory power and should be excluded.

DefinedCrowd confidential

70 of 98

Final pool of features

FEATURE

INSIGHT

Intra-utterance

Inter-Utterance

Intra-

Speaker

Inter-

Speaker

Spectral Centroid

Centre of mass of the signal

0.139

0.265

0.23

0.23

Spectral Spread

Spread of energy around the centre of mass.

0.068

0.16

0.147

0.147

Spectral Skewness

Symmetry of the spectrum

1.388

-

-

-

Spectral Kurtosis

Flatness of the spectrum

0.045

0.047

-

-

HNR

Hoarseness, �Distance to pure tones

-

0.25

0.184

0.184

Pitch

Fundamental frequency (F0)

0.104

0.234

0.22

0.22

Jitter

Pitch fluctuations

-

0.385

0.274

0.274

Shimmer

Variations of the amplitude within the utterance.

-

0.557

0.304

0.304

Energy

Root mean energy of the signal

-

0.256

0.234

0.234

Loudness

Volume

-

0.859

0.772

0.772

Speaking Rate

Number of words per second.

-

0.558

0.2

0.2

Methodology

DefinedCrowd confidential

71 of 98

Final pool of features

FEATURE

INSIGHT

Intra-utterance

Inter-Utterance

Intra-

Speaker

Inter-

Speaker

Spectral Centroid

Centre of mass of the signal

0.139

0.265

0.23

0.23

Spectral Spread

Spread of energy around the centre of mass.

0.068

0.16

0.147

0.147

HNR

Hoarseness, �Distance to pure tones

-

0.25

0.184

0.184

Pitch

Fundamental frequency (F0)

0.104

0.234

0.22

0.22

Jitter

Pitch fluctuations

-

0.385

0.274

0.274

Shimmer

Variations of the amplitude within the utterance.

-

0.557

0.304

0.304

Energy

Root mean energy of the signal

-

0.256

0.234

0.234

Loudness

Volume

-

0.859

0.772

0.772

Speaking Rate

Number of words per second.

-

0.558

0.2

0.2

Methodology

DefinedCrowd confidential

72 of 98

72

B. GENDER MIMIC

73 of 98

RQ1.B | Vocal traits portraying the same information as gender labels

RQ1: Voice traits that better differentiate speakers

  • Using the nine features in our pool train a basic XGBoost model to classify the gender of the speaker.
  • Using the feature importance tool, obtain the features that most contribute for the prediction.

Predicting gender from vocal traits

FINDINGS

  • Pitch alone has 87.3% accuracy in predicting gender.
  • Pitch with the most information gain for all iterations he is included (5.36 score vs. 0.63 jitter, for #1)
  • When removing pitch from the train-set, the test accuracy went down ~15 p.p (81.7%).

  • Jitter with the 2nd most information gain
  • Adding HNR to pitch leads to a worse performance than Pitch alone (81.8% vs. 87.3%)

#

ITERATION

% TEST

1

Complete dataset

96.3%

2

Excluding Pitch

81.7%

3

Excluding Pitch + Jitter + HNR

70,1%

4

Pitch only

87.3%

5

Pitch + Jitter

89.1%

6

Pitch + HNR

81.8%

DefinedCrowd confidential

74 of 98

74

  • Most misalignments are concentrated on the 128-148 Hz pitch area: grey area.

  • This area corresponds to the frontier between the pitch ranges commonly defined for each gender:
    • Female: 85-155 Hz
    • Male: 165-255 Hz

PITCH GREY AREA

Pitch Grey Area | Pitch + Jitter scatter plot.

04

RQ1.B | Vocal traits portraying the same information as gender labels

RQ1: Voice traits that better differentiate speakers

75 of 98

75

C. BEYOND GENDER

76 of 98

RQ1.C | Separating vocal profiles

RQ1: Voice traits that better differentiate speakers

>

Hierarchical clustering

  • Identify the optimal number of clusters (k)
  • Standardize the acoustic features for all speakers in our dataset.

Standardize features

Compare mean differences

  • Calculate the mean differences between each pair of clusters
  • Identify the variables with the greater standardized differences.

Partitional

clustering

Using the identified number clusters:

  • generate the clusters,
  • obtain mean vocal traits for the cluster.

>

>

DefinedCrowd confidential

77 of 98

RQ1: Voice traits that better differentiate speakers

RQ1.C | Separating vocal profiles

  • Spectral centroid and Spectral Spread show the greatest mean differences.
  • Energy and Loudness define the 2nd lot of relevant variables.

  • Findings were maintained for all considered subsets of the dataset.
  • Complete dataset
  • 50-50 gender-balanced sample
  • Male speakers
  • Female speakers
  • Subset of speakers in the grey area (128-148 Hz of pitch).

Experiment replicated over:

DefinedCrowd confidential

78 of 98

78

  • None of the identified gender replacements (pitch, jitter and HNR) showed significant differences between clusters.

  • The 4 most informative variables (loudness, energy, spectral centroid and spectral spread) to separate speakers are weekly correlated with gender.�
  • Findings were maintained for all considered subsets of the dataset.

GENDER GROUPS VS. CLUSTERS

RQ1.C | Vocal traits that differentiate speakers

RQ1: Voice traits that better differentiate speakers

79 of 98

79

RQ2 METHODOLOGY

80 of 98

80

1. Extract acoustic features

2. Build the

train set

3. Model

training

4. Evaluation

0. Pool of

recordings

ASR Training Pipeline

Build a pool of recordings to

extract the train sets

4 features:

  • Pitch
  • Jitter
  • Spectral Centroid
  • Loudness

  • ~200/100 HOURS

  • Criteria for balancing:
  • 4 features
  • Gender
  • Unbalanced

  • Train ASR DeepSpeech framework
  • Compute WER for a similar test set.

>

>

>

>

RQ2: Impact of balancing vocal traits in the training dataset

81 of 98

81

A. CommonVoice

6.1 English dataset

>

B. Initial Pool

C. DEV set

C. TEST set

D. TRAIN set

pool

>

E. Final Pool of recordings

30 H

30 H

TEST

DEV

FEATURE EXTRACTION

  • 1.686 hours
  • 66,173 speakers
  • Mean duration 4.81 seconds
  • 13.400 speakers
  • 585.97 hours
  • 79.3% vs. 20.7% (F/M)
  • EN alphabet
  • Gender Labels
  • Min 20 seconds/speaker
  • Max 29 minutes/speaker

Pool of Recordings

RQ2: Impact of balancing vocal traits in the training dataset

82 of 98

82

Model training pipeline

Methodology

3. Model

Training

  • Train ASR DeepSpeech framework

>

Speech Signal

Acoustic Features

Spectograms

Lexicon

words

Language

utterances

Decoding

Search for the most probable sequence

>

>

>

<

Windowing

Capture frames of the signal

Acoustic

phones

<

Utterance

The most probable utterance is outputted

“How are

you?”

DEEPSPEECH

PRE-TRAINED MODELS

DEEPSPEECH

FRAMEWORK ( RNNs)

83 of 98

83

Model training pipeline

Methodology

3. Model

Training

  • Train ASR DeepSpeech framework

>

  • 3 RELUS + 1 RNN (LSTM) + 1 RELU
  • Softmax output layer
  • CTC as the loss measure

  • 64 train batch size
  • 64 test batch size
  • 16 dev batch size

  • Dropout RNN: 0.3
  • Learning rate: 0.0005

  • Epochs: 100 (early stopping at 10 epochs, with 0.2 loss delta)

Similar architecture for all trained models (DeepSpeech)

84 of 98

84

TRAIN SETS

85 of 98

85

Train set pipeline

Amplitude

Pitch

0

14

14

14

14

14

14

14

>

REFERENCE GROUPS

#

CONSTRAINTS:

BALANCING CRITERION

?

FIXED-SIZED INTERVALS

100/200 HOURS

2. Balanced

Train Set

>

RQ2: Impact of balancing vocal traits in the training dataset

86 of 98

Selecting and ranking train sets

Train sets

  • Evaluates how uniform is the distribution of speech data across bins.
  • Threshold: 80% (45%) (a maximum difference of 20% in the number of hours per bins)

3.

BIN_UNISCORE

1.

PREMISES

TARGET

100/200 hours

DISTRIBUTION

Uniform – with a similar amount of data for each bin in the train set.

CRITERIA FOR BALANCING

Vocal Models – Pitch, Pitch+Jitter, Spectral Centroid, Spectral Centroid + Loudness

Non-Vocal Models – Gender, Random

DISCRETIZATION

For vocal models, discretization will be performed. The number of bins (reference groups, will vary with the vocal traits):

  1. Pitch and Spectral Centroid: 2-10
  2. Pitch+Jitter and Spectral Centroid + Loudness: 4- 12
  • Evaluates how close to the 200/100 hours target the train set is.
  • Threshold: 80% (45%)

2.

UNISCORE

DefinedCrowd confidential

87 of 98

#

Balancement

Criterion

# Bins

#Hours

#Utterances

# Speakers

1

Gender

2

200

147,014

12,302

2

Pitch

2

200

148,788

12,730

3

Pitch

3

200

148,113

12,268

4

Random

1

200

149,266

12,818

5

Spectral Centroid

2

200

149,630

12,797

6

Spectral Centroid

3

200

149,403

12,167

7

Pitch+Jitter

2*2=4

194.97

146,458

12,575

8

Gender

2

100

73,344

10,599

9

Pitch

2

100

74,426

11,209

10

Pitch

3

100

73,996

10,709

11

Pitch+Jitter

2*2=4

100

74,918

10,926

12

Random

1

100

74,899

11,251

13

Spectral Centroid

2

100

74,760

11,306

14

Spectral Centroid

3

100

74,660

10,685

15

Spectral Centroid

4

100

74,419

10,348

16

Spectral Centroid+Loudness

2*2=4

100

76,814

10,368

17

Pitch

4

97.96

72,536

10,331

Shortlist of train sets

Train sets

UniScore > 0.45 & BinUniScore > 0.45

  • Pitch 4
  • Spectral Centroid 4
  • Spectral Centroid and Loudness (2-2)
  • All combinations in the 200 hours group.

2.

100H (10 train sets )

1.

200 H (7 train sets)

UniScore > 0.8 & BinUniScore > 0.8:

  • PITCH: 2,3
  • SPECTRAL CENTROID: 2,3
  • PITCH AND JITTER: 2x2
  • GENDER
  • RANDOM

DefinedCrowd confidential

88 of 98

Are the train sets signficantly different? | Methodology

Train sets

E. Final Pool of recordings

  • 13.400 speakers
  • 585.97 hours
  • 85-90% of the speakers is repeated between trainsets in the 100 hours group

10.

100H - ~10,800 speakers

07.

200 H - ~12,300 speakers

  • 90-95% of the speakers is repeated between trainsets in the 200 hours group

>

Are the train sets significantly different?

DefinedCrowd confidential

89 of 98

Are the train sets signficantly different? | Methodology

Train sets

Are the train sets significantly different?

NUMBER OF FILES FOR EACH SPEAKER IN THE TRAIN SETS

A

FEATURES DISTRIBUTION

IN THE TRAIN SETS

B

Wilcoxon signed rank test (α=0.01)

For each pair of train sets, within each group, build a dataframe with the speakers on both train sets, and compare the number of files for each speaker.

Mann-Whitney U test (α=0.01)

For each pair of train sets, compare the distribution of a specific acoustic feature (pitch, jitter, spectral centroid and loudness)

>

METHOD [similar speakers + similar features]

RESULTS [similar speakers AND features]

PitchJitter and Pitch 3 – similar spectral centroid

Pitch 3 and Spectral Centroid 3 – no similar features

Spectral centroid 2 and the random train set – similar pitch and jitter.

Gender and spectral centroid 3 – similar spectral centroid and loudness.

No pair of train sets showed a similar distribution of all vocal traits.

DefinedCrowd confidential

90 of 98

#

Balancement

Criterion

# Bins

#Hours

#Utterances

# Speakers

1

Gender

2

200

147,014

12,302

2

Pitch

3

200

148,788

12,730

3

Pitch

3

200

148,113

12,268

4

Random

1

200

149,266

12,818

5

Spectral Centroid

2

200

149,630

12,797

6

Spectral Centroid

3

200

149,403

12,167

7

Pitch+Jitter

2*2=4

194.97

146,458

12,575

8

Gender

2

100

73,344

10,599

9

Pitch

2

100

74,426

11,209

10

Pitch

3

100

73,996

10,709

11

Pitch+Jitter

2*2=4

100

74,918

10,926

12

Random

1

100

74,899

11,251

13

Spectral Centroid

2

100

74,760

11,306

14

Spectral Centroid

3

100

74,660

10,685

15

Spectral Centroid

4

100

74,419

10,348

16

Spectral Centroid+Loudness

2*2=4

100

76,814

10,368

17

Pitch

4

97.96

72,536

10,331

Shortlist of train sets

Train sets

UniScore > 0.45 & BinUniScore > 0.45

  • Pitch 4
  • Spectral Centroid 4
  • Spectral Centroid and Loudness (2-2)
  • All combinations in the 200 hours group.

2.

100H (10 train sets )

1.

200 H (7 train sets)

UniScore > 0.8 & BinUniScore > 0.8:

  • PITCH: 2,3
  • SPECTRAL CENTROID: 2,3
  • PITCH AND JITTER: 2x2
  • GENDER
  • RANDOM

DefinedCrowd confidential

91 of 98

91

BIAS & PERFORMANCE

92 of 98

92

4. Evaluation

Obtained performance

Methodology

>

Obtained (CV test sets):

  • 100 hours – ~+50% WER
  • 200 hours – ~40-45% WER

(DeepSpeech pre-trained obtained a 28% test accuracy)

Expected:

  • 100 hours – ~+45% WER
  • 200 hours – ~35-40% WER

WER per # hours

  • 3 test sets
  • WER as performance measure.

Ekapol Chuangsuwanich.Multilingual techniques for low resource automatic speech recognition. PhDthesis, 01 2016

Expected WER with respect to amount of training data.

01

93 of 98

93

Performance | Performance results

  • 5-7 p.p differences in WER between the 100 and 200-hours models.

  • Vocal models with 2+ reference groups show a better performance when compared to Non-Vocal Models (gender and random).

NOTES ON PERFORMANCE

Average WER per model and number of hours

01

RQ2: Impact of balancing vocal traits in the training dataset

94 of 98

Bias |Methodology

Evaluation

Bias is about systematic errors “against” specific sub-groups

|

|

Source: (Garnerin, 2019)

Race

Source: (Koenecke, 2020)

Source: (Schlögl, 2013)

Gender

Age

Unbiased = no significant performance differences between groups.

DefinedCrowd confidential

95 of 98

Bias | Gender

Evaluation

  • The gender model shows the greatest differences between genders: female speakers show, on average, a WER 3.72 pp lower than male speakers.

  • Gender, random, pitch+jitter, and spectral centroid + loudness obtained the most significant differences between gender groups. [Vocal models with 2 criteria and non-vocal models]

  • Spectral centroid 3, spectral centroid 4, and pitch 4 have similar performances between groups [Vocal models with 3+ bins and 1 criterion]

Performance per gender groups

DefinedCrowd confidential

96 of 98

96

Bias | Gender

RQ2: Impact of balancing vocal traits in the training dataset

* only validated for 100h.

  • Vocal models with a single criteria produced unbiased models:
    • Pitch 2-4
    • Spectral Centroid 2-4

PERFORMANCE IN GENDER GROUPS

Mean WER between gender groups.

09

BALANCEMENT CRITERION

MALE

FEMALE

meanDIFF%

Gender

44.37%

48.09%

-3.72

Pitch 2

46.37%

46.96%

-0.59

Pitch 4*

49.53%

49.67%

-0.14

Spectral Centroid 2

48.32%

48.43%

-0.11

Pitch 3

46.55%

46.46%

0.09

Spectral Centroid 3

46.32%

46.21%

0.11

Spectral Centroid 4*

50.46%

49.86%

0.60

Random

49.03%

48.08%

0.94

Spectral Centroid + Loudness (2-2)*

49.71%

48.05%

1.65

Pitch + Jitter (2-2)

47.16%

45.38%

1.78

97 of 98

Bias | Age

Evaluation

  • Teens show, on average, a worse performance than all other age groups in our test sets (7 pp above the models’ average performance).

  • All models showed significant performance differences between age groups (Kruskal-Wallis test, α=0.06)

  • Differences motivated by the discrepancies in the representation of age groups in the pool of recordings.

Performance per age groups

DefinedCrowd confidential

98 of 98

98

Balancement Criterion

Differences to the mean model’s performance

Teens

Thirties

Fourties

Fifties

Over60

Sum

Squardiff

Pitch 4

6.12

-2.10

-4.74

-1.62

-4.88

96.32

Spectral Centroid 3

8.51

-0.84

-2.35

-0.64

-3.25

105.89

Spectral Centroid + Loudness (22)

7.09

-2.88

-4.48

-1.33

-5.72

117.76

Pitch3

6.87

-2.38

-3.91

-2.02

-6.24

120.00

Spectral Centroid 4

8.47

-2.10

-3.51

-1.90

-5.45

125.54

Pitch 2

7.79

-2.53

-4.33

-2.55

-5.59

129.51

Spectral Centroid 2

7.37

-2.67

-4.26

-2.07

-6.37

130.19

Pitch+jitter (2-2)

7.26

-2.52

-4.66

-1.85

-6.35

131.18

Random

7.13

-3.12

-4.51

-2.77

-6.14

132.45

Gender

5.66

-3.05

-5.12

-3.45

-7.08

134.10

Bias | Age

RQ2: Impact of balancing vocal traits in the training dataset

* only validated for 100h.

  • Vocal models with 3+ bins reduced bias against age groups.
    • Pitch 4
    • Spectral Centroid 3

  • Non-vocal models obtained the most bias between age groups.

PERFORMANCE IN AGE GROUPS

Mean WER between age groups.

09