1
Fighting bias with a more ethical and reliable description of speech data
André Monforte
October 2021
Impact of Vocal Traits Distribution on Speech Applications’ Performance and Bias
Supervisor: Professor Doutor João Gama
Advisor: Doutor Rui Correia
2
AGENDA
Motivation
01
Problem Discussion
02
Research Hypothesis
03
RQ1: Most Informative Vocal Traits
04
RQ2: Impact of Vocal Traits on ASR Performance and Bias
05
Conclusions and Future Work
06
3
Motivation
Speech Applications
How are
you?
>
4
Motivation
+
=
Training data
Machine Learning
Algorithms
AI
>
Essential ingredientes for speech applications
5
Motivation
+
=
Training data
Machine Learning
Algorithms
AI
>
BIAS
Essential ingredientes for speech applications
6
Motivation
Bias in Speech
Bias is about systematic errors “against” specific sub-groups.
|
|
Gender
Source: (Garnerin, 2019)
Ethnicity
Source: (Koenecke, 2020)
Age
Source: (Schlögl, 2013)
7
Motivation
Bias in Speech | Mitigation techniques
Bias is about systematic errors “against” specific sub-groups.
Gender
Source: (Garnerin, 2019)
INTERVENTION
EQUAL REPRESENTATION OF GENDER GROUPS
ON THE TRAIN SET
8
Motivation
Bias in Speech | Mitigation techniques
Bias is about systematic errors “against” specific sub-groups.
INTERVENTION
EQUAL REPRESENTATION OF SOCIAL GROUPS
ON THE TRAIN SET
|
GENDER
ETHNICITY
AGE
|
9
Balancing speech data over metadata has
2 major limitations :
10
Balancing speech data over metadata has
2 major limitations :
11
Problem Discussion
Speech Contributors
>
Collection Job
>
Speech data
Speech collection pipeline
12
Problem Discussion
Speech Contributors
>
Collection Job
>
Speech data
Speech collection pipeline
13
Problem Discussion
Speech collection - fraud types
Profile vs. detected from speech
1 Account, Many Speakers
1 Speaker, Many Accounts
Gender Mismatch
Multiple Speaker
Multiple Account
14
Problem Discussion
Speech Contributors
>
Collection Job
>
Speech data
>
Validation Step
Speech
content
Background
conditions
Nativeness
Speech collection pipeline
15
Problem Discussion
Speech Contributors
>
Collection Job
>
Speech data
>
Validation Step
Speech
content
Background
conditions
Nativeness
Speech collection pipeline
Self-reported
Speaker metadata
?
…
16
Balancing speech data over metadata has
2 major limitations :
Impact of Vocal Traits Distribution on Speech Applications’ Performance and Bias | Research Plan
17
Problem Discussion
Proxies for actual vocal traits of the speaker
>
TYPICAL
Male speakers
50
Female speakers
50
GENDER APPROACH:
Amplitude
Pitch
0
18
Problem Discussion
Proxies for actual vocal traits of the speaker
>
TYPICAL
Male speakers
50
Female speakers
50
GENDER APPROACH:
Amplitude
Pitch
0
IDEALLY
19
Problem Discussion
Proxies for actual vocal traits of the speaker
>
TYPICAL
Male speakers
50
Female speakers
50
Amplitude
0
GENDER APPROACH:
Pitch
ACTUALLY
20
Problem Discussion
Proxies for actual vocal traits of the speaker
>
TYPICAL
Balancing data over GENDER is only efficient in capturing the typical vocal traits of the group.
5
10
25
25
5
20
10
Male speakers
50
Female speakers
50
Amplitude
Pitch
0
GENDER APPROACH:
ACTUALLY
21
Research Hypothesis
Actual vocal representation
0
Ignore social groups
01
WHAT WE PROPOSE
Vocal trait 2
Vocal trait 1
22
Vocal trait 2
0
Ignore social groups
01
WHAT WE PROPOSE
Use vocal traits as criterion
02
5
10
25
25
5
20
10
Research Hypothesis
Actual vocal representation
Vocal trait 1
23
Ignore social groups
01
WHAT WE PROPOSE
Use vocal traits as criterion
02
Research Hypothesis
Actual vocal representation
Actually balance the dataset
03
Vocal trait 2
Vocal trait 1
0
14
14
14
14
14
14
14
24
Ignore social groups
01
WHAT WE PROPOSE
Use vocal traits as criterion
02
Actually balance the dataset
03
Research Hypothesis
Actual vocal representation
+ EFFECTIVE
+ VERIFIABLE
+ ETHICAL
24
Vocal trait 2
Vocal trait 1
0
14
14
14
14
14
14
14
25
PROXY
for individual vocal traits
ACTUAL
individual vocal traits
WHAT PEOPLE TIPICALLY SOUND LIKE
WHAT PEOPLE ACTUALLY SOUND LIKE
Research Hypothesis
Actual vocal representation
26
Research Questions
Impact of Vocal Traits Distribution on Speech Applications’ Performance and Bias
27
RQ1: Voice traits that better differentiate speakers
RQ 1
What voice traits better differentiate and characterize speakers?
3 steps approach
B. Vocal traits portraying the same information as gender labels
A. Pool of acoustic features
C. Vocal traits that differentiate speakers
28
RQ1.A | Pool of acoustic features
# | FEATURE | DESCRIPTION | INSIGHT |
1 | Spectral Centroid | Centre of mass of the signal | Brightness of a sound signal. |
2 | Spectral Spread | Spread of energy around the centre of mass. | Average deviation around the centroid. |
3 | HNR | Hoarseness, �Distance to pure tones | Distance between pure tones and the sound frequencies. |
4 | Pitch | Fundamental frequency (F0) | Degree of highness or lowness of a tone |
5 | Jitter | Pitch fluctuations | Variations of the pitch within the utterance. |
6 | Shimmer | Variations of the amplitude within the utterance. | Variations of the amplitude within the utterance. |
7 | Energy | Root mean energy of the signal | Intensity of the signal, and its fluctuation within the utterance. |
8 | Loudness | Volume | Volume |
9 | Speaking Rate | Number of words per second. | Rhythm of speaking |
RQ1: Voice traits that better differentiate speakers
29
RQ1: Voice traits that better differentiate speakers
RQ 1
What voice traits better differentiate and characterize speakers?
3 steps approach
B. Vocal traits portraying the same information as gender labels
A. Pool of acoustic features
C. Vocal traits that differentiate speakers
30
02
GROUP DIFFERENCES
GENDER CORRELATION
MOST PROMISING FEATURES:
Box plots per measure and gender.
01
Correlation with gender
RQ1.B | Vocal traits portraying the same information as gender labels
RQ1: Voice traits that better differentiate speakers
31
02
GROUP DIFFERENCES
GENDER CORRELATION
MOST PROMISING FEATURES:
Box plots per measure and gender.
01
#1
PITCH
Correlation with gender
RQ1.B | Vocal traits portraying the same information as gender labels
RQ1: Voice traits that better differentiate speakers
32
Correlation with gender
02
GROUP DIFFERENCES
GENDER CORRELATION
MOST PROMISING FEATURES:
Box plots per measure and gender.
01
#2
HNR AND JITTER
#1
PITCH
RQ1.B | Vocal traits portraying the same information as gender labels
RQ1: Voice traits that better differentiate speakers
33
# | ITERATION | % TEST |
1 | Complete dataset | 96.3% |
2 | Excluding Pitch | 81.7% |
3 | Excluding Pitch + Jitter + HNR | 70,1% |
4 | Pitch | 87.3% |
5 | Pitch + Jitter | 89.1% |
6 | Pitch + HNR | 81.8% |
Test Accuracy results | Gender predictor
03
#1
PITCH
MOST PROMISING FEATURES
PREDICTING GENDER FROM VOCAL TRAITS
RQ1.B | Vocal traits portraying the same information as gender labels
RQ1: Voice traits that better differentiate speakers
34
Test Accuracy results | Gender predictor
03
#1
PITCH
MOST PROMISING FEATURES
PREDICTING GENDER FROM VOCAL TRAITS
# | ITERATION | % TEST |
1 | Complete dataset | 96.3% |
2 | Excluding Pitch | 81.7% |
3 | Excluding Pitch + Jitter + HNR | 70,1% |
4 | Pitch | 87.3% |
5 | Pitch + Jitter | 89.1% |
6 | Pitch + HNR | 81.8% |
#2
JITTER
RQ1.B | Vocal traits portraying the same information as gender labels
RQ1: Voice traits that better differentiate speakers
35
RQ1: Voice traits that better differentiate speakers
RQ 1
What voice traits better differentiate and characterize speakers?
3 steps approach
B. Vocal traits portraying the same information as gender labels
A. Pool of acoustic features
C. Vocal traits that differentiate speakers
Pitch and Jitter
36
RQ1.C | Vocal traits that differentiate speakers
RQ1: Voice traits that better differentiate speakers
>
Hierarchical clustering
Standardize acoustic features
Compare mean differences
Partitional
clustering
>
>
RQ1.C | Vocal traits that differentiate speakers
RQ1: Voice traits that better differentiate speakers
Clustering experiment – mean difference vocal traits
05
RQ1.C | Vocal traits that differentiate speakers
RQ1: Voice traits that better differentiate speakers
Clustering experiment – mean difference vocal traits
05
Correlation Matrix – vocal traits and gender
06
RQ1.C | Vocal traits that differentiate speakers
RQ1: Voice traits that better differentiate speakers
Clustering experiment – mean difference vocal traits
05
Correlation Matrix – vocal traits and gender
06
40
MOST EFFECTIVE VOCAL TRAITS TO SEPARATE SPEAKERS:
RQ1.C | Vocal traits that differentiate speakers
RQ1: Voice traits that better differentiate speakers
#1
#2
SPECTRAL
CENTROID
LOUDNESS
Beyond Gender labels | Final hypothesis
07
41
RQ 1
What voice traits better differentiate and characterize speakers?
A. Vocal traits portraying the same information as gender labels
0. Pool of acoustic features
B. Vocal traits that differentiate speakers
Pitch and Jitter
Spectral Centroid and Loudness
RQ1: Voice traits that better differentiate speakers
3 steps approach
42
Research Questions
Impact of Vocal Traits Distribution on Speech Applications’ Performance and Bias | Research Plan
43
>
>
>
2. Balanced
Train Set
3. Train ASR model
4. Evaluate performance
5. Compare performance and bias
FOR EACH FEATURE IN OUR SHORTLIST:
>
1. Most informative
criterion for balancing
High-level experimental pipeline
RQ2: Impact of balancing vocal traits in the training dataset
TRAIN AN ASR USING THE SAME TRAINING PIPELINE
44
>
1. Most informative
criterion for balancing
Most informative criterion for balancing
RQ2: Impact of balancing vocal traits in the training dataset
4 VOCAL MODELS:
2 NON-VOCAL MODELS:
45
RQ2: Impact of balancing vocal traits in the training dataset
Train set pipeline
Amplitude
Pitch
0
14
14
14
14
14
14
14
>
CONSTRAINTS:
FIXED-SIZED INTERVALS
100/200 HOURS
2. Balanced
Train Set
46
Train set pipeline
Amplitude
Pitch
0
14
14
14
14
14
14
14
>
REFERENCE GROUPS
#
CONSTRAINTS:
BALANCING CRITERION
?
FIXED-SIZED INTERVALS
100/200 HOURS
2. Balanced
Train Set
>
RQ2: Impact of balancing vocal traits in the training dataset
47
Shortlist of train sets
200 HOURS (7 train sets)
100 HOURS (10 train sets)
+
RQ2: Impact of balancing vocal traits in the training dataset
48
Performance
RQ2: Impact of balancing vocal traits in the training dataset
Average WER per model and number of hours in the train set.
08
49
Performance | High-Performers
100 Hours
200 Hours
HIGH-PERFORMERS
RQ2: Impact of balancing vocal traits in the training dataset
>
VOCAL MODELS WITH AT LEAST 3 BINS
Average WER per model and number of hours in the train set.
08
50
Performance | Low-Performers
100 Hours
200 Hours
LOW-PERFORMERS
RQ2: Impact of balancing vocal traits in the training dataset
Average WER per model and number of hours in the train set.
08
>
NON-VOCAL MODELS +
VOCAL MODELS WITH 2 BINS
51
Performance | Mean WER per criterion and differences to Gender
RQ2: Impact of balancing vocal traits in the training dataset
# | BALANCEMENT CRITERION | WER difference to Gender Model | ||
100h | 200h | mean | ||
1 | SPECTRAL CENTROID + LOUDNESS (2-2)* | -2.09 | - | -2.09 |
2 | SPECTRAL CENTROID 3 | -1.58 | -2.10 | -1.84 |
3 | PITCHJITTER 22 | -2.27 | -1.19 | -1.73 |
4 | PITCH3 | -0.48 | -1.97 | -1.22 |
5 | PITCH4* | -1.18 | - | -1.18 |
6 | SPECTRAL CENTROID 4* | -0.98 | - | -0.98 |
7 | PITCH2 | 0.15 | -1.36 | -0.61 |
8 | SPECTRAL CENTROID 2 | 0.08 | 1.50 | 0.79 |
9 | RANDOM | 3.09 | -1.12 | 0.99 |
* only validated for 100h.
DIFFERENCES TO GENDER
Mean WER difference to the Gender Model (percentual points).
08
52
Bias | Gender
RQ2: Impact of balancing vocal traits in the training dataset
* only validated for 100h.
PERFORMANCE IN GENDER GROUPS
Mean WER between gender groups.
09
BALANCEMENT CRITERION | MALE | FEMALE | meanDIFF% |
Gender | 44.37% | 48.09% | -3.72 |
Pitch 2 | 46.37% | 46.96% | -0.59 |
Pitch 4* | 49.53% | 49.67% | -0.14 |
Spectral Centroid 2 | 48.32% | 48.43% | -0.11 |
Pitch 3 | 46.55% | 46.46% | 0.09 |
Spectral Centroid 3 | 46.32% | 46.21% | 0.11 |
Spectral Centroid 4* | 50.46% | 49.86% | 0.60 |
Random | 49.03% | 48.08% | 0.94 |
Spectral Centroid + Loudness (2-2)* | 49.71% | 48.05% | 1.65 |
Pitch + Jitter (2-2) | 47.16% | 45.38% | 1.78 |
53
Balancement Criterion | Differences to the mean model’s performance | |||||
Teens | Thirties | Fourties | Fifties | Over60 | Sum Squardiff | |
Pitch 4 | 6.12 | -2.10 | -4.74 | -1.62 | -4.88 | 96.32 |
Spectral Centroid 3 | 8.51 | -0.84 | -2.35 | -0.64 | -3.25 | 105.89 |
Spectral Centroid + Loudness (22) | 7.09 | -2.88 | -4.48 | -1.33 | -5.72 | 117.76 |
Pitch3 | 6.87 | -2.38 | -3.91 | -2.02 | -6.24 | 120.00 |
Spectral Centroid 4 | 8.47 | -2.10 | -3.51 | -1.90 | -5.45 | 125.54 |
Pitch 2 | 7.79 | -2.53 | -4.33 | -2.55 | -5.59 | 129.51 |
Spectral Centroid 2 | 7.37 | -2.67 | -4.26 | -2.07 | -6.37 | 130.19 |
Pitch+jitter (2-2) | 7.26 | -2.52 | -4.66 | -1.85 | -6.35 | 131.18 |
Random | 7.13 | -3.12 | -4.51 | -2.77 | -6.14 | 132.45 |
Gender | 5.66 | -3.05 | -5.12 | -3.45 | -7.08 | 134.10 |
Bias | Age
RQ2: Impact of balancing vocal traits in the training dataset
* only validated for 100h.
PERFORMANCE IN AGE GROUPS
Mean WER between age groups.
09
54
Final conclusions
RQ2: Impact of balancing vocal traits in the training dataset
RQ2
PERFORMANCE
A
Vocal models with 3+ bins improved performance in 1-2 pp.
BIAS
B
What is the impact of balancing such features in the training dataset of a speech application?
Vocal models with 3+ bins and a single balancing criterion were effective in mitigating bias between age and gender groups.
55
Ignore social groups
01
WHAT WE PROPOSE
Use vocal traits as criterion
02
Actually balance the dataset
03
CONCLUSIONS AND FUTURE WORK
Research Hypothesis vs. Obtained Results
+ EFFECTIVE
+ VERIFIABLE
+ ETHICAL
55
Vocal trait 2
Vocal trait 1
0
14
14
14
14
14
14
14
56
Ignore social groups
01
OBTAINED RESULTS
Use vocal traits as criterion
02
Actually balance the dataset
03
CONCLUSIONS AND FUTURE WORK
Research Hypothesis vs. Obtained Results
+ EFFECTIVE
+ VERIFIABLE
+ ETHICAL
56
Vocal trait 2
Vocal trait 1
0
14
14
14
14
14
14
14
57
Ignore social groups
01
OBTAINED RESULTS
Use vocal traits as criterion
02
Actually balance the dataset
03
CONCLUSIONS AND FUTURE WORK
Research Hypothesis vs. Obtained Results
+ EFFECTIVE
+ VERIFIABLE
+ ETHICAL
57
Vocal trait 2
Vocal trait 1
0
58
Ignore social groups
01
OBTAINED RESULTS
Use vocal traits as criterion
02
Actually balance the dataset
03
CONCLUSIONS AND FUTURE WORK
Research Hypothesis vs. Obtained Results
+ EFFECTIVE
+ VERIFIABLE
+ ETHICAL
58
Vocal trait 2
PITCH OR SPECTRAL CENTROID
0
59
Ignore social groups
01
OBTAINED RESULTS
Use vocal traits as criterion
02
Actually balance the dataset
03
CONCLUSIONS AND FUTURE WORK
Obtained results
+ EFFECTIVE
+ VERIFIABLE
+ ETHICAL
59
Vocal trait 2
PITCH OR SPECTRAL CENTROID
0
#1
+
#2
#3
3+ REFERENCE GROUPS
60
Ignore social groups
01
OBTAINED RESULTS
Use vocal traits as criterion
02
Actually balance the dataset
03
CONCLUSIONS AND FUTURE WORK
Obtained results
60
Vocal trait 2
PITCH OR SPECTRAL CENTROID
0
#1
+
#2
#3
3+ REFERENCE GROUPS
60
60
+ EFFECTIVE
+ VERIFIABLE
+ ETHICAL
61
Ignore social groups
01
OBTAINED RESULTS
Use vocal traits as criterion
02
Actually balance the dataset
03
CONCLUSIONS AND FUTURE WORK
Obtained results
61
Vocal trait 2
PITCH OR SPECTRAL CENTROID
0
#1
+
#2
#3
3+ REFERENCE GROUPS
↑ PERFORMANCE
↓ BIAS
61
61
+ EFFECTIVE
+ VERIFIABLE
+ ETHICAL
62
FUTURE WORK
Languages
01
Nationalities
02
Recording conditions
03
Training set setups (distribution, # groups, # hours)
04
TEST THE CONCLUSIONS IN ALTERNATIVE SCENARIOS:
63
THANK YOU.
64
Q&A
65
RQ2
RQ1
DATASET
POOL OF RECORDINGS
ASR TRAINING PIPELINE
ASR ARCHITECTURE
TRAIN SET SELECTION
SHORTLIST
METHODOLOGY
SELECTION CRITERIA
EXTRACTION
CV RESULTS
XGBOOST RESULTS
GREY AREA
CLUSTERING PIPELINE
CONSIDERED ITERATIONS
GENDER VS. CLUSTERS
GENERAL PERFORMANCE NOTES
EXPECTED PERFORMANCE
GENDER BIAS
AGE BIAS
Q&A
66
A. ACOUSTIC FEATURES
Vocal Traits Conveying Speaker Characteristics
Methodology
>
Utterance-level feature extraction
Base Pool of Features
Selecting
Vocal features
Speaker-level
feature extraction
Aggregate the feature results on a speaker level, using:
>
>
DefinedCrowd confidential
Feature Extraction Process
DefinedCrowd Speech Dataset
>
For each
recording
Extract the features for each file and aggregate them on an utterance level, using the Surfboard python toolkit:
>
Aggregate on a
speaker level
Aggregate feature on a speaker level, using:
Methodology
DefinedCrowd confidential
Selecting and ranking features
Methodology
2.
COEFFICIENT OF VARIATION (CV)
1.
EXCLUSION PREMISES
INTRA-UTTERANCE | Variables with a very high variability within the utterance do not have a significant mean (aggregated) value and should be excluded. |
INTRA-SPEAKER | Variables with a very high variability across all recordings of the same speaker do not have a significant speaker-mean value and should be excluded. |
INTER-UTTERANCE | Variables with a very low variability across all recordings have low discriminatory power and should be excluded. |
INTER-SPEAKER | Variables with a very low variability between speakers have a low discriminatory power and should be excluded. |
DefinedCrowd confidential
Final pool of features
FEATURE | INSIGHT | Intra-utterance | Inter-Utterance | Intra- Speaker | Inter- Speaker |
Spectral Centroid | Centre of mass of the signal | 0.139 | 0.265 | 0.23 | 0.23 |
Spectral Spread | Spread of energy around the centre of mass. | 0.068 | 0.16 | 0.147 | 0.147 |
Spectral Skewness | Symmetry of the spectrum | 1.388 | - | - | - |
Spectral Kurtosis | Flatness of the spectrum | 0.045 | 0.047 | - | - |
HNR | Hoarseness, �Distance to pure tones | - | 0.25 | 0.184 | 0.184 |
Pitch | Fundamental frequency (F0) | 0.104 | 0.234 | 0.22 | 0.22 |
Jitter | Pitch fluctuations | - | 0.385 | 0.274 | 0.274 |
Shimmer | Variations of the amplitude within the utterance. | - | 0.557 | 0.304 | 0.304 |
Energy | Root mean energy of the signal | - | 0.256 | 0.234 | 0.234 |
Loudness | Volume | - | 0.859 | 0.772 | 0.772 |
Speaking Rate | Number of words per second. | - | 0.558 | 0.2 | 0.2 |
Methodology
DefinedCrowd confidential
Final pool of features
FEATURE | INSIGHT | Intra-utterance | Inter-Utterance | Intra- Speaker | Inter- Speaker |
Spectral Centroid | Centre of mass of the signal | 0.139 | 0.265 | 0.23 | 0.23 |
Spectral Spread | Spread of energy around the centre of mass. | 0.068 | 0.16 | 0.147 | 0.147 |
HNR | Hoarseness, �Distance to pure tones | - | 0.25 | 0.184 | 0.184 |
Pitch | Fundamental frequency (F0) | 0.104 | 0.234 | 0.22 | 0.22 |
Jitter | Pitch fluctuations | - | 0.385 | 0.274 | 0.274 |
Shimmer | Variations of the amplitude within the utterance. | - | 0.557 | 0.304 | 0.304 |
Energy | Root mean energy of the signal | - | 0.256 | 0.234 | 0.234 |
Loudness | Volume | - | 0.859 | 0.772 | 0.772 |
Speaking Rate | Number of words per second. | - | 0.558 | 0.2 | 0.2 |
Methodology
DefinedCrowd confidential
72
B. GENDER MIMIC
RQ1.B | Vocal traits portraying the same information as gender labels
RQ1: Voice traits that better differentiate speakers
Predicting gender from vocal traits
FINDINGS
# | ITERATION | % TEST |
1 | Complete dataset | 96.3% |
2 | Excluding Pitch | 81.7% |
3 | Excluding Pitch + Jitter + HNR | 70,1% |
4 | Pitch only | 87.3% |
5 | Pitch + Jitter | 89.1% |
6 | Pitch + HNR | 81.8% |
DefinedCrowd confidential
74
PITCH GREY AREA
Pitch Grey Area | Pitch + Jitter scatter plot.
04
RQ1.B | Vocal traits portraying the same information as gender labels
RQ1: Voice traits that better differentiate speakers
75
C. BEYOND GENDER
RQ1.C | Separating vocal profiles
RQ1: Voice traits that better differentiate speakers
>
Hierarchical clustering
Standardize features
Compare mean differences
Partitional
clustering
Using the identified number clusters:
>
>
DefinedCrowd confidential
RQ1: Voice traits that better differentiate speakers
RQ1.C | Separating vocal profiles
Experiment replicated over:
DefinedCrowd confidential
78
GENDER GROUPS VS. CLUSTERS
RQ1.C | Vocal traits that differentiate speakers
RQ1: Voice traits that better differentiate speakers
79
RQ2 METHODOLOGY
80
1. Extract acoustic features
2. Build the
train set
3. Model
training
4. Evaluation
0. Pool of
recordings
ASR Training Pipeline
Build a pool of recordings to
extract the train sets
4 features:
>
>
>
>
RQ2: Impact of balancing vocal traits in the training dataset
81
A. CommonVoice
6.1 English dataset
>
B. Initial Pool
C. DEV set
C. TEST set
D. TRAIN set
pool
>
E. Final Pool of recordings
30 H
30 H
TEST
DEV
FEATURE EXTRACTION
Pool of Recordings
RQ2: Impact of balancing vocal traits in the training dataset
82
Model training pipeline
Methodology
3. Model
Training
>
Speech Signal
Acoustic Features
Spectograms
Lexicon
words
Language
utterances
Decoding
Search for the most probable sequence
>
>
>
<
Windowing
Capture frames of the signal
Acoustic
phones
<
Utterance
The most probable utterance is outputted
“How are
you?”
DEEPSPEECH
PRE-TRAINED MODELS
DEEPSPEECH
FRAMEWORK ( RNNs)
83
Model training pipeline
Methodology
3. Model
Training
>
Similar architecture for all trained models (DeepSpeech)
84
TRAIN SETS
85
Train set pipeline
Amplitude
Pitch
0
14
14
14
14
14
14
14
>
REFERENCE GROUPS
#
CONSTRAINTS:
BALANCING CRITERION
?
FIXED-SIZED INTERVALS
100/200 HOURS
2. Balanced
Train Set
>
RQ2: Impact of balancing vocal traits in the training dataset
Selecting and ranking train sets
Train sets
3.
BIN_UNISCORE
1.
PREMISES
TARGET | 100/200 hours |
DISTRIBUTION | Uniform – with a similar amount of data for each bin in the train set. |
CRITERIA FOR BALANCING | Vocal Models – Pitch, Pitch+Jitter, Spectral Centroid, Spectral Centroid + Loudness Non-Vocal Models – Gender, Random |
DISCRETIZATION | For vocal models, discretization will be performed. The number of bins (reference groups, will vary with the vocal traits):
|
2.
UNISCORE
DefinedCrowd confidential
# | Balancement Criterion | # Bins | #Hours | #Utterances | # Speakers |
1 | Gender | 2 | 200 | 147,014 | 12,302 |
2 | Pitch | 2 | 200 | 148,788 | 12,730 |
3 | Pitch | 3 | 200 | 148,113 | 12,268 |
4 | Random | 1 | 200 | 149,266 | 12,818 |
5 | Spectral Centroid | 2 | 200 | 149,630 | 12,797 |
6 | Spectral Centroid | 3 | 200 | 149,403 | 12,167 |
7 | Pitch+Jitter | 2*2=4 | 194.97 | 146,458 | 12,575 |
8 | Gender | 2 | 100 | 73,344 | 10,599 |
9 | Pitch | 2 | 100 | 74,426 | 11,209 |
10 | Pitch | 3 | 100 | 73,996 | 10,709 |
11 | Pitch+Jitter | 2*2=4 | 100 | 74,918 | 10,926 |
12 | Random | 1 | 100 | 74,899 | 11,251 |
13 | Spectral Centroid | 2 | 100 | 74,760 | 11,306 |
14 | Spectral Centroid | 3 | 100 | 74,660 | 10,685 |
15 | Spectral Centroid | 4 | 100 | 74,419 | 10,348 |
16 | Spectral Centroid+Loudness | 2*2=4 | 100 | 76,814 | 10,368 |
17 | Pitch | 4 | 97.96 | 72,536 | 10,331 |
Shortlist of train sets
Train sets
UniScore > 0.45 & BinUniScore > 0.45
2.
100H (10 train sets )
1.
200 H (7 train sets)
UniScore > 0.8 & BinUniScore > 0.8:
DefinedCrowd confidential
Are the train sets signficantly different? | Methodology
Train sets
E. Final Pool of recordings
10.
100H - ~10,800 speakers
07.
200 H - ~12,300 speakers
>
Are the train sets significantly different?
DefinedCrowd confidential
Are the train sets signficantly different? | Methodology
Train sets
Are the train sets significantly different?
NUMBER OF FILES FOR EACH SPEAKER IN THE TRAIN SETS
A
FEATURES DISTRIBUTION
IN THE TRAIN SETS
B
Wilcoxon signed rank test (α=0.01)
For each pair of train sets, within each group, build a dataframe with the speakers on both train sets, and compare the number of files for each speaker.
Mann-Whitney U test (α=0.01)
For each pair of train sets, compare the distribution of a specific acoustic feature (pitch, jitter, spectral centroid and loudness)
>
METHOD [similar speakers + similar features]
RESULTS [similar speakers AND features]
PitchJitter and Pitch 3 – similar spectral centroid
Pitch 3 and Spectral Centroid 3 – no similar features
Spectral centroid 2 and the random train set – similar pitch and jitter.
Gender and spectral centroid 3 – similar spectral centroid and loudness.
No pair of train sets showed a similar distribution of all vocal traits.
DefinedCrowd confidential
# | Balancement Criterion | # Bins | #Hours | #Utterances | # Speakers |
1 | Gender | 2 | 200 | 147,014 | 12,302 |
2 | Pitch | 3 | 200 | 148,788 | 12,730 |
3 | Pitch | 3 | 200 | 148,113 | 12,268 |
4 | Random | 1 | 200 | 149,266 | 12,818 |
5 | Spectral Centroid | 2 | 200 | 149,630 | 12,797 |
6 | Spectral Centroid | 3 | 200 | 149,403 | 12,167 |
7 | Pitch+Jitter | 2*2=4 | 194.97 | 146,458 | 12,575 |
8 | Gender | 2 | 100 | 73,344 | 10,599 |
9 | Pitch | 2 | 100 | 74,426 | 11,209 |
10 | Pitch | 3 | 100 | 73,996 | 10,709 |
11 | Pitch+Jitter | 2*2=4 | 100 | 74,918 | 10,926 |
12 | Random | 1 | 100 | 74,899 | 11,251 |
13 | Spectral Centroid | 2 | 100 | 74,760 | 11,306 |
14 | Spectral Centroid | 3 | 100 | 74,660 | 10,685 |
15 | Spectral Centroid | 4 | 100 | 74,419 | 10,348 |
16 | Spectral Centroid+Loudness | 2*2=4 | 100 | 76,814 | 10,368 |
17 | Pitch | 4 | 97.96 | 72,536 | 10,331 |
Shortlist of train sets
Train sets
UniScore > 0.45 & BinUniScore > 0.45
2.
100H (10 train sets )
1.
200 H (7 train sets)
UniScore > 0.8 & BinUniScore > 0.8:
DefinedCrowd confidential
91
BIAS & PERFORMANCE
92
4. Evaluation
Obtained performance
Methodology
>
Obtained (CV test sets):
(DeepSpeech pre-trained obtained a 28% test accuracy)
Expected:
WER per # hours
Ekapol Chuangsuwanich.Multilingual techniques for low resource automatic speech recognition. PhDthesis, 01 2016
Expected WER with respect to amount of training data.
01
93
Performance | Performance results
NOTES ON PERFORMANCE
Average WER per model and number of hours
01
RQ2: Impact of balancing vocal traits in the training dataset
Bias |Methodology
Evaluation
Bias is about systematic errors “against” specific sub-groups
|
|
Source: (Garnerin, 2019)
Race
Source: (Koenecke, 2020)
Source: (Schlögl, 2013)
Gender
Age
Unbiased = no significant performance differences between groups.
DefinedCrowd confidential
Bias | Gender
Evaluation
Performance per gender groups
DefinedCrowd confidential
96
Bias | Gender
RQ2: Impact of balancing vocal traits in the training dataset
* only validated for 100h.
PERFORMANCE IN GENDER GROUPS
Mean WER between gender groups.
09
BALANCEMENT CRITERION | MALE | FEMALE | meanDIFF% |
Gender | 44.37% | 48.09% | -3.72 |
Pitch 2 | 46.37% | 46.96% | -0.59 |
Pitch 4* | 49.53% | 49.67% | -0.14 |
Spectral Centroid 2 | 48.32% | 48.43% | -0.11 |
Pitch 3 | 46.55% | 46.46% | 0.09 |
Spectral Centroid 3 | 46.32% | 46.21% | 0.11 |
Spectral Centroid 4* | 50.46% | 49.86% | 0.60 |
Random | 49.03% | 48.08% | 0.94 |
Spectral Centroid + Loudness (2-2)* | 49.71% | 48.05% | 1.65 |
Pitch + Jitter (2-2) | 47.16% | 45.38% | 1.78 |
Bias | Age
Evaluation
Performance per age groups
DefinedCrowd confidential
98
Balancement Criterion | Differences to the mean model’s performance | |||||
Teens | Thirties | Fourties | Fifties | Over60 | Sum Squardiff | |
Pitch 4 | 6.12 | -2.10 | -4.74 | -1.62 | -4.88 | 96.32 |
Spectral Centroid 3 | 8.51 | -0.84 | -2.35 | -0.64 | -3.25 | 105.89 |
Spectral Centroid + Loudness (22) | 7.09 | -2.88 | -4.48 | -1.33 | -5.72 | 117.76 |
Pitch3 | 6.87 | -2.38 | -3.91 | -2.02 | -6.24 | 120.00 |
Spectral Centroid 4 | 8.47 | -2.10 | -3.51 | -1.90 | -5.45 | 125.54 |
Pitch 2 | 7.79 | -2.53 | -4.33 | -2.55 | -5.59 | 129.51 |
Spectral Centroid 2 | 7.37 | -2.67 | -4.26 | -2.07 | -6.37 | 130.19 |
Pitch+jitter (2-2) | 7.26 | -2.52 | -4.66 | -1.85 | -6.35 | 131.18 |
Random | 7.13 | -3.12 | -4.51 | -2.77 | -6.14 | 132.45 |
Gender | 5.66 | -3.05 | -5.12 | -3.45 | -7.08 | 134.10 |
Bias | Age
RQ2: Impact of balancing vocal traits in the training dataset
* only validated for 100h.
PERFORMANCE IN AGE GROUPS
Mean WER between age groups.
09