1 of 20

Automatic voice onset time estimation from reassignment spectra (Stouten and Van hamme, 2009)

Cheonkam Jeong

2 of 20

Introduction

Problem: The settings (i.e., sliding window size and shift) used in the canonical ASR fails to catch acoustic events that occur at a finer time scale, such as Voice Onset Time

VOT: time interval between the release of the plosive and the onset of voicing of the following vowel; primary acoustic cue to distinguish plosives in many languages, such as English
Some limitations (?) of the traditional ASR: difficult to do modeling of timing at different scales

Goal: to provide an algorithm that considers phone-level features, such as VOT ⇒ better performance...
Method: the reassigned time-frequency representation (RTFR), a high resolution signal analysis method

3 of 20

Spectral reassignment

Time-frequency reassignment (Auger and Flandrin, 1995, among others): to improve the sharpness of the localization of the signal components by reallocating its energy distribution in the time-frequency plane

4 of 20

Spectral reassignment

8ms Hamming window, shifted by 0.625 ms per analysis frame, thus 128 and 10 samples, respectively at a sampling frequency of 16 kHz
256 equally spaced frequency bins for reassignment

5 of 20

Properties of the VOT

Factors that affect VOT values: place of articulation, speech rate, context, position within the word, lexical stress, gender, etc.
Problem: Voiceless stops & high f0 => shorter VOT, longer VOT for the voiced stops in conversational stops. Due to this overlapping distribution, the VOT value and plosive identity is not straightforward.
Solution: only consider plosives that are uttered in a constrained way

6 of 20

Data sets

Data: TIMIT (Garofolo et al., 1990)
Targets: 6 plosives ( /p, t, k, b, d, g/)
Four data sets: “forced,” “manual,” “free,” and “test”

7 of 20

Data sets - “forced”

Segment boundaries using a forced alignment with a HMM-based speech recognizer using the manually verified phonetic transcriptions
Irrespective of the left and the right phonetic context
The acoustic models: context independent HMMs with 2-4 states per phone
The speech features: mel-scaled log-filterbank outputs
Sharing closure models depending on the voicing of the targets
Segment boundaries for the plosive: burst only

8 of 20

Data sets - “free”

Fully automatic VOT extraction setting
Plosive segment candidates generated by a phonetic automatic speech recognizer as described in (Demuynck et al., 2006) applied to the same utterances used in the “forced” data set
The HMMs described above used to find the best matching phonetic transcription using a phone-level bigram language model with Witten-Bell smoothing (Witten and Bell, 1991)

9 of 20

Data sets - “manual”

A subset of the plosive speech segments selected from the “forced” set
Manually measured by the experts

10 of 20

Data sets - “test”

constructed exactly like the “forced” data set, except that the sentences are taken from the TIMIT test set

11 of 20

The VOT estimation algorithm

Step 1: candidate plosive segments are detected and segment boundaries are generated
Step 2: the burst onset is detected based on “burst power”
Step 3: the onset of voicing is found based on “periodicity”

12 of 20

The VOT estimation algorithm - detection of plosive segments

Step 1: candidate plosive segments are detected and segment boundaries are generated
Using a HMM-based automatic speech recognizer described earlier
Defined the “forced” and “free” data with/without phonetic knowledge of the test utterance
The burst 2.5 ms or 4 frames prior to the burst segment start found by the recognizer

13 of 20

The VOT estimation algorithm - burst onset detection

Step 2: the burst onset is detected based on “burst power”
The corresponding frequency bins in the RTFR power summed to form the “burst power” p(n)

p(n) > p(n - j), for j = -1, 1, and 2 (local maximum)
p(n) - p(n - i) > p_m(n) for i = 2, . . . , 5 (sufficiently sharp and strong peak), where p_m(n) is taken to be mean of p(n) over 150 plosive frames

14 of 20

The VOT estimation algorithm - start of periodicity

Step 3: the onset of voicing is found based on “periodicity”
A short term autocorrelation computed by multiplying every RTFR frame (for every 0.625 ms frame advance)

The autocorrelation function: a large value where there is a substantial amount of energy that is periodically repeated with the analysis frame

15 of 20

Experiments - Algorithm performance for phonetic studies

The absolute difference between the manually and the automatically extracted VOT estimates

smaller than 10ms (76.1%), smaller than 20 ms (91.4%), and smaller than 30ms (96.2%)
the average deviation: largest for /d/ followed by /k/, /g/, /t/, /p/, and /b/

16 of 20

Experiments - Algorithm performance for ASR

The absolute difference between manual and automatic estimates analyzed on the “free” data set
The absolute difference between the manual and fully automatic VOT estimates

smaller than 10ms (72.6%), smaller than 20 ms (87.8%), and smaller than 30ms (93.8%)

Only 16 (=582-566) out of 582 plosives from the “manual” set could not be found automatically, which is far less than 53 (9.2% of 582)

17 of 20

Experiments - Estimated VOTs

Such factors as gender and phonetic context considered with respect to the voicing dimension, rather than place of articulation
VOT: female > male
Right context

18 of 20

Experiments - VOT as a feature for ASR

P(V | l, p, r): the probability that the estimated VOT falls in bin V for plosive p
P(V | l, p̄, r): the probability of the plosive with opposite voicing
The log-likelihood ratio:

19 of 20

Experiments - VOT as a feature for ASR

In an attempt to improve the phone recognition rate by exploiting the VOT as a feature, phone lattices were generated on the TIMIT test data as described in (Demuynck et al., 2006)
When dealing with the “test” data set, the left and right phonetic contexts are unique; thus, the set of phone labels of arcs ending (or staring) in the starting (or ending) node of arc A with L (or R) and sum the statistics over all contexts of A allowed by the lattice

20 of 20

Experiments - VOT as a feature for ASR

The single free parameter α was tuned on the “forced” data set, which reduced the phone error rate from 26.70% to 26.53% on the TIMIT test set
Performance assessment

Contributed only very little to error rate improvement (26.70% ⇒ 26.53%)
The best obtainable error rate by correcting the voicing of the plosives in the first best path through the phone lattice using the reference transcription ⇒ 25.85%
(26.7-26.53)/(26.7-25.85) = 20% of the performance gain achievable using an ideal voicing detector