2 of 26

Frontend Optimization Methods for Robust Speaker Verification

What is Speaker Verification?

02/09/2021

Task of Speaker Recognition

Speaker Identification

Speaker Verification

Pics are from course Machine Learning for Speech, UEF Fall 2020.

3 of 26

Frontend Optimization Methods for Robust Speaker Verification

How Modern Speaker Verification Works?

02/09/2021

Feature Extractor

Speaker Embedding

Extractor

Backend

TDNNs, ResNet34, Self-attention pooling, Multi-task learning, Angular SoftMax,

Data augmentation via RIR, MUSAN…

Windowing

DFT

Power

Spectrum

Mel

Filterbanks

Log

DCT

MFCC

4 of 26

Frontend Optimization Methods for Robust Speaker Verification

Why Deep Speaker Embeddings?

02/09/2021

Speaker Embeddings used to be dominated by statistical models for years (GMM-SVM, i-vector…..)

We have witnessed the amazing (and still evolving) power of DNNs for almost 10 years.

Its wide acknowledgement and application started from speech recognition.

Replacing the parts of speech processing systems with DNN seems a natural choice.

Successive examples: d-vector, deep speaker, x-vector……

5 of 26

Frontend Optimization Methods for Robust Speaker Verification

But Why Then For Frontend?

02/09/2021

Feature Extractor

Speaker Embedding

Extractor

Backend

Lots of work so far have been focused on DNN embeddings and neural backends

We think feature extractor, in the context of SV, has been missed while holding huge potential

(and less computing-exhaustive…)

People use MFCCs/Mel Filterbanks, which does not concern many things like phase, temporal ops, etc

Windowing

DFT

Power

Spectrum

Mel

Filterbanks

Log

DCT

MFCC

6 of 26

Frontend Optimization Methods for Robust Speaker Verification

Anyway - We First Need to Benchmark

02/09/2021

Short-term features, magnitude spectrum

Short-term features, phase spectrum

Short-term features, with long-term processing

Fundamental frequency features

Feature Extractor

x-vector

PLDA

14 Hand-crafted feature extractors

We re-assess……

7 of 26

Frontend Optimization Methods for Robust Speaker Verification

Anyway - We First Need to Benchmark

02/09/2021

Constant-Q cepstral coefficients

(CQCCs, Todisco et al. 2016)

Multi-taper MFCCs

(Kinnunen et al. 2012)

Spectral centroid frequency coefficients/

Spectral centroid magnitude coefficients

(SCFCs/SCMCs, Kua et al, 2010)

Mel frequency cepstral coefficients

(MFCCs, Baseline)

Linear prediction cepstral coefficients/

Perceptual linear prediction cepstral coefficients

(LPCCs, Makhoul, 1975; PLPCCs, Hermansky, 1990)

8 of 26

Frontend Optimization Methods for Robust Speaker Verification

Anyway - We First Need to Benchmark

02/09/2021

Modified group delay function

(MGDF, Murthy and Gadde, 2003)

All-Pole group delay function

(APGDF, Rajan et al, 2013)

Unwrapping + Cosine function

Cosine phase function

(Cosphase, Wu et al, 2012)

Cepstal magnitude-phase octave coefficients

(CMPOCs, Yang et al, 2018)

9 of 26

Frontend Optimization Methods for Robust Speaker Verification

Anyway - We First Need to Benchmark

02/09/2021

Mean Hilbert envelope coefficients

(MHECs, Sadjadi et al, 2012)

Power-normalized cepstral coefficients

(PNCCs, Kim et al, 2016)

Also, MFCC+pitch

MFCC base, attached with 3-dimensional pitch vector

(Ghahremani et al, 2014)

10 of 26

Frontend Optimization Methods for Robust Speaker Verification

Anyway - We First Need to Benchmark

02/09/2021

Data: Voxceleb-1

Model: Basic X-Vector (Snyder et al. 2018)

Two score-level fusion systems: one set from 1st category, 1-from-each for the other

Metric: Equal error rate (EER) and minimum detection cost function (minDCF)

Feature	EER (%)	minDCF
MFCC	4.65	0.5937
SCMC	4.57	0.5875
Multi-taper	4.84	0.5459
MFCC+pitch	4.67	0.5223
MFCC+SCMC+Multi-taper	3.89	0.5396

Feature	EER (%)	minDCF
MFCC	8.12	0.8531
PNCC	6.08	0.7614
SCMC	6.62	0.762
MFCC+cosphase+PNCC	6.24	0.7998

Voxceleb1-E Results

SITW-DEV Results

11 of 26

Frontend Optimization Methods for Robust Speaker Verification

Anyway - We First Need to Benchmark

02/09/2021

Feature Extractor

Backend

Alternatives from MFCCs?

Speaker Embedding

Extractor

Short-term features,

magnitude Based

Short-term features,

phase Based

Short-term features with long-term processing

BEST INDOMAIN!

MOST ROBUST!

12 of 26

Frontend Optimization Methods for Robust Speaker Verification

We Have DNN. Can Data-Driven be an Option?

02/09/2021

A natural thought can be let the feature extractor be data-driven. But how to do it is a bit tricky.

Complete End-to-End is a bit of a ‘black-box’

So, we start from something straightforward and (rather) simple.

MFCCs are like (or essentially is) a DNN, with bunch of matrices and non-linearities

Feature Extractor

Speaker Embedding

Extractor

Backend

Windowing

DFT

Power

Spectrum

Mel

Filterbanks

Log

DCT

MFCC

13 of 26

Frontend Optimization Methods for Robust Speaker Verification

Learnable MFCCs

02/09/2021

Windowing

DFT

Power Spectrum

Mel Filterbanks

Log

DCT

MFCC

Feature Extractor

Speaker Embedding

Extractor

Backend

14 of 26

Frontend Optimization Methods for Robust Speaker Verification

Learnable MFCCs

02/09/2021

Windowing

DFT

Power Spectrum

Mel Filterbanks

Log

DCT

MFCC

We adapt the four

linear operations

Feature Extractor

Speaker Embedding

Extractor

Backend

15 of 26

Frontend Optimization Methods for Robust Speaker Verification

Learnable MFCCs

02/09/2021

Log

MFCC

Kernel Initialization

Feature Extractor

Speaker Embedding

Extractor

Backend

16 of 26

Frontend Optimization Methods for Robust Speaker Verification

Learnable MFCCs

02/09/2021

Windowing

DFT + Power Spectrum

Mel Filterbanks

DCT

Log

MFCC

Loss Regularization

(+loss.)

Feature Extractor

Speaker Embedding

Extractor

Backend

[1] Y. Zhu and B. Mak, Orthogonality Regularizations for End-to-End Speaker Verification. Odyssey 2020.

17 of 26

Frontend Optimization Methods for Robust Speaker Verification

Learnable MFCCs

02/09/2021

Windowing

DFT + Power Spectrum

DCT

Log

MFCC

Feature Extractor

Speaker Embedding

Extractor

Backend

Kernel Update

(+kernel.)

Mel Filterbanks

18 of 26

Frontend Optimization Methods for Robust Speaker Verification

Learnable MFCCs

02/09/2021

Windowing

DFT + Power Spectrum

DCT

Log

MFCC

Mel Filterbanks

6.09% EER on SITW

4.33% EER on Vox-1

0.7689 minDCF

on SITW

0.4971 minDCF

on Vox-1

+kernel.

+loss.

9.7% rel. lower

6.7% rel. lower

18.1% rel. lower

Baseline	EER/minDCF
Vox-1 test	4.64%/0.6071
SITW	6.72%/0.8243

Feature Extractor

Speaker Embedding

Extractor

Backend

19 of 26

Frontend Optimization Methods for Robust Speaker Verification

Robustness of Features Against Recent Challenges

02/09/2021

We extend from earlier context and would like to implement the data-driven ideas to improve robustness of feature extractor.

We anticipate the usefulness of optimization and simplification techniques

We performed three case studies on it, resulting in three work pieces

Feature Extractor

Speaker Embedding

Extractor

Backend

Multi-Taper

PCEN&PCMN

Filterbank

PNCCs

20 of 26

Frontend Optimization Methods for Robust Speaker Verification

Robustness of Features Against Recent Challenges

02/09/2021

Feature Extractor

Speaker Embedding

Extractor

Backend

Multi-Taper

PNCCs

Kernel Initialization

PCEN&PCMN

Filterbank

21 of 26

Frontend Optimization Methods for Robust Speaker Verification

Robustness of Features Against Recent Challenges

02/09/2021

Feature Extractor

Speaker Embedding

Extractor

Backend

PNCCs

Multi-Taper

Kernel Initialization

PCEN&PCMN

Filterbank

22 of 26

Frontend Optimization Methods for Robust Speaker Verification

How about their Robustness Against Recent Challenges?

02/09/2021

Feature Extractor

Speaker Embedding

Extractor

Backend

PNCCs

Multi-Taper

VoxCeleb1-{E,H}

VoxMovies (new!)

PCEN&PCMN

Filterbank

23 of 26

Frontend Optimization Methods for Robust Speaker Verification

Key Take-Aways

Temporal/Long-term operations can result in more robust features for deep speaker verification

As a pilot study, making certain part of MFCCs data-driven with some tricks on updating it can improve robustness system-wise

Feature extractor has huge potential and numbers of open challenges and research spots, especially for mismatched conditions (which is recently hot)

I failed to achieve as much as I expected during my first 1.5 year of PhD.

24 of 26

Frontend Optimization Methods for Robust Speaker Verification

What’s Next?

02/09/2021

Feature Extractor

Speaker Embedding

Extractor

Backend

Speech attributes

Signal processing/

Filtering

Temporal/long-term operations

More robust kernels/architectures

General

Speaker Verification

Scenario-Specific

Speaker Verification

With applications to…

25 of 26

Frontend Optimization Methods for Robust Speaker Verification

Papers Mentioned in This Presentation

02/09/2021

X. Liu, M. Sahidullah and T. Kinnunen, “A Comparative Re-Assessment of Feature Extractors for Deep Speaker Embeddings,” Proc. Interspeech 2020

X. Liu, M. Sahidullah and T. Kinnunen, "Learnable MFCCs for Speaker Verification," 2021 IEEE International Symposium on Circuits and Systems (ISCAS), 2021

M. Sahidullah et al., "UIAI System for Short-Duration Speaker Verification Challenge 2020," 2021 IEEE Spoken Language Technology Workshop (SLT), 2021

X. Liu, M. Sahidullah and T. Kinnunen , “Optimizing Multi-Taper Features for Deep Speaker Verification,” Submitted to IEEE Signal Processing Letters, 2021.

X. Liu, M. Sahidullah and T. Kinnunen, “OPTIMIZED POWER NORMALIZED CEPSTRAL COEFFICIENTS TOWARDS ROBUST DEEP SPEAKER VERIFICATION”, IEEE ASRU 2021.

X. Liu, M. Sahidullah and T. Kinnunen, “PARAMETERIZED CHANNEL NORMALIZATION FOR FAR-FIELD DEEP SPEAKER VERIFICATION”, IEEE ASRU 2021.

26 of 26

THANKS FOR LISTENING!

For questions, please either ask in mattermost or email to:

xuechen.liu@inria.fr

(I’m not in Nancy anymore, but I’m not going to anywhere either)

25/11/2020