1 of 26

Frontend Optimization Methods for Robust Speaker Verification�

Xuechen Liu�PhD Student, MULTISPEECH & CSG@UEF��MULTISPEECH Weekly, 2021.09.02

-

1

02/09/2021

2 of 26

Frontend Optimization Methods for Robust Speaker Verification

What is Speaker Verification?

-

2

02/09/2021

Task of Speaker Recognition

Speaker Identification

Speaker Verification

Pics are from course Machine Learning for Speech, UEF Fall 2020.

3 of 26

Frontend Optimization Methods for Robust Speaker Verification

How Modern Speaker Verification Works?

-

3

02/09/2021

Feature Extractor

Speaker Embedding

Extractor

Backend

TDNNs, ResNet34, Self-attention pooling, Multi-task learning, Angular SoftMax,

Data augmentation via RIR, MUSAN…

Windowing

DFT

Power

Spectrum

Mel

Filterbanks

Log

DCT

MFCC

4 of 26

Frontend Optimization Methods for Robust Speaker Verification

Why Deep Speaker Embeddings?

-

4

02/09/2021

  • Speaker Embeddings used to be dominated by statistical models for years (GMM-SVM, i-vector…..)

  • We have witnessed the amazing (and still evolving) power of DNNs for almost 10 years.

  • Its wide acknowledgement and application started from speech recognition.

  • Replacing the parts of speech processing systems with DNN seems a natural choice.

  • Successive examples: d-vector, deep speaker, x-vector……

5 of 26

Frontend Optimization Methods for Robust Speaker Verification

But Why Then For Frontend?

-

5

02/09/2021

Feature Extractor

Speaker Embedding

Extractor

Backend

  • Lots of work so far have been focused on DNN embeddings and neural backends

  • We think feature extractor, in the context of SV, has been missed while holding huge potential

(and less computing-exhaustive…)

  • People use MFCCs/Mel Filterbanks, which does not concern many things like phase, temporal ops, etc

Windowing

DFT

Power

Spectrum

Mel

Filterbanks

Log

DCT

MFCC

6 of 26

Frontend Optimization Methods for Robust Speaker Verification

Anyway - We First Need to Benchmark

-

6

02/09/2021

Short-term features, magnitude spectrum

Short-term features, phase spectrum

Short-term features, with long-term processing

Fundamental frequency features

Feature Extractor

x-vector

PLDA

14 Hand-crafted feature extractors

We re-assess……

7 of 26

Frontend Optimization Methods for Robust Speaker Verification

Anyway - We First Need to Benchmark

-

7

02/09/2021

Constant-Q cepstral coefficients

(CQCCs, Todisco et al. 2016)

Multi-taper MFCCs

(Kinnunen et al. 2012)

Spectral centroid frequency coefficients/

Spectral centroid magnitude coefficients

(SCFCs/SCMCs, Kua et al, 2010)

Mel frequency cepstral coefficients

(MFCCs, Baseline)

Linear prediction cepstral coefficients/

Perceptual linear prediction cepstral coefficients

(LPCCs, Makhoul, 1975; PLPCCs, Hermansky, 1990)

X

 

 

 

 

8 of 26

Frontend Optimization Methods for Robust Speaker Verification

Anyway - We First Need to Benchmark

-

8

02/09/2021

Modified group delay function

(MGDF, Murthy and Gadde, 2003)

All-Pole group delay function

(APGDF, Rajan et al, 2013)

 

 

Unwrapping + Cosine function

Cosine phase function

(Cosphase, Wu et al, 2012)

Cepstal magnitude-phase octave coefficients

(CMPOCs, Yang et al, 2018)

9 of 26

Frontend Optimization Methods for Robust Speaker Verification

Anyway - We First Need to Benchmark

-

9

02/09/2021

Mean Hilbert envelope coefficients

(MHECs, Sadjadi et al, 2012)

Power-normalized cepstral coefficients

(PNCCs, Kim et al, 2016)

Also, MFCC+pitch

MFCC base, attached with 3-dimensional pitch vector

(Ghahremani et al, 2014)

10 of 26

Frontend Optimization Methods for Robust Speaker Verification

Anyway - We First Need to Benchmark

-

10

02/09/2021

  • Data: Voxceleb-1

  • Model: Basic X-Vector (Snyder et al. 2018)

  • Two score-level fusion systems: one set from 1st category, 1-from-each for the other

  • Metric: Equal error rate (EER) and minimum detection cost function (minDCF)

Feature

EER (%)

minDCF

MFCC

4.65

0.5937

SCMC

4.57

0.5875

Multi-taper

4.84

0.5459

MFCC+pitch

4.67

0.5223

MFCC+SCMC+Multi-taper

3.89

0.5396

Feature

EER (%)

minDCF

MFCC

8.12

0.8531

PNCC

6.08

0.7614

SCMC

6.62

0.762

MFCC+cosphase+PNCC

6.24

0.7998

Voxceleb1-E Results

SITW-DEV Results

11 of 26

Frontend Optimization Methods for Robust Speaker Verification

Anyway - We First Need to Benchmark

-

11

02/09/2021

Feature Extractor

Backend

Alternatives from MFCCs?

Speaker Embedding

Extractor

Short-term features,

magnitude Based

Short-term features,

phase Based

Short-term features with long-term processing

BEST INDOMAIN!

MOST ROBUST!

12 of 26

Frontend Optimization Methods for Robust Speaker Verification

We Have DNN. Can Data-Driven be an Option?

-

12

02/09/2021

  • A natural thought can be let the feature extractor be data-driven. But how to do it is a bit tricky.

  • Complete End-to-End is a bit of a ‘black-box’

  • So, we start from something straightforward and (rather) simple.

  • MFCCs are like (or essentially is) a DNN, with bunch of matrices and non-linearities

Feature Extractor

Speaker Embedding

Extractor

Backend

Windowing

DFT

Power

Spectrum

Mel

Filterbanks

Log

DCT

MFCC

13 of 26

Frontend Optimization Methods for Robust Speaker Verification

Learnable MFCCs

-

13

02/09/2021

Windowing

DFT

Power Spectrum

Mel Filterbanks

Log

DCT

MFCC

Feature Extractor

Speaker Embedding

Extractor

Backend

14 of 26

Frontend Optimization Methods for Robust Speaker Verification

Learnable MFCCs

-

14

02/09/2021

Windowing

DFT

Power Spectrum

Mel Filterbanks

Log

DCT

MFCC

We adapt the four

linear operations

Feature Extractor

Speaker Embedding

Extractor

Backend

15 of 26

Frontend Optimization Methods for Robust Speaker Verification

Learnable MFCCs

-

15

02/09/2021

Log

MFCC

Kernel Initialization

Feature Extractor

Speaker Embedding

Extractor

Backend

16 of 26

Frontend Optimization Methods for Robust Speaker Verification

Learnable MFCCs

-

16

02/09/2021

Windowing

DFT + Power Spectrum

Mel Filterbanks

DCT

 

 

 

Log

 

MFCC

Loss Regularization

(+loss.)

 

Feature Extractor

Speaker Embedding

Extractor

Backend

[1] Y. Zhu and B. Mak, Orthogonality Regularizations for End-to-End Speaker Verification. Odyssey 2020.

17 of 26

Frontend Optimization Methods for Robust Speaker Verification

Learnable MFCCs

-

17

02/09/2021

Windowing

DFT + Power Spectrum

DCT

Log

MFCC

Feature Extractor

Speaker Embedding

Extractor

Backend

 

 

 

Kernel Update

(+kernel.)

 

Mel Filterbanks

 

18 of 26

Frontend Optimization Methods for Robust Speaker Verification

Learnable MFCCs

-

18

02/09/2021

Windowing

DFT + Power Spectrum

DCT

Log

MFCC

Mel Filterbanks

6.09% EER on SITW

4.33% EER on Vox-1

0.7689 minDCF

on SITW

0.4971 minDCF

on Vox-1

+kernel.

+kernel.

+loss.

9.7% rel. lower

6.7% rel. lower

6.7% rel. lower

18.1% rel. lower

Baseline

EER/minDCF

Vox-1 test

4.64%/0.6071

SITW

6.72%/0.8243

Feature Extractor

Speaker Embedding

Extractor

Backend

19 of 26

Frontend Optimization Methods for Robust Speaker Verification

Robustness of Features Against Recent Challenges

-

19

02/09/2021

  • We extend from earlier context and would like to implement the data-driven ideas to improve robustness of feature extractor.

  • We anticipate the usefulness of optimization and simplification techniques

  • We performed three case studies on it, resulting in three work pieces

Feature Extractor

Speaker Embedding

Extractor

Backend

Multi-Taper

PCEN&PCMN

Filterbank

PNCCs

20 of 26

Frontend Optimization Methods for Robust Speaker Verification

Robustness of Features Against Recent Challenges

-

20

02/09/2021

Feature Extractor

Speaker Embedding

Extractor

Backend

Multi-Taper

PNCCs

Kernel Initialization

PCEN&PCMN

Filterbank

21 of 26

Frontend Optimization Methods for Robust Speaker Verification

Robustness of Features Against Recent Challenges

-

21

02/09/2021

Feature Extractor

Speaker Embedding

Extractor

Backend

PNCCs

Multi-Taper

Kernel Initialization

PCEN&PCMN

Filterbank

22 of 26

Frontend Optimization Methods for Robust Speaker Verification

How about their Robustness Against Recent Challenges?

-

22

02/09/2021

Feature Extractor

Speaker Embedding

Extractor

Backend

PNCCs

Multi-Taper

VoxCeleb1-{E,H}

VoxMovies (new!)

PCEN&PCMN

Filterbank

23 of 26

Frontend Optimization Methods for Robust Speaker Verification

Key Take-Aways

  • Temporal/Long-term operations can result in more robust features for deep speaker verification

  • As a pilot study, making certain part of MFCCs data-driven with some tricks on updating it can improve robustness system-wise

  • Feature extractor has huge potential and numbers of open challenges and research spots, especially for mismatched conditions (which is recently hot)

  • I failed to achieve as much as I expected during my first 1.5 year of PhD.

24 of 26

Frontend Optimization Methods for Robust Speaker Verification

What’s Next?

-

24

02/09/2021

Feature Extractor

Speaker Embedding

Extractor

Backend

Speech attributes

Signal processing/

Filtering

Temporal/long-term operations

More robust kernels/architectures

General

Speaker Verification

Scenario-Specific

Speaker Verification

With applications to…

25 of 26

Frontend Optimization Methods for Robust Speaker Verification

Papers Mentioned in This Presentation

-

25

02/09/2021

  • X. Liu, M. Sahidullah and T. Kinnunen, “A Comparative Re-Assessment of Feature Extractors for Deep Speaker Embeddings,” Proc. Interspeech 2020

  • X. Liu, M. Sahidullah and T. Kinnunen, "Learnable MFCCs for Speaker Verification," 2021 IEEE International Symposium on Circuits and Systems (ISCAS), 2021

  • M. Sahidullah et al., "UIAI System for Short-Duration Speaker Verification Challenge 2020," 2021 IEEE Spoken Language Technology Workshop (SLT), 2021

  • X. Liu, M. Sahidullah and T. Kinnunen , “Optimizing Multi-Taper Features for Deep Speaker Verification,” Submitted to IEEE Signal Processing Letters, 2021.

  • X. Liu, M. Sahidullah and T. Kinnunen, “OPTIMIZED POWER NORMALIZED CEPSTRAL COEFFICIENTS TOWARDS ROBUST DEEP SPEAKER VERIFICATION”, IEEE ASRU 2021.

  • X. Liu, M. Sahidullah and T. Kinnunen, “PARAMETERIZED CHANNEL NORMALIZATION FOR FAR-FIELD DEEP SPEAKER VERIFICATION”, IEEE ASRU 2021.

26 of 26

THANKS FOR LISTENING!

For questions, please either ask in mattermost or email to:

xuechen.liu@inria.fr

(I’m not in Nancy anymore, but I’m not going to anywhere either)

-

26

25/11/2020