META-LEARNING FRAMEWORK FOR END-TO-END IMPOSTER
IDENTIFICATION IN UNSEEN SPEAKER RECOGNITION
Ashutosh Chaubey, Sparsh Sinha, Susmita Ghose�LG Ad Solutions*
* All the authors were part of LG Ad Solutions while working on this project
- We demonstrate the issue with fixed thresholding under a domain shift and propose a simple speaker-specific thresholding technique for robust imposter identification in unseen speaker identification.
- We propose another novel meta-learning based imposter detection network which learns to detect imposters in unseen speaker identification.
- We show the efficacy of the proposed approaches in case of a domain shift on FFSVC 2022, VCTK and far-field augmented VoxCeleb1 datasets.
- Speaker identification without imposters -
- Speaker identification with a fixed threshold to detect imposters -
Problem with Fixed Thresholding
- Threshold depends on intra-speaker and inter-speaker similarities and their distribution.
- Ideally, threshold should be more than the inter-speaker similarities to filter away incorrect speakers and it should be less than intra-speaker similarities so that correct predictions do not get filtered away.
- Since intra-speaker and inter-speaker similarities depend on the dataset, fixed thresholding does not generalize in case of a domain shift.
Fig. 2: T-SNE Plots of TDNN Speaker embeddings. Speaker embeddings for SITW data are separable while those for VoxCeleb1 data are more dense and convoluted.
Proposed Method-1: Speaker-Specific Thresholding
Proposed Method-2: Imposter Detection Network
Fig. 3: Speaker-specific thresholding for detecting imposters.
Motivation - Threshold for an enrolled speaker sj should be -
- greater than the inter-speaker similarities of enrollment samples of sj with all other enrollment samples.
- less than the intra-speaker similarities of all the enrollment samples of sj within themselves.
- Decouples the task of speaker identification (using Relation Networks) and imposter detection.
- Facilitates end-to-end training of speaker encoder with the speaker identification and imposter detection backends.
Fig. 4: Meta-learning framework for end-to-end imposter identification (or imposter detection). Relation network is used for speaker identification and imposter detection network is used for detecting imposters.
Table. 1: Performance of the proposed techniques on VCTK and VoxCeleb1 (with reverb) for unseen speaker identification.
Table. 2: Performance of the proposed techniques on the 2022 FFSV Challenge data for unseen speaker identification for M = 5 enrolled speakers.
- Proposed techniques do not perform as good as baselines, in cases when the training and testing conditions are similar and there isn’t a domain shift.
- Proposed approaches work better with 2 or more samples per speaker in the enrollment set.
- This paper highlights the issues with fixed thresholding for imposter detection in unseen speaker identification.
- We propose two novel techniques for imposter detection, one based on speaker-specific thresholding, and another based on meta-learning and relation networks.
Fig. 1: Imposter detection in speaker identification.