1 of 15

Algorithms that Recognizes Uncommon Spoken Languages

Cheng-I Jeff Lai

MIT Horizon 2022

2 of 15

A Challenge for Automatic Speech Recognition

  • UNIESCO: “Indigenous Language Decade” (2022-2032)
  • ~7000 spoken/unwritten languages/dialects
    • Not all of them are supported by Google/Siri/YouTube!

3 of 15

A Challenge for Automatic Speech Recognition

  • Objective: Less human annotation, more machine automated learning
  • Benefits:
    • Data Scalability: much less expenses in annotation (in different languages & conditions!)
    • Model Scalability: simpler model development.
    • Better results!

4 of 15

Conventional Automatic Speech Recognition

Towards End-to-End Speech Recognition (Li et al. ISCSLP Tutorial 4 2018)

5 of 15

Building a Speech Recognizer with 10min of data

  • Speech Recognizer with 10min of data rivals the best supervised model with 1k hours of data from 2 years ago

Pushing the Limits of Semi-Supervised Learning for Automatic Speech Recognition (Yu et al. arxiv 2020)

6 of 15

Core Technology: Self-Supervised Pre-Training

  • Step 1: Learning without human annotated labels

Raw speech waveforms

7 of 15

Core Technology: Fine-Tuning with Minimal Data

  • Step 2: fine-tuning with minimal (e.g. 10min) of labeled data

8 of 15

Core Technology: Transformer Architecture

Wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations (Baevski et al. NeurIPS 2020)

9 of 15

Scaling up to 53 Languages Speech Recognition

Unsupervised Cross-lingual Representation Learning for Speech Recognition (Conneau et al. Interspeech 2021)

10 of 15

Scaling up to 128 Languages ASR and Speech Translation

XLS-R: Self-Supervised Cross-Lingual Speech Representation Learning at Scale (Babu et al. arxiv 2021)

11 of 15

Efficiency and Universal Benchmark for Self-Supervised Learning

PARP: Prune, Adjust, Re-Prune for Self-Supervised Speech Recognition (Lai et al. NeurIPS 2021)

12 of 15

Efficiency and Universal Benchmark for Self-Supervised Learning

SUPERB: Speech processing Universal PERformance Benchmark (Yang et al. Interspeech 2021)

13 of 15

An Open-sourced & Ongoing Effort

Toolkits on Github:

Ongoing Open Challenges:

    • SUPERB @ AAAI 2022
      • https://superbbenchmark.org/
    • Zero-Resource Speech @ NeurIPS 2021, AAAI 2022
      • https://www.zerospeech.com/

14 of 15

An Open-sourced & Ongoing Effort

  • Self-supervised learning under more challenging conditions (with background noises, speaker variations), e.g. the CHiME Challenge
  • Efficiency in training and deployment
  • Applications beyond automatic speech recognition:
    • Automatic speech to speech translation without textual forms
    • Multi-modal learning (audio-visual retrieval, lip-reading speech recognition)
    • Self-supervised/multi-modal speaker recognition
    • Speech synthesis/voice conversion
    • and many more..

15 of 15

Thank you!�clai24@mit.edu