1 of 2

Enhancing the Robustness of Speech Foundation Models Through Adaptation on Large-Scale Diverse Speech

Jessan Rendell G. Belenzo, Rowel O. Atienza, Ph.D.

Solution

We enhance SpeechGPT’s understanding of diverse speech by training it on varied recordings and evaluate its transcription accuracy using word error rate (WER), which considers substitutions, omissions, and insertions.

Problem Statement

Large language models (LLMs) are programs that mimic human language by learning from vast amounts of text. Speech foundation models are LLMs that comprehend speech. Open-sourced speech models such as SpeechGPT may struggle with diverse speaking styles or noisy environments because they are mostly trained on audiobooks.

Features of Solution

Breakdown of recordings into shorter sequences of integers using mHuBERT speech tokenizer
Use of diverse speech datasets such as GigaSpeech, VoxPopuli, SPGISpeech, and LibriSpeech for the continued training
Up to 80.8% improvement on the SpeechGPT’s WER

Figure 1: The model architecture of SpeechGPT. The human instruction to transcribe speech is combined with the integer sequence representation of the recording to generate cross-modal inputs. The inputs are fed into SpeechGPT to train the foundation model to interpret the speech.

2 of 2

Capstone Project: “Enhancing the Robustness of Speech Foundation Models Through Adaptation on Large-Scale Diverse Speech”

One of the First Graduates of

Master of Engineering in Artificial Intelligence