Enhancing the Robustness of Speech Foundation Models Through Adaptation on Large-Scale Diverse Speech
Jessan Rendell G. Belenzo, Rowel O. Atienza, Ph.D.
Solution
We enhance SpeechGPT’s understanding of diverse speech by training it on varied recordings and evaluate its transcription accuracy using word error rate (WER), which considers substitutions, omissions, and insertions.
Problem Statement
Large language models (LLMs) are programs that mimic human language by learning from vast amounts of text. Speech foundation models are LLMs that comprehend speech. Open-sourced speech models such as SpeechGPT may struggle with diverse speaking styles or noisy environments because they are mostly trained on audiobooks.
Features of Solution
Figure 1: The model architecture of SpeechGPT. The human instruction to transcribe speech is combined with the integer sequence representation of the recording to generate cross-modal inputs. The inputs are fed into SpeechGPT to train the foundation model to interpret the speech.
2
Capstone Project: “Enhancing the Robustness of Speech Foundation Models Through Adaptation on Large-Scale Diverse Speech”
One of the First Graduates of
Master of Engineering in Artificial Intelligence