1 of 19

1

Whispering LLaMA: A Cross-Modal Generative Error Correction �Framework for Speech Recognition

Srijith Radhakrishnan, Huck Yang, Sumeer Ahmad Khan, Rohit�Kumar, Narsis A. Kiani, David Gomez-Cabrero, Jesper N. Tegner

2 of 19

Could You Recognize this Speech?

[Context] at a noisy party

[Context] referring to someone who has not been uninvited or excluded from an event or social circle. �

2

“He has not been dropped.”

Slide sourced from Dr. Huck Yang on �ASRU 23 Tutorial on LLM for ASR

3 of 19

Open Questions During this Talk

Do we need better ears or a better brain to recognize speech?

  • Could audio and text work as a single cross-modal correction framework

3

Slide sourced from Dr. Huck Yang on �ASRU 23 Tutorial on LLM for ASR

4 of 19

LLMs are Generative Error Correctors

4

5 of 19

Whispering-LLaMA: Prompt-Template

5

 

Generative Speech Recognition Error Correction with Large Language Models and Task-Activating Prompting, �ASRU 2023, Yang et al.

Related Zero-Shot Correction Ref.

6 of 19

Whispering-LLaMA (1/2): Model-Level

6

7 of 19

Whispering-LLaMA (2/2): Layer-Level

7

8 of 19

N-best Hypotheses Dataset

8

9 of 19

Model Performance

9

10 of 19

Whispering-LLama Conclusion (1/2)

  • We propose a multimodal decoder based layer-level fusion to combine Whisper decoder and LLaMA decoder to improve speech transcripts.

10

11 of 19

Whispering-LLama Conclusion (2/2)

  • We demonstrate that LLM integration is a promising direction

  • Utilizing audio information and quality transcripts further boosts performance

  • Whisper Tiny (70M) + LLM (7B) = Whisper Large (1.5B)

11

12 of 19

More References

  1. Generative Speech Recognition Error Correction with Large Language Models and Task-Activating Prompting, ASRU 2023�
  2. HyPoradise: An Open Baseline for Generative Speech Recognition with Large Language Models, NeurIPS 2023, Dataset and Benchmark�
  3. Generative error correction for code-switching speech recognition using large language models, Arxiv 2023�
  4. Low-rank Adaptation of Large Language Model Rescoring for Parameter-Efficient Speech Recognition, ASRU 2023

12

13 of 19

Resource & Thank you all!

13

https://github.com/Srijith-rkr/Whispering-LLaMA/tree/main

100+ stars or GitHub 😊

srijithrkr@gmail.com; hucky@nvidia.com

14 of 19

14

Link to our slides

15 of 19

Appendix - Fine-tuning the LLM with adapters

15

16 of 19

Appendix - Where to extract the audio information from?

16

17 of 19

Appendix - How to learn cross-modal representations?

17

18 of 19

Appendix - Dealing with shape mismatch?

18

19 of 19

Appendix - Whisper

Trained on to 680,000 hours of multilingual (99 languages) and multi task audio data

19