1
Whispering LLaMA: A Cross-Modal Generative Error Correction �Framework for Speech Recognition
Srijith Radhakrishnan, Huck Yang, Sumeer Ahmad Khan, Rohit�Kumar, Narsis A. Kiani, David Gomez-Cabrero, Jesper N. Tegner
Could You Recognize this Speech?
[Context] at a noisy party
[Context] referring to someone who has not been uninvited or excluded from an event or social circle. �
2
“He has not been dropped.”
Slide sourced from Dr. Huck Yang on �ASRU 23 Tutorial on LLM for ASR
Open Questions During this Talk
Do we need better ears or a better brain to recognize speech?
3
Slide sourced from Dr. Huck Yang on �ASRU 23 Tutorial on LLM for ASR
LLMs are Generative Error Correctors
4
Whispering-LLaMA: Prompt-Template
5
Generative Speech Recognition Error Correction with Large Language Models and Task-Activating Prompting, �ASRU 2023, Yang et al.
Related Zero-Shot Correction Ref.
Whispering-LLaMA (1/2): Model-Level
6
Whispering-LLaMA (2/2): Layer-Level
7
N-best Hypotheses Dataset
8
Model Performance
9
Whispering-LLama Conclusion (1/2)
10
Whispering-LLama Conclusion (2/2)
11
More References
12
Resource & Thank you all!
13
https://github.com/Srijith-rkr/Whispering-LLaMA/tree/main
100+ stars or GitHub 😊
srijithrkr@gmail.com; hucky@nvidia.com
14
Link to our slides
Appendix - Fine-tuning the LLM with adapters
15
Appendix - Where to extract the audio information from?
16
Appendix - How to learn cross-modal representations?
17
Appendix - Dealing with shape mismatch?
18
Appendix - Whisper
Trained on to 680,000 hours of multilingual (99 languages) and multi task audio data
�
19