Spoken Language Technologies and ASR
IASNLP 2026
Dr. Anil Kumar Vuppala, Speech Processing Laboratory, IIIT Hyderabad
Contents
Introduction
Speech applications
ASR
Introduction contd.
Information in speech
Typical speech signal for a word “Artificial”
Introduction
Speech is a unique, complex, and dynamic motor activity through which we express our thoughts and emotions.
Natural mode of communication
for human beings.
Important in human computer interaction
Introduction contd.
Applications of speech processing:
Spoken language dialog systems
Speaker recognition/verification
Emotion recognition
Language identification
Pronunciation evaluation for clinical tools
Forensic tools
❏
❏
❏
❏
❏
❏
❏
Speech coding
Introduction contd.
Applications of speech processing:
Spoken language dialog systems
Speaker recognition/verification
Emotion recognition
Language identification
Pronunciation evaluation for clinical tools
Forensic tools
❏
❏
❏
❏
❏
❏
❏
Speech Pathology
Importance of speech technologies:
For information security and authentication
Speech based smart appliances
❏ For better human machine interaction
❏
❏
❏
Multilingual speech systems (speech to
speech translation)
As a personalized speech therapist
❏ As a navigator for disabled people
❏
❏
As a personalized assistant in education, agriculture, health care, and other service sectors.
Speech to Speech Machine Translation (SSMT)
IIIT Hyderabad Speech Processing Laboratory LTRC
A king named Sumer Singh
Source Speech to Target Language Text
Speech Synthesis
Cross-lingual Style Transfer
सुमेर सिंह नाम का एक राजा
Pathological speech processing
IIIT Hyderabad Speech Processing Laboratory LTRC
Resonance disorders
Voice disorders
Spastic dysarthria
Stuttering
Emotions / Depression / Mood swings
Feature extraction
❏ Primary step in the development of speech systems
❏
❏
❏
❏
❏
Fourier transform
Short time Fourier analysis due to non-stationarity
20-30 ms framesize of speech
Window type: Hamming, Hanning, Rectangular etc.
Cepstral coefficients are most common feature representation the which captures the vocal tract shape and dynamic information
❏ Mel-frequency cepstral coefficients (MFCC), Linear prediction prediction cepstral coefficients (LPCC), Perceptual linear prediction cepstral coefficients (PLPCC), etc.
Feature extraction
❏ Primary step in the development of speech systems
❏
❏
❏
❏
❏
Fourier transform
Short time Fourier analysis due to non-stationarity
20-30 ms framesize of speech
Window type: Hamming, Hanning, Rectangular etc.
Cepstral coefficients are most common feature representation the which captures the vocal tract shape and dynamic information
❏ Mel-frequency cepstral coefficients (MFCC), Linear prediction prediction cepstral coefficients (LPCC), Perceptual linear prediction cepstral coefficients (PLPCC), etc.
Feature extraction contd.
Speech systems
Speaker recognition: is a technique to recognize the identity of a speaker from a speech utterance.
Speaker verification aims to verify whether an input speech corresponds to
the claimed identity.
Emotion recognition is a process of identifying the emotional state (Anger, Happy, Sad, Neutral, Fear, etc) of a speaker from the spoken utterance.
Spoken Language identification is a task of recognising the language
identity/information from the speech signal.
Automatic speech recognition is the process of deriving the transcription (word sequence) of an utterance, given the speech waveform.
Text to speech synthesis aims to converts the given text (word sequence) information into speech.
Speech systems contd.
Speaker
identification/verification
Language identification Emotion recognition
Simple classification problems
Automatic speech recognition (ASR)
Text to speech synthesis
Are advanced speech systems require: classification, duration/pronunciation modelling, sequence-sequence mapping etc.
Deep learning in AI
15
Ref:- https://www.quora.com/What-are-the-main-differences-between-artificial-intelligence-and-machine-learning-Is-machine-learning-a-part-of-artificial-intelligence
Why Third wave? 1950’s, 1990’s and now
More data from systems and sensors (IoT)
More compute power : GPU’s, multi-core CPU’s
Can train deep architectures
Some more applications of DL are:
Speech recognition, Image classification, natural languge processing, chat bots, personalized recommendations, prediction, anomaly detection, fraud detection, drug discovery, autonomous cars, video analytics etc...
16
Fig Ref:- https://www.linkedin.com/pulse/how-artificial-intelligence-revolutionizing-finance-del-toro-barba/
Machine learning vs Deep learning
17
Ref:- https://www.xenonstack.com/blog/log-analytics-with-deep-learning-and-machine-learning
Parameters to vary for tuning
18
Speech systems contd.
Classification system example
Input: speech signal
Features: 39-dimensional MFCCs
Class labels:
Language ID (Ex: Telugu-1, Hindi-2, English-3, Tamil-4, etc.) in language identification task. Similarly, speaker ID will be the class label in speaker identification.
Machine learning algorithms: Gaussian Mixture Modelling (GMM), GMM with universal background
modelling, I-vector modelling.
Gaussian Mixture Modelling (GMM)
Parameter estimation:
Gaussian Mixture Modelling (GMM) contd.
Formulation of GMM:
Gaussian Mixture Modelling (GMM) contd.
GMM distribution from three Gaussians
Gaussian Mixture Modelling (GMM) contd.
E-step
M-step
GMM with universal background model (GMM-UBM)
GMM are used for both target and background models
Target model is adapted from universal background model (UBM)
GMM with universal background model (GMM-UBM)
GMM are used for both target and background models
Target model is adapted from universal background model (UBM)
Maximum a posteriori (MAP) adaptation:
GMM with universal background model (GMM-UBM)
GMM are used for both target and background models
Target model is adapted from universal background model (UBM)
Maximum a posteriori (MAP) adaptation:
GMM-UBM Example: Adjustment of enrolled speaker’s GMM using UBM.
GMM-UBM based classification
Procedure for speaker recognition, language identification, emotion recognition using GMM-UBM modelling
GMM-UBM based classification
Metrics: Accuracy, and Equal error rate (EER). EER is function of false acceptance ratio (FAR) and False rejection ratio (FRR).
Procedure for speaker recognition, language identification, emotion recognition using GMM-UBM modelling
Joint factor analysis and I-vector modelling
variables in terms of potentially lower number of unobserved variables called factors.
Joint factor analysis and I-vector modelling
variables in terms of potentially lower number of unobserved variables called factors.
Intuition and interpretation
dimensions of the corresponding component
Joint factor analysis and I-vector modelling
GMM supervector u for a speaker (language/emotion/ any other) can be decomposed as:
Joint factor analysis and I-vector modelling
I-vector modelling in Language identification
Motivation
dimensional vector (Ex: 400)
between different language classes
I-vector modelling in Language identification
I-vector modelling in Emotion recognition
INTERSPEECH 2009 Emotion recognition Challenge: uses FAU-AIBO Emotion corpus, It is two class problem: Positive or Negative
GMM-MFCC systems (s1)
MFCC feature (12-MFCC+E+delta+double delta)
GMM uses 512 Gaussian Components
GMM-MFCC systems (s2)
MFCC feature (12-MFCC+E+delta+double delta)
UBM with 512 Gaussian Components
I-vector dimension of 150, Fisher Discriminant Analysis
GMM-Prosodic system (s3)
Prosody features: Pitch+ energy + duration features
GMM-UBM with 256 components.
I-vector modelling in Emotion recognition
INTERSPEECH 2009 Emotion recognition Challenge: uses FAU-AIBO Emotion corpus, It is two class problem: Positive or Negative
GMM-MFCC systems (s1)
MFCC feature (12-MFCC+E+delta+double delta)
GMM uses 512 Gaussian Components
GMM-MFCC systems (s2)
MFCC feature (12-MFCC+E+delta+double delta)
UBM with 512 Gaussian Components
I-vector dimension of 150, Fisher Discriminant Analysis
GMM-Prosodic system (s3)
Prosody features: Pitch+ energy + duration features
GMM-UBM with 256 components.
I-vector modelling in Speaker recognition
Baseline results: Comparison of JFA and i-vector systems on the common subset of the 2008 NIST SRE database. WCCN: Within-class Covariance Normalisation, LDA: Linear discriminant analysis, SDNAP: scatter-difference Nuisance Attribute Projection, GPLDA: Gaussian Probabilistic LDA
EER: Equal error rate, DCF: Decision Cost Function is equivalent to pairwise loss function (Cavg)
Automatic speech recognition
Automatic Speech recognition (ASR): is a transduction of spoken acoustic sequence to text sequence.
Automatic speech recognition
Building of ASR system required various knowledge sources:
❏
❏
❏
Syntax - knowledge about the grammatical structure of language
Semantics - knowledge about the meaning of the words
Pragmatics - knowledge about the context of conversion
Automatic Speech recognition
(ASR): is a transduction of spoken acoustic sequence to text sequence.
Automatic speech recognition
Automatic speech recognition
Speech Recognition Problem: P(W |Y )
Y represents sequence of observation symbols (acoustic features MFCC), W represents the sequence of words.
Objective: maximize P(W |Y ) during training
Automatic speech recognition
Speech Recognition Problem: P(W |Y )
Y represents sequence of observation symbols (acoustic features MFCC), W represents the sequence of words.
Objective: maximize P(W |Y ) during training
Automatic speech recognition
Speech Recognition Problem: P(W |Y )
Y represents sequence of observation symbols (acoustic features MFCC), W represents the sequence of words.
Objective: maximize P(W |Y ) during training
P(Y |W ): likelihood function, P(W ): a priori probability distribution
❏ Performance of Speech Recognition Systems: word error rate (WER) = (S+I+D)/N Here S, I, D, C are the number of substitutions, insertions, deletions and correct words and N is (S + D + C)
Automatic speech recognition
Widely used acoustic models:
Automatic speech recognition
Hidden markov models:
A Markov chain is useful when we need to compute a probability for a sequence of observable events.
Automatic speech recognition
Hidden markov models:
A Markov chain is useful when we need to compute a probability for a sequence of observable events.
Three basic problems in HMM modelling:
❏ The Evaluation Problem: Given an HMM and a sequence of observations O=O1, O2, O3, …… OT, what is the probability that the observations are generated by the model, P(O|λ)?
Automatic speech recognition
Hidden markov models:
A Markov chain is useful when we need to compute a probability for a sequence of observable events.
Three basic problems in HMM modelling:
i,.e Q∗ = q1, q2, q3, …….. qT ; Q∗ = arg maxQ P(Q|O; λ); Here, q1, q2 … are
referred to states.
Automatic speech recognition
Hidden markov models:
A Markov chain is useful when we need to compute a probability for a sequence of observable events.
Three basic problems in HMM modelling:
i,.e Q∗ = q1, q2, q3, …….. qT ; Q∗ = arg maxQ P(Q|O; λ); Here, q1, q2 … are
referred to states.
Automatic speech recognition
HMM-GMM based acoustic models
Block diagram describing HMM-GMM based acoustic modeling.
Automatic speech recognition
Improvements in the performances of speech recognition systems with relevance to acoustic models (GMM-HMM, and DNN-HMM) as function training data
Automatic speech recognition
Performances of speech recognition systems developed using the WSJ corpus.
Thank you
Part-II
Advanced AI & ML in Speech Systems
Deep Neural Network
Deep neural network
Deep Neural Network
Frame level probabilities of a DNN-based LID system (8 languages selected) evaluated over
an English-USA (4s) test utterance.
Deep Neural Network
i-vector vs DNN performance on LRE09 database. Cavg=average cost.
DNN with attention
DNN with attention architecture
Context vector (c) is weighted average of
hidden representations
DNN with attention
❏ Attention weights are low for
silence frames
An example of spectrogram with attention
Sequential Networks
LSTM cell
Attention based residual time delay neural network
Residual blocks allows skip connections which provide smooth flow of gradients
In TDNN each layer captures temporal dependencies at different context
Attention aggregates the whole
input sequence information
Performance of Indian LID system using DNNs
Results on IIITH-ILSC database using different neural networks
Performance of Indian LID system using DNNs
Performance of Speaker verification using Attention Networks
An example of the Self Multi-Head Attention
Pooling with 3 heads
Evaluation results of the text-independent
verification on VocCelb database.
Performance of Speaker identification using Attention Networks
The results for speaker identification on VoxCeleb
The results of the three most frequently used acoustic
features for speaker identification on VoxCeleb. Here, Spectr. corresponds to the spectrograms feature.
Automatic speech recognition (recap)
Automatic speech recognition (recap)
Lexical model
DNN-HMM in Speech Recognition
DNN-HMM in Speech Recognition
1990s: Large vocabulary continuous dictation
2000s: Discriminative training
(minimize word/phone error rate) 2010s: Deep learning significantly reduce error rate
George E. Dahl, et al. Context-Dependent Pre-Trained Deep Neural Networks for Large-Vocabulary Speech Recognition. IEEE Trans. Audio, Speech & Language Processing, 2012.
DNN-HMM vs. GMM-HMM
Deep models are more powerful
Deep models take data more efficiently
Deep models can be further improved by recent advances in deep learning
RNN-CTC in Speech Recognition
Connectionist Temporal Classification (CTC)
❏ No need of aligned data
❏
❏
Reason:It can assign probability for any label, given an input
It works by summing over the probability of all possible alignments between the input and the label
❏ Both X and Y can vary in length.
Loss function
LSTM-CTC MODELS: switch board corpus
Word error rate compared with HMM based models
Encoder Decoder models for speech recognition
Listen, Attend and Spell (LAS) model:
Deep learning in ASR on Chime and switch board
HMM based system still performs better than End-to-End
system on large scale dataset
WAV2VEC2
Figure2: Overview Wave2Vec2.0 model architecture
Figure3: WER for different durations
Figure1: Wave2Vec2.0 model architecture
Figure4:WER for different durations
Figure1is taken from Original paper “wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations”.
Figure2 : https://jonathanbgn.com/2021/09/30/illustrated-wav2vec-2.html
Whisper Model
Figure: Overview of Whisper model approach
Table : Detailed comparison on different datasets
Table: Architecture details of Whisper model family
Reference: Radford A, Kim JW, Xu T, Brockman G, McLeavey C, Sutskever I. Robust speech recognition via large-scale weak supervision. In International Conference on Machine Learning 2023 Jul 3 (pp. 28492-28518). PMLR.
Table: Multilingual speech recognition performance
Thank you