1 of 79

Spoken Language Technologies and ASR

IASNLP 2026

Dr. Anil Kumar Vuppala, Speech Processing Laboratory, IIIT Hyderabad

2 of 79

Contents

Introduction

Speech applications

ASR

3 of 79

Introduction contd.

Information in speech

  • Message
  • Language
  • Gender
  • Age
  • Speaker identity
  • Emotional state
  • Cognitive behaviours of
    • Depression
    • Autism
    • Language dis
  • Abnormalities in speech production etc.

Typical speech signal for a word “Artificial”

4 of 79

Introduction

Speech is a unique, complex, and dynamic motor activity through which we express our thoughts and emotions.

Natural mode of communication

for human beings.

Important in human computer interaction

5 of 79

Introduction contd.

Applications of speech processing:

Spoken language dialog systems

Speaker recognition/verification

Emotion recognition

Language identification

Pronunciation evaluation for clinical tools

Forensic tools

  • Automatic speech recognition

  • Text-to-speech synthesis

Speech coding

6 of 79

Introduction contd.

Applications of speech processing:

Spoken language dialog systems

Speaker recognition/verification

Emotion recognition

Language identification

Pronunciation evaluation for clinical tools

Forensic tools

  • Automatic speech recognition

  • Text-to-speech synthesis

Speech Pathology

Importance of speech technologies:

For information security and authentication

Speech based smart appliances

For better human machine interaction

Multilingual speech systems (speech to

speech translation)

As a personalized speech therapist

As a navigator for disabled people

As a personalized assistant in education, agriculture, health care, and other service sectors.

7 of 79

8 of 79

Speech to Speech Machine Translation (SSMT)

IIIT Hyderabad Speech Processing Laboratory LTRC

A king named Sumer Singh

Source Speech to Target Language Text

Speech Synthesis

Cross-lingual Style Transfer

सुमेर सिंह नाम का एक राजा

  • We curated a dataset with stress annotations for Indian English and trained stress detection models at word level.
  • We modified an existing TTS architecture for the addition of stress in synthesized speech.

9 of 79

Pathological speech processing

IIIT Hyderabad Speech Processing Laboratory LTRC

Resonance disorders

Voice disorders

Spastic dysarthria

Stuttering

Emotions / Depression / Mood swings

  • Speech disorders are detected and analysed by Speech & Language Pathologists (SLPs).

  • What is the role of speech researcher here?

  • Can we replace a SLP by a speech system/technology?

10 of 79

Feature extraction

Primary step in the development of speech systems

Fourier transform

Short time Fourier analysis due to non-stationarity

20-30 ms framesize of speech

Window type: Hamming, Hanning, Rectangular etc.

Cepstral coefficients are most common feature representation the which captures the vocal tract shape and dynamic information

Mel-frequency cepstral coefficients (MFCC), Linear prediction prediction cepstral coefficients (LPCC), Perceptual linear prediction cepstral coefficients (PLPCC), etc.

11 of 79

Feature extraction

Primary step in the development of speech systems

Fourier transform

Short time Fourier analysis due to non-stationarity

20-30 ms framesize of speech

Window type: Hamming, Hanning, Rectangular etc.

Cepstral coefficients are most common feature representation the which captures the vocal tract shape and dynamic information

Mel-frequency cepstral coefficients (MFCC), Linear prediction prediction cepstral coefficients (LPCC), Perceptual linear prediction cepstral coefficients (PLPCC), etc.

12 of 79

Feature extraction contd.

  • Including the energy, total 13-dimensional static cepstral features
  • 39-dimensional MFCC features: [static -- delta -- delta-delta]

13 of 79

Speech systems

Speaker recognition: is a technique to recognize the identity of a speaker from a speech utterance.

Speaker verification aims to verify whether an input speech corresponds to

the claimed identity.

Emotion recognition is a process of identifying the emotional state (Anger, Happy, Sad, Neutral, Fear, etc) of a speaker from the spoken utterance.

Spoken Language identification is a task of recognising the language

identity/information from the speech signal.

Automatic speech recognition is the process of deriving the transcription (word sequence) of an utterance, given the speech waveform.

Text to speech synthesis aims to converts the given text (word sequence) information into speech.

14 of 79

Speech systems contd.

Speaker

identification/verification

Language identification Emotion recognition

Simple classification problems

Automatic speech recognition (ASR)

Text to speech synthesis

Are advanced speech systems require: classification, duration/pronunciation modelling, sequence-sequence mapping etc.

15 of 79

Deep learning in AI

15

Ref:- https://www.quora.com/What-are-the-main-differences-between-artificial-intelligence-and-machine-learning-Is-machine-learning-a-part-of-artificial-intelligence

16 of 79

Why Third wave? 1950’s, 1990’s and now

More data from systems and sensors (IoT)

More compute power : GPU’s, multi-core CPU’s

Can train deep architectures

Some more applications of DL are:

Speech recognition, Image classification, natural languge processing, chat bots, personalized recommendations, prediction, anomaly detection, fraud detection, drug discovery, autonomous cars, video analytics etc...

16

Fig Ref:- https://www.linkedin.com/pulse/how-artificial-intelligence-revolutionizing-finance-del-toro-barba/

17 of 79

Machine learning vs Deep learning

17

Ref:- https://www.xenonstack.com/blog/log-analytics-with-deep-learning-and-machine-learning

18 of 79

Parameters to vary for tuning

  • Number of layers
  • Number of neurons in each layer
  • Activation function in each layer
  • Number of epochs
  • Error/loss functions
    • Iteration (equivalent to when a weight update is done)
  • Learning rate (α)
    • Size of the step in the direction of the negative gradient

  • Batch size
  • Momentum parameter (weightage given to earlier steps taken in the process of gradient descent)
  • Kernels
  • Number of features
  • Gradient descent methods 

18

19 of 79

Speech systems contd.

Classification system example

Input: speech signal

Features: 39-dimensional MFCCs

Class labels:

Language ID (Ex: Telugu-1, Hindi-2, English-3, Tamil-4, etc.) in language identification task. Similarly, speaker ID will be the class label in speaker identification.

Machine learning algorithms: Gaussian Mixture Modelling (GMM), GMM with universal background

modelling, I-vector modelling.

20 of 79

Gaussian Mixture Modelling (GMM)

Parameter estimation:

21 of 79

Gaussian Mixture Modelling (GMM) contd.

Formulation of GMM:

22 of 79

Gaussian Mixture Modelling (GMM) contd.

GMM distribution from three Gaussians

23 of 79

Gaussian Mixture Modelling (GMM) contd.

E-step

M-step

24 of 79

GMM with universal background model (GMM-UBM)

GMM are used for both target and background models

  • Target model of a class is trained using features corresponding to that class
  • Universal background model is trained using features from many classes

Target model is adapted from universal background model (UBM)

  • good with limited target training data

25 of 79

GMM with universal background model (GMM-UBM)

GMM are used for both target and background models

  • Target model of a class is trained using features corresponding to that class
  • Universal background model is trained using features from many classes

Target model is adapted from universal background model (UBM)

  • good with limited target training data

Maximum a posteriori (MAP) adaptation:

  • align target training vectors to UBM
  • accumulate sufficient statistics
  • update target model parameters with smoothing to UBM parameters

26 of 79

GMM with universal background model (GMM-UBM)

GMM are used for both target and background models

  • Target model of a class is trained using features corresponding to that class
  • Universal background model is trained using features from many classes

Target model is adapted from universal background model (UBM)

  • good with limited target training data

Maximum a posteriori (MAP) adaptation:

  • align target training vectors to UBM
  • accumulate sufficient statistics
  • update target model parameters with smoothing to UBM parameters

GMM-UBM Example: Adjustment of enrolled speaker’s GMM using UBM.

27 of 79

GMM-UBM based classification

Procedure for speaker recognition, language identification, emotion recognition using GMM-UBM modelling

28 of 79

GMM-UBM based classification

Metrics: Accuracy, and Equal error rate (EER). EER is function of false acceptance ratio (FAR) and False rejection ratio (FRR).

Procedure for speaker recognition, language identification, emotion recognition using GMM-UBM modelling

29 of 79

Joint factor analysis and I-vector modelling

  • Factor analysis is a statistical method which is used to describe the variability among the observed

variables in terms of potentially lower number of unobserved variables called factors.

  • Joint factor analysis (JFA) was the initial paradigm for speaker recognition
  • Later, it is used in Language identification, Emotion recognition, Automatic speech recognition, etc.

30 of 79

Joint factor analysis and I-vector modelling

  • Factor analysis is a statistical method which is used to describe the variability among the observed

variables in terms of potentially lower number of unobserved variables called factors.

  • Joint factor analysis (JFA) was the initial paradigm for speaker recognition
  • Later, it is used in Language identification, Emotion recognition, Automatic speech recognition, etc.

Intuition and interpretation

  • A supervector for a speaker (language / emotion) should be decomposable into speaker independent, speaker dependent, channel dependent, and residual components
  • Each component is represented by low-dimensional factors, which operate along the principal

dimensions of the corresponding component

  • Speaker (language/emotion) dependent component, known as the eigenvoice, and the corresponding factors

31 of 79

Joint factor analysis and I-vector modelling

GMM supervector u for a speaker (language/emotion/ any other) can be decomposed as:

32 of 79

Joint factor analysis and I-vector modelling

  • I-vectors gives the utterance level representation. Cosine distance can be used to find similarity between I vectors.
  • Variable length sequence to fixed dimension representation
  • I-vectors with SVM or DNN or PLDA scoring are used for speaker/emotion/language identification

33 of 79

I-vector modelling in Language identification

Motivation

  • I-vector models are the state-of-art baseline models in NIST 2009 Language Recognition Evaluation (NIST 2009 LRE) and Oriental language recognition (OLR) challenge.
  • Allows low dimensional speech representation based on the Factor analysis
  • Each speech recording is mapped on low

dimensional vector (Ex: 400)

  • Factor analysis as feature extractor
  • Modeling the inter-language variability

between different language classes

34 of 79

I-vector modelling in Language identification

  • Cavg is the average pairwise loss between miss rate and false alarm rate;
  • EER refers to equal error rate.
  • Low values of these metrics implies better system performance

35 of 79

I-vector modelling in Emotion recognition

INTERSPEECH 2009 Emotion recognition Challenge: uses FAU-AIBO Emotion corpus, It is two class problem: Positive or Negative

GMM-MFCC systems (s1)

MFCC feature (12-MFCC+E+delta+double delta)

GMM uses 512 Gaussian Components

GMM-MFCC systems (s2)

MFCC feature (12-MFCC+E+delta+double delta)

UBM with 512 Gaussian Components

I-vector dimension of 150, Fisher Discriminant Analysis

GMM-Prosodic system (s3)

Prosody features: Pitch+ energy + duration features

GMM-UBM with 256 components.

36 of 79

I-vector modelling in Emotion recognition

INTERSPEECH 2009 Emotion recognition Challenge: uses FAU-AIBO Emotion corpus, It is two class problem: Positive or Negative

GMM-MFCC systems (s1)

MFCC feature (12-MFCC+E+delta+double delta)

GMM uses 512 Gaussian Components

GMM-MFCC systems (s2)

MFCC feature (12-MFCC+E+delta+double delta)

UBM with 512 Gaussian Components

I-vector dimension of 150, Fisher Discriminant Analysis

GMM-Prosodic system (s3)

Prosody features: Pitch+ energy + duration features

GMM-UBM with 256 components.

37 of 79

I-vector modelling in Speaker recognition

Baseline results: Comparison of JFA and i-vector systems on the common subset of the 2008 NIST SRE database. WCCN: Within-class Covariance Normalisation, LDA: Linear discriminant analysis, SDNAP: scatter-difference Nuisance Attribute Projection, GPLDA: Gaussian Probabilistic LDA

EER: Equal error rate, DCF: Decision Cost Function is equivalent to pairwise loss function (Cavg)

38 of 79

Automatic speech recognition

Automatic Speech recognition (ASR): is a transduction of spoken acoustic sequence to text sequence.

39 of 79

Automatic speech recognition

Building of ASR system required various knowledge sources:

  • Acoustics - knowledge about variability in speech
  • Phonetics - knowledge about characteristics of speech sounds
  • Phonology - knowledge about variability of speech sounds
  • Prosodics - knowledge about stress and the intonation patterns
  • Lexical - knowledge about patterns of language

Syntax - knowledge about the grammatical structure of language

Semantics - knowledge about the meaning of the words

Pragmatics - knowledge about the context of conversion

Automatic Speech recognition

(ASR): is a transduction of spoken acoustic sequence to text sequence.

40 of 79

Automatic speech recognition

41 of 79

Automatic speech recognition

Speech Recognition Problem: P(W |Y )

Y represents sequence of observation symbols (acoustic features MFCC), W represents the sequence of words.

Objective: maximize P(W |Y ) during training

42 of 79

Automatic speech recognition

Speech Recognition Problem: P(W |Y )

Y represents sequence of observation symbols (acoustic features MFCC), W represents the sequence of words.

Objective: maximize P(W |Y ) during training

43 of 79

Automatic speech recognition

Speech Recognition Problem: P(W |Y )

Y represents sequence of observation symbols (acoustic features MFCC), W represents the sequence of words.

Objective: maximize P(W |Y ) during training

P(Y |W ): likelihood function, P(W ): a priori probability distribution

Performance of Speech Recognition Systems: word error rate (WER) = (S+I+D)/N Here S, I, D, C are the number of substitutions, insertions, deletions and correct words and N is (S + D + C)

44 of 79

Automatic speech recognition

Widely used acoustic models:

  • GMM-HMM based acoustic models
  • DNN-HMM based acoustic models
  • RNN-CTC based acoustic models
  • Encoder-Decoder acoustic models

45 of 79

Automatic speech recognition

Hidden markov models:

A Markov chain is useful when we need to compute a probability for a sequence of observable events.

46 of 79

Automatic speech recognition

Hidden markov models:

A Markov chain is useful when we need to compute a probability for a sequence of observable events.

Three basic problems in HMM modelling:

The Evaluation Problem: Given an HMM and a sequence of observations O=O1, O2, O3, …… OT, what is the probability that the observations are generated by the model, P(O|λ)?

47 of 79

Automatic speech recognition

Hidden markov models:

A Markov chain is useful when we need to compute a probability for a sequence of observable events.

Three basic problems in HMM modelling:

  • The Evaluation Problem: Given an HMM and a sequence of observations O=O1, O2, O3, …… OT, what is the probability that the observations are generated by the model, P(O|λ)?
  • The Decoding Problem: Given a model λ and a sequence of observations O=O1, O2, O3, …… OT, what is the most likely state sequence in the model that produced the observations?

i,.e Q= q1, q2, q3, …….. qT ; Q= arg maxQ P(Q|O; λ); Here, q1, q2 … are

referred to states.

48 of 79

Automatic speech recognition

Hidden markov models:

A Markov chain is useful when we need to compute a probability for a sequence of observable events.

Three basic problems in HMM modelling:

  • The Evaluation Problem: Given an HMM and a sequence of observations O=O1, O2, O3, …… OT, what is the probability that the observations are generated by the model, P(O|λ)?
  • The Decoding Problem: Given a model λ and a sequence of observations O=O1, O2, O3, …… OT, what is the most likely state sequence in the model that produced the observations?

i,.e Q= q1, q2, q3, …….. qT ; Q= arg maxQ P(Q|O; λ); Here, q1, q2 … are

referred to states.

  • The Learning Problem Given a model λ and a sequence of observations O=O1, O2, O3, …… OT, how should we adjust the model parameters λ = (A, B, π ) in order to maximize P(O|λ)

49 of 79

Automatic speech recognition

HMM-GMM based acoustic models

  • Spoken word can be decomposed into sequence of Kw basic phones (units) and the sequence called pronunciation sequence.
  • HMMs model the temporal variability of speech and GMMs model how well each frame or a short window of frames fits a state of HMM
  • In practice each phone is represented by a HMM with left to right topology and three hidden states.

Block diagram describing HMM-GMM based acoustic modeling.

50 of 79

Automatic speech recognition

Improvements in the performances of speech recognition systems with relevance to acoustic models (GMM-HMM, and DNN-HMM) as function training data

51 of 79

Automatic speech recognition

Performances of speech recognition systems developed using the WSJ corpus.

52 of 79

Thank you

53 of 79

Part-II

Advanced AI & ML in Speech Systems

54 of 79

Deep Neural Network

Deep neural network

  • Decision is taken at frame level.
  • The frame level decisions are averaged to get utterance level decision.

55 of 79

Deep Neural Network

Frame level probabilities of a DNN-based LID system (8 languages selected) evaluated over

an English-USA (4s) test utterance.

56 of 79

Deep Neural Network

i-vector vs DNN performance on LRE09 database. Cavg=average cost.

57 of 79

DNN with attention

DNN with attention architecture

Context vector (c) is weighted average of

hidden representations

  • All frames may not contribute for decision / classification
  • Decision is taken at utterance level.
  • Weighted average of hidden representations

58 of 79

DNN with attention

Attention weights are low for

silence frames

An example of spectrogram with attention

59 of 79

Sequential Networks

  • Long temporal dependencies
  • Sequence to sequence mapping

LSTM cell

60 of 79

Attention based residual time delay neural network

Residual blocks allows skip connections which provide smooth flow of gradients

In TDNN each layer captures temporal dependencies at different context

Attention aggregates the whole

input sequence information

61 of 79

Performance of Indian LID system using DNNs

Results on IIITH-ILSC database using different neural networks

62 of 79

Performance of Indian LID system using DNNs

63 of 79

Performance of Speaker verification using Attention Networks

An example of the Self Multi-Head Attention

Pooling with 3 heads

Evaluation results of the text-independent

verification on VocCelb database.

64 of 79

Performance of Speaker identification using Attention Networks

The results for speaker identification on VoxCeleb

The results of the three most frequently used acoustic

features for speaker identification on VoxCeleb. Here, Spectr. corresponds to the spectrograms feature.

65 of 79

Automatic speech recognition (recap)

66 of 79

Automatic speech recognition (recap)

67 of 79

Lexical model

68 of 79

DNN-HMM in Speech Recognition

69 of 79

DNN-HMM in Speech Recognition

1990s: Large vocabulary continuous dictation

2000s: Discriminative training

(minimize word/phone error rate) 2010s: Deep learning significantly reduce error rate

George E. Dahl, et al. Context-Dependent Pre-Trained Deep Neural Networks for Large-Vocabulary Speech Recognition. IEEE Trans. Audio, Speech & Language Processing, 2012.

70 of 79

DNN-HMM vs. GMM-HMM

Deep models are more powerful

  • GMM assumes data is generated from single component of mixture model
  • GMM with diagonal variance matrix ignores correlation between dimensions

Deep models take data more efficiently

  • GMM consists with many components and each learns from a small fraction of data

Deep models can be further improved by recent advances in deep learning

71 of 79

RNN-CTC in Speech Recognition

72 of 79

Connectionist Temporal Classification (CTC)

No need of aligned data

Reason:It can assign probability for any label, given an input

It works by summing over the probability of all possible alignments between the input and the label

  • X = {x 1 , x 2 ...x T } represents input sequences
  • Y = {y 1 , y 2 ...y T } Transcripts
  • Need an accurate mapping from X to Y
  • Challenges using simpler supervised learning algorithms:

Both X and Y can vary in length.

73 of 79

Loss function

  • The CTC alignments give us a natural way to go from probabilities at each time-step to the probability of an output sequence
  • conditional probability P(Y /X ) = AA X ,Y Tt=1 p t (a t |X )
  • p t (a t |X ) computes the probability for a single alignment step-by-step.

74 of 79

LSTM-CTC MODELS: switch board corpus

Word error rate compared with HMM based models

75 of 79

Encoder Decoder models for speech recognition

Listen, Attend and Spell (LAS) model:

  • Listener is a pyramidal BLSTM encoding our input sequence x into high level features h.

  • Epeller is an attention-based decoder generating the y characters from h.

76 of 79

Deep learning in ASR on Chime and switch board

HMM based system still performs better than End-to-End

system on large scale dataset

77 of 79

WAV2VEC2

  • One of the current state-of-the-art models for Speech related tasks especially Automatic Speech Recognition.
  • Based on the Transformer’s encoder, with a training objective similar to BERT’s masked language modeling objective.
  • Self Supervised model for learning Speech Representations.
  • Architecture mainly consists of 4 main parts:
    • Feature Encoder: Converting raw waveform into sequence of feature vectors Z0,Z1,Z2,..,ZT(Latent speech representations).
    • Quantization Module: Maps continuous-valued embeddings to discrete symbols using vector quantization.
    • Context Network: Improves the quality of the learned contextual embeddings generated by the feature encoder by capturing both local and global contextual information in the audio.
    • Contrastive loss: Loss function used during unsupervised pre-training to learn meaningful representations of speech data.
  • Uses Gumbel softmax distribution, CTC loss, and trained on 960 hours of unannotated Librispeech.
  • Wav2vec2- XLSR (cross lingual speech representations) for multilingual speech recognition, trained on 128 languages.

Figure2: Overview Wave2Vec2.0 model architecture

Figure3: WER for different durations

Figure1: Wave2Vec2.0 model architecture

Figure4:WER for different durations

Figure1is taken from Original paper “wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations”.

Figure2 : https://jonathanbgn.com/2021/09/30/illustrated-wav2vec-2.html

78 of 79

Whisper Model

  • Another state of the model for speech related tasks.
  • Trained on over 680000 hours of multilingual data
  • Simple end-to-end approach, implemented as an encoder-decoder Transformer.
  • Input is 30-second audio chunks, converted into a log-Mel spectrogram.
  • Both Encoder and Decoder blocks use Neural network architectures, and Attention mechanism.
  • In decoder block, in addition to self attention, cross attention is also used.
  • Model is trained on many different speech processing tasks, multilingual speech recognition, speech translation, spoken language identification, voice activity detection.

Figure: Overview of Whisper model approach

Table : Detailed comparison on different datasets

Table: Architecture details of Whisper model family

Reference: Radford A, Kim JW, Xu T, Brockman G, McLeavey C, Sutskever I. Robust speech recognition via large-scale weak supervision. In International Conference on Machine Learning 2023 Jul 3 (pp. 28492-28518). PMLR.

Table: Multilingual speech recognition performance

79 of 79

Thank you