Introduction to Speech & Natural Language Processing
Lecture 10
Speech Feature Extraction
Krishnendu Ghosh
Silence, Unvoiced & Voiced Speech
Speech Spectrograms
Speech Feature Extraction
Speech feature extraction is the process of converting a raw audio waveform into a compact, informative numerical representation.
Raw audio is:
Speech is resampled for standardization, framed because it is quasi-stationary, windowed to reduce spectral leakage, and overlapped to preserve temporal continuity.
Speech Signal Levels
Speech information exists at multiple levels:
Different features capture different levels.
Knowledge Sources in Speech
Knowledge Sources in Speech
Knowledge Sources in Speech
Speech to Digital Signal
Sampling and Quantization
Common sampling rates:
Sampling rate defines maximum frequency content that can be analyzed.
Preprocessing
Before extracting features, audio is typically:
Resampling
Reason
Speech information relevant for intelligibility lies mostly below 8 kHz.
By the Nyquist theorem, sampling at 16 kHz is sufficient to capture it.
Why standardize?
Framing
Reason
Speech is non-stationary, but over short durations it behaves approximately stationary (quasi-stationary).
Vocal tract shape ≈ constant for ~20–30 ms
Phoneme identity does not change within this window
Why 20–25 ms?
Short enough to ensure stationarity
Long enough to capture pitch and formants
Windowing
Problem without windowing
Framing causes sharp cuts at frame boundaries → spectral leakage.
Solution
Multiply each frame by a smooth window (Hamming):
Reduces edge discontinuities
Suppresses side lobes in frequency domain
Overlapping
Reason
Speech changes gradually, not abruptly.
Overlapping ensures:
Why 10 ms hop?
50–60% overlap with 20–25 ms frame
Matches typical phoneme transition speed
Time-Domain Features
Short-Time Energy
Zero Crossing Rate (ZCR)
Frequency-Domain Features
FFT (Fast Fourier Transform)
But FFT is:
Spectral Features
Spectral Centroid
Spectral Bandwidth
Spectral Roll-off
Source–Filter Model of Speech
Speech production is modeled as:
Mathematically:
Speech=Source⊗Filter
This model explains why pitch and articulation are separable.
Poles and Zeros
Pole-zero modeling explains how vocal tract shape affects speech sounds.
LPC (Linear Predictive Coding)
LPC models the vocal tract filter
Cepstral Analysis
Cepstral analysis is the foundation of MFCCs.
Mel-Frequency Cepstral Coefficients
Why MFCCs?
MFCC Pipeline:
Prosodic Features
Pitch (Fundamental Frequency, F0)
Speaking Rate
Intensity
Voice Quality Features
Jitter
Shimmer
Harmonics-to-Noise Ratio (HNR)
High-Level Statistical Functionals
Instead of frame-level features, compute:
This converts variable-length speech → fixed-length vectors.
Deep Speech Features
Extract embeddings from pretrained models.
Examples:
Advantages:
Feature Choice by Task
Task Common Features
ASR MFCCs, log-Mel, deep embeddings
Speaker Diarization MFCCs, x-vectors
Emotion Recognition MFCCs + prosody
Voice Disorder Detection Jitter, shimmer, HNR
Keyword Spotting MFCCs, log-Mel
Medical Speech MFCCs + pitch + embeddings