1 of 29

Introduction to Speech & Natural Language Processing

Lecture 10

Speech Feature Extraction

Krishnendu Ghosh

2 of 29

Silence, Unvoiced & Voiced Speech

3 of 29

Speech Spectrograms

4 of 29

Speech Feature Extraction

Speech feature extraction is the process of converting a raw audio waveform into a compact, informative numerical representation.

Raw audio is:

  • high-dimensional
  • noisy
  • difficult for ML models

Speech is resampled for standardization, framed because it is quasi-stationary, windowed to reduce spectral leakage, and overlapped to preserve temporal continuity.

5 of 29

Speech Signal Levels

Speech information exists at multiple levels:

  • Signal level: waveform, energy, frequency
  • Phonetic level: sounds, articulation
  • Prosodic level: intonation, stress, rhythm
  • Paralinguistic level: emotion, speaker traits

Different features capture different levels.

6 of 29

Knowledge Sources in Speech

  • Message: Thought to be conveyed
  • Speaker: Who the speaker is?
  • Language: Language in which speech is produced
  • Naturalness: Pleasing quality of speech
  • Intonation: Increasing/decreasing pitch
  • Duration: Variations in the duration patterns
  • Stress: Uttering with special attention (words or syllables)
  • Accent: Dialect region of person
  • Emotions (mood): State of mind
  • Health: Condition of speech production organs
  • Device and Channel: Microphone and medium in which speech is collected

7 of 29

Knowledge Sources in Speech

8 of 29

Knowledge Sources in Speech

  • Segmental level (10-30 ms): Positioning and movement of articulators, shapes of cavities (oral and nasal), voiced and unvoiced excitation, periodicity (pitch) and formant structure

  • Sub-segmental level (3-5 ms): Glottal pulse shape, open and closure regions of glottis, consonant regions (burst and transition regions)

  • Supra-segmental level (>100 ms): Prosody (duration and pitch), stress, prominence, melody, syntax and semantics.

  • Coarticulation level: Constraints due to linguistic context: Influence of the adjacent units in the articulation of the present unit.

9 of 29

Speech to Digital Signal

  • Speech processing starts with digitization.
  • Speech is:
    • captured using a microphone (air pressure → voltage)
    • sampled in time
    • quantized in amplitude
  • This converts continuous speech into a digital signal.

10 of 29

Sampling and Quantization

  • Sampling converts continuous time → discrete time
  • Quantization converts continuous amplitude → discrete values

Common sampling rates:

  • 8 kHz → telephone speech
  • 16 kHz → standard ASR and diarization

Sampling rate defines maximum frequency content that can be analyzed.

11 of 29

Preprocessing

Before extracting features, audio is typically:

  • Resampled (e.g., 16 kHz)
  • Framed (20–25 ms windows)
  • Windowed (Hamming window)
  • Overlapped (10 ms hop)

12 of 29

Resampling

Reason

Speech information relevant for intelligibility lies mostly below 8 kHz.

By the Nyquist theorem, sampling at 16 kHz is sufficient to capture it.

Why standardize?

  • Different recordings come at 8 kHz, 16 kHz, 44.1 kHz, etc.
  • Models and features assume a fixed sampling rate

13 of 29

Framing

Reason

Speech is non-stationary, but over short durations it behaves approximately stationary (quasi-stationary).

Vocal tract shape ≈ constant for ~20–30 ms

Phoneme identity does not change within this window

Why 20–25 ms?

Short enough to ensure stationarity

Long enough to capture pitch and formants

14 of 29

Windowing

Problem without windowing

Framing causes sharp cuts at frame boundaries → spectral leakage.

Solution

Multiply each frame by a smooth window (Hamming):

Reduces edge discontinuities

Suppresses side lobes in frequency domain

15 of 29

Overlapping

Reason

Speech changes gradually, not abruptly.

Overlapping ensures:

  • Smooth temporal tracking
  • No loss of information at frame boundaries

Why 10 ms hop?

50–60% overlap with 20–25 ms frame

Matches typical phoneme transition speed

16 of 29

Time-Domain Features

Short-Time Energy

  • Measures loudness per frame
  • Useful for:
    • speech vs silence
    • emphasis detection

Zero Crossing Rate (ZCR)

  • Number of sign changes in waveform
  • High for:
    • unvoiced sounds
    • noise
  • Low for:
    • voiced speech

17 of 29

Frequency-Domain Features

FFT (Fast Fourier Transform)

  • Converts time signal → frequency spectrum
  • Shows which frequencies are present

But FFT is:

  • high-dimensional
  • sensitive to noise

18 of 29

Spectral Features

Spectral Centroid

  • “Center of mass” of spectrum
  • Correlates with brightness of sound

Spectral Bandwidth

  • Spread of frequencies
  • Indicates richness vs sharpness

Spectral Roll-off

  • Frequency below which X% energy lies
  • Distinguishes speech from noise/music

19 of 29

Source–Filter Model of Speech

Speech production is modeled as:

  • Source: excitation from vocal folds or turbulence
  • Filter: vocal tract shaping

Mathematically:

Speech=Source⊗Filter

This model explains why pitch and articulation are separable.

20 of 29

Poles and Zeros

  • Poles represent vocal tract resonances (formants)
  • Zeros represent anti-resonances

Pole-zero modeling explains how vocal tract shape affects speech sounds.

21 of 29

LPC (Linear Predictive Coding)

LPC models the vocal tract filter

  • Predicts current sample from past samples
  • LPC coefficients represent vocal tract shape
  • Widely used in classical speech analysis.

22 of 29

Cepstral Analysis

  • Separates source and filter information
  • Slow variations → vocal tract
  • Fast variations → excitation

Cepstral analysis is the foundation of MFCCs.

23 of 29

Mel-Frequency Cepstral Coefficients

Why MFCCs?

  • Model human auditory perception
  • Compress spectral information
  • Robust to noise

MFCC Pipeline:

  • Frame + window signal
  • FFT
  • Mel filterbank
  • Log energies
  • DCT → MFCCs

24 of 29

Prosodic Features

Pitch (Fundamental Frequency, F0)

  • Indicates intonation
  • Important for:
    • emotion
    • speaker identity
    • question vs statement

Speaking Rate

  • Words/syllables per second
  • Indicates stress, fluency, pathology

Intensity

  • Loudness variation over time

25 of 29

Voice Quality Features

Jitter

  • Variation in pitch period
  • Indicates vocal instability

Shimmer

  • Variation in amplitude
  • Used in voice disorder detection

Harmonics-to-Noise Ratio (HNR)

  • Measures voice clarity
  • Very important for:
    • pathology detection
    • clinical speech analysis

26 of 29

High-Level Statistical Functionals

Instead of frame-level features, compute:

  • mean
  • variance
  • min / max
  • percentiles

This converts variable-length speech → fixed-length vectors.

27 of 29

Deep Speech Features

Extract embeddings from pretrained models.

Examples:

  • wav2vec-style embeddings
  • HuBERT-style representations

Advantages:

  • capture long-range context
  • task-agnostic
  • strong performance

28 of 29

Feature Choice by Task

Task Common Features

ASR MFCCs, log-Mel, deep embeddings

Speaker Diarization MFCCs, x-vectors

Emotion Recognition MFCCs + prosody

Voice Disorder Detection Jitter, shimmer, HNR

Keyword Spotting MFCCs, log-Mel

Medical Speech MFCCs + pitch + embeddings

Colab Link

29 of 29