CMSC 848U: Selected Topics in Information Processing; Modern Computational Speech and Audition
1
Large-scale models trained on audio data, capable of multiple downstream tasks
Audio
Foundation
Model
🎙 ASR / Transcription
Whisper (OpenAI, 2022)
Multilingual ASR, 680k hrs of data
Wav2Vec 2.0 (Meta, 2020)
Self-supervised speech repr.
HuBERT (Meta, 2021)
Hidden-unit BERT for speech
🏷 Audio Classification
PANN (Kong et al., 2020)
AudioSet pretrained, 527 classes
AST (MIT, 2021)
Audio Spectrogram Transformer
BEATs (MSFT, 2023)
Audio tokenizer + classifier
CLAP (LAION, 2023)
Zero-shot cls via text queries
💬 Other Audio Tasks
AudioPaLM (Google, 2023)
Speech QA & translation (LLM)
Pengi (MSFT, 2023)
Audio+text open-ended QA
Audio Flamingo Series (2024, 2025)
Few-shot audio reasoning
Qwen-Audio (Alibaba, 2023)
Multi-task audio-language model
Refs: Radford et al. (Whisper, 2022); Baevski et al. (Wav2Vec 2.0, NeurIPS 2020); Hsu et al. (HuBERT, 2021); Kong et al. (PANN, 2020); Gong et al. (AST, 2021); Chen et al. (BEATs, ICML 2023); Wu et al. (LAION-CLAP, 2023); Rubenstein et al. (AudioPaLM, 2023); Deshmukh et al. (Pengi, NeurIPS 2023); Huang et al. (Audio Flamingo, 2024).
Audio Foundation Models
🔊 Retrieval
LAION-CLAP (LAION, 2023)
Zero-shot cls via text queries
CompA-CLAP (UMD, 2024)
Zero-shot cls via text queries
Paradigm Shift: Beyond the Cascade
3
Large Audio Language Model (LALM)
4
⚡ The Paradigm Shift
Moving from single-task, narrow-domain models to general-purpose, open-ended auditory reasoning engines.
Single-Task
ASR / Classifier
General LALM
Open-Ended Reasoning
⚠ ASR-Based Pipelines
Cascaded systems fundamentally lose non-linguistic acoustic cues:
Audio
ASR
Text
LLM
✘ Acoustic cues lost at ASR step
⚖ Alignment vs. End-to-End
Contrastive (e.g. CLAP)
End-to-End LALM
Audio
Text
Shared
Space
Audio
LLM
(AR gen.)
Contrastive (CLAP)
End-to-End (LALM)
[1] Radford et al., Robust Speech Recognition via Large-Scale Weak Supervision (Whisper), ICML 2023. [2] Baevski et al., wav2vec 2.0, NeurIPS 2020. [3] Wu et al., LAION-CLAP, ICASSP 2023. [4] Deshmukh et al., Pengi: An Audio Language Model for Audio Tasks, NeurIPS 2023.
DEFINITION
LALMs are multimodal architectures that natively integrate continuous audio signals with discrete text spaces, enabling LLMs to “hear” and reason over acoustics, speech, and environmental sounds.
5
Challenges in Audio Processing
Limited Audio Datasets
Audio data is scarce compared to other modalities, with minimal availability for certain tasks.
Reasoning data: Most datasets focus on recognition, not complex reasoning.
Long audio: Datasets contain 5–10s clips; no long-form sound/music data beyond ASR.
Weak Audio Representations
Audio encoders achieve ~60% on AudioSet, far below 95%+ on ImageNet.
Models struggle with compositional audio structure (event order) and linguistic variations.
Speech, sounds, and music are still treated separately rather than holistically.
Lacking Evaluation Benchmarks
Most benchmarks are recognition-focused and simplistic, limiting real-world progress.
25+ reasoning benchmarks for vision vs. only a handful for audio.
No benchmarks dedicated to real-world understanding & reasoning tasks.
Dataset Scale Comparison
Language
Billions of tokens
Vision
5B+ image-text pairs
Speech
60k+ hours
Non-verbal Audio
~1k hours
Structure of an LALM
6
Raw Audio
Waveform / Spectrogram
① Audio Encoder
Whisper · WavLM · BEATs
② Projection Layer
Q-Former · Linear · MLP
③ LLM Backbone
Llama · Vicuna · Qwen
Text Output
Response / Caption
↔ Cross-modal alignment spans the Encoder → Projection → LLM boundary
① Audio Encoder
The sensory organ of the LALM. Extracts high-dimensional acoustic features from raw waveforms or spectrograms.
Examples:
② Projection Layer
The translator. Downsamples and aligns continuous audio embeddings into the LLM’s discrete text-token space.
Variants:
③ LLM Backbone
The reasoning engine. Processes the fused multimodal sequence autoregressively to generate text.
Examples:
④ Cross-Modal Alignment
The optimization objective ensuring audio embeddings occupy the same semantic space as text tokens.
e.g. audio of “dog barking” ↔ text token “bark”
Methods:
Refs: [1] Hsu et al., WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech Processing, IEEE JSTSP 2022. [2] Chen et al., BEATs: Audio Pre-Training with Acoustic Tokenizers, ICML 2023. [3] Li et al., BLIP-2: Q-Former Bootstrapping Language-Image Pre-training, ICML 2023 (Q-Former design adopted in audio LALMs).
Some real architectures
7
AudioPALM
Audio Flamingo 3
Llama-Mimi
GAMA
Core Problems in LALMs
8
Temporal Resolution Mismatch
Audio is sampled at 16k–48k Hz, producing massive sequence lengths that overwhelm LLM context windows.
Self-attention scales O(N²) with sequence length — a 10-second clip at 16 kHz gives 160,000 samples before any encoding.
LLM ctx
Raw audio (10 s @ 16kHz = 160k samples)
Sequence length mismatch:
Token Compression
Pool or stride audio frames to compress seconds of audio into a handful of tokens without losing semantic density.
Key papers:
1000s of audio frames
~32 tokens
Spectrogram vs. Waveform
A persistent debate in representation choice for audio encoders.
Spectrogram (e.g. AST)
Raw Waveform (e.g. EnCodec)
- Uses 1D convolution to process the audio.
- Uses RVQ Quantization
[1] Tang et al., SALMONN: Towards Generic Hearing Abilities for Large Language Models, ICLR 2024. arXiv:2310.13289. [2] Deshmukh et al., Pengi: An Audio Language Model for Audio Tasks, NeurIPS 2023. arXiv:2305.11834. [3] Gong et al., AST: Audio Spectrogram Transformer, Interspeech 2021. [4] Défossez et al., High Fidelity Neural Audio Compression (EnCodec), TMLR 2023. [5] GAMA [6] Audio Flamingo 3
SECTION 2
Three hierarchical challenge levels: Representation › Reasoning › Interaction
Now that we understand the architecture, we map the research landscape — categorizing core challenges into three hierarchical levels: Representation, Reasoning, and Interaction.
REPRESENTATION BOTTLENECKS
Addressing the core problems
9
Reasoning Problems (Space and Time)
10
Multi-Event & Temporal Grounding
Models handle “What is this?” but struggle with “What happened at 0:45?”
LTU tries to address this via temporal-localization training on timestamped QA pairs [1].
Global understanding vs. local grounding:
Note: Temporal grounding still remained a big challenge after LTU.
0:10
0:45
1:20
?
Spatial Reasoning
Deriving 3D geometry and acoustic room properties from binaural or multichannel audio is computationally complex.
👂
Multi-Speaker Modeling
Overlapping speech (the Cocktail Party Problem) causes catastrophic interference in standard attention.
Recent approach: speaker-diarized tokenization — pre-segment speakers before LLM ingestion [3].
Mixed signal → attention collapse:
⚡ attn. collapse
Long-Context Audio
Maintaining acoustic context over hour-long recordings requires memory beyond standard positional encodings.
Solutions:
Hierarchical memory layers
[1] Gong et al., Listen, Think, and Understand (LTU), ICLR 2024. [2] Comminiello et al., Spatial Audio Deep Learning, IEEE SPM 2022. [3] Cornell et al., One model to rule them all: Towards Joint Speaker Diarization and Recognition, ICASSP 2023. [4] Zhao et al., LongAudio: Towards Long-Context Audio Language Models, 2024.
SECTION 2
Three hierarchical challenge levels: Representation › Reasoning › Interaction
Even with good representations, LALMs must reason when events occur, where sounds originate, who is speaking, and how long context must be maintained.
LEVEL 2 — REASONING PROBLEMS
Addressing the Core Problem of Temporal Grounding
11
Raw Caption Text "1.12s a dog barking..."
FLOAT_PATTERN Regex "1.12s..." → "<|1.12|>s..."
Tokenizer Token IDs (e.g. 151762 for <|1.12|>)
Cross-Entropy Loss Per-token losses computed
Weighted Loss timestamp_weighted average loss
Audio spectrogram with caption overlay
PROCESSING PIPELINE
Instruction & Interaction Problems
12
SECTION 3
Taxonomy › Level 3: Instruction & Interaction
Having built robust representations, models must now follow instructions and engage in multi-modal dialogue across text and audio.
▶ 3 key challenges in instruction-following and interactive audio understanding
Zero-Shot Instruction Following
Challenge: Generalize to unseen audio tasks at inference time without task-specific fine-tuning.
Core issue: Instruction-tuning datasets for audio tasks are scarce; novel queries like "describe the mood" require broad generalization.
Seen in: Audio Flamingo 3
Multi-Turn Audio Dialogue
Challenge: Maintain conversational context across turns where inputs interleave text and audio.
Core issue: LLM context windows do not natively handle audio token sequences across turns; representations shift as the conversation evolves.
Seen in: Audio Flamingo 3 (multi-turn audio-text dialogue)
Audio Generation Conditioning
Challenge: Move beyond understanding to emitting audio tokens, requiring joint probability of text and audio codes.
Core issue: Decoder must model p(audio codes | text, context) — harder than classification or captioning.
Seen in: AudioPaLM, UALM (unified audio-language generation)
New task
?
Answer ✓
Turn 1: [audio]
Response 1
Turn 2: [text]
Text in
c1
c2
c3
♫ Audio out
References: Ghosh et al. Audio Flamingo 3. Rubenstein et al. AudioPaLM (arXiv:2306.12925). Tian et al. UALM: Unified Audio-Language Model (2024).
13
The Tokenization Debate: Continuous vs. Discrete
📡 Continuous Embeddings
Raw encoder vectors (e.g., Whisper hidden states)
✓ High fidelity — no quantization loss
✓ Rich acoustic detail preserved
✗ Cannot auto-regressively generate audio
✗ Variable-length sequences; hard to decode
VS
🎵 Discrete Audio Codes (RVQ)
q1
q1
q1
q1
q1
q2
q2
q2
q2
q2
q3
q3
q3
q3
q3
RVQ L1
RVQ L2
RVQ L3
Neural Audio Codec (EnCodec, SoundStream) → discrete vocab
✓ LLM treats audio just like text tokens
✓ True bidirectional text↔audio (AudioLM)
✗ Quantization noise — fidelity loss
✗ Codebook collapse under distribution shift
✨ SYNTHESIS
Discrete codes let LLMs model audio like text — enabling true bidirectional generation (AudioLM) at the cost of quantization noise.
References: Defossez et al. EnCodec (arXiv:2210.13438). Borsos et al. AudioLM (arXiv:2209.03143). Zeghidour et al. SoundStream (arXiv:2107.03312).
14
Why Do We Need Residual Vector Quantization?
✗ THE PROBLEM
⚠ NAIVE SOLUTION
✓ RVQ SOLUTION
6.85
Encoder output (continuous float)
LLM only understands
integers (tokens)
Audio encoders output high-precision floats. LLMs need discrete integers. There is a fundamental representation mismatch.
6.85 → round() → 7
✗ FLAW
Rounding throws away massive fine-grained acoustic detail — pitch, timbre, emotion. A single integer cannot capture the richness of an audio frame.
Break 6.85 into layers of precision:
CB1: ~6.0 (nearest foot)
CB2: +0.9 (nearest inch)
CB3: −0.05 (nearest mm)
CB4: +0.00 (perfect)
Sum = 6.0 + 0.9 − 0.05 = 6.85 ✓
Ref: Defossez et al. EnCodec (arXiv:2210.13438). van den Oord et al. VQ-VAE (arXiv:1711.00937).
15
Defining Our Audio Frame and Codebooks
Our encoder outputs a single frame value of 6.85. We have 4 Codebooks, each with 4 entries (Token IDs 1–4). Highlighted = winning match.
Codebook 1 (Coarse)
ID 1: 0.0
ID 2: 3.0
ID 3: 6.0 ← match
ID 4: 9.0
Codebook 2 (Fine)
ID 1: 0.0
ID 2: 0.3
ID 3: 0.6
ID 4: 0.9 ← match
Codebook 3 (Finer)
ID 1: -0.10
ID 2: -0.05 ← match
ID 3: 0.00
ID 4: 0.05
Codebook 4 (Finest)
ID 1: -0.02
ID 2: -0.01
ID 3: 0.00 ← match
ID 4: 0.01
Winning Token IDs for 6.85: CB1 → 3 | CB2 → 4 | CB3 → 2 | CB4 → 3
6.85 encodes as token sequence: [ 3, 4, 2, 3 ]
📝 Note: Real models (e.g., EnCodec) use 128-dim vectors with codebook sizes of 1024 — the math is identical!
16
Quantizing the Audio — The Forward Pass
Target value: 6.85
1
Codebook 1
Coarse Match
Input residual
6.85
Closest match
6.0
Token ID: 3
Residual = 6.85 − 6.0
0.85
Pass 0.85 → next layer
2
Codebook 2
Fix First Error
Input residual
0.85
Closest match
0.9
Token ID: 4
Residual = 0.85 − 0.9
−0.05
Pass −0.05 → next layer
3
Codebook 3
Getting Closer
Input residual
−0.05
Closest match
−0.05
Token ID: 2
Residual = −0.05−(−0.05)
0.00
Pass 0.00 → next layer
4
Codebook 4
Final Polish
Input residual
0.00
Closest match
0.00
Token ID: 3
Residual = 0.00 − 0.00
0.00
✓ Perfect match!
Result: 6.85 → Token IDs: 3, 4, 2, 3 = [ 3, 4, 2, 3 ]
17
From Continuous Audio to LLM Tokens
Audio frame
6.85
RVQ Encoder
4 Codebooks
Token sequence
[ 3, 4, 2, 3 ]
LLM processes
like text tokens ✓
📤 The Encoding
Continuous value 6.85 is permanently converted into a discrete token sequence:
[ 3, 4, 2, 3 ]
The LLM treats these exactly like word tokens — no special treatment needed.
📥 The Decoding (Reconstruction)
3
Token 3 (CB1) = 6.0
4
Token 4 (CB2) = 0.9
2
Token 2 (CB3) = −0.05
3
Token 3 (CB4) = 0.00
Total = 6.0 + 0.9 − 0.05 + 0.00 =
6.85 ✓
RVQ = lossless (somewhat) bridge between continuous audio and discrete LLM token space
Ref: Defossez et al. EnCodec (arXiv:2210.13438). Borsos et al. AudioLM (arXiv:2209.03143). van den Oord et al. VQ-VAE (arXiv:1711.00937).
18
Training Paradigms & Strategies
How Audio LLMs learn: from pre-training through alignment
SSL / Contrastive Learning
Foundation Stage
Goal: Map audio embeddings to a shared text space
Method: Contrastive loss (e.g., CLAP)
Strength: Excellent global semantic understanding
Limitation: Fails at fine-grained temporal event ordering
Data: ASR pairs (LibriSpeech) + audio captioning
Curriculum Training
Multi-Stage Pipeline
Goal: Generalization across tasks via staged learning
Stage 1: Modality alignment – map audio to text
Stage 2-N: Instruction tuning with Activation Tuning to prevent overfitting
Stage N+1: Preference alignment (DPO) for safety (Optional)
Innovation: Tasks converted to prompts (e.g., '[AUDIO] Identify emotion')
Pre-train (SSL) → Modality Alignment → Instruction Tuning → Preference Alignment (DPO) → Deployed Model
Qwen2-Audio, Audio Flamingo 3, GAMA
19
The Three-Stage Training Pipeline of Qwen2-Audio
Stage 1:
Modality Alignment
Goal
Map audio embeddings to text space
Data
ASR pairs (LibriSpeech) + Audio Captioning
Outcome
Transcribes audio but cannot chat or reason
→
Stage 2:
Instruction Tuning
Goal
Generalization across diverse audio tasks
Method
Convert tasks to prompts (e.g., '[AUDIO] Identify emotion')
Innovation
Activation Tuning prevents catastrophic forgetting and overfitting
→
Stage 3:
Preference Alignment
Goal
Safety, helpfulness, reduced hallucination
Method
DPO – Direct Preference Optimization (no reward model needed)
Result
Significantly reduces safety refusals in Qwen2-Audio
Qwen2-Audio (Chu et al., 2024) – Three-Stage Training Pipeline
20
Aligning Behavior: Direct Preference Optimization (DPO)
The Problem
Models can be helpful but unsafe (susceptible to jailbreaks) or hallucinate (confidently fabricate information).
→
The Solution: DPO
Optimize policy πθ to prefer response yw (winning) over yl (losing) without a reward model.
DPO Equation
LDPO = –log σ( β log πθ(yw|x) / πref(yw|x) – β log πθ(yl|x) / πref(yl|x) )
Impact & Metrics
1
Qwen2-Audio
DPO significantly reduces safety refusals while maintaining helpfulness across audio understanding tasks.
2
Reliability Gain Index (RGI)
Measures 'humbleness' — the model's ability to admit ignorance rather than hallucinate confidently.
Rafailov et al. (2023), DPO; Chu et al. (2024), Qwen2-Audio
Instruction Tuning & Synthetic Bootstrapping
!
THE DATA SCARCITY PROBLEM
No massive "Common Crawl" of perfectly transcribed, multi-event environmental audio exists.
SOLUTIONS
1
LLMs as Annotators
Papers increasingly use powerful LLMs to generate synthetic dialogues, captions, and Q&A pairs from weak audio tags.
e.g., WavCaps
2
Multi-Task Training
Mixing ASR, captioning, music analysis, and QA in a single instruction-tuning batch to force generalized representation.
ASR • Captioning • Music • QA
3
Bootstrapping Pipeline
Weak audio tags
↓
LLM generates captions
↓
Train on synthetic data
22
GAMA: CompA-R Reasoning Dataset
1
Pre-training Stage
Large-scale audio-language pair alignment
↓
2
Instruction-Tuning on CompA-R
Complex audio reasoning with structured Q&A
What is CompA-R?
A dataset designed to train models to understand and reason about complex audio scenarios. Uses a synthesis pipeline: Audio Events → Caption Generation → Instruction Generation → Verified Q&A pairs.
Pipeline to synthesize CompA-R dataset
Ghosh et al., "GAMA: A Large Audio-Language Model with Advanced Audio Understanding and Complex Reasoning Abilities"
AudioSkills (Audio Flamingo 3)
Large-Scale Audio QA
The first attempt at creating a large-scale Audio-QA dataset focused on expert-level skills across diverse audio domains.
Skill Taxonomy & Data Pipeline
Task categories & skill distribution
Reasoning & evaluation framework
Kong et al., "Audio Flamingo 3 / AudioSkills" - NVIDIA
AudioSkills: LongAudioBench Pipeline
Multi-step synthesis: Video/Audio Segments → Segment Captions → LLM Q&A Generation → Expert Review → LongAudioBench
End-to-end pipeline for LongAudioBench dataset construction with self-verification
✓ Self-Verification Loop
✓ Expert Review Stage
✓ Multi-Modal Captions
25
GAMA: Qualitative Examples
Evaluation of LALMs
26
1
Rapid Growth
Growing number of LALMs: LTU, Pengi, SALMONN, Qwen-Audio, and more emerging models
2
No Standard Framework
No standardized evaluation framework exists — each model uses its own metrics and benchmarks
3
Fragmented Evaluation
Most works rely on ad-hoc evaluation setups, making fair comparison across models difficult
4
Benchmarks Fall Short
AudioCaps, Clotho, etc. fail to evaluate holistic LALM performance — many tasks don’t require expert-level ability
Key gap: A unified, holistic evaluation benchmark for LALMs is urgently needed
27
Speech • Sound • Music
MMAU: Massive Multi-Task Audio Understanding & Reasoning
MMLU (language) and MMMU (vision) pushed model capabilities forward — but nothing equivalent existed for audio. MMAU fills that gap.
10,000 QA Pairs
Expert Curated
MMAU: 10k curated audio clips paired with natural language Q&A spanning speech, sound, and music.
28
MMAU: Skill Distribution
27 distinct skills across 11 Information Extraction (left) and 16 Reasoning (right)
29
MMAU Benchmark: Model Performance Over Time
30
MMAU-Pro: Key Features
Multi-Cultural Music
Multi-Hop Reasoning
Variable-Length Audio
Speech-to-Speech
Extends MMAU with more complex evaluation settings and challenging skills
Evaluates understanding of diverse global music traditions and sound
Questions requiring multiple reasoning steps over audio content
Short, medium, and long audio clips including 5+ minute recordings
Tests interaction skills with speech input and speech output
31
| MMAU | MMAR | MMSU |
gemma-3-12b-it | 42.1 | 37.5 | 32.4 |
gemma-3-27b-it | 48.3 | 35.8 | 43.6 |
Llama-3.1-8B-Instruct | 37.5 | 32.4 | 38.4 |
Qwen2.5-Omni-7b | 68.1 | 57.3 | 50.7 |
Audio Flamingo 3 | 72.8 | 58.0 | 59.7 |
LLMs
LALMs
Random MCQ baseline = 25%
Why MMAU-Pro? Existing Benchmark Shortcomings
Text-only LLMs (without audio) can score nearly as high as LALMs — revealing that many questions don’t actually require audio understanding
32
MMAU-Pro: Most Comprehensive Audio Benchmark
MMAU-Pro uniquely covers long audio, multi-audio, spatial audio, and multicultural music diversity
33
11 Domains
Total: ~5000 QA
Multi-Audio QA:
MMAU-Pro Stats
34
MMAU-Pro Stats
35
New benchmarks reveal new shortcomings of frontier models
Music cultures not seen during training perform worst at test-time.
36
New benchmarks reveal new shortcomings of frontier models
Comparison of top 3 best and worst performing skills.
Performance comparison of instruction following capabilities.
37
New benchmarks reveal new shortcomings of frontier models
38
Failures & Vulnerabilities
LALMs exhibit critical failure modes that limit real-world deployment and reliability
!
Object Hallucination
Generating descriptions of sounds that don’t exist in the audio. Models confidently describe non-present events due to training data biases.
!
Non-Speech Deafness
Filtering out background events (door slams, alarms) as ‘noise’ due to ASR-heavy training. Critical environmental sounds are missed.
!
Adversarial Jailbreaks
Voice modulation (changing emotion/tone) can bypass text-based safety filters. Audio-specific attack vectors remain largely unaddressed.
!
Temporal Confusion
Inability to correctly sequence or timestamp events in audio. Models struggle to distinguish ‘before’ and ‘after’ in temporal reasoning tasks.
Implication: Addressing these vulnerabilities is essential before deploying LALMs in safety-critical applications
39
Spatial Audio Understanding & Reasoning
CURRENT STATE
Existing LALMs mostly focus on reasoning over mono-channel audio, leaving spatial audio understanding largely unexplored.
Existing works on spatial audio understanding have key limitations:
1
Lacks Array Invariance
Models are tied to specific microphone array configurations and cannot generalize across different setups.
2
Poor Scalability
Doesn’t scale well on different tasks with multiple or moving sound sources in complex scenes.
3
Binaural Only
Can only process binaural audio, limiting support for higher-order formats like ambisonics.
OUR AIM
Develop a comprehensive Spatial LALM that is array agnostic, capable of reasoning over 1st/2nd order ambisonics.
Thank You!��Questions?
40