2 of 40

Large-scale models trained on audio data, capable of multiple downstream tasks

Audio

Foundation

Model

🎙 ASR / Transcription

Whisper (OpenAI, 2022)

Multilingual ASR, 680k hrs of data

Wav2Vec 2.0 (Meta, 2020)

Self-supervised speech repr.

HuBERT (Meta, 2021)

Hidden-unit BERT for speech

🏷 Audio Classification

PANN (Kong et al., 2020)

AudioSet pretrained, 527 classes

AST (MIT, 2021)

Audio Spectrogram Transformer

BEATs (MSFT, 2023)

Audio tokenizer + classifier

CLAP (LAION, 2023)

Zero-shot cls via text queries

💬 Other Audio Tasks

AudioPaLM (Google, 2023)

Speech QA & translation (LLM)

Pengi (MSFT, 2023)

Audio+text open-ended QA

Audio Flamingo Series (2024, 2025)

Few-shot audio reasoning

Qwen-Audio (Alibaba, 2023)

Multi-task audio-language model

Refs: Radford et al. (Whisper, 2022); Baevski et al. (Wav2Vec 2.0, NeurIPS 2020); Hsu et al. (HuBERT, 2021); Kong et al. (PANN, 2020); Gong et al. (AST, 2021); Chen et al. (BEATs, ICML 2023); Wu et al. (LAION-CLAP, 2023); Rubenstein et al. (AudioPaLM, 2023); Deshmukh et al. (Pengi, NeurIPS 2023); Huang et al. (Audio Flamingo, 2024).

Audio Foundation Models

🔊 Retrieval

LAION-CLAP (LAION, 2023)

Zero-shot cls via text queries

CompA-CLAP (UMD, 2024)

Zero-shot cls via text queries

3 of 40

Paradigm Shift: Beyond the Cascade

4 of 40

Large Audio Language Model (LALM)

⚡ The Paradigm Shift

Moving from single-task, narrow-domain models to general-purpose, open-ended auditory reasoning engines.

Single-Task

ASR / Classifier

General LALM

Open-Ended Reasoning

⚠ ASR-Based Pipelines

Cascaded systems fundamentally lose non-linguistic acoustic cues:

Emotion & prosody
Background & env. noise
Speaker identity

Audio

ASR

Text

LLM

✘ Acoustic cues lost at ASR step

⚖ Alignment vs. End-to-End

Contrastive (e.g. CLAP)

Maps embeddings to a shared space
No generative output

End-to-End LALM

Projects audio into the LLM’s autoregressive generation space

Audio

Text

Shared

Space

Audio

LLM

(AR gen.)

Contrastive (CLAP)

End-to-End (LALM)

[1] Radford et al., Robust Speech Recognition via Large-Scale Weak Supervision (Whisper), ICML 2023. [2] Baevski et al., wav2vec 2.0, NeurIPS 2020. [3] Wu et al., LAION-CLAP, ICASSP 2023. [4] Deshmukh et al., Pengi: An Audio Language Model for Audio Tasks, NeurIPS 2023.

DEFINITION

LALMs are multimodal architectures that natively integrate continuous audio signals with discrete text spaces, enabling LLMs to “hear” and reason over acoustics, speech, and environmental sounds.

5 of 40

Challenges in Audio Processing

Limited Audio Datasets

Audio data is scarce compared to other modalities, with minimal availability for certain tasks.

Reasoning data: Most datasets focus on recognition, not complex reasoning.

Long audio: Datasets contain 5–10s clips; no long-form sound/music data beyond ASR.

Weak Audio Representations

Audio encoders achieve ~60% on AudioSet, far below 95%+ on ImageNet.

Models struggle with compositional audio structure (event order) and linguistic variations.

Speech, sounds, and music are still treated separately rather than holistically.

Lacking Evaluation Benchmarks

Most benchmarks are recognition-focused and simplistic, limiting real-world progress.

25+ reasoning benchmarks for vision vs. only a handful for audio.

No benchmarks dedicated to real-world understanding & reasoning tasks.

Dataset Scale Comparison

Language

Billions of tokens

Vision

5B+ image-text pairs

Speech

60k+ hours

Non-verbal Audio

~1k hours

6 of 40

Structure of an LALM

Raw Audio

Waveform / Spectrogram

① Audio Encoder

Whisper · WavLM · BEATs

② Projection Layer

Q-Former · Linear · MLP

③ LLM Backbone

Llama · Vicuna · Qwen

Text Output

Response / Caption

↔ Cross-modal alignment spans the Encoder → Projection → LLM boundary

① Audio Encoder

The sensory organ of the LALM. Extracts high-dimensional acoustic features from raw waveforms or spectrograms.

Examples:

Whisper — speech-optimized encoder
WavLM — self-supervised universal
BEATs — audio tokenizer + cls

② Projection Layer

The translator. Downsamples and aligns continuous audio embeddings into the LLM’s discrete text-token space.

Variants:

Linear / MLP — simple projection
Q-Former — query-based compression
Perceiver — cross-attn resampler

③ LLM Backbone

The reasoning engine. Processes the fused multimodal sequence autoregressively to generate text.

Examples:

Llama 2/3 — Meta open-source LLM
Vicuna — instruction-tuned Llama
Qwen / Phi-3 — efficient alternatives

④ Cross-Modal Alignment

The optimization objective ensuring audio embeddings occupy the same semantic space as text tokens.

e.g. audio of “dog barking” ↔ text token “bark”

Methods:

Contrastive loss (CLAP-style)
Instruction tuning (IT)
Next-token prediction loss

Refs: [1] Hsu et al., WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech Processing, IEEE JSTSP 2022. [2] Chen et al., BEATs: Audio Pre-Training with Acoustic Tokenizers, ICML 2023. [3] Li et al., BLIP-2: Q-Former Bootstrapping Language-Image Pre-training, ICML 2023 (Q-Former design adopted in audio LALMs).

7 of 40

Some real architectures

AudioPALM

Audio Flamingo 3

Llama-Mimi

GAMA

8 of 40

Core Problems in LALMs

Temporal Resolution Mismatch

Audio is sampled at 16k–48k Hz, producing massive sequence lengths that overwhelm LLM context windows.

Self-attention scales O(N²) with sequence length — a 10-second clip at 16 kHz gives 160,000 samples before any encoding.

LLM ctx

Raw audio (10 s @ 16kHz = 160k samples)

Sequence length mismatch:

Token Compression

Pool or stride audio frames to compress seconds of audio into a handful of tokens without losing semantic density.

Key papers:

SALMONN — dual-encoder + Q-Former bridge [1]
Pengi — prefix audio tokens via CLAP [2]
GAMA — single-encoder + Q-Former bridge [5]
AudioFlamingo3 — single-encoder [6]

1000s of audio frames

~32 tokens

Spectrogram vs. Waveform

A persistent debate in representation choice for audio encoders.

Spectrogram (e.g. AST)

Dense time-frequency representation
Loses phase & fine transients

Raw Waveform (e.g. EnCodec)

Preserves phase & transient detail
Extreme sequence length overhead

- Uses 1D convolution to process the audio.

- Uses RVQ Quantization

[1] Tang et al., SALMONN: Towards Generic Hearing Abilities for Large Language Models, ICLR 2024. arXiv:2310.13289. [2] Deshmukh et al., Pengi: An Audio Language Model for Audio Tasks, NeurIPS 2023. arXiv:2305.11834. [3] Gong et al., AST: Audio Spectrogram Transformer, Interspeech 2021. [4] Défossez et al., High Fidelity Neural Audio Compression (EnCodec), TMLR 2023. [5] GAMA [6] Audio Flamingo 3

SECTION 2

Three hierarchical challenge levels: Representation › Reasoning › Interaction

Now that we understand the architecture, we map the research landscape — categorizing core challenges into three hierarchical levels: Representation, Reasoning, and Interaction.

REPRESENTATION BOTTLENECKS

9 of 40

Addressing the core problems

As part of Audio Flamingo 3, we proposed AF-Whisper – a SOTA audio encoder which improves audio representation by:
Interleaved sound, speech and music captions.
Novel training objective

Improve generalization of Whisper on other modalities of audio in addition to speech.
Improve audio perception by training the encoder with a decoder initialized from scratch.

10 of 40

Reasoning Problems (Space and Time)

Multi-Event & Temporal Grounding

Models handle “What is this?” but struggle with “What happened at 0:45?”

LTU tries to address this via temporal-localization training on timestamped QA pairs [1].

Global understanding vs. local grounding:

Note: Temporal grounding still remained a big challenge after LTU.

0:10

0:45

1:20

Spatial Reasoning

Deriving 3D geometry and acoustic room properties from binaural or multichannel audio is computationally complex.

Binaural ITD/ILD cues encode azimuth & elevation
Ambisonics – finer spatial details
Room impulse response → reverb fingerprint
Requires specialized spatial encoders [2]

👂

Multi-Speaker Modeling

Overlapping speech (the Cocktail Party Problem) causes catastrophic interference in standard attention.

Recent approach: speaker-diarized tokenization — pre-segment speakers before LLM ingestion [3].

Mixed signal → attention collapse:

⚡ attn. collapse

Long-Context Audio

Maintaining acoustic context over hour-long recordings requires memory beyond standard positional encodings.

Solutions:

Hierarchical segment memory
Sliding-window attention [4]
RoPE / ALiBi positional scaling

Hierarchical memory layers

[1] Gong et al., Listen, Think, and Understand (LTU), ICLR 2024. [2] Comminiello et al., Spatial Audio Deep Learning, IEEE SPM 2022. [3] Cornell et al., One model to rule them all: Towards Joint Speaker Diarization and Recognition, ICASSP 2023. [4] Zhao et al., LongAudio: Towards Long-Context Audio Language Models, 2024.

SECTION 2

Three hierarchical challenge levels: Representation › Reasoning › Interaction

Even with good representations, LALMs must reason when events occur, where sounds originate, who is speaking, and how long context must be maintained.

LEVEL 2 — REASONING PROBLEMS

11 of 40

Addressing the Core Problem of Temporal Grounding

Raw Caption Text "1.12s a dog barking..."

FLOAT_PATTERN Regex "1.12s..." → "<|1.12|>s..."

Tokenizer Token IDs (e.g. 151762 for <|1.12|>)

Cross-Entropy Loss Per-token losses computed

Weighted Loss timestamp_weighted average loss

Audio spectrogram with caption overlay

PROCESSING PIPELINE

12 of 40

Instruction & Interaction Problems

SECTION 3

Taxonomy › Level 3: Instruction & Interaction

Having built robust representations, models must now follow instructions and engage in multi-modal dialogue across text and audio.

▶ 3 key challenges in instruction-following and interactive audio understanding

Zero-Shot Instruction Following

Challenge: Generalize to unseen audio tasks at inference time without task-specific fine-tuning.

Core issue: Instruction-tuning datasets for audio tasks are scarce; novel queries like "describe the mood" require broad generalization.

Seen in: Audio Flamingo 3

Multi-Turn Audio Dialogue

Challenge: Maintain conversational context across turns where inputs interleave text and audio.

Core issue: LLM context windows do not natively handle audio token sequences across turns; representations shift as the conversation evolves.

Seen in: Audio Flamingo 3 (multi-turn audio-text dialogue)

Audio Generation Conditioning

Challenge: Move beyond understanding to emitting audio tokens, requiring joint probability of text and audio codes.

Core issue: Decoder must model p(audio codes | text, context) — harder than classification or captioning.

Seen in: AudioPaLM, UALM (unified audio-language generation)

New task

Answer ✓

Turn 1: [audio]

Response 1

Turn 2: [text]

Text in

♫ Audio out

References: Ghosh et al. Audio Flamingo 3. Rubenstein et al. AudioPaLM (arXiv:2306.12925). Tian et al. UALM: Unified Audio-Language Model (2024).

13 of 40

The Tokenization Debate: Continuous vs. Discrete

📡 Continuous Embeddings

Raw encoder vectors (e.g., Whisper hidden states)

✓ High fidelity — no quantization loss

✓ Rich acoustic detail preserved

✗ Cannot auto-regressively generate audio

✗ Variable-length sequences; hard to decode

🎵 Discrete Audio Codes (RVQ)

RVQ L1

RVQ L2

RVQ L3

Neural Audio Codec (EnCodec, SoundStream) → discrete vocab

✓ LLM treats audio just like text tokens

✓ True bidirectional text↔audio (AudioLM)

✗ Quantization noise — fidelity loss

✗ Codebook collapse under distribution shift

✨ SYNTHESIS

Discrete codes let LLMs model audio like text — enabling true bidirectional generation (AudioLM) at the cost of quantization noise.

References: Defossez et al. EnCodec (arXiv:2210.13438). Borsos et al. AudioLM (arXiv:2209.03143). Zeghidour et al. SoundStream (arXiv:2107.03312).

14 of 40

Why Do We Need Residual Vector Quantization?

✗ THE PROBLEM

⚠ NAIVE SOLUTION

✓ RVQ SOLUTION

6.85

Encoder output (continuous float)

LLM only understands

integers (tokens)

Audio encoders output high-precision floats. LLMs need discrete integers. There is a fundamental representation mismatch.

6.85 → round() → 7

✗ FLAW

Rounding throws away massive fine-grained acoustic detail — pitch, timbre, emotion. A single integer cannot capture the richness of an audio frame.

Break 6.85 into layers of precision:

CB1: ~6.0 (nearest foot)

CB2: +0.9 (nearest inch)

CB3: −0.05 (nearest mm)

CB4: +0.00 (perfect)

Sum = 6.0 + 0.9 − 0.05 = 6.85 ✓

Ref: Defossez et al. EnCodec (arXiv:2210.13438). van den Oord et al. VQ-VAE (arXiv:1711.00937).

15 of 40

Defining Our Audio Frame and Codebooks

Our encoder outputs a single frame value of 6.85. We have 4 Codebooks, each with 4 entries (Token IDs 1–4). Highlighted = winning match.

Codebook 1 (Coarse)

ID 1: 0.0

ID 2: 3.0

ID 3: 6.0 ← match

ID 4: 9.0

Codebook 2 (Fine)

ID 1: 0.0

ID 2: 0.3

ID 3: 0.6

ID 4: 0.9 ← match

Codebook 3 (Finer)

ID 1: -0.10

ID 2: -0.05 ← match

ID 3: 0.00

ID 4: 0.05

Codebook 4 (Finest)

ID 1: -0.02

ID 2: -0.01

ID 3: 0.00 ← match

ID 4: 0.01

Winning Token IDs for 6.85: CB1 → 3 | CB2 → 4 | CB3 → 2 | CB4 → 3

6.85 encodes as token sequence: [ 3, 4, 2, 3 ]

📝 Note: Real models (e.g., EnCodec) use 128-dim vectors with codebook sizes of 1024 — the math is identical!

16 of 40

Quantizing the Audio — The Forward Pass

Target value: 6.85

Codebook 1

Coarse Match

Input residual

6.85

Closest match

6.0

Token ID: 3

Residual = 6.85 − 6.0

0.85

Pass 0.85 → next layer

Codebook 2

Fix First Error

Input residual

0.85

Closest match

0.9

Token ID: 4

Residual = 0.85 − 0.9

−0.05

Pass −0.05 → next layer

Codebook 3

Getting Closer

Input residual

−0.05

Closest match

−0.05

Token ID: 2

Residual = −0.05−(−0.05)

0.00

Pass 0.00 → next layer

Codebook 4

Final Polish

Input residual

0.00

Closest match

0.00

Token ID: 3

Residual = 0.00 − 0.00

0.00

✓ Perfect match!

Result: 6.85 → Token IDs: 3, 4, 2, 3 = [ 3, 4, 2, 3 ]

17 of 40

From Continuous Audio to LLM Tokens

Audio frame

6.85

RVQ Encoder

4 Codebooks

Token sequence

[ 3, 4, 2, 3 ]

LLM processes

like text tokens ✓

📤 The Encoding

Continuous value 6.85 is permanently converted into a discrete token sequence:

[ 3, 4, 2, 3 ]

The LLM treats these exactly like word tokens — no special treatment needed.

📥 The Decoding (Reconstruction)

Token 3 (CB1) = 6.0

Token 4 (CB2) = 0.9

Token 2 (CB3) = −0.05

Token 3 (CB4) = 0.00

Total = 6.0 + 0.9 − 0.05 + 0.00 =

6.85 ✓

RVQ = lossless (somewhat) bridge between continuous audio and discrete LLM token space

Ref: Defossez et al. EnCodec (arXiv:2210.13438). Borsos et al. AudioLM (arXiv:2209.03143). van den Oord et al. VQ-VAE (arXiv:1711.00937).

18 of 40

Training Paradigms & Strategies

How Audio LLMs learn: from pre-training through alignment

SSL / Contrastive Learning

Foundation Stage

Goal: Map audio embeddings to a shared text space

Method: Contrastive loss (e.g., CLAP)

Strength: Excellent global semantic understanding

Limitation: Fails at fine-grained temporal event ordering

Data: ASR pairs (LibriSpeech) + audio captioning

Curriculum Training

Multi-Stage Pipeline

Goal: Generalization across tasks via staged learning

Stage 1: Modality alignment – map audio to text

Stage 2-N: Instruction tuning with Activation Tuning to prevent overfitting

Stage N+1: Preference alignment (DPO) for safety (Optional)

Innovation: Tasks converted to prompts (e.g., '[AUDIO] Identify emotion')

Pre-train (SSL) → Modality Alignment → Instruction Tuning → Preference Alignment (DPO) → Deployed Model

Qwen2-Audio, Audio Flamingo 3, GAMA

19 of 40

The Three-Stage Training Pipeline of Qwen2-Audio

Stage 1:

Modality Alignment

Goal

Map audio embeddings to text space

Data

ASR pairs (LibriSpeech) + Audio Captioning

Outcome

Transcribes audio but cannot chat or reason

→

Stage 2:

Instruction Tuning

Goal

Generalization across diverse audio tasks

Method

Convert tasks to prompts (e.g., '[AUDIO] Identify emotion')

Innovation

Activation Tuning prevents catastrophic forgetting and overfitting

→

Stage 3:

Preference Alignment

Goal

Safety, helpfulness, reduced hallucination

Method

DPO – Direct Preference Optimization (no reward model needed)

Result

Significantly reduces safety refusals in Qwen2-Audio

Qwen2-Audio (Chu et al., 2024) – Three-Stage Training Pipeline

20 of 40

Aligning Behavior: Direct Preference Optimization (DPO)

The Problem

Models can be helpful but unsafe (susceptible to jailbreaks) or hallucinate (confidently fabricate information).

→

The Solution: DPO

Optimize policy π_θ to prefer response y_w (winning) over y_l (losing) without a reward model.

DPO Equation

L_DPO = –log σ( β log π_θ(y_w|x) / π_ref(y_w|x) – β log π_θ(y_l|x) / π_ref(y_l|x) )

Impact & Metrics

Qwen2-Audio

DPO significantly reduces safety refusals while maintaining helpfulness across audio understanding tasks.

Reliability Gain Index (RGI)

Measures 'humbleness' — the model's ability to admit ignorance rather than hallucinate confidently.

Rafailov et al. (2023), DPO; Chu et al. (2024), Qwen2-Audio

21 of 40

Instruction Tuning & Synthetic Bootstrapping

THE DATA SCARCITY PROBLEM

No massive "Common Crawl" of perfectly transcribed, multi-event environmental audio exists.

SOLUTIONS

LLMs as Annotators

Papers increasingly use powerful LLMs to generate synthetic dialogues, captions, and Q&A pairs from weak audio tags.

e.g., WavCaps

Multi-Task Training

Mixing ASR, captioning, music analysis, and QA in a single instruction-tuning batch to force generalized representation.

ASR • Captioning • Music • QA

Bootstrapping Pipeline

Weak audio tags

↓

LLM generates captions

↓

Train on synthetic data

22 of 40

GAMA: CompA-R Reasoning Dataset

Pre-training Stage

Large-scale audio-language pair alignment

↓

Instruction-Tuning on CompA-R

Complex audio reasoning with structured Q&A

What is CompA-R?

A dataset designed to train models to understand and reason about complex audio scenarios. Uses a synthesis pipeline: Audio Events → Caption Generation → Instruction Generation → Verified Q&A pairs.

Pipeline to synthesize CompA-R dataset

Ghosh et al., "GAMA: A Large Audio-Language Model with Advanced Audio Understanding and Complex Reasoning Abilities"

23 of 40

AudioSkills (Audio Flamingo 3)

Large-Scale Audio QA

The first attempt at creating a large-scale Audio-QA dataset focused on expert-level skills across diverse audio domains.

Skill Taxonomy & Data Pipeline

Task categories & skill distribution

Reasoning & evaluation framework

Kong et al., "Audio Flamingo 3 / AudioSkills" - NVIDIA

24 of 40

AudioSkills: LongAudioBench Pipeline

Multi-step synthesis: Video/Audio Segments → Segment Captions → LLM Q&A Generation → Expert Review → LongAudioBench

End-to-end pipeline for LongAudioBench dataset construction with self-verification

✓ Self-Verification Loop

✓ Expert Review Stage

✓ Multi-Modal Captions

25 of 40

GAMA: Qualitative Examples

26 of 40

Evaluation of LALMs

Rapid Growth

Growing number of LALMs: LTU, Pengi, SALMONN, Qwen-Audio, and more emerging models

No Standard Framework

No standardized evaluation framework exists — each model uses its own metrics and benchmarks

Fragmented Evaluation

Most works rely on ad-hoc evaluation setups, making fair comparison across models difficult

Benchmarks Fall Short

AudioCaps, Clotho, etc. fail to evaluate holistic LALM performance — many tasks don’t require expert-level ability

Key gap: A unified, holistic evaluation benchmark for LALMs is urgently needed

27 of 40

Speech • Sound • Music

MMAU: Massive Multi-Task Audio Understanding & Reasoning

MMLU (language) and MMMU (vision) pushed model capabilities forward — but nothing equivalent existed for audio. MMAU fills that gap.

10,000 QA Pairs

Expert Curated

MMAU: 10k curated audio clips paired with natural language Q&A spanning speech, sound, and music.

28 of 40

MMAU: Skill Distribution

27 distinct skills across 11 Information Extraction (left) and 16 Reasoning (right)

29 of 40

MMAU Benchmark: Model Performance Over Time

30 of 40

MMAU-Pro: Key Features

Multi-Cultural Music

Multi-Hop Reasoning

Variable-Length Audio

Speech-to-Speech

Extends MMAU with more complex evaluation settings and challenging skills

Evaluates understanding of diverse global music traditions and sound

Questions requiring multiple reasoning steps over audio content

Short, medium, and long audio clips including 5+ minute recordings

Tests interaction skills with speech input and speech output

31 of 40

	MMAU	MMAR	MMSU
gemma-3-12b-it	42.1	37.5	32.4
gemma-3-27b-it	48.3	35.8	43.6
Llama-3.1-8B-Instruct	37.5	32.4	38.4
Qwen2.5-Omni-7b	68.1	57.3	50.7
Audio Flamingo 3	72.8	58.0	59.7

LLMs

LALMs

Random MCQ baseline = 25%

Why MMAU-Pro? Existing Benchmark Shortcomings

Text-only LLMs (without audio) can score nearly as high as LALMs — revealing that many questions don’t actually require audio understanding

32 of 40

MMAU-Pro: Most Comprehensive Audio Benchmark

MMAU-Pro uniquely covers long audio, multi-audio, spatial audio, and multicultural music diversity

33 of 40

11 Domains

Total: ~5000 QA

Multi-Audio QA:

Sound: 247
Speech: 111
Music: 101

MMAU-Pro Stats

34 of 40

MMAU-Pro Stats

35 of 40

New benchmarks reveal new shortcomings of frontier models

Music cultures not seen during training perform worst at test-time.

36 of 40

New benchmarks reveal new shortcomings of frontier models

Comparison of top 3 best and worst performing skills.

Performance comparison of instruction following capabilities.

37 of 40

New benchmarks reveal new shortcomings of frontier models

38 of 40

Failures & Vulnerabilities

LALMs exhibit critical failure modes that limit real-world deployment and reliability

Object Hallucination

Generating descriptions of sounds that don’t exist in the audio. Models confidently describe non-present events due to training data biases.

Non-Speech Deafness

Filtering out background events (door slams, alarms) as ‘noise’ due to ASR-heavy training. Critical environmental sounds are missed.

Adversarial Jailbreaks

Voice modulation (changing emotion/tone) can bypass text-based safety filters. Audio-specific attack vectors remain largely unaddressed.

Temporal Confusion

Inability to correctly sequence or timestamp events in audio. Models struggle to distinguish ‘before’ and ‘after’ in temporal reasoning tasks.

Implication: Addressing these vulnerabilities is essential before deploying LALMs in safety-critical applications

39 of 40

Spatial Audio Understanding & Reasoning

CURRENT STATE

Existing LALMs mostly focus on reasoning over mono-channel audio, leaving spatial audio understanding largely unexplored.

Existing works on spatial audio understanding have key limitations:

Lacks Array Invariance

Models are tied to specific microphone array configurations and cannot generalize across different setups.

Poor Scalability

Doesn’t scale well on different tasks with multiple or moving sound sources in complex scenes.

Binaural Only

Can only process binaural audio, limiting support for higher-order formats like ambisonics.

OUR AIM

Develop a comprehensive Spatial LALM that is array agnostic, capable of reasoning over 1st/2nd order ambisonics.

1 of 40

2 of 40

3 of 40

4 of 40

5 of 40

6 of 40

7 of 40

8 of 40

9 of 40

10 of 40

11 of 40

12 of 40

13 of 40

14 of 40

15 of 40

16 of 40

17 of 40

18 of 40

19 of 40

20 of 40

21 of 40

22 of 40

23 of 40

24 of 40

25 of 40

26 of 40

27 of 40

28 of 40

29 of 40

30 of 40

31 of 40

32 of 40

33 of 40

34 of 40

35 of 40

36 of 40

37 of 40

38 of 40

39 of 40

40 of 40