1 of 64

LLMs Beyond Text: Music

Chris Donahue

CMU 11-667 LLMs

23 Oct 12

2 of 64

Music LLMs have remarkable capabilities.

2

Acoustic music LLMs �can generate music audio from text Examples from Copet+ 23

Symbolic music LLMs �can harmonize and infill music scores�Examples from Thickstun+ 23

3 of 64

Agenda

What is a music LM, and what do LMs mean for music?
Coercing music into tokens
Modeling music tokens with LMs
Research directions in music LLMs

3

4 of 64

What is a music LM, and what do LMs mean for music?

4

5 of 64

What is a music LM?

A model over sequences of musical tokens
Specifically, factorize auto regressively
So, literally just an LM… except what are musical tokens ?

A model over sequences of musical tokens
Specifically, factorize auto regressively

A model over sequences of musical tokens

5

6 of 64

Musical tokens can be symbolic

Perhaps most intuitively, symbolic music can be represented as tokens

6

C, C, G, G, A, A, G, …

7 of 64

Musical tokens can be acoustic

Somewhat less intuitively, musical tokens can represent audio (more on this later)

7

-1

1

0

0, 1, 0, -1, 0, 1, 0, -1, 0, …

8 of 64

What is a music LM? (continued)

A music LM is a type of generative model of music in some data modality
Often, we want this model to be controllable
Control is often multi-modal, e.g., generate acoustic given symbolic
Why do we want controllable music generative models?

A music LM is a type of generative model of music in some data modality
Often, we want this model to be controllable

A music LM is a type of generative model of music in some data modality

8

9 of 64

Everyone is a musician…

… but musical expression

is locked behind expertise.

9

Composition

Improvisation

10 of 64

Conventional tools impose a barrier

between intuition and expression.

10

Expression

Expertise divide

Intuition

🧠

🎶

11 of 64

Generative models can

bridge the expertise divide.

11

Expression

Intuition

(via control)

👋

🎶

v

e

m

o

d

e

l

i

t

a

r

e

n

e

G

Expertise divide

12 of 64

A concrete example: Piano Genie

Donahue+ IUI’19

12

Expertise divide

Simple interface

Piano

improvisation

v

e

m

o

d

e

l

i

t

a

r

e

n

e

G

Demo: chrisdonahue.com/piano-genie

13 of 64

A concrete example: SingSong

Donahue+ 23

13

Expertise divide

Singing

Rich music

v

e

m

o

d

e

l

i

t

a

r

e

n

e

G

14 of 64

Large music LMs: the “Midjourney moment” for music

14

Full audio

v

e

m

o

d

e

l

i

t

a

r

e

n

e

G

“Drake and The Weeknd collab with Metro Boomin style production”

Text prompt

15 of 64

Generative AI is poised to redefine the nature of music creation over the next 5 years.

15

16 of 64

Generative AI is poised to redefine the nature of music creation over the next 5 years.

16

We must be responsible stewards, unlocking expression while protecting music culture and musicians.

17 of 64

Why LMs, as opposed to other generative models?

Music generation bifurcated into two paradigms:

(1) LMs on discrete music representations (e.g. tokens)
(2) Diffusion on continuous music representations (e.g. audio)

Both are qualitatively competitive, and both leverage the breadth of other research areas (NLP / CV), but LMs offer:

Elegance: tokenization decouples modality-specific concerns from modeling
Well-lit scaling: it’s currently more clear how to scale LMs compared to diffusion models
Low delay: LMs generate tokens individually, enabling real-time (e.g. Piano Genie)

Fun fact: LMs also have a long history in music! ->

17

“Markov chains”

(n-gram models)

Popular in contemporary composition (80s-90s)

18 of 64

Coercing music

into tokens

Motivation: Make music modeling methodologically equivalent to text modeling by converting music to tokens and tokens to music

18

19 of 64

Musical tokens can be symbolic

Perhaps most intuitively, symbolic music can be represented as tokens

19

C, C, G, G, A, A, G, …

20 of 64

Tokenization schemes can preserve different musical info

Just like with text, there are different ways to tokenize music

20

Pitch classes: C, C, G, G, A, A, G

With rhythms: ♩C, ♩C, ♩G, ♩G, ♩A, ♩A, G

With octaves: ♩C4, ♩C4, ♩G4, ♩G4, ♩A4, ♩A4, G4

More simple

More complex

21 of 64

Tokenization schemes can preserve different musical info

Just like with text, there are different ways to tokenize music

21

Pitch classes: C, C, G, G, A, A, G

With rhythms: ♩C, ♩C, ♩G, ♩G, ♩A, ♩A, G

With octaves: ♩C4, ♩C4, ♩G4, ♩G4, ♩A4, ♩A4, G4

More simple

More complex

Note that we can manipulate both the information we tokenize as well as the vocabulary .�Trade off sequence length and vocab size.

= {♩C, G, … } = {♩, , C, G, …}

22 of 64

Dealing with note simultaneity can be complicated

Unlike in text, multiple symbolic music tokens can coincide in time. One approach is to insert “time shift” tokens to encode changes in time [Oore and Simon 17]

22

C3, E3, G3, C4, ♩, C4, ♩, G4 …

Time shift by 1 beat

23 of 64

Dealing with note simultaneity can be complicated

Time shift tokens are cumbersome, especially for infilling. Instead, we can model notes as tuples of (time, pitch) [Ippolito+ 18, Hsiao+ 21, Thickstun+ 23]

23

(1, C3), (1, E3), (1, G3), (1, C4), (2, C4), (3, G4), …

24 of 64

A simple tutorial! Training an LM on symbolic lead sheets

24

View on HookPad

25 of 64

Why don’t we just model the audio itself?

Symbolic music is simultaneously natural (because it’s discrete) and unnatural (because of sparsity / simultaneity) to model with LMs
Symbolic is great for musicians but audio is more universally appreciated
Symbolic music data is hard to come by but recordings are everywhere
Hence, why not just model audio with LMs?

25

26 of 64

A ⚡ primer on digital audio

Audio is a continuous measurement of fluctuating air pressure caused by sound

But continuous signals cannot be natively stored on digital media

26

-1

1

0

Time

Pressure

27 of 64

A ⚡ primer on digital audio

Digital audio involves sampling the signal at uniform intervals:

27

-1

1

0

Time

Pressure

28 of 64

A ⚡ primer on digital audio

Digital audio further quantizes signals to set of discrete amplitudes:

= {-1, -½, 0, ½, 1}

28

-1

1

0

Time

Pressure

29 of 64

A ⚡ primer on digital audio

Digital audio further quantizes signals to set of discrete amplitudes:

= {-1, -½, 0, ½, 1}

29

-1

1

0

Time

Pressure

30 of 64

Waveforms can be “tokens”!

Hence, we could simply treat each sample of a digital audio waveform as a token

30

-1

1

0

Pressure

0, 1, 0, -1, 0, 1, 0, -1, …

31 of 64

But there’s a catch…

is large (e.g. 180s), is large (e.g. 48kHz)�Structure at many different timescales

31

Waveforms are long

Achilles heel for modern ML

📷: (Left) WaveNet blog post, (Right) Vaswani+ 17

32 of 64

Waveform lengths in perspective

32

10²

10³

10⁴

10⁵

10⁶

10⁷

10⁸

Sequence length

(log scale)

GPT

LRA

Utterances

1MP Images

Pop songs

(4m waveforms)

Symphonies

(40m waveforms)

33 of 64

Audio sequence lengths in perspective

33

Utterances

1MP Images

10²

10³

10⁴

10⁵

10⁶

10⁷

10⁸

Sequence length

(log scale)

GPT

LRA

Sequences considered “long” by ML community are

3-4 orders of magnitude shorter than music waveforms.

Pop songs

(4m waveforms)

Symphonies

(40m waveforms)

34 of 64

How can we tractably model music audio with LMs?

Modeling samples of waveforms is overkill. Much of the entropy is perceptually irrelevant

Instead, use LMs to model compressed tokens [van den Oord+ 17]

is small vocab, frame rate => high compression factor, often factor of ~100
Compressing the waveform makes LM modeling empirically tractable

Learn a discrete codec ,

Train as autoencoder: is perceptually similar to

Tokenize dataset and model with LM

Sample tokens and output audio

34

35 of 64

Modeling audio tokens from learned codecs

van den Oord+ NeurIPS’17, Dieleman+ NeurIPS’18, Dhariwal/Jun/Payne+ 20

35

w

Dec_θ(Enc_φ(w))

Enc_φ

Dec_θ

Enc_φ(w)

[1, 0, 6, 3, 8, 2, 1, 8, …]

1. Learn a discrete codec

Minimize round trip reconstruction error:

L(φ, θ) = E_x[distance(w, Dec_θ(Enc_φ(w)))]

36 of 64

Modeling audio tokens from learned codecs

van den Oord+ NeurIPS’17, Dieleman+ NeurIPS’18, Dhariwal/Jun/Payne+ 20

36

w

Dec(Enc(w))

Enc

Dec

Enc(w)

[1, 0, 6, 3, 8, 2, 1, 8, …]

2. Model distribution with LM

P(Enc(w))

LM

37 of 64

Modeling audio tokens from learned codecs

van den Oord+ NeurIPS’17, Dieleman+ NeurIPS’18, Dhariwal/Jun/Payne+ 20

37

w

Dec(Enc(w))

Enc

Dec

Enc(w)

[1, 0, 6, 3, 8, 2, 1, 8, …]

P(Enc(w))

Generated

Dec

3. Generate audio

using LM and Dec

LM

38 of 64

38

LM of audio tokens

van den Oord+ NeurIPS’17, Dieleman+ NeurIPS’18, Dhariwal/Jun/Payne+ 20

Audio

(high rate, continuous)

Audio “tokens”

(lower rate, discrete)

Enc

LM

Prediction

7

Future

0

Compare prediction to future

0

6

5

2

5

2

0

6

39 of 64

Multimodal, controllable music LMs

enabled by tokenizers and generic seq2seq

39

Symbolic tokenizer

C C G G A A G, …

61, 15, 72, 85, …

Transformer Encoder

Transformer Decoder

Seq2seq LM e.g., Transformer

Goal

Acoustic (de)tokenizer

40 of 64

Music LMs solved! … or not?

Previously, claimed that “tokenization decouples modality-specific concerns from modeling”... true in theory but not in current practice
In reality, token sequences still too long to enable naive language modeling, e.g., 500 tokens per second for typical high-fidelity audio codecs
Domain-specific methods still common for modeling music tokens with LMs

40

41 of 64

Modeling music tokens

41

42 of 64

Hierarchical modeling: a tractable short-term recipe

Dieleman+ NeurIPS’18 proposed hierarchical approach to this problem
Goal: induce hierarchy and model w/ standard LM architectures

Create curriculum of increasingly fine-grained tokens: ,

Factorize joint into stages:

Model each stage w/ autoregressive LM

Makes conditional independence assumption:

At each stage, truncate the context to the maximum length allowed by architecture

Hierarchical structure enables long-term dependencies w/ fixed-length models

Where do these token stages come from?

Recently: a strategy called “Residual vector quantization” (RVQ) [Zeghidour+ 21]

42

43 of 64

Residual vector quantization (RVQ): an overview

Zeghidour+ 21

43

Encoder

Levels

Frames

44 of 64

How does RVQ work?

Zeghidour+ 21

RVQ iteratively quantizes vectors in a coarse-to-fine fashion

Encoder produces a fixed-length, continuous vector per frame

RVQ quantizes individual frames:

RVQ involves learning k codebooks , one per level

Recursive quantization:

Where and

Levels are ordinal: lower levels contribute more to output sound

44

45 of 64

Example of simple multi-stage hierarchical LM w/ RVQ

First stage has 30s of context: 1500 stage 1 tokens

Second stage has 10s of context: 500 stage 1 tokens, 1000 stage 2 tokens

Third stage has 3s of context: 300 stage 2 tokens, 1200 stage 3 tokens

Three stage model: P(stage 1) * P(stage 2 | stage 1) * P(stage 3 | stage 2)
As an example, let’s assume LM sequence length of 1500 per stage:

Early stages handle long-term dependencies, late stages handle fidelity

45

Stage 1; 50 tokens per sec

Stage 2; 100 tokens per sec

Stage 3; 400 tokens per sec

46 of 64

Insight: semantic representations improve efficiency

Lakhotia/Kharitonov+ ACL’21, Borsos+ 22

46

[1, 0, 6, 3, 8, 2, 1, 8, …]

[1, _, 6, 3, _, 2, _, 8, …]

[1, 0, 6, 3, 8, 2, 1, 8, …]

Predict

Discretize

Mask

x

Semantic tokens: Sem(w)

discretized representations from some intermediate layer of BERT-style model

Acoustic tokens: Enc(w)

from RVQ

47 of 64

Insight: semantic representations improve efficiency

Lakhotia/Kharitonov+ ACL’21, Borsos+ 22

47

x

AudioLM (Borsos+ 22):

Model joint w/ proxy LM

P(Sem(w)) * P(Enc(w)) | Sem(w))

1B params vs 7B Jukebox

Enc

Sem

[1, 0, 6, 3…]

[8, 2, 1, 8…]

Semantic codes

High-level

Acoustic tokens

Low-level

Hierarchical LM

Semantic tokens: Sem(w)

discretized representations from some intermediate layer of BERT-style model

Acoustic tokens: Enc(w)

from RVQ

48 of 64

Another approach: leverage structure in tokens to

parallelize prediction

Flattening RVQ tokens induces long sequences, and creates a heterogeneous definition of an LM timestep:

Sometimes a timestep advances time in the audio
Sometimes a timestep advances levels in the RVQ

Insight: can we change the definition of an LM timestep to a frame, and predict all tokens in parallel?

48

49 of 64

“Delay trick” from MusicGen (Copet+ 23)

49

Flattening is inefficient (4x sequence length increase)

Parallel makes unreasonable independence assumption

We know levels are not independent due to recursive structure of RVQ

Proposed: delay pattern. Tradeoff between flattening and parallel

50 of 64

Long context architectures

The dream: forget about engineering hierarchy and token structure, model raw waveforms directly, throw long-context LMs at the problem
Somewhat anachronistically, the earliest music audio generative model (WaveNet) operated on raw waveforms [van den Oord+ 16]
More recently, this has been revisited with state space models [Goel+ 22]
Current drawbacks:

Can only model a few seconds of audio context
Cannot model broad music audio (only narrow distributions like piano)
Relatively slow to generate from

50

51 of 64

Long context architectures: WaveNet

van den Oord+ 16

WaveNet is an LM that uses exponentially sparser convolutions to enable tractable modeling of longer contexts (around 48k timesteps or 3s)

51

52 of 64

Long context architectures: SaShiMi

Goel+ 22

SaShiMi uses structured state space sequence models (S4) [Gu+ 21], a generalization of convolution, to model even longer audio contexts (8s)

52

53 of 64

More research directions

in music LLMs

53

54 of 64

Demystifying hierarchical LMs: what are the optimal tradeoffs and scaling laws?

54

x

AudioLM (Borsos+ 22):

Model joint w/ proxy LM

P(Sem(w)) * P(Enc(w)) | Sem(w))

1B params vs 7B Jukebox

Enc

Sem

[1, 0, 6, 3…]

[8, 2, 1, 8…]

Semantic codes

High-level

Acoustic tokens

Low-level

Hierarchical LM

Semantic tokens: Sem(w)

discretized representations from some intermediate layer of BERT-style model

Acoustic tokens: Enc(w)

from RVQ

55 of 64

Intuitive control is crucial for musical expressivity.

55

👋 Control

🎵 Music

“An uptempo jazz song featuring a singer, a saxophone, and a washboard”

Music LLM

Need paired data

56 of 64

Improving music information retrieval (MIR) can help improve control.

56

👋 Control

🎵 Music

“An uptempo jazz song featuring a singer, a saxophone, and a washboard”

MIR Model

57 of 64

Music LLMs can aid MIR

Castellon+ 20, Donahue+ 21

57

Audio

Audio Tokenizer

0

6

5

2

5

2

0

6

Music LLMs

Transfer to

MIR tasks

Prediction

“Jazz”

Audio tokens

58 of 64

Music is inherently multimodal and crossmodal.

58

Multimodal

Crossmodal

59 of 64

How do we build foundation models with limited paired data?

59

Multimodal

Music LLM

Multimodal

Crossmodal

60 of 64

Can we adapt music LLMS to

run on commodity hardware?

60

Music LLM

Privacy

Reliability

61 of 64

What is the right interface for music generative models?

61

Interface

Generative Model

?

GPT

ChatGPT

Music LLM

MusicLM, AudioCraft, Riffusion, AudioLDM,

…

62 of 64

Protecting artists via consent, credit, and compensation

62

Training data attribution for revenue streams

Retrieval for improved diversity and credit

Music LLM

Generated

output

Training

data

$

Retrieved

data

Music LLM

Generated

output

63 of 64

Interested? Reach out!

✉️ chrisdonahue@cmu.edu

🌐 chrisdonahue.com

@chrisdonahuey

64 of 64

Quiz question

What is the primary property of audio waveforms that make them empirically challenging to model with Transformers?

64