1 of 64

LLMs Beyond Text: Music

Chris Donahue

CMU 11-667 LLMs

23 Oct 12

2 of 64

Music LLMs have remarkable capabilities.

2

Acoustic music LLMs �can generate music audio from text Examples from Copet+ 23

Symbolic music LLMs �can harmonize and infill music scores�Examples from Thickstun+ 23

© Chris Donahue 2023

3 of 64

Agenda

  • What is a music LM, and what do LMs mean for music?
  • Coercing music into tokens
  • Modeling music tokens with LMs
  • Research directions in music LLMs

3

© Chris Donahue 2023

4 of 64

What is a music LM, and what do LMs mean for music?

4

© Chris Donahue 2023

5 of 64

What is a music LM?

  • A model over sequences of musical tokens
  • Specifically, factorize auto regressively
  • So, literally just an LM… except what are musical tokens ?
  • A model over sequences of musical tokens
  • Specifically, factorize auto regressively
  • A model over sequences of musical tokens

5

© Chris Donahue 2023

6 of 64

Musical tokens can be symbolic

Perhaps most intuitively, symbolic music can be represented as tokens

6

C, C, G, G, A, A, G, …

© Chris Donahue 2023

7 of 64

Musical tokens can be acoustic

Somewhat less intuitively, musical tokens can represent audio (more on this later)

7

-1

1

0

0, 1, 0, -1, 0, 1, 0, -1, 0, …

© Chris Donahue 2023

8 of 64

What is a music LM? (continued)

  • A music LM is a type of generative model of music in some data modality
  • Often, we want this model to be controllable
  • Control is often multi-modal, e.g., generate acoustic given symbolic
  • Why do we want controllable music generative models?
  • A music LM is a type of generative model of music in some data modality
  • Often, we want this model to be controllable
  • A music LM is a type of generative model of music in some data modality

8

© Chris Donahue 2023

9 of 64

Everyone is a musician…

… but musical expression

is locked behind expertise.

9

Composition

Improvisation

© Chris Donahue 2023

10 of 64

Conventional tools impose a barrier

between intuition and expression.

10

Expression

Expertise divide

Intuition

🧠

🎶

© Chris Donahue 2023

11 of 64

Generative models can

bridge the expertise divide.

11

Expression

Intuition

(via control)

👋

🎶

v

e

m

o

d

e

l

i

t

a

r

e

n

e

G

Expertise divide

© Chris Donahue 2023

12 of 64

A concrete example: Piano Genie

Donahue+ IUI’19

12

Expertise divide

Simple interface

Piano

improvisation

v

e

m

o

d

e

l

i

t

a

r

e

n

e

G

© Chris Donahue 2023

13 of 64

A concrete example: SingSong

Donahue+ 23

13

Expertise divide

Singing

Rich music

v

e

m

o

d

e

l

i

t

a

r

e

n

e

G

© Chris Donahue 2023

14 of 64

Large music LMs: the “Midjourney moment” for music

14

Full audio

v

e

m

o

d

e

l

i

t

a

r

e

n

e

G

“Drake and The Weeknd collab with Metro Boomin style production”

Text prompt

© Chris Donahue 2023

15 of 64

Generative AI is poised to redefine the nature of music creation over the next 5 years.

15

© Chris Donahue 2023

16 of 64

Generative AI is poised to redefine the nature of music creation over the next 5 years.

16

We must be responsible stewards, unlocking expression while protecting music culture and musicians.

© Chris Donahue 2023

17 of 64

Why LMs, as opposed to other generative models?

  • Music generation bifurcated into two paradigms:
    • (1) LMs on discrete music representations (e.g. tokens)
    • (2) Diffusion on continuous music representations (e.g. audio)
  • Both are qualitatively competitive, and both leverage the breadth of other research areas (NLP / CV), but LMs offer:
    • Elegance: tokenization decouples modality-specific concerns from modeling
    • Well-lit scaling: it’s currently more clear how to scale LMs compared to diffusion models
    • Low delay: LMs generate tokens individually, enabling real-time (e.g. Piano Genie)
  • Fun fact: LMs also have a long history in music! ->

17

“Markov chains”

(n-gram models)

Popular in contemporary composition (80s-90s)

© Chris Donahue 2023

18 of 64

Coercing music

into tokens

Motivation: Make music modeling methodologically equivalent to text modeling by converting music to tokens and tokens to music

18

© Chris Donahue 2023

19 of 64

Musical tokens can be symbolic

Perhaps most intuitively, symbolic music can be represented as tokens

19

C, C, G, G, A, A, G, …

© Chris Donahue 2023

20 of 64

Tokenization schemes can preserve different musical info

Just like with text, there are different ways to tokenize music

20

Pitch classes: C, C, G, G, A, A, G

With rhythms: ♩C, ♩C, ♩G, ♩G, ♩A, ♩A, G

With octaves: ♩C4, ♩C4, ♩G4, ♩G4, ♩A4, ♩A4, G4

More simple

More complex

© Chris Donahue 2023

21 of 64

Tokenization schemes can preserve different musical info

Just like with text, there are different ways to tokenize music

21

Pitch classes: C, C, G, G, A, A, G

With rhythms: ♩C, ♩C, ♩G, ♩G, ♩A, ♩A, G

With octaves: ♩C4, ♩C4, ♩G4, ♩G4, ♩A4, ♩A4, G4

More simple

More complex

Note that we can manipulate both the information we tokenize as well as the vocabulary .�Trade off sequence length and vocab size.

= {♩C, G, … } = {♩, , C, G, …}

© Chris Donahue 2023

22 of 64

Dealing with note simultaneity can be complicated

Unlike in text, multiple symbolic music tokens can coincide in time. One approach is to insert “time shift” tokens to encode changes in time [Oore and Simon 17]

22

C3, E3, G3, C4, ♩, C4, ♩, G4 …

Time shift by 1 beat

© Chris Donahue 2023

23 of 64

Dealing with note simultaneity can be complicated

Time shift tokens are cumbersome, especially for infilling. Instead, we can model notes as tuples of (time, pitch) [Ippolito+ 18, Hsiao+ 21, Thickstun+ 23]

23

(1, C3), (1, E3), (1, G3), (1, C4), (2, C4), (3, G4), …

© Chris Donahue 2023

24 of 64

A simple tutorial! Training an LM on symbolic lead sheets

24

© Chris Donahue 2023

25 of 64

Why don’t we just model the audio itself?

  • Symbolic music is simultaneously natural (because it’s discrete) and unnatural (because of sparsity / simultaneity) to model with LMs
  • Symbolic is great for musicians but audio is more universally appreciated
  • Symbolic music data is hard to come by but recordings are everywhere
  • Hence, why not just model audio with LMs?

25

© Chris Donahue 2023

26 of 64

A ⚡ primer on digital audio

Audio is a continuous measurement of fluctuating air pressure caused by sound

But continuous signals cannot be natively stored on digital media

26

-1

1

0

Time

Pressure

© Chris Donahue 2023

27 of 64

A ⚡ primer on digital audio

Digital audio involves sampling the signal at uniform intervals:

27

-1

1

0

Time

Pressure

© Chris Donahue 2023

28 of 64

A ⚡ primer on digital audio

Digital audio further quantizes signals to set of discrete amplitudes:

= {-1, -½, 0, ½, 1}

28

-1

1

0

Time

Pressure

© Chris Donahue 2023

29 of 64

A ⚡ primer on digital audio

Digital audio further quantizes signals to set of discrete amplitudes:

= {-1, -½, 0, ½, 1}

29

-1

1

0

Time

Pressure

© Chris Donahue 2023

30 of 64

Waveforms can be “tokens”!

Hence, we could simply treat each sample of a digital audio waveform as a token

30

-1

1

0

Pressure

0, 1, 0, -1, 0, 1, 0, -1, …

© Chris Donahue 2023

31 of 64

But there’s a catch…

is large (e.g. 180s), is large (e.g. 48kHz)�Structure at many different timescales

31

Waveforms are long

Achilles heel for modern ML

📷: (Left) WaveNet blog post, (Right) Vaswani+ 17

© Chris Donahue 2023

32 of 64

Waveform lengths in perspective

32

102

103

104

105

106

107

108

Sequence length

(log scale)

GPT

LRA

Utterances

1MP Images

Pop songs

(4m waveforms)

Symphonies

(40m waveforms)

© Chris Donahue 2023

33 of 64

Audio sequence lengths in perspective

33

Utterances

1MP Images

102

103

104

105

106

107

108

Sequence length

(log scale)

GPT

LRA

Sequences considered “long” by ML community are

3-4 orders of magnitude shorter than music waveforms.

Pop songs

(4m waveforms)

Symphonies

(40m waveforms)

© Chris Donahue 2023

34 of 64

How can we tractably model music audio with LMs?

  • Modeling samples of waveforms is overkill. Much of the entropy is perceptually irrelevant
  • Instead, use LMs to model compressed tokens [van den Oord+ 17]
    • is small vocab, frame rate => high compression factor, often factor of ~100
    • Compressing the waveform makes LM modeling empirically tractable
  • Learn a discrete codec ,
    • Train as autoencoder: is perceptually similar to
  • Tokenize dataset and model with LM
  • Sample tokens and output audio

34

© Chris Donahue 2023

35 of 64

Modeling audio tokens from learned codecs

van den Oord+ NeurIPS’17, Dieleman+ NeurIPS’18, Dhariwal/Jun/Payne+ 20

35

w

Decθ(Encφ(w))

Encφ

Decθ

Encφ(w)

[1, 0, 6, 3, 8, 2, 1, 8, …]

1. Learn a discrete codec

Minimize round trip reconstruction error:

L(φ, θ) = Ex[distance(w, Decθ(Encφ(w)))]

© Chris Donahue 2023

36 of 64

Modeling audio tokens from learned codecs

van den Oord+ NeurIPS’17, Dieleman+ NeurIPS’18, Dhariwal/Jun/Payne+ 20

36

w

Dec(Enc(w))

Enc

Dec

Enc(w)

[1, 0, 6, 3, 8, 2, 1, 8, …]

2. Model distribution with LM

P(Enc(w))

LM

© Chris Donahue 2023

37 of 64

Modeling audio tokens from learned codecs

van den Oord+ NeurIPS’17, Dieleman+ NeurIPS’18, Dhariwal/Jun/Payne+ 20

37

w

Dec(Enc(w))

Enc

Dec

Enc(w)

[1, 0, 6, 3, 8, 2, 1, 8, …]

P(Enc(w))

Generated

Dec

3. Generate audio

using LM and Dec

LM

© Chris Donahue 2023

38 of 64

38

LM of audio tokens

van den Oord+ NeurIPS’17, Dieleman+ NeurIPS’18, Dhariwal/Jun/Payne+ 20

Audio

(high rate, continuous)

Audio “tokens”

(lower rate, discrete)

Enc

LM

Prediction

7

Future

0

Compare prediction to future

0

6

5

2

5

2

0

6

© Chris Donahue 2023

39 of 64

Multimodal, controllable music LMs

enabled by tokenizers and generic seq2seq

39

Symbolic tokenizer

C C G G A A G, …

61, 15, 72, 85, …

Transformer Encoder

Transformer Decoder

Seq2seq LM e.g., Transformer

Goal

Acoustic (de)tokenizer

© Chris Donahue 2023

40 of 64

Music LMs solved! … or not?

  • Previously, claimed that “tokenization decouples modality-specific concerns from modeling”... true in theory but not in current practice
  • In reality, token sequences still too long to enable naive language modeling, e.g., 500 tokens per second for typical high-fidelity audio codecs
  • Domain-specific methods still common for modeling music tokens with LMs

40

© Chris Donahue 2023

41 of 64

Modeling music tokens

41

© Chris Donahue 2023

42 of 64

Hierarchical modeling: a tractable short-term recipe

  • Dieleman+ NeurIPS’18 proposed hierarchical approach to this problem
  • Goal: induce hierarchy and model w/ standard LM architectures
    • Create curriculum of increasingly fine-grained tokens: ,
    • Factorize joint into stages:
    • Model each stage w/ autoregressive LM
    • Makes conditional independence assumption:
    • At each stage, truncate the context to the maximum length allowed by architecture
    • Hierarchical structure enables long-term dependencies w/ fixed-length models
  • Where do these token stages come from?
    • Recently: a strategy called “Residual vector quantization” (RVQ) [Zeghidour+ 21]

42

© Chris Donahue 2023

43 of 64

Residual vector quantization (RVQ): an overview

Zeghidour+ 21

43

Encoder

Levels

Frames

© Chris Donahue 2023

44 of 64

How does RVQ work?

Zeghidour+ 21

  • RVQ iteratively quantizes vectors in a coarse-to-fine fashion
  • Encoder produces a fixed-length, continuous vector per frame
  • RVQ quantizes individual frames:
  • RVQ involves learning k codebooks , one per level
  • Recursive quantization:
    • Where and
  • Levels are ordinal: lower levels contribute more to output sound

44

© Chris Donahue 2023

45 of 64

Example of simple multi-stage hierarchical LM w/ RVQ

    • First stage has 30s of context: 1500 stage 1 tokens
    • Second stage has 10s of context: 500 stage 1 tokens, 1000 stage 2 tokens
    • Third stage has 3s of context: 300 stage 2 tokens, 1200 stage 3 tokens
  • Three stage model: P(stage 1) * P(stage 2 | stage 1) * P(stage 3 | stage 2)
  • As an example, let’s assume LM sequence length of 1500 per stage:
  • Early stages handle long-term dependencies, late stages handle fidelity

45

Stage 1; 50 tokens per sec

Stage 2; 100 tokens per sec

Stage 3; 400 tokens per sec

© Chris Donahue 2023

46 of 64

Insight: semantic representations improve efficiency

Lakhotia/Kharitonov+ ACL’21, Borsos+ 22

46

[1, 0, 6, 3, 8, 2, 1, 8, …]

[1, _, 6, 3, _, 2, _, 8, …]

[1, 0, 6, 3, 8, 2, 1, 8, …]

Predict

Discretize

Mask

x

Semantic tokens: Sem(w)

discretized representations from some intermediate layer of BERT-style model

Acoustic tokens: Enc(w)

from RVQ

© Chris Donahue 2023

47 of 64

Insight: semantic representations improve efficiency

Lakhotia/Kharitonov+ ACL’21, Borsos+ 22

47

x

AudioLM (Borsos+ 22):

Model joint w/ proxy LM

P(Sem(w)) * P(Enc(w)) | Sem(w))

1B params vs 7B Jukebox

Enc

Sem

[1, 0, 6, 3…]

[8, 2, 1, 8…]

Semantic codes

High-level

Acoustic tokens

Low-level

Hierarchical LM

Semantic tokens: Sem(w)

discretized representations from some intermediate layer of BERT-style model

Acoustic tokens: Enc(w)

from RVQ

© Chris Donahue 2023

48 of 64

Another approach: leverage structure in tokens to

parallelize prediction

  • Flattening RVQ tokens induces long sequences, and creates a heterogeneous definition of an LM timestep:
    • Sometimes a timestep advances time in the audio
    • Sometimes a timestep advances levels in the RVQ
  • Insight: can we change the definition of an LM timestep to a frame, and predict all tokens in parallel?

48

© Chris Donahue 2023

49 of 64

“Delay trick” from MusicGen (Copet+ 23)

49

Flattening is inefficient (4x sequence length increase)

Parallel makes unreasonable independence assumption

We know levels are not independent due to recursive structure of RVQ

Proposed: delay pattern. Tradeoff between flattening and parallel

© Chris Donahue 2023

50 of 64

Long context architectures

  • The dream: forget about engineering hierarchy and token structure, model raw waveforms directly, throw long-context LMs at the problem
  • Somewhat anachronistically, the earliest music audio generative model (WaveNet) operated on raw waveforms [van den Oord+ 16]
  • More recently, this has been revisited with state space models [Goel+ 22]
  • Current drawbacks:
    • Can only model a few seconds of audio context
    • Cannot model broad music audio (only narrow distributions like piano)
    • Relatively slow to generate from

50

© Chris Donahue 2023

51 of 64

Long context architectures: WaveNet

van den Oord+ 16

WaveNet is an LM that uses exponentially sparser convolutions to enable tractable modeling of longer contexts (around 48k timesteps or 3s)

51

© Chris Donahue 2023

52 of 64

Long context architectures: SaShiMi

Goel+ 22

SaShiMi uses structured state space sequence models (S4) [Gu+ 21], a generalization of convolution, to model even longer audio contexts (8s)

52

© Chris Donahue 2023

53 of 64

More research directions

in music LLMs

53

© Chris Donahue 2023

54 of 64

Demystifying hierarchical LMs: what are the optimal tradeoffs and scaling laws?

54

x

AudioLM (Borsos+ 22):

Model joint w/ proxy LM

P(Sem(w)) * P(Enc(w)) | Sem(w))

1B params vs 7B Jukebox

Enc

Sem

[1, 0, 6, 3…]

[8, 2, 1, 8…]

Semantic codes

High-level

Acoustic tokens

Low-level

Hierarchical LM

Semantic tokens: Sem(w)

discretized representations from some intermediate layer of BERT-style model

Acoustic tokens: Enc(w)

from RVQ

© Chris Donahue 2023

55 of 64

Intuitive control is crucial for musical expressivity.

55

👋 Control

🎵 Music

“An uptempo jazz song featuring a singer, a saxophone, and a washboard”

Music LLM

Need paired data

© Chris Donahue 2023

56 of 64

Improving music information retrieval (MIR) can help improve control.

56

👋 Control

🎵 Music

“An uptempo jazz song featuring a singer, a saxophone, and a washboard”

MIR Model

© Chris Donahue 2023

57 of 64

Music LLMs can aid MIR

Castellon+ 20, Donahue+ 21

57

Audio

Audio Tokenizer

0

6

5

2

5

2

0

6

Music LLMs

Transfer to

MIR tasks

Prediction

“Jazz”

Audio tokens

© Chris Donahue 2023

58 of 64

Music is inherently multimodal and crossmodal.

58

Multimodal

Crossmodal

© Chris Donahue 2023

59 of 64

How do we build foundation models with limited paired data?

59

Multimodal

Music LLM

Multimodal

Crossmodal

© Chris Donahue 2023

60 of 64

Can we adapt music LLMS to

run on commodity hardware?

60

Music LLM

Privacy

Reliability

© Chris Donahue 2023

61 of 64

What is the right interface for music generative models?

61

Interface

Generative Model

?

GPT

ChatGPT

Music LLM

MusicLM, AudioCraft, Riffusion, AudioLDM,

© Chris Donahue 2023

62 of 64

Protecting artists via consent, credit, and compensation

62

Training data attribution for revenue streams

Retrieval for improved diversity and credit

Music LLM

Generated

output

Training

data

$

Retrieved

data

Music LLM

Generated

output

© Chris Donahue 2023

63 of 64

Interested? Reach out!

✉️ chrisdonahue@cmu.edu

🌐 chrisdonahue.com

@chrisdonahuey

© Chris Donahue 2023

64 of 64

Quiz question

What is the primary property of audio waveforms that make them empirically challenging to model with Transformers?

64

© Chris Donahue 2023