LLMs Beyond Text: Music
Chris Donahue
CMU 11-667 LLMs
23 Oct 12
Music LLMs have remarkable capabilities.
2
Acoustic music LLMs �can generate music audio from text Examples from Copet+ 23
Symbolic music LLMs �can harmonize and infill music scores�Examples from Thickstun+ 23
© Chris Donahue 2023
Agenda
3
© Chris Donahue 2023
What is a music LM, and what do LMs mean for music?
4
© Chris Donahue 2023
What is a music LM?
5
© Chris Donahue 2023
Musical tokens can be symbolic
Perhaps most intuitively, symbolic music can be represented as tokens
6
C, C, G, G, A, A, G, …
© Chris Donahue 2023
Musical tokens can be acoustic
Somewhat less intuitively, musical tokens can represent audio (more on this later)
7
-1
1
0
0, 1, 0, -1, 0, 1, 0, -1, 0, …
© Chris Donahue 2023
What is a music LM? (continued)
8
© Chris Donahue 2023
Everyone is a musician…
… but musical expression
is locked behind expertise.
9
Composition
Improvisation
© Chris Donahue 2023
Conventional tools impose a barrier
between intuition and expression.
10
Expression
Expertise divide
Intuition
🧠
🎶
© Chris Donahue 2023
Generative models can
bridge the expertise divide.
11
Expression
Intuition
(via control)
👋
🎶
v
e
m
o
d
e
l
i
t
a
r
e
n
e
G
Expertise divide
© Chris Donahue 2023
A concrete example: Piano Genie
Donahue+ IUI’19
12
Expertise divide
Simple interface
Piano
improvisation
v
e
m
o
d
e
l
i
t
a
r
e
n
e
G
© Chris Donahue 2023
A concrete example: SingSong
Donahue+ 23
13
Expertise divide
Singing
Rich music
v
e
m
o
d
e
l
i
t
a
r
e
n
e
G
© Chris Donahue 2023
Large music LMs: the “Midjourney moment” for music
14
Full audio
v
e
m
o
d
e
l
i
t
a
r
e
n
e
G
“Drake and The Weeknd collab with Metro Boomin style production”
Text prompt
© Chris Donahue 2023
Generative AI is poised to redefine the nature of music creation over the next 5 years.
15
© Chris Donahue 2023
Generative AI is poised to redefine the nature of music creation over the next 5 years.
16
We must be responsible stewards, unlocking expression while protecting music culture and musicians.
© Chris Donahue 2023
Why LMs, as opposed to other generative models?
17
“Markov chains”
(n-gram models)
Popular in contemporary composition (80s-90s)
© Chris Donahue 2023
Coercing music
into tokens
Motivation: Make music modeling methodologically equivalent to text modeling by converting music to tokens and tokens to music
18
© Chris Donahue 2023
Musical tokens can be symbolic
Perhaps most intuitively, symbolic music can be represented as tokens
19
C, C, G, G, A, A, G, …
© Chris Donahue 2023
Tokenization schemes can preserve different musical info
Just like with text, there are different ways to tokenize music
20
Pitch classes: C, C, G, G, A, A, G
With rhythms: ♩C, ♩C, ♩G, ♩G, ♩A, ♩A, G
With octaves: ♩C4, ♩C4, ♩G4, ♩G4, ♩A4, ♩A4, G4
More simple
More complex
© Chris Donahue 2023
Tokenization schemes can preserve different musical info
Just like with text, there are different ways to tokenize music
21
Pitch classes: C, C, G, G, A, A, G
With rhythms: ♩C, ♩C, ♩G, ♩G, ♩A, ♩A, G
With octaves: ♩C4, ♩C4, ♩G4, ♩G4, ♩A4, ♩A4, G4
More simple
More complex
Note that we can manipulate both the information we tokenize as well as the vocabulary .�Trade off sequence length and vocab size.
= {♩C, G, … } = {♩, , C, G, …}
© Chris Donahue 2023
Dealing with note simultaneity can be complicated
Unlike in text, multiple symbolic music tokens can coincide in time. One approach is to insert “time shift” tokens to encode changes in time [Oore and Simon 17]
22
C3, E3, G3, C4, ♩, C4, ♩, G4 …
Time shift by 1 beat
© Chris Donahue 2023
Dealing with note simultaneity can be complicated
Time shift tokens are cumbersome, especially for infilling. Instead, we can model notes as tuples of (time, pitch) [Ippolito+ 18, Hsiao+ 21, Thickstun+ 23]
23
(1, C3), (1, E3), (1, G3), (1, C4), (2, C4), (3, G4), …
© Chris Donahue 2023
A simple tutorial! Training an LM on symbolic lead sheets
24
© Chris Donahue 2023
Why don’t we just model the audio itself?
25
© Chris Donahue 2023
A ⚡ primer on digital audio
Audio is a continuous measurement of fluctuating air pressure caused by sound
But continuous signals cannot be natively stored on digital media
26
-1
1
0
Time
Pressure
© Chris Donahue 2023
A ⚡ primer on digital audio
Digital audio involves sampling the signal at uniform intervals:
27
-1
1
0
Time
Pressure
© Chris Donahue 2023
A ⚡ primer on digital audio
Digital audio further quantizes signals to set of discrete amplitudes:
= {-1, -½, 0, ½, 1}
28
-1
1
0
Time
Pressure
© Chris Donahue 2023
A ⚡ primer on digital audio
Digital audio further quantizes signals to set of discrete amplitudes:
= {-1, -½, 0, ½, 1}
29
-1
1
0
Time
Pressure
© Chris Donahue 2023
Waveforms can be “tokens”!
Hence, we could simply treat each sample of a digital audio waveform as a token
30
-1
1
0
Pressure
0, 1, 0, -1, 0, 1, 0, -1, …
© Chris Donahue 2023
But there’s a catch…
is large (e.g. 180s), is large (e.g. 48kHz)�Structure at many different timescales
31
Waveforms are long
Achilles heel for modern ML
📷: (Left) WaveNet blog post, (Right) Vaswani+ 17
© Chris Donahue 2023
Waveform lengths in perspective
32
102
103
104
105
106
107
108
Sequence length
(log scale)
GPT
LRA
Utterances
1MP Images
Pop songs
(4m waveforms)
Symphonies
(40m waveforms)
© Chris Donahue 2023
Audio sequence lengths in perspective
33
Utterances
1MP Images
102
103
104
105
106
107
108
Sequence length
(log scale)
GPT
LRA
Sequences considered “long” by ML community are
3-4 orders of magnitude shorter than music waveforms.
Pop songs
(4m waveforms)
Symphonies
(40m waveforms)
© Chris Donahue 2023
How can we tractably model music audio with LMs?
34
© Chris Donahue 2023
Modeling audio tokens from learned codecs
van den Oord+ NeurIPS’17, Dieleman+ NeurIPS’18, Dhariwal/Jun/Payne+ 20
35
w
Decθ(Encφ(w))
Encφ
Decθ
Encφ(w)
[1, 0, 6, 3, 8, 2, 1, 8, …]
1. Learn a discrete codec
Minimize round trip reconstruction error:
L(φ, θ) = Ex[distance(w, Decθ(Encφ(w)))]
© Chris Donahue 2023
Modeling audio tokens from learned codecs
van den Oord+ NeurIPS’17, Dieleman+ NeurIPS’18, Dhariwal/Jun/Payne+ 20
36
w
Dec(Enc(w))
Enc
Dec
Enc(w)
[1, 0, 6, 3, 8, 2, 1, 8, …]
2. Model distribution with LM
P(Enc(w))
LM
© Chris Donahue 2023
Modeling audio tokens from learned codecs
van den Oord+ NeurIPS’17, Dieleman+ NeurIPS’18, Dhariwal/Jun/Payne+ 20
37
w
Dec(Enc(w))
Enc
Dec
Enc(w)
[1, 0, 6, 3, 8, 2, 1, 8, …]
P(Enc(w))
Generated
Dec
3. Generate audio
using LM and Dec
LM
© Chris Donahue 2023
38
LM of audio tokens
van den Oord+ NeurIPS’17, Dieleman+ NeurIPS’18, Dhariwal/Jun/Payne+ 20
Audio
(high rate, continuous)
Audio “tokens”
(lower rate, discrete)
Enc
LM
Prediction
7
Future
0
Compare prediction to future
0
6
5
2
5
2
0
6
© Chris Donahue 2023
Multimodal, controllable music LMs
enabled by tokenizers and generic seq2seq
39
Symbolic tokenizer
C C G G A A G, …
61, 15, 72, 85, …
Transformer Encoder
Transformer Decoder
Seq2seq LM e.g., Transformer
Goal
Acoustic (de)tokenizer
© Chris Donahue 2023
Music LMs solved! … or not?
40
© Chris Donahue 2023
Modeling music tokens
41
© Chris Donahue 2023
Hierarchical modeling: a tractable short-term recipe
42
© Chris Donahue 2023
Residual vector quantization (RVQ): an overview
Zeghidour+ 21
43
Encoder
Levels
Frames
© Chris Donahue 2023
How does RVQ work?
Zeghidour+ 21
44
© Chris Donahue 2023
Example of simple multi-stage hierarchical LM w/ RVQ
45
Stage 1; 50 tokens per sec
Stage 2; 100 tokens per sec
Stage 3; 400 tokens per sec
© Chris Donahue 2023
Insight: semantic representations improve efficiency
Lakhotia/Kharitonov+ ACL’21, Borsos+ 22
46
[1, 0, 6, 3, 8, 2, 1, 8, …]
[1, _, 6, 3, _, 2, _, 8, …]
[1, 0, 6, 3, 8, 2, 1, 8, …]
Predict
Discretize
Mask
x
Semantic tokens: Sem(w)
discretized representations from some intermediate layer of BERT-style model
Acoustic tokens: Enc(w)
from RVQ
© Chris Donahue 2023
Insight: semantic representations improve efficiency
Lakhotia/Kharitonov+ ACL’21, Borsos+ 22
47
x
AudioLM (Borsos+ 22):
Model joint w/ proxy LM
P(Sem(w)) * P(Enc(w)) | Sem(w))
1B params vs 7B Jukebox
Enc
Sem
[1, 0, 6, 3…]
[8, 2, 1, 8…]
Semantic codes
High-level
Acoustic tokens
Low-level
Hierarchical LM
Semantic tokens: Sem(w)
discretized representations from some intermediate layer of BERT-style model
Acoustic tokens: Enc(w)
from RVQ
© Chris Donahue 2023
Another approach: leverage structure in tokens to
parallelize prediction
48
© Chris Donahue 2023
“Delay trick” from MusicGen (Copet+ 23)
49
Flattening is inefficient (4x sequence length increase)
Parallel makes unreasonable independence assumption
We know levels are not independent due to recursive structure of RVQ
Proposed: delay pattern. Tradeoff between flattening and parallel
© Chris Donahue 2023
Long context architectures
50
© Chris Donahue 2023
Long context architectures: WaveNet
van den Oord+ 16
WaveNet is an LM that uses exponentially sparser convolutions to enable tractable modeling of longer contexts (around 48k timesteps or 3s)
51
© Chris Donahue 2023
Long context architectures: SaShiMi
Goel+ 22
SaShiMi uses structured state space sequence models (S4) [Gu+ 21], a generalization of convolution, to model even longer audio contexts (8s)
52
© Chris Donahue 2023
More research directions
in music LLMs
53
© Chris Donahue 2023
Demystifying hierarchical LMs: what are the optimal tradeoffs and scaling laws?
54
x
AudioLM (Borsos+ 22):
Model joint w/ proxy LM
P(Sem(w)) * P(Enc(w)) | Sem(w))
1B params vs 7B Jukebox
Enc
Sem
[1, 0, 6, 3…]
[8, 2, 1, 8…]
Semantic codes
High-level
Acoustic tokens
Low-level
Hierarchical LM
Semantic tokens: Sem(w)
discretized representations from some intermediate layer of BERT-style model
Acoustic tokens: Enc(w)
from RVQ
© Chris Donahue 2023
Intuitive control is crucial for musical expressivity.
55
👋 Control
🎵 Music
“An uptempo jazz song featuring a singer, a saxophone, and a washboard”
Music LLM
Need paired data
© Chris Donahue 2023
Improving music information retrieval (MIR) can help improve control.
56
👋 Control
🎵 Music
“An uptempo jazz song featuring a singer, a saxophone, and a washboard”
MIR Model
© Chris Donahue 2023
Music LLMs can aid MIR
Castellon+ 20, Donahue+ 21
57
Audio
Audio Tokenizer
0
6
5
2
5
2
0
6
Music LLMs
Transfer to
MIR tasks
Prediction
“Jazz”
Audio tokens
© Chris Donahue 2023
Music is inherently multimodal and crossmodal.
58
Multimodal
Crossmodal
© Chris Donahue 2023
How do we build foundation models with limited paired data?
59
Multimodal
Music LLM
Multimodal
Crossmodal
© Chris Donahue 2023
Can we adapt music LLMS to
run on commodity hardware?
60
Music LLM
Privacy
Reliability
© Chris Donahue 2023
What is the right interface for music generative models?
61
Interface
Generative Model
?
GPT
ChatGPT
Music LLM
MusicLM, AudioCraft, Riffusion, AudioLDM,
…
© Chris Donahue 2023
Protecting artists via consent, credit, and compensation
62
Training data attribution for revenue streams
Retrieval for improved diversity and credit
Music LLM
Generated
output
Training
data
$
Retrieved
data
Music LLM
Generated
output
© Chris Donahue 2023
Interested? Reach out!
© Chris Donahue 2023
Quiz question
What is the primary property of audio waveforms that make them empirically challenging to model with Transformers?
64
© Chris Donahue 2023