Lonce Wyse
TRANSFORMER SYNTHESIS
Audio (music) companies
Research
ALL use subsymbolic reps
(codecs, spectorgrams, latent embeddings)
AudioLM
2022
AudioLM
AudioLM
Semantic v acoustic?
Question
Question
MusicGen and Soundstorm
MusicGen
MusicGe
Q:
What is the difference?
MusicGen
"slow ambient piano"
Max: C D E F F# G G# A
Playable Transformer?
Topic: Generalist vs specialist models?
Other control-responsive models
High-level meaningful, continuous controls over the raw audio waveform still challenging
Conditioning Strategies
Topic: Special symbols?
Conditioning Strategies
Synthformer
DAC 8D latents
+
Conditioning
Token stack
(parallel)
Sounds: Conditioning and Parameter sensitivity
Category changes
Param sweep
More sounds
Param sweep
Mixed categories?
Training and inference times
Topic: DAC streaming?
Why a transformer?
Topic: Architectural biases
Architecture
Topic: Conditioning Strategies?
Codec Latent space
Q9 | 823 | 549 | 983 | 212 | 514 | 728 | 849 | 523 | 332 | 505 |
Q8 | 640 | 46 | 344 | 774 | 961 | 477 | 813 | 251 | 585 | 1023 |
Q7 | 361 | 499 | 134 | 547 | 388 | 93 | 387 | 703 | 197 | 454 |
Q6 | 628 | 108 | 620 | 803 | 384 | 586 | 305 | 433 | 966 | 537 |
Q5 | 171 | 823 | 197 | 526 | 35 | 405 | 896 | 684 | 238 | 143 |
Q4 | 622 | 512 | 866 | 244 | 812 | 214 | 71 | 177 | 142 | 344 |
Q3 | 720 | 201 | 667 | 483 | 336 | 855 | 243 | 662 | 807 | 969 |
Q2 | 420 | 219 | 740 | 691 | 616 | 212 | 70 | 265 | 1019 | 272 |
Q1 | 5 | 174 | 212 | 160 | 213 | 764 | 111 | 604 | 686 | 542 |
time
codebooks
Projection
encode
8
Quantization
time
time
86 frames /sec
Topic: Universal Codec???
Codec sequence strategies
Music Gen
Exact pdf can be expected
Inexact – loose dependence
Compromise?
Multi-embedding
Decompress DAC codes to frozen DAC “token” embeddings.
Time Step 1: [Token A1] [Token A2] [Token A3]
Time Step 2: [Token B1] [Token B2] [Token B3]
Time Step 3: [Token C1] [Token C2] [Token C3]
Time Step 1: [Embedding A1] [Embedding A2] [Embedding A3] Time Step 2: [Embedding B1] [Embedding B2] [Embedding B3] Time Step 3: [Embedding C1] [Embedding C2] [Embedding C3]
Separate parallel (“multi”) embedding.
Conflated embeddings
…Transformer input
OR
Topic: Does it even matter?
Time Step 1: [Embedding (A1+A2+A3)]
Time Step 2: [Embedding (B1+B2+B3)]
Time Step 3: [Embedding (C1+C2+C3)]
Context window
BANDED Mask << context
Topic: Sliding window?
Masking
Training
Inference
Big to parallelize training
while avoiding edge effects
Smaller, appropriate for sliding window size
Topic: Look ahead for RT audio generation?
Positional Encoding
Kinda like Binary Counting
Adding sin pos encoding
Start with vector at (1,1) – what is it in different positions m?
Why does this work at all?
Relative Positional Encoding
RoPE
Pairwise Pos and Dim dependent rotations
SU, Jianlin, et al. Roformer: Enhanced transformer with rotary position embedding. Neurocomputing, 2024, vol. 568, p. 127063.
m is position in the sequence, theta also depends on d, the dimension index
RoPE
The pig chased the dog
In the forest the pig bit the dog
ALiBi
Topic: other pos strategy?
Transformer-XL?
Other masking/positional stragegies
Topic: Necessary ontext window length? Does context window length influence pos coding choice?
Multi headed, K, Q, V
Original Formulation
Encoder
Decoder
Cross-attention
Self-attention
Vaswani, A. (2017). Attention is all you need. [1]
MLP
How many MLPs for context window of length n?
Output layer
Output Sampling
Future work
END
Anticipation