JavaScript isn't enabled in your browser, so this file can't be opened. Enable and reload.

1 of 42

Lonce Wyse

TRANSFORMER SYNTHESIS

2 of 42

Audio (music) companies

Suno

“lo-fi jazz with female vocals about rain”

Udio

Stable Audio (diffusion)

Roayalty free music generation for “content” creators

Eleven Music (ElevenLabs) (hybrid?)

New editing controls, voice as temporal SFX control (not rt)

Research

AudioLM, MusicGen, Audio LDM (diffusion)

ALL use subsymbolic reps

(codecs, spectorgrams, latent embeddings)

3 of 42

AudioLM

Problems:

“even powerful models like WaveNet [1] generate unstructured audio”
NC token sequences still long for audio (8*75=600)/sec compared to language

Solution: different timescales & objectives

a) semantic (objective = masked prediction) (“w2v-BERT”) 25 Hz
b) acoustic (objective = reconstruction) (“soundstream”)

Inference – first model the sequence of semantic tokens, then use then to condition the sequence of acoustic tokens

“semantic”
“acoustic

Problem – audio/music doesn’t have semantics in the same way speech does (many acoustic tokens for one semantic)

2022

4 of 42

AudioLM

Tokenizaion

5 of 42

AudioLM

Semantic, course acoustic, fine acoustic.

Acoustic flattened to create longer (undstacked) sequence

Prompt for continuations

1st prompt (from unseen examples) semantic, then predict
2nd prompt course, then predict conditioned on semantic
3^rd prompt from data, condition on semantic and course

Note:

3 passes, and interaction through prompt only

6 of 42

Semantic v acoustic?

Unconditional generation

Of semantic tokens
Use them to condition acoustic
https://google-research.github.io/seanet/audiolm/examples/ -> unconditional

Only sound tokens (2 layers)

https://google-research.github.io/seanet/audiolm/examples/ -> Generating without semantic tokens

-> Piano continuation examples

https://google-research.github.io/seanet/audiolm/examples/

7 of 42

Question

What does “autoregressive” transformer mean?

8 of 42

Question

What does “autoregressive” transformer mean?
Inference – you run the forward pass for each time step depending on all previous steps

Keep the last step, bring it down to create a longer sequence for the next sequence step

Each time, compute the whole past sequence.

What about all those attention matrix computations?

KV caching - compute attention only for the new time step.

Reduces cost from O(T^2) to O(T)
Still, must process each token through the entire layer stack, and every layer still depends on all previous layers.

9 of 42

MusicGen and Soundstorm

AudioLM

Semantic tokens
Course acoustic
Fine acoustic

MusicGen (2023, Meta)

“Single stage” transformer
No semantic tokens
Just Encodec

AudioLM

Autoregressive

Soundstorm (2023, Google)

AudioLM “decoder”
Semantic tokens as conditioning, but
Parallel (masked & iterative) prediction of acoustic tokens

10 of 42

MusicGen

Text conditioned

T5 – Encoder is text-to-vector *sequence*

Far more info than an “embedding”
Classifier-free guidance (coming up later)

Melody conditioned

At input, o ne maalue for each sequency ste

MusicGe

What is the difference?

11 of 42

MusicGen

“Decoder only”
Conditioning strategy

Decoder only model
T5 (pretrained) produces text embedding
C mas from Chromogram appended each timestep

"slow ambient piano"

Max: C D E F F# G G# A

12 of 42

Playable Transformer?

Would like to start from a sound (e.g. in a database such as freesound as a ‘query”) and provide a model for exploring a semantic space surrounding the query.

Training data source

Exploit TTA (AudioBox, Fugatto), or Diffusion Models that have understandable latent spaces, but are not playable – to generate sounds for training RT synths
Required: Generate lowD params and sounds
Nice to have: novelty

Build a conditionally trainable synth that is

Playable, real-time and streamable, capable of novelty

Topic: Generalist vs specialist models?

13 of 42

Other control-responsive models

DDSP

fast training, small model, excellent quality
audio domain
strong inductive bias, hard to train on multiple sound classes

WaveNet

Good quality, arbitrarily conditionable, “babbling” ~ texture
Audio or frame-based input, usually audio out
autoregressive and dilated convolution make it expensive

RAVE

SOTA for training RT Synthesizers
VAE representation (for latents) and GAN training
No conditioning input (though see Esling 2023)

MusicGen

Autoregressive Transformer, Tokenized representation
Conditioning with text

Not real time in speed. (300M, 1.5B, 3.3B parameters)

Real-time controllable(?)

High-level meaningful, continuous controls over the raw audio waveform still challenging

14 of 42

Conditioning Strategies

Grey line between conditioning values and input values?

Topic: Special symbols?

15 of 42

Conditioning Strategies

MusicGen

Text encodings – T5 a *single vector*

Injected where? Cross attention at each block – but what does that mean if the text encoding is just one vector?

Melody (chromagram) encoding that changes over time

Injected where? As a “prefix” to input at each block.
Why “overfitting” and how do they manage it?

Synthformer (Wyse, 0000)

Class and parmeter(s)
Concatented AFTER normalization in each block, then projected back to model size

16 of 42

Synthformer

DAC 8D latents

Conditioning

Token stack

(parallel)

17 of 42

Sounds: Conditioning and Parameter sensitivity

Category changes

Param sweep

18 of 42

More sounds

Param sweep

Mixed categories?

19 of 42

Training and inference times

Base model , 10M parameters

Inference

Transformer: 20 seconds of sound in 10.28
DAC Decoder: 20 seconds of sound ns 7.8

Train

1 minute/epoch – usually 200 epochs

Inference

Transformer: 20 seconds of sound in 2.35
DAC Decoder: 20 seconds of sound in 0.46

Topic: DAC streaming?

20 of 42

Why a transformer?

Naturally autoregressive for sequence generation
Easy to condition
Low frame rate for high-quality (and fast) reconstruction

“Universal” DAC codec (could also use others)

Proven performance in other domains
Attention seems better than recurrence for capturing structured (e.g. periodic) time dependencies
Weaker “inductive biases” than some other nets

Topic: Architectural biases

21 of 42

Architecture

DAC, 4 codebooks (8D vector for each)
Multi embedding (one for each codebook)
T block

Norm
Concat (unnormalized) cond, project to model “size”
Multihead (8) Self attention

Compute K, Q, V projections
Apply RoPE to K, Q
Attention score
Mask (causal)
Softmax() V

Dropout + skip
Norm
Feed forward
Dropout + skip

Logits and softmax, sampling

Topic: Conditioning Strategies?

22 of 42

Codec Latent space

Q9	823	549	983	212	514	728	849	523	332	505
Q8	640	46	344	774	961	477	813	251	585	1023
Q7	361	499	134	547	388	93	387	703	197	454
Q6	628	108	620	803	384	586	305	433	966	537
Q5	171	823	197	526	35	405	896	684	238	143
Q4	622	512	866	244	812	214	71	177	142	344
Q3	720	201	667	483	336	855	243	662	807	969
Q2	420	219	740	691	616	212	70	265	1019	272
Q1	5	174	212	160	213	764	111	604	686	542

time

codebooks

Projection

encode

Quantization

time

86 frames /sec

Topic: Universal Codec???

23 of 42

Codec sequence strategies

Music Gen

Exact pdf can be expected

Inexact – loose dependence

Compromise?

24 of 42

Multi-embedding

Decompress DAC codes to frozen DAC “token” embeddings.

Time Step 1: [Token A1] [Token A2] [Token A3]

Time Step 2: [Token B1] [Token B2] [Token B3]

Time Step 3: [Token C1] [Token C2] [Token C3]

Time Step 1: [Embedding A1] [Embedding A2] [Embedding A3] Time Step 2: [Embedding B1] [Embedding B2] [Embedding B3] Time Step 3: [Embedding C1] [Embedding C2] [Embedding C3]

Separate parallel (“multi”) embedding.

Conflated embeddings

…Transformer input

Topic: Does it even matter?

Time Step 1: [Embedding (A1+A2+A3)]

Time Step 2: [Embedding (B1+B2+B3)]

Time Step 3: [Embedding (C1+C2+C3)]

25 of 42

Context window

Sliding window for inference, “just big enough”

+ Infinite generation
+ fixed computational cost
- edge effects

Training, want long context for parallel training

Autoregressive inference,

sliding window
Inference mask << training mask

But Banded lengths are equal!!!

Recent elements have masks full

BANDED Mask << context

Topic: Sliding window?

26 of 42

Masking

Training

Inference

Big to parallelize training

while avoiding edge effects

Smaller, appropriate for sliding window size

Topic: Look ahead for RT audio generation?

27 of 42

Positional Encoding

Absolute positional embeddings using sinusoids
Applied once before transformer blocks
Doesn’t extrapolate well (though some models use context of up to a million!)
Label like – no sense of distance

Kinda like Binary Counting

28 of 42

Adding sin pos encoding

Start with vector at (1,1) – what is it in different positions m?

Why does this work at all?

29 of 42

Relative Positional Encoding

Makes sense for music and some environmental audio
Sensible with sliding window
Has to be applied to the Q and K matrices at each transformer block, not just once into the embedding at the beginning.

30 of 42

RoPE

Preserves some qualities of both absolute and positional encoding
Rotates embeddings by amounts depending on (absolute) positions.
But dot product doesn’t change if relative position is the same! Thus the K Q dot product in attention computation depends on relative position

This is also why the encoding “extrapolates” to longer sequences than seen in training.

31 of 42

Pairwise Pos and Dim dependent rotations

Rotate vectors for Q and K, (but not V) at every transformer block

SU, Jianlin, et al. Roformer: Enhanced transformer with rotary position embedding. Neurocomputing, 2024, vol. 568, p. 127063.

m is position in the sequence, theta also depends on d, the dimension index

32 of 42

RoPE

Preserves relative angles (and thus dot product)

Same distance apart, same angle between them. Thus we have (implicit) relative encoding.

The pig chased the dog

In the forest the pig bit the dog

33 of 42

ALiBi

Relative positional coding based on relative position
Computed once
Applied at the same point in computation as a mask (e.g. causal) so can be combined if mask is constant.

where

Topic: other pos strategy?

Transformer-XL?

34 of 42

Other masking/positional stragegies

KV caching (e.g. Transformer-XL)

Depends on positional encoding that doesn’t dynamically change Q computation
Good for long sequences because computation grows as O(n) rather than O(n^2)

RoPE

Probably slower than ALiBi
Separation of masking and positional

ALiBi

Topic: Necessary ontext window length? Does context window length influence pos coding choice?

35 of 42

Multi headed, K, Q, V

Multi-head attention

How and why does it work

Wq, Wk, and Wv are used to project X on to K, Q, and V
If d_model is 512 and you have 8 heads each matrix projects down on to a 512/8=64 d space
Each head works on different dimensions of the input.
Concatenating gets you back to original (d_model) dimension

Cross-attention vs Self-attention

Q from ‘self’ K and V from ‘cross’ or ‘other’

Why does that make sense?

How are self and x-attention combined?

36 of 42

Original Formulation

Encoder

Decoder

Cross-attention

Self-attention

Vaswani, A. (2017). Attention is all you need. [1]

37 of 42

MLP

How many MLPs for context window of length n?

38 of 42

Output layer

Liner -> Softmax

Result: probability over all tokens
Loss: cross entropy for difference between distributions

Temperature

But sampling range/temperature don’t seem to make as much difference with texture audio tokens as they do with language tokens

39 of 42

Output Sampling

Top 1

Top 5

Top 1024

40 of 42

Future work

Masking and sequence strategies
Computational efficiency (how many block need positional information for example)
Morphing strategies such as the Mitsubishi group paper for combining attention matrices
Better codecs with semantic information
Novelty generation strategies
Cross modal

Audio (analogy)
Video (SuperSonic)

From textures to events

Include special tokens for start-event…
(or) Simply train with single events

41 of 42

END

42 of 42

Anticipation

Intuition: improvising musicians actually do look ahead.
Exactly like random masking strategies for training (like BERT or Vampnet)

Masking is not causal

That means that at each step in an autoregressive inference, values for sequence positions already predicted are *revised* given new future values.

Autoregression with look-ahead:

n – context length, m – look-ahead
At each step,

Take n-m as the streaming output
Take the previous most forward prediction, concatenate it to the input sequence, and shift window.