1 of 42

Lonce Wyse

TRANSFORMER SYNTHESIS

2 of 42

Audio (music) companies

  • Suno
    • “lo-fi jazz with female vocals about rain”

  • Udio

  • Stable Audio (diffusion)
    • Roayalty free music generation for “content” creators

  • Eleven Music (ElevenLabs) (hybrid?)
    • New editing controls, voice as temporal SFX control (not rt)

Research

  • AudioLM, MusicGen, Audio LDM (diffusion)

ALL use subsymbolic reps

(codecs, spectorgrams, latent embeddings)

3 of 42

AudioLM

  • Problems:
    • “even powerful models like WaveNet [1] generate unstructured audio”
    • NC token sequences still long for audio (8*75=600)/sec compared to language
  • Solution: different timescales & objectives
    • a) semantic (objective = masked prediction) (“w2v-BERT”) 25 Hz
    • b) acoustic (objective = reconstruction) (“soundstream”)
  • Inference – first model the sequence of semantic tokens, then use then to condition the sequence of acoustic tokens
    • “semantic”
    • “acoustic
  • Problem – audio/music doesn’t have semantics in the same way speech does (many acoustic tokens for one semantic)

2022

4 of 42

AudioLM

  • Tokenizaion

5 of 42

AudioLM

  • Semantic, course acoustic, fine acoustic.
    • Acoustic flattened to create longer (undstacked) sequence

    • Prompt for continuations
      • 1st prompt (from unseen examples) semantic, then predict
      • 2nd prompt course, then predict conditioned on semantic
      • 3rd prompt from data, condition on semantic and course
    • Note:
      • 3 passes, and interaction through prompt only

6 of 42

Semantic v acoustic?

  • Unconditional generation
  • Only sound tokens (2 layers)
  • -> Piano continuation examples

7 of 42

Question

  • What does “autoregressive” transformer mean?

8 of 42

Question

  • What does “autoregressive” transformer mean?
  • Inference – you run the forward pass for each time step depending on all previous steps
    • Keep the last step, bring it down to create a longer sequence for the next sequence step
      • Each time, compute the whole past sequence.
    • What about all those attention matrix computations?
      • KV caching - compute attention only for the new time step.
        • Reduces cost from O(T^2) to O(T)
        • Still, must process each token through the entire layer stack, and every layer still depends on all previous layers.

9 of 42

MusicGen and Soundstorm

  • AudioLM
    • Semantic tokens
    • Course acoustic
    • Fine acoustic

  • MusicGen (2023, Meta)
    • “Single stage” transformer
    • No semantic tokens
    • Just Encodec
  • AudioLM
    • Autoregressive

  • Soundstorm (2023, Google)
    • AudioLM “decoder”
    • Semantic tokens as conditioning, but
    • Parallel (masked & iterative) prediction of acoustic tokens

10 of 42

MusicGen

  • Text conditioned
    • T5 – Encoder is text-to-vector *sequence*
      • Far more info than an “embedding”
      • Classifier-free guidance (coming up later)

  • Melody conditioned
    • At input, o ne maalue for each sequency ste

MusicGe

Q:

What is the difference?

11 of 42

MusicGen

  • “Decoder only”
  • Conditioning strategy
    • Decoder only model
    • T5 (pretrained) produces text embedding
    • C mas from Chromogram appended each timestep

"slow ambient piano"

Max: C D E F F# G G# A

12 of 42

Playable Transformer?

  • Would like to start from a sound (e.g. in a database such as freesound as a ‘query”) and provide a model for exploring a semantic space surrounding the query.

  • Training data source
    • Exploit TTA (AudioBox, Fugatto), or Diffusion Models that have understandable latent spaces, but are not playable – to generate sounds for training RT synths
    • Required: Generate lowD params and sounds
    • Nice to have: novelty

  • Build a conditionally trainable synth that is
    • Playable, real-time and streamable, capable of novelty

Topic: Generalist vs specialist models?

13 of 42

Other control-responsive models

  • DDSP
    • fast training, small model, excellent quality
    • audio domain
    • strong inductive bias, hard to train on multiple sound classes

  • WaveNet
    • Good quality, arbitrarily conditionable, “babbling” ~ texture
    • Audio or frame-based input, usually audio out
    • autoregressive and dilated convolution make it expensive

  • RAVE
    • SOTA for training RT Synthesizers
    • VAE representation (for latents) and GAN training
    • No conditioning input (though see Esling 2023)

  • MusicGen
    • Autoregressive Transformer, Tokenized representation
    • Conditioning with text
  • Not real time in speed. (300M, 1.5B, 3.3B parameters)
    • Real-time controllable(?)

High-level meaningful, continuous controls over the raw audio waveform still challenging

14 of 42

Conditioning Strategies

  • Grey line between conditioning values and input values?

Topic: Special symbols?

15 of 42

Conditioning Strategies

  • MusicGen
    • Text encodings – T5 a *single vector*
      • Injected where? Cross attention at each block – but what does that mean if the text encoding is just one vector?
    • Melody (chromagram) encoding that changes over time
      • Injected where? As a “prefix” to input at each block.
      • Why “overfitting” and how do they manage it?

    • Synthformer (Wyse, 0000)
      • Class and parmeter(s)
      • Concatented AFTER normalization in each block, then projected back to model size

16 of 42

Synthformer

DAC 8D latents

+

Conditioning

Token stack

(parallel)

17 of 42

Sounds: Conditioning and Parameter sensitivity

Category changes

Param sweep

18 of 42

More sounds

Param sweep

Mixed categories?

19 of 42

Training and inference times

  • Base model , 10M parameters
    • CPU
      • Inference
        • Transformer: 20 seconds of sound in 10.28
        • DAC Decoder: 20 seconds of sound ns 7.8
    • GPU
      • Train
        • 1 minute/epoch – usually 200 epochs
      • Inference
        • Transformer: 20 seconds of sound in 2.35
        • DAC Decoder: 20 seconds of sound in 0.46

Topic: DAC streaming?

20 of 42

Why a transformer?

  • Naturally autoregressive for sequence generation
  • Easy to condition
  • Low frame rate for high-quality (and fast) reconstruction
    • “Universal” DAC codec (could also use others)
  • Proven performance in other domains
  • Attention seems better than recurrence for capturing structured (e.g. periodic) time dependencies
  • Weaker “inductive biases” than some other nets

Topic: Architectural biases

21 of 42

Architecture

  • DAC, 4 codebooks (8D vector for each)
  • Multi embedding (one for each codebook)
  • T block
    • Norm
    • Concat (unnormalized) cond, project to model “size”
    • Multihead (8) Self attention
      • Compute K, Q, V projections
      • Apply RoPE to K, Q
      • Attention score
      • Mask (causal)
      • Softmax() V
    • Dropout + skip
    • Norm
    • Feed forward
    • Dropout + skip
  • Logits and softmax, sampling

Topic: Conditioning Strategies?

22 of 42

Codec Latent space

Q9

823

549

983

212

514

728

849

523

332

505

Q8

640

46

344

774

961

477

813

251

585

1023

Q7

361

499

134

547

388

93

387

703

197

454

Q6

628

108

620

803

384

586

305

433

966

537

Q5

171

823

197

526

35

405

896

684

238

143

Q4

622

512

866

244

812

214

71

177

142

344

Q3

720

201

667

483

336

855

243

662

807

969

Q2

420

219

740

691

616

212

70

265

1019

272

Q1

5

174

212

160

213

764

111

604

686

542

time

codebooks

Projection

encode

8

Quantization

time

time

86 frames /sec

Topic: Universal Codec???

23 of 42

Codec sequence strategies

Music Gen

Exact pdf can be expected

Inexact – loose dependence

Compromise?

24 of 42

Multi-embedding

Decompress DAC codes to frozen DAC “token” embeddings.

Time Step 1: [Token A1] [Token A2] [Token A3]

Time Step 2: [Token B1] [Token B2] [Token B3]

Time Step 3: [Token C1] [Token C2] [Token C3]

Time Step 1: [Embedding A1] [Embedding A2] [Embedding A3] Time Step 2: [Embedding B1] [Embedding B2] [Embedding B3] Time Step 3: [Embedding C1] [Embedding C2] [Embedding C3]

Separate parallel (“multi”) embedding.

Conflated embeddings

…Transformer input

OR

Topic: Does it even matter?

Time Step 1: [Embedding (A1+A2+A3)]

Time Step 2: [Embedding (B1+B2+B3)]

Time Step 3: [Embedding (C1+C2+C3)]

25 of 42

Context window

  • Sliding window for inference, “just big enough”
    • + Infinite generation
    • + fixed computational cost
    • - edge effects
  • Training, want long context for parallel training

  • Autoregressive inference,
    • sliding window
    • Inference mask << training mask
      • But Banded lengths are equal!!!
    • Recent elements have masks full

BANDED Mask << context

Topic: Sliding window?

26 of 42

Masking

Training

Inference

Big to parallelize training

while avoiding edge effects

Smaller, appropriate for sliding window size

Topic: Look ahead for RT audio generation?

27 of 42

Positional Encoding

  • Absolute positional embeddings using sinusoids
  • Applied once before transformer blocks
  • Doesn’t extrapolate well (though some models use context of up to a million!)
  • Label like – no sense of distance

Kinda like Binary Counting

28 of 42

Adding sin pos encoding

Start with vector at (1,1) – what is it in different positions m?

Why does this work at all?

29 of 42

Relative Positional Encoding

  • Makes sense for music and some environmental audio
  • Sensible with sliding window
  • Has to be applied to the Q and K matrices at each transformer block, not just once into the embedding at the beginning.

30 of 42

RoPE

  • Preserves some qualities of both absolute and positional encoding
  • Rotates embeddings by amounts depending on (absolute) positions.
  • But dot product doesn’t change if relative position is the same! Thus the K Q dot product in attention computation depends on relative position
    • This is also why the encoding “extrapolates” to longer sequences than seen in training.

31 of 42

Pairwise Pos and Dim dependent rotations

  • Rotate vectors for Q and K, (but not V) at every transformer block

SU, Jianlin, et al. Roformer: Enhanced transformer with rotary position embedding. Neurocomputing, 2024, vol. 568, p. 127063.

m is position in the sequence, theta also depends on d, the dimension index

32 of 42

RoPE

  • Preserves relative angles (and thus dot product)

  • Same distance apart, same angle between them. Thus we have (implicit) relative encoding.

The pig chased the dog

In the forest the pig bit the dog

33 of 42

ALiBi

  • Relative positional coding based on relative position
  • Computed once
  • Applied at the same point in computation as a mask (e.g. causal) so can be combined if mask is constant.

  • where

Topic: other pos strategy?

Transformer-XL?

34 of 42

Other masking/positional stragegies

  • KV caching (e.g. Transformer-XL)
    • Depends on positional encoding that doesn’t dynamically change Q computation
    • Good for long sequences because computation grows as O(n) rather than O(n^2)
  • RoPE
    • Probably slower than ALiBi
    • Separation of masking and positional
  • ALiBi

Topic: Necessary ontext window length? Does context window length influence pos coding choice?

35 of 42

Multi headed, K, Q, V

  • Multi-head attention
    • How and why does it work
      • Wq, Wk, and Wv are used to project X on to K, Q, and V
      • If d_model is 512 and you have 8 heads each matrix projects down on to a 512/8=64 d space
      • Each head works on different dimensions of the input.
      • Concatenating gets you back to original (d_model) dimension

  • Cross-attention vs Self-attention
    • Q from ‘self’ K and V from ‘cross’ or ‘other’
      • Why does that make sense?
    • How are self and x-attention combined?

36 of 42

Original Formulation

Encoder

Decoder

Cross-attention

Self-attention

Vaswani, A. (2017). Attention is all you need. [1]

37 of 42

MLP

How many MLPs for context window of length n?

38 of 42

Output layer

  • Liner -> Softmax
    • Result: probability over all tokens
    • Loss: cross entropy for difference between distributions
  • Temperature

  • But sampling range/temperature don’t seem to make as much difference with texture audio tokens as they do with language tokens

39 of 42

Output Sampling

  • Top 1

  • Top 5

  • Top 1024

40 of 42

Future work

  • Masking and sequence strategies
  • Computational efficiency (how many block need positional information for example)
  • Morphing strategies such as the Mitsubishi group paper for combining attention matrices
  • Better codecs with semantic information
  • Novelty generation strategies
  • Cross modal
    • Audio (analogy)
    • Video (SuperSonic)
  • From textures to events
    • Include special tokens for start-event…
    • (or) Simply train with single events

41 of 42

END

42 of 42

Anticipation

  • Intuition: improvising musicians actually do look ahead.
  • Exactly like random masking strategies for training (like BERT or Vampnet)
    • Masking is not causal
      • That means that at each step in an autoregressive inference, values for sequence positions already predicted are *revised* given new future values.
  • Autoregression with look-ahead:
    • n – context length, m – look-ahead
    • At each step,
      • Take n-m as the streaming output
      • Take the previous most forward prediction, concatenate it to the input sequence, and shift window.