1 of 65

REPRESENTATION & MODELING

Lonce Wyse

2 of 65

Lossy, Non-invertible representations?

HEAR "Audio Representations":

Goal: Learn semantically meaningful features for immediate downstream use
Expectation: Should work well on classification/detection tasks with minimal additional training
Examples: wav2vec 2.0 features, CLAP embeddings, pre-trained CNN features
Optimization target: Semantic understanding (through self-supervision, contrastive learning, etc.)

Universal Codecs (EnCodec, DAC):

Goal: Compress audio while preserving reconstruction quality
Expectation: Can represent any audio, but may need substantial training for semantic tasks
Optimization target: Perceptual fidelity, not semantic structure

3 of 65

�Encoding, embedding, and latent spaces, tokens

“Encoding” is used in broader contexts than just NNs

Mp3, mel frequency, MFCC, vocoding, etc.

We transform raw data into a new space that is easier to work with

Almost always lower dimensional
Usually creates “spatial” relationships
Generalizability
Semantics
Sometimes invertible (enough information is represented to recreate (“decode”) the embeddings.

4 of 65

Coding/decoding for other nets

Convert between audio “features” and raw audio. Then network can work on your representation

Wavenet as a “decoder” (e.g. for Tacotron2)
MelGan
Advantages – core net can work on easier data, lower dimensional, understandable,

AudioLDM – put signal into a latent space, process it (diffusion) expand it (decode it) at the other end.
TANGO – diffusion model on latent space

5 of 65

Mel Spectrogram

6 of 65

Mel Spectrogram

Mel Filter bank

Usually fewer bands than FFT bins (eg 64)
Works well for speech
Can be inverted (lossy) with trained networks

7 of 65

MELGAN

Input: compare to typical GANs like GanSynth

Loss function –

Either real/fake prediction, or Wasserstien difference (remember GanSynth)

Reconstructing Audio from Mel Spectra

Condition

Random

X

Real/Fake

Feature Diff

Kumar, K., Kumar, R., De Boissiere, T., Gestin, L., Teoh, W. Z., Sotelo, J., ... & Courville, A. C. (2019). Melgan: Generative adversarial networks for conditional waveform synthesis. Advances in neural information processing systems, 32.

8 of 65

BigVGAN

Introduced

Snake activation function
multi-scale and multi-period discriminators
A slew of “auxiliary” loss functions (pitch and amplitude functions)

https://bigvgan-demo.github.io/

Lee, S. G., Ping, W., Ginsburg, B., Catanzaro, B., & Yoon, S. (2022). Bigvgan: A universal neural vocoder with large-scale training. arXiv preprint arXiv:2206.04658.

Pretrained models available

AMP: anti-aliased multi-periodicity

9 of 65

Workflow

Audio -> LowD ->

Your network here

-> lowD -> decode -> audio

E

D

Audio

10 of 65

Sequence of latents

nsynth

https://nsynthsuper.withgoogle.com/

Nsynth Super (hardware interface)

What does the “encoded” signal look like?

16 dimensions for every 512 samples

11 of 65

morph

| a₁₁ a₁₂ a₁₃ ... a₁(ₘ₋₁) a₁ₘ |

| a₂₁ a₂₂ a₂₃ ... a₂(ₘ₋₁) a₂ₘ |

| . . . . . |

| aₙ₁ aₙ₂ aₙ₃ ... aₙ(ₘ₋₁) aₙₘ |

| b₁₁ b₁₂ b₁₃ ... b₁(ₘ₋₁) b₁ₘ |

| b₂₁ ba₂₂ b₂₃ ... b₂(ₘ₋₁) b₂ₘ |

| . . . . . |

| bₙ₁ bₙ₂ bₙ₃ ... bₙ(ₘ₋₁) bₙₘ|

A

B

Sequence of latents

Nature of this “representation?

https://magenta.tensorflow.org/nsynth

Some instrument morphs:

12 of 65

Next up: Tokenization!

13 of 65

Tokenization

Characteristics

Discrete
Countable
(usually) Carry some meaning
(usually) context dependent

Advantages

Function like vocabularies

Tokens have worked great for text and speech (how?). Can they do the same for audio (how?)

14 of 65

“Universal” Audio Codecs

Soundstream
Encodec
DAC

Huh? Don’t we already have them (e.g. mu-law or mp3)?

Trained

In the past, trained “vocoders” were good at the dataset they were trained on, less on others.
Some Comparisons: https://descript.notion.site/Descript-Audio-Codec-11389fce0ce2419891d6591a68f814d5

Size matters

15 of 65

codes

Q9	823	549	983	212	514	728	849	523	332	505
Q8	640	46	344	774	961	477	813	251	585	1023
Q7	361	499	134	547	388	93	387	703	197	454
Q6	628	108	620	803	384	586	305	433	966	537
Q5	171	823	197	526	35	405	896	684	238	143
Q4	622	512	866	244	812	214	71	177	142	344
Q3	720	201	667	483	336	855	243	662	807	969
Q2	420	219	740	691	616	212	70	265	1019	272
Q1	5	174	212	160	213	764	111	604	686	542

time

codebooks

input

80ish fps

44k fps

16 of 65

DAC bit rate Performance

Mp3

Raw

Recall LT dependencies

Let’s count…..

17 of 65

DAC bit rate Performance

Mp3 44.1(1 ch) 128 40ish

Raw 44.1 (2) 700

Recall LT dependencies

Let’s count…..

18 of 65

Counting

Atoms in the universe 2^266 = 10^80

And, what is the true dimensionality of the perceptually relevant audio manifold?

CD quality sampled 1-second sounds?

19 of 65

How many 1-second sounds?

Atoms in the universe 2^266 = 10^80

16bit 44kHz – 65K^44K = 10^220K

(1024^8)^75 = 1024^600 = 10^1.8K

CD-quality sampling

Encodec 75 pfs, 8 codebooks

20 of 65

DAC / RVQGAN

Convolutional Autoencoder

Waveform in, sequence of embeddings out.
Embeddings are then quantized.

Trained end-to-end using both reconstruction (multiscale frequency transforms) and adversarial losses.
residual (a.k.a. multistage) vector quantizer, which is learned end-to-end with the rest of the model.

Kumar, R., Seetharaman, P., Luebs, A., Kumar, I., & Kumar, K. (2024). High-fidelity audio compression with improved rvqgan. Advances in Neural Information Processing Systems, 36.

21 of 65

Overall structure

Reconstruction

Losses

(Soundstream, Zeghidour et al 2021)

22 of 65

More detail …

DAC Innovations:
Snake activation function

(provide inductive bias for period functions)

Multiscale error measures

Multiple discriminators

“Factorization” of code lookup and code embedding

23 of 65

VQ-VAE

Embedding space is a categorical codebook

Similar to K-nearest neighbor strategy

Want z vectors to use the codebook efficiently
Gradient has some “error” due to quantization

Codebook loss

Commitment loss

Reconstruction loss

24 of 65

Facturing code for quantization

Code embedding: 1024 D “z” space)
Projection (learned) to 8D space for quantizing (“latent”)

25 of 65

Encoder and Codebook loss

26 of 65

Multiscale objectives

STFT, MEL, waveforms

L1 loss on mel-spectrograms computed with window lengths of [32, 64, 128, 256, 512, 1024, 2048] and hop length set to window_length / 4. We especially find that using the lowest hop size of 8 improves modeling of very quick transients that are especially common in the music domain.
multi-period waveform discriminator (MPWD) folds waveform at different periods into a 2D structure.

27 of 65

Quantizing

1024 embedding vector and normalization

2-layer net to “project” to 8D latent space and normalization

Quantization learns to minimize “commitment loss” (the distances from the projected vector and its quant point).

28 of 65

Residual quantizer

29 of 65

Residual quantizer

What if you want to choose the number of codebooks?

30 of 65

RVQ

Codebooks are not independent

Higher order codebooks depend on the residual from previous codebooks
Codebooks have a temporal dependency as well – not just dependent on the audio in the frame

Different from some other audio representations!

DAC play collab: https://colab.research.google.com/drive/1aGzVYOu4vynEz8LY2Kf6wsl9j5OzaHON?usp=sharing

31 of 65

DAC latent space?

Q9	823	549	983	212	514	728	849	523	332	505
Q8	640	46	344	774	961	477	813	251	585	1023
Q7	361	499	134	547	388	93	387	703	197	454
Q6	628	108	620	803	384	586	305	433	966	537
Q5	171	823	197	526	35	405	896	684	238	143
Q4	622	512	866	244	812	214	71	177	142	344
Q3	720	201	667	483	336	855	243	662	807	969
Q2	420	219	740	691	616	212	70	265	1019	272
Q1	5	174	212	160	213	764	111	604	686	542

time

codebooks

Projection

1024

8

Quantization

time

Nice low-D input!(?)

32 of 65

Masked language modeling

Note: masking in for training only. They are predicted, but BERT is used to produce the contextual embeddings for downstream tasks (though the embeddings are used to produce logits so error can be computed).

33 of 65

Iterative reconstruction

At each iteration, keep the most “confident” values and then run again

34 of 65

Using codecs: �Vampnet training

Masked *generative* modeling
Separate course and fine nets

Course + Fine Reconstruction?
Is this really necessary?

35 of 65

Using codecs: �Vampnet inference

Prompting strategies - https://youtu.be/3XfeWlV9Cp0?t=80

36 of 65

Non-autoregressive

https://hugo-does-things.notion.site/Effect-of-Prompt-on-Generation-Quality-e6734fdd4a6f4d80bd7b57eaf170dc6b

Prompt and audio:

37 of 65

Meta’s encodec

Défossez, A., Copet, J., Synnaeve, G., & Adi, Y. (2022). High fidelity neural audio compression. arXiv preprint arXiv:2210.13438.
Simpler than DAC

24kHz and 48kHz model
Lower quality
Bandwidths

Options for coding

24kZ “all at once”
48K

One-second chunks, scales, padding for overlap-add
Optional bidirectional coding

38 of 65

Encodec architecture

39 of 65

MusicGen

Actually a TTM model

24-layer “single stage decoder-only transformer
T5 or similar creates a sequence of embeddings
Cross attention from the MusicGen transformer
Dataset 20K hours
Output 8 seconds of audio (antique: 2023)
Examples

Copet, J., Kreuk, F., Gat, I., Remez, T., Kant, D., Synnaeve, G., ... & Défossez, A. (2023). Simple and controllable music generation. Advances in neural information processing systems, 36, 47704-47720.

40 of 65

MusicGen

From Meta, one year after Encodec
A “Language Model”

Note the term

Recall RVQ token creation

So how to present/predict them ?

41 of 65

Possible Strategies

42 of 65

RNeNcodec

43 of 65

RNeNcodec

RNN

3 layers
128D->128D

44 of 65

RNN core

RNN

C#4

C#3

C#1

C#2

Params

Latents

Data

t = n+1

Data

t = n

Inference

Data to latents�Ouput to latents�Why use tokens at all?

45 of 65

RNeNcodec

Audio->ecdc (codes)

-> latents
Why latents?

Embed conditioning into n< “model size”
Parameter: ratio of latent/conditioning embedding size
Concatenate 2 embeddings to “model size”

46 of 65

Codebooks out (parallel)

Learned projection

Simple, linear

Fast and parallel

Temporally parallel: All time t predictions happen at time t
Computationally parallel:

RNN

C#1 Logits

C#3 Logits

C#4 Logits

C#2 Logits

sample

Token stack

C#4

C#3

C#1

C#2

47 of 65

Use lower-order code selections

Learned combination of RNN and previous code latent

Temporally parallel: All time t predictions happen at time t
Computationally sequential:

Same kind of cascading dependence as RVQ

Each coded depends on residual after previous code

Next sequence step input?

RNN

C#1 Logits

C#2 Logits

C#3 Logits

C#4 Logits

sample

latent

sample

latent

sample

latent

sample

latent

48 of 65

Next sequence input

Because of the RVQ way latents are derived,
We just sum!

RNN

C#1 Logits

C#2 Logits

C#3 Logits

C#4 Logits

sample

latent

sample

latent

sample

latent

sample

latent

49 of 65

Tokens to summary latent

50 of 65

Weighted sum for *each* token.

“Soft tokens”

Expected values

Avoids sampling

Errors could accumulate
Some export formats can’t include sampling

Next RNN input?

RNN

C#1 Logits

C#2 Logits

C#3 Logits

C#4 Logits

latent

∑

weighted sum

51 of 65

Teacher forcing

During training

Computer error based on predicted vs correct
Pass *correct* to next logit
Alternate training with teacher forcing/ no teacher forcing

RNN

C#1 Logits

C#2 Logits

C#3 Logits

C#4 Logits

latent

∑

weighted sum

TF latent

52 of 65

Playability

Computation must be faster than real time (“Real Time Factor”)
Many TTA (text to audio) have RTF < 1

But computer time in parallel

Assuming responsiveness, want Low latency

Gesture (parameter change) -> audible effect
For instrumental musicians?
For “texture like” environmental sounds?

53 of 65

Classical RT architecture

When Read ->write, generate new “hop”

“hop”

How to generate a “hop” with RNeNcodec?

54 of 65

Playability

Keep OS supplied with buffers

Get a “hop” of audio at a time from the backend

Lowish latency

Take smallish “hop”

Sound quality

Keep old code frames for decoding
Return “hop” of audio

55 of 65

Sound Sets

https://classroom.github.com/classrooms/253272481-upf-smc-genai-music-classroom-25-26/assignments/dataset-creation

56 of 65

RNeNcodec examples

https://animatedsound.com/rnencodec/

others?

57 of 65

Two notebooks

Encodec Explorer

https://github.com/lonce/EncodecExplorerCore

RNeNcodec

58 of 65

Final Projects

What do you want to explore, learn more about, or use creatively?
Conference paper/presentation

Paper

Summarize relevant literature
Develop some code
Present (video) with some sound

Ideas

Ablation studies on a generative audio network
Modify a network with a Transformer with a different positional encoding
Compare tokenization strategies
Explore the latent space of a trained net
Creative orientation (

59 of 65

Things I don’t know about RNeNcodec

Can it be conditioned on pitch and sound good
Can it be used for timbre (or even style) transfer?
What about a transformer that uses the same coding trick?
Capacity - Does the model need to be bigger to handle a bigger data set?

From our collective dataset?

What about only class parameters, otherwise unconditioned inference?

Several different styles of music?

What do the trajectories through the encodec space look like for particular sounds?

Questions for the codec
What happens if you "down sample" the codec stream? Interesting because the RNN still uses context to recreate the audio, so it might "smooth over" artefacts.
48kHz Encodec?
DAC encoder

60 of 65

Next week in class

Show Encodec exploration (using your data set sounds

Visualization, statistics, space manipulation

Show dataset training

75 -175 iterations works well

Discuss what worked, what didn’t, and why

61 of 65

NOT USED

62 of 65

MusicGen and toekenization

63 of 65

Curse of crust

64 of 65

Curse of high-dimension

65 of 65

Cos similarity

In high dimensions,

The curse destroys the use of magnitudes for comparing vectors
But….. Not direction (assuming there is structure in the distribution of point)

So …. Cosine similarity!

Cos similarity
cos(θ) = (a · b) / (||a|| × ||b||) - in [-1,1]
1 = same direction, 0 = orthogonal, -1 = opposite directions
Cosine distance is simply: 1 - cosine_similarity