1 of 65

REPRESENTATION & MODELING

Lonce Wyse

2 of 65

Lossy, Non-invertible representations?

  • HEAR "Audio Representations":
    • Goal: Learn semantically meaningful features for immediate downstream use
    • Expectation: Should work well on classification/detection tasks with minimal additional training
    • Examples: wav2vec 2.0 features, CLAP embeddings, pre-trained CNN features
    • Optimization target: Semantic understanding (through self-supervision, contrastive learning, etc.)
  • Universal Codecs (EnCodec, DAC):
    • Goal: Compress audio while preserving reconstruction quality
    • Expectation: Can represent any audio, but may need substantial training for semantic tasks
    • Optimization target: Perceptual fidelity, not semantic structure

3 of 65

�Encoding, embedding, and latent spaces, tokens

  • “Encoding” is used in broader contexts than just NNs
    • Mp3, mel frequency, MFCC, vocoding, etc.

  • We transform raw data into a new space that is easier to work with
    • Almost always lower dimensional
    • Usually creates “spatial” relationships
    • Generalizability
    • Semantics
    • Sometimes invertible (enough information is represented to recreate (“decode”) the embeddings.

4 of 65

Coding/decoding for other nets

  • Convert between audio “features” and raw audio. Then network can work on your representation
    • Wavenet as a “decoder” (e.g. for Tacotron2)
    • MelGan
    • Advantages – core net can work on easier data, lower dimensional, understandable,

  • AudioLDM – put signal into a latent space, process it (diffusion) expand it (decode it) at the other end.
  • TANGO – diffusion model on latent space

5 of 65

Mel Spectrogram

6 of 65

Mel Spectrogram

  • Mel Filter bank
    • Usually fewer bands than FFT bins (eg 64)
    • Works well for speech
    • Can be inverted (lossy) with trained networks

7 of 65

MELGAN

  • Input: compare to typical GANs like GanSynth

  • Loss function –
    • Either real/fake prediction, or Wasserstien difference (remember GanSynth)

Reconstructing Audio from Mel Spectra

Condition

Random

X

Real/Fake

Feature Diff

Kumar, K., Kumar, R., De Boissiere, T., Gestin, L., Teoh, W. Z., Sotelo, J., ... & Courville, A. C. (2019). Melgan: Generative adversarial networks for conditional waveform synthesis. Advances in neural information processing systems32.

8 of 65

BigVGAN

  • Introduced
    • Snake activation function
    • multi-scale and multi-period discriminators
    • A slew of “auxiliary” loss functions (pitch and amplitude functions)

Lee, S. G., Ping, W., Ginsburg, B., Catanzaro, B., & Yoon, S. (2022). Bigvgan: A universal neural vocoder with large-scale training. arXiv preprint arXiv:2206.04658.

Pretrained models available

AMP: anti-aliased multi-periodicity

9 of 65

Workflow

Audio -> LowD ->

Your network here

-> lowD -> decode -> audio

E

D

Audio

10 of 65

Sequence of latents

  • nsynth

https://nsynthsuper.withgoogle.com/

Nsynth Super (hardware interface)

What does the “encoded” signal look like?

16 dimensions for every 512 samples

11 of 65

morph

| a₁₁ a₁₂ a₁₃ ... a₁(ₘ₋₁) a₁ₘ |

| a₂₁ a₂₂ a₂₃ ... a₂(ₘ₋₁) a₂ₘ |

| . . . . . |

| . . . . . |

| . . . . . |

| aₙ₁ aₙ₂ aₙ₃ ... aₙ(ₘ₋₁) aₙₘ |

| b₁₁ b₁₂ b₁₃ ... b₁(ₘ₋₁) b₁ₘ |

| b₂₁ ba₂₂ b₂₃ ... b₂(ₘ₋₁) b₂ₘ |

| . . . . . |

| . . . . . |

| . . . . . |

| bₙ₁ bₙ₂ bₙ₃ ... bₙ(ₘ₋₁) bₙₘ|

A

B

Sequence of latents

Nature of this “representation?

Some instrument morphs:

12 of 65

Next up: Tokenization!

13 of 65

Tokenization

  • Characteristics
    • Discrete
    • Countable
    • (usually) Carry some meaning
    • (usually) context dependent
  • Advantages
    • Function like vocabularies

Tokens have worked great for text and speech (how?). Can they do the same for audio (how?)

14 of 65

“Universal” Audio Codecs

  • Soundstream
  • Encodec
  • DAC

  • Huh? Don’t we already have them (e.g. mu-law or mp3)?

  • Trained
  • Size matters

15 of 65

codes

Q9

823

549

983

212

514

728

849

523

332

505

Q8

640

46

344

774

961

477

813

251

585

1023

Q7

361

499

134

547

388

93

387

703

197

454

Q6

628

108

620

803

384

586

305

433

966

537

Q5

171

823

197

526

35

405

896

684

238

143

Q4

622

512

866

244

812

214

71

177

142

344

Q3

720

201

667

483

336

855

243

662

807

969

Q2

420

219

740

691

616

212

70

265

1019

272

Q1

5

174

212

160

213

764

111

604

686

542

time

codebooks

input

80ish fps

44k fps

16 of 65

DAC bit rate Performance

Mp3

Raw

Recall LT dependencies

Let’s count…..

17 of 65

DAC bit rate Performance

Mp3 44.1(1 ch) 128 40ish

Raw 44.1 (2) 700

Recall LT dependencies

Let’s count…..

18 of 65

Counting

  • Atoms in the universe 2^266 = 10^80

And, what is the true dimensionality of the perceptually relevant audio manifold?

CD quality sampled 1-second sounds?

19 of 65

How many 1-second sounds?

  • Atoms in the universe 2^266 = 10^80

  • 16bit 44kHz – 65K^44K = 10^220K

  • (1024^8)^75 = 1024^600 = 10^1.8K

CD-quality sampling

Encodec 75 pfs, 8 codebooks

20 of 65

DAC / RVQGAN

  • Convolutional Autoencoder
    • Waveform in, sequence of embeddings out.
    • Embeddings are then quantized.
  • Trained end-to-end using both reconstruction (multiscale frequency transforms) and adversarial losses.
  • residual (a.k.a. multistage) vector quantizer, which is learned end-to-end with the rest of the model.

Kumar, R., Seetharaman, P., Luebs, A., Kumar, I., & Kumar, K. (2024). High-fidelity audio compression with improved rvqgan. Advances in Neural Information Processing Systems36.

21 of 65

Overall structure

Reconstruction

Losses

(Soundstream, Zeghidour et al 2021)

22 of 65

More detail …

  • DAC Innovations:
  • Snake activation function
    • (provide inductive bias for period functions)

  • Multiscale error measures
    • Multiple discriminators

  • “Factorization” of code lookup and code embedding

23 of 65

VQ-VAE

  • Embedding space is a categorical codebook
    • Similar to K-nearest neighbor strategy
  • Want z vectors to use the codebook efficiently
  • Gradient has some “error” due to quantization

Codebook loss

Commitment loss

Reconstruction loss

24 of 65

Facturing code for quantization

  • Code embedding: 1024 D “z” space)
  • Projection (learned) to 8D space for quantizing (“latent”)

25 of 65

Encoder and Codebook loss

26 of 65

Multiscale objectives

  • STFT, MEL, waveforms

  • L1 loss on mel-spectrograms computed with window lengths of [32, 64, 128, 256, 512, 1024, 2048] and hop length set to window_length / 4. We especially find that using the lowest hop size of 8 improves modeling of very quick transients that are especially common in the music domain.
  • multi-period waveform discriminator (MPWD) folds waveform at different periods into a 2D structure.

27 of 65

Quantizing

  • 1024 embedding vector and normalization

  • 2-layer net to “project” to 8D latent space and normalization

  • Quantization learns to minimize “commitment loss” (the distances from the projected vector and its quant point).

28 of 65

Residual quantizer

29 of 65

Residual quantizer

What if you want to choose the number of codebooks?

30 of 65

RVQ

  • Codebooks are not independent
    • Higher order codebooks depend on the residual from previous codebooks
    • Codebooks have a temporal dependency as well – not just dependent on the audio in the frame
      • Different from some other audio representations!

31 of 65

DAC latent space?

Q9

823

549

983

212

514

728

849

523

332

505

Q8

640

46

344

774

961

477

813

251

585

1023

Q7

361

499

134

547

388

93

387

703

197

454

Q6

628

108

620

803

384

586

305

433

966

537

Q5

171

823

197

526

35

405

896

684

238

143

Q4

622

512

866

244

812

214

71

177

142

344

Q3

720

201

667

483

336

855

243

662

807

969

Q2

420

219

740

691

616

212

70

265

1019

272

Q1

5

174

212

160

213

764

111

604

686

542

time

codebooks

Projection

1024

8

Quantization

time

time

Nice low-D input!(?)

32 of 65

Masked language modeling

Note: masking in for training only. They are predicted, but BERT is used to produce the contextual embeddings for downstream tasks (though the embeddings are used to produce logits so error can be computed).

33 of 65

Iterative reconstruction

  • At each iteration, keep the most “confident” values and then run again

34 of 65

Using codecs: �Vampnet training

  • Masked *generative* modeling
  • Separate course and fine nets
    • Course + Fine Reconstruction?
    • Is this really necessary?

35 of 65

Using codecs: �Vampnet inference

Prompting strategies - https://youtu.be/3XfeWlV9Cp0?t=80

36 of 65

Non-autoregressive

Prompt and audio:

37 of 65

Meta’s encodec

  • Défossez, A., Copet, J., Synnaeve, G., & Adi, Y. (2022). High fidelity neural audio compression. arXiv preprint arXiv:2210.13438.
  • Simpler than DAC
    • 24kHz and 48kHz model
    • Lower quality
    • Bandwidths

  • Options for coding
    • 24kZ “all at once”
    • 48K
      • One-second chunks, scales, padding for overlap-add
      • Optional bidirectional coding

38 of 65

Encodec architecture

39 of 65

MusicGen

  • Actually a TTM model
    • 24-layer “single stage decoder-only transformer
    • T5 or similar creates a sequence of embeddings
    • Cross attention from the MusicGen transformer
    • Dataset 20K hours
    • Output 8 seconds of audio (antique: 2023)
    • Examples

Copet, J., Kreuk, F., Gat, I., Remez, T., Kant, D., Synnaeve, G., ... & Défossez, A. (2023). Simple and controllable music generation. Advances in neural information processing systems36, 47704-47720.

40 of 65

MusicGen

  • From Meta, one year after Encodec
  • A “Language Model”
    • Note the term
  • Recall RVQ token creation
    • So how to present/predict them ?

41 of 65

Possible Strategies

42 of 65

RNeNcodec

43 of 65

RNeNcodec

  • RNN
    • 3 layers
    • 128D->128D

44 of 65

RNN core

RNN

C#4

C#3

C#1

C#2

Params

Latents

Data

t = n+1

Data

t = n

Inference

Data to latents�Ouput to latents�Why use tokens at all?

45 of 65

RNeNcodec

  • Audio->ecdc (codes)
    • -> latents
    • Why latents?
  • Embed conditioning into n< “model size”
  • Parameter: ratio of latent/conditioning embedding size
  • Concatenate 2 embeddings to “model size”

46 of 65

Codebooks out (parallel)

  • Learned projection
    • Simple, linear
  • Fast and parallel
    • Temporally parallel: All time t predictions happen at time t
    • Computationally parallel:

RNN

C#1 Logits

C#3 Logits

C#4 Logits

C#2 Logits

sample

sample

sample

sample

Token stack

C#4

C#3

C#1

C#2

47 of 65

Use lower-order code selections

  • Learned combination of RNN and previous code latent
    • Temporally parallel: All time t predictions happen at time t
    • Computationally sequential:

  • Same kind of cascading dependence as RVQ
    • Each coded depends on residual after previous code

  • Next sequence step input?

RNN

C#1 Logits

C#2 Logits

C#3 Logits

C#4 Logits

sample

latent

sample

latent

sample

latent

sample

latent

48 of 65

Next sequence input

  • Because of the RVQ way latents are derived,
  • We just sum!

RNN

C#1 Logits

C#2 Logits

C#3 Logits

C#4 Logits

sample

latent

sample

latent

sample

latent

sample

latent

49 of 65

Tokens to summary latent

50 of 65

Weighted sum for *each* token.

  • “Soft tokens”
    • Expected values
  • Avoids sampling
    • Errors could accumulate
    • Some export formats can’t include sampling

    • Next RNN input?

RNN

C#1 Logits

C#2 Logits

C#3 Logits

C#4 Logits

latent

latent

latent

latent

weighted sum

weighted sum

weighted sum

weighted sum

51 of 65

Teacher forcing

  • During training
    • Computer error based on predicted vs correct
    • Pass *correct* to next logit
    • Alternate training with teacher forcing/ no teacher forcing

RNN

C#1 Logits

C#2 Logits

C#3 Logits

C#4 Logits

latent

latent

latent

latent

weighted sum

weighted sum

weighted sum

weighted sum

TF latent

TF latent

TF latent

52 of 65

Playability

  • Computation must be faster than real time (“Real Time Factor”)
  • Many TTA (text to audio) have RTF < 1
    • But computer time in parallel

  • Assuming responsiveness, want Low latency
    • Gesture (parameter change) -> audible effect
    • For instrumental musicians?
    • For “texture like” environmental sounds?

53 of 65

Classical RT architecture

When Read ->write, generate new “hop”

“hop”

How to generate a “hop” with RNeNcodec?

54 of 65

Playability

  • Keep OS supplied with buffers
    • Get a “hop” of audio at a time from the backend
  • Lowish latency
    • Take smallish “hop”
  • Sound quality
    • Keep old code frames for decoding
    • Return “hop” of audio

55 of 65

Sound Sets

56 of 65

RNeNcodec examples

  • others?

57 of 65

Two notebooks

  • Encodec Explorer

https://github.com/lonce/EncodecExplorerCore

  • RNeNcodec

58 of 65

Final Projects

  • What do you want to explore, learn more about, or use creatively?
  • Conference paper/presentation
    • Paper
      • Summarize relevant literature
      • Develop some code
      • Present (video) with some sound
  • Ideas
    • Ablation studies on a generative audio network
    • Modify a network with a Transformer with a different positional encoding
    • Compare tokenization strategies
    • Explore the latent space of a trained net
    • Creative orientation (

59 of 65

Things I don’t know about RNeNcodec

  • Can it be conditioned on pitch and sound good
  • Can it be used for timbre (or even style) transfer?
  • What about a transformer that uses the same coding trick?
  • Capacity - Does the model need to be bigger to handle a bigger data set?
    • From our collective dataset?
  • What about only class parameters, otherwise unconditioned inference?
    • Several different styles of music?
  • What do the trajectories through the encodec space look like for particular sounds?

  • Questions for the codec
  • What happens if you "down sample" the codec stream? Interesting because the RNN still uses context to recreate the audio, so it might "smooth over" artefacts.
  • 48kHz Encodec?
  • DAC encoder

60 of 65

Next week in class

  • Show Encodec exploration (using your data set sounds
    • Visualization, statistics, space manipulation
  • Show dataset training
    • 75 -175 iterations works well
  • Discuss what worked, what didn’t, and why

61 of 65

NOT USED

62 of 65

MusicGen and toekenization

63 of 65

Curse of crust

64 of 65

Curse of high-dimension

65 of 65

Cos similarity

  • In high dimensions, 
    • The curse destroys the use of magnitudes for comparing vectors
    • But….. Not direction (assuming there is structure in the distribution of point)
  • So …. Cosine similarity! 
    • Cos similarity
    • cos(θ) = (a · b) / (||a|| × ||b||)   - in [-1,1]
    • 1 = same direction, 0 = orthogonal, -1 = opposite directions
    • Cosine distance is simply: 1 - cosine_similarity