REPRESENTATION & MODELING
Lonce Wyse
Lossy, Non-invertible representations?
�Encoding, embedding, and latent spaces, tokens
Coding/decoding for other nets
Mel Spectrogram
Mel Spectrogram
MELGAN
Reconstructing Audio from Mel Spectra
Condition
Random
X
Real/Fake
Feature Diff
Kumar, K., Kumar, R., De Boissiere, T., Gestin, L., Teoh, W. Z., Sotelo, J., ... & Courville, A. C. (2019). Melgan: Generative adversarial networks for conditional waveform synthesis. Advances in neural information processing systems, 32.
BigVGAN
Lee, S. G., Ping, W., Ginsburg, B., Catanzaro, B., & Yoon, S. (2022). Bigvgan: A universal neural vocoder with large-scale training. arXiv preprint arXiv:2206.04658.
Pretrained models available
AMP: anti-aliased multi-periodicity
Workflow
Audio -> LowD ->
Your network here
-> lowD -> decode -> audio
E
D
Audio
Sequence of latents
https://nsynthsuper.withgoogle.com/
Nsynth Super (hardware interface)
What does the “encoded” signal look like?
16 dimensions for every 512 samples
morph
| a₁₁ a₁₂ a₁₃ ... a₁(ₘ₋₁) a₁ₘ |
| a₂₁ a₂₂ a₂₃ ... a₂(ₘ₋₁) a₂ₘ |
| . . . . . |
| . . . . . |
| . . . . . |
| aₙ₁ aₙ₂ aₙ₃ ... aₙ(ₘ₋₁) aₙₘ |
| b₁₁ b₁₂ b₁₃ ... b₁(ₘ₋₁) b₁ₘ |
| b₂₁ ba₂₂ b₂₃ ... b₂(ₘ₋₁) b₂ₘ |
| . . . . . |
| . . . . . |
| . . . . . |
| bₙ₁ bₙ₂ bₙ₃ ... bₙ(ₘ₋₁) bₙₘ|
A
B
Sequence of latents
Nature of this “representation?
Some instrument morphs:
Next up: Tokenization!
Tokenization
Tokens have worked great for text and speech (how?). Can they do the same for audio (how?)
“Universal” Audio Codecs
codes
Q9 | 823 | 549 | 983 | 212 | 514 | 728 | 849 | 523 | 332 | 505 |
Q8 | 640 | 46 | 344 | 774 | 961 | 477 | 813 | 251 | 585 | 1023 |
Q7 | 361 | 499 | 134 | 547 | 388 | 93 | 387 | 703 | 197 | 454 |
Q6 | 628 | 108 | 620 | 803 | 384 | 586 | 305 | 433 | 966 | 537 |
Q5 | 171 | 823 | 197 | 526 | 35 | 405 | 896 | 684 | 238 | 143 |
Q4 | 622 | 512 | 866 | 244 | 812 | 214 | 71 | 177 | 142 | 344 |
Q3 | 720 | 201 | 667 | 483 | 336 | 855 | 243 | 662 | 807 | 969 |
Q2 | 420 | 219 | 740 | 691 | 616 | 212 | 70 | 265 | 1019 | 272 |
Q1 | 5 | 174 | 212 | 160 | 213 | 764 | 111 | 604 | 686 | 542 |
time
codebooks
input
80ish fps
44k fps
DAC bit rate Performance
Mp3
Raw
Recall LT dependencies
Let’s count…..
DAC bit rate Performance
Mp3 44.1(1 ch) 128 40ish
Raw 44.1 (2) 700
Recall LT dependencies
Let’s count…..
Counting
And, what is the true dimensionality of the perceptually relevant audio manifold?
CD quality sampled 1-second sounds?
How many 1-second sounds?
CD-quality sampling
Encodec 75 pfs, 8 codebooks
DAC / RVQGAN
Kumar, R., Seetharaman, P., Luebs, A., Kumar, I., & Kumar, K. (2024). High-fidelity audio compression with improved rvqgan. Advances in Neural Information Processing Systems, 36.
Overall structure
Reconstruction
Losses
(Soundstream, Zeghidour et al 2021)
More detail …
VQ-VAE
Codebook loss
Commitment loss
Reconstruction loss
Facturing code for quantization
Encoder and Codebook loss
Multiscale objectives
Quantizing
Residual quantizer
Residual quantizer
What if you want to choose the number of codebooks?
RVQ
DAC latent space?
Q9 | 823 | 549 | 983 | 212 | 514 | 728 | 849 | 523 | 332 | 505 |
Q8 | 640 | 46 | 344 | 774 | 961 | 477 | 813 | 251 | 585 | 1023 |
Q7 | 361 | 499 | 134 | 547 | 388 | 93 | 387 | 703 | 197 | 454 |
Q6 | 628 | 108 | 620 | 803 | 384 | 586 | 305 | 433 | 966 | 537 |
Q5 | 171 | 823 | 197 | 526 | 35 | 405 | 896 | 684 | 238 | 143 |
Q4 | 622 | 512 | 866 | 244 | 812 | 214 | 71 | 177 | 142 | 344 |
Q3 | 720 | 201 | 667 | 483 | 336 | 855 | 243 | 662 | 807 | 969 |
Q2 | 420 | 219 | 740 | 691 | 616 | 212 | 70 | 265 | 1019 | 272 |
Q1 | 5 | 174 | 212 | 160 | 213 | 764 | 111 | 604 | 686 | 542 |
time
codebooks
Projection
1024
8
Quantization
time
time
Nice low-D input!(?)
Masked language modeling
Note: masking in for training only. They are predicted, but BERT is used to produce the contextual embeddings for downstream tasks (though the embeddings are used to produce logits so error can be computed).
Iterative reconstruction
Using codecs: �Vampnet training
Using codecs: �Vampnet inference
Prompting strategies - https://youtu.be/3XfeWlV9Cp0?t=80
Non-autoregressive
Prompt and audio:
Meta’s encodec
Encodec architecture
MusicGen
Copet, J., Kreuk, F., Gat, I., Remez, T., Kant, D., Synnaeve, G., ... & Défossez, A. (2023). Simple and controllable music generation. Advances in neural information processing systems, 36, 47704-47720.
MusicGen
Possible Strategies
RNeNcodec
RNeNcodec
RNN core
RNN
C#4
C#3
C#1
C#2
Params
Latents
Data
t = n+1
Data
t = n
Inference
Data to latents�Ouput to latents�Why use tokens at all?
RNeNcodec
Codebooks out (parallel)
RNN
C#1 Logits
C#3 Logits
C#4 Logits
C#2 Logits
sample
sample
sample
sample
Token stack
C#4
C#3
C#1
C#2
Use lower-order code selections
RNN
C#1 Logits
C#2 Logits
C#3 Logits
C#4 Logits
sample
latent
sample
latent
sample
latent
sample
latent
Next sequence input
RNN
C#1 Logits
C#2 Logits
C#3 Logits
C#4 Logits
sample
latent
sample
latent
sample
latent
sample
latent
Tokens to summary latent
Weighted sum for *each* token.
RNN
C#1 Logits
C#2 Logits
C#3 Logits
C#4 Logits
latent
latent
latent
latent
∑
∑
∑
∑
weighted sum
weighted sum
weighted sum
weighted sum
Teacher forcing
RNN
C#1 Logits
C#2 Logits
C#3 Logits
C#4 Logits
latent
latent
latent
latent
∑
∑
∑
∑
weighted sum
weighted sum
weighted sum
weighted sum
TF latent
TF latent
TF latent
Playability
Classical RT architecture
When Read ->write, generate new “hop”
“hop”
How to generate a “hop” with RNeNcodec?
Playability
Sound Sets
RNeNcodec examples
Two notebooks
Final Projects
Things I don’t know about RNeNcodec
Next week in class
NOT USED
MusicGen and toekenization
Curse of crust
Curse of high-dimension
Cos similarity