1 of 44

Diffusion-based generative models for audio

GLADIA Research Group

Speakers: Michele Mancusi

Giorgio Mariani

2 of 44

Generative Models

3 of 44

Generative Models

4 of 44

Generative Models: Diffusion models

Diffusion models define a sequence of steps to slowly add random noise to data (forward process) and then learn to reverse it (backward process).
This allows to obtain probable data points from random noise.

5 of 44

Understanding Diffusion Models: Langevin Dynamics

6 of 44

Brownian motion

Pollen grains in water

Simulation of particles

moving in water

https://water.lsbu.ac.uk/water/Brownian.html

7 of 44

Langevin Dynamics

For very tiny particles (few microns):

8 of 44

Brownian motion

9 of 44

Brownian motion

10 of 44

Brownian motion

11 of 44

Fokker-Planck equation

12 of 44

Steady State of Langevin Dynamics

if we want to sample from the distribution we need to set the potential energy to be

Our equation becomes

13 of 44

Sampling Using Langevin Dynamics

We can use the Euler-Maruyama method

14 of 44

Sampling Using Langevin Dynamics

Illustration from https://yang-song.net/blog/2021/score

15 of 44

Diffusion Models

16 of 44

Score function

Score function!

17 of 44

Score-based model

We need to train a model to estimate the score function

by minimizing this loss (a.k.a. Fisher divergence)

18 of 44

The ideal workflow…

19 of 44

Two important issues

20 of 44

1^st Problem: Low-density region

How can we bypass the difficulty of accurate score estimation in regions of low data density?

21 of 44

Solution: add noise to the data

22 of 44

How much noise?

Larger noise can cover more low density regions for better score estimation, but it over-corrupts the data.
Smaller noise causes less corruption of data, but does not cover the low density regions.

Solution: Multiple Levels!

23 of 44

Annealed Langevin dynamics combine a sequence of Langevin chains with gradually decreasing noise scales.

Annealed Langevin Dynamics

24 of 44

2^nd Problem: Unknown data score

25 of 44

Solution: Denoising Score Matching

26 of 44

Noise Conditional Score Networks (NCSN)

27 of 44

The Network

28 of 44

Forward and Backward Diffusion

Forward Diffusion Process

Backward Diffusion Process

29 of 44

Backward Diffusion Process

(Song et al., 2019; Ho et al. 2020; Song et al., 2021)

30 of 44

What about audio?

Audio is lagging behind:

Limited availability of data and copyright issues
Non standardized benchmarks for comparisons
Recycle ideas from vision instead of new things
Cannot embed audio on paper

Peculiarities of the audio domain:

Multiple sources in the same mix, compositionality problem
Audio has very long contexts (GPT4 max context is 25,000 tokens, a second of high-quality audio is 44,100 samples)

31 of 44

Focus on music – Commercial interest

Music production (DAWs)

Mixing

Recommendation systems

Lyrics extraction

Immersive music

32 of 44

Data Representation

Symbolic

Continuous

33 of 44

Data Representation

Waveform domain:

1D signal
Sampling rate:

High resolution: 44.1 kHz (CD format)
In deep learning often use 16 kHz

Bit depth:

16-bit: 2^16 (or 65,536)
24-bit: 2^24 (or 16,777,216)

Common formats:

WAV (Waveform Audio File Format):

No compression

MP3 (MPEG-1 Audio Layer III):

Compression, bitrate (# of bits transmitted per second) range of 32-320 kbps

34 of 44

Multi-Source Diffusion Models

36 of 44

Multi-Source Diffusion Models

37 of 44

Inference procedure: Total Generation

38 of 44

Inference procedure: Partial Generation

39 of 44

Inference procedure: Source Separation

40 of 44

Quantitative metrics

42 of 44

Future work

Introduce text-conditioning for fine-grained control
Latent-space inference
Train over more realistic data by bootstrapping SOTA regressors (Demucs)
Train symbolic version on MIDI data (which is more abundant)

1 of 44

2 of 44

3 of 44

4 of 44

5 of 44

6 of 44

7 of 44

8 of 44

9 of 44

10 of 44

11 of 44

12 of 44

13 of 44

14 of 44

15 of 44

16 of 44

17 of 44

18 of 44

19 of 44

20 of 44

21 of 44

22 of 44

23 of 44

24 of 44

25 of 44

26 of 44

27 of 44

28 of 44

29 of 44

30 of 44

31 of 44

32 of 44

33 of 44

34 of 44

35 of 44

36 of 44

37 of 44

38 of 44

39 of 44

40 of 44

41 of 44

42 of 44

43 of 44

44 of 44