1 of 44

Diffusion-based generative models for audio

GLADIA Research Group

Speakers:  Michele Mancusi 

Giorgio Mariani   

2 of 44

Generative Models

3 of 44

Generative Models

  •  

4 of 44

Generative Models: Diffusion models

  • Diffusion models define a sequence of steps to slowly add random noise to data (forward process) and then learn to reverse it (backward process).
  • This allows to obtain probable data points from random noise.

5 of 44

Understanding Diffusion Models: Langevin Dynamics

6 of 44

Brownian motion

Pollen grains in water

Simulation of particles

moving in water

https://water.lsbu.ac.uk/water/Brownian.html

 

 

7 of 44

Langevin Dynamics

 

 

 

 

For very tiny particles (few microns):

 

8 of 44

Brownian motion

 

 

 

 

9 of 44

Brownian motion

10 of 44

Brownian motion

 

 

 

11 of 44

Fokker-Planck equation

 

12 of 44

Steady State of Langevin Dynamics

if we want to sample from the distribution  we need to set the potential energy to be

Our equation becomes

 

 

 

13 of 44

Sampling Using Langevin Dynamics

We can use the Euler-Maruyama method

 

 

 

14 of 44

Sampling Using Langevin Dynamics

 

Illustration from https://yang-song.net/blog/2021/score

15 of 44

Diffusion Models

16 of 44

Score function

  •  

 

 

Score function!

17 of 44

Score-based model

We need to train a model to estimate the score function

by minimizing this loss (a.k.a. Fisher divergence)

 

 

18 of 44

The ideal workflow…

19 of 44

Two important issues

  •  

20 of 44

1st Problem: Low-density region

How can we bypass the difficulty of accurate score estimation in regions of low data density? 

21 of 44

Solution: add noise to the data

22 of 44

How much noise?

  • Larger noise can cover more low density regions for better score estimation, but it over-corrupts the data.
  • Smaller noise causes less corruption of data, but does not cover the low density regions.

Solution: Multiple Levels!

23 of 44

Annealed Langevin dynamics combine a sequence of Langevin chains with gradually decreasing noise scales.

Annealed Langevin Dynamics

24 of 44

2nd Problem: Unknown data score

  •  

 

25 of 44

Solution: Denoising Score Matching

  •  

26 of 44

Noise Conditional Score Networks (NCSN)

 

27 of 44

The Network

28 of 44

Forward and Backward Diffusion

Forward Diffusion Process

Backward Diffusion Process

29 of 44

Backward Diffusion Process

(Song et al., 2019; Ho et al. 2020; Song et al., 2021)

30 of 44

What about audio?

  • Audio is lagging behind:
    • Limited availability of data and copyright issues
    • Non standardized benchmarks for comparisons
    • Recycle ideas from vision instead of new things
    • Cannot embed audio on paper
  • Peculiarities of the audio domain:
    • Multiple sources in the same mix, compositionality problem
    • Audio has very long contexts (GPT4 max context is 25,000 tokens, a second of high-quality audio is 44,100 samples)

31 of 44

Focus on music – Commercial interest

Music production (DAWs)

Mixing

Recommendation systems

Lyrics extraction

Immersive music

32 of 44

Data Representation

Symbolic

Continuous

33 of 44

Data Representation

  • Waveform domain:
    • 1D signal
    • Sampling rate:
      • High resolution: 44.1 kHz (CD format)
      • In deep learning often use 16 kHz
    • Bit depth:
      • 16-bit: 2^16 (or 65,536)
      • 24-bit: 2^24 (or 16,777,216)
    • Common formats:
      • WAV (Waveform Audio File Format):
        • No compression
      • MP3 (MPEG-1 Audio Layer III):
        • Compression, bitrate (# of bits transmitted per second) range of 32-320 kbps

34 of 44

Multi-Source Diffusion Models

35 of 44

Core idea

  •  

36 of 44

Multi-Source Diffusion Models

37 of 44

Inference procedure: Total Generation

38 of 44

Inference procedure: Partial Generation

39 of 44

Inference procedure: Source Separation

40 of 44

Quantitative metrics

41 of 44

42 of 44

Future work

  • Introduce text-conditioning for fine-grained control
  • Latent-space inference
  • Train over more realistic data by bootstrapping SOTA regressors (Demucs)
  • Train symbolic version on MIDI data (which is more abundant)

43 of 44

44 of 44

Thank You!