1 of 37

Harmonai • 27 September 2022

Style transfer of audio effects �with differentiable signal processing

Christian J. Steinmetz1,2

@csteinmetz1

Nick J. Bryan2

Joshua D. Reiss1

1Queen Mary University of London

2Adobe Research

2 of 37

2

Christian Steinmetz

Queen Mary University of London

PhD in Artificial Intelligence and Music

Universitat Pompeu Fabra

Master in Sound and Music Computing

Clemson University

B.S. in Electrical Engineering

B.A. in Audio Technology

Minor in Mathematical Sciences

mixing / mastering / production

3 of 37

3

More people are creating audio content

Music

Podcasts

Short-form content

Sound for Video

🔊

4 of 37

Producing high quality audio requires expertise

Demand for high quality audio

4

5 of 37

5

6 of 37

Deep learning for audio processing

The current paradigm

6

Neural network

Source separation

Speech enhancement

Audio effect modeling

Stöter et al., 2019, "Open-unmix-a reference implementation for music source separation." JOSS

Pascual et al., 2017 "SEGAN: Speech enhancement generative adversarial network." arXiv:1703.09452

Martínez Ramírez et al., 2020, "Deep learning for black-box modeling of audio effects." Applied Sciences

Audio In

Audio Out

7 of 37

Audio engineers solve problems with DSP

7

Controlling audio effects

Modeling acoustic spaces

Creating a mix

8 of 37

Building models that control DSP

8

Neural network

Signal processing

Control parameters

2. How to convey user intention?

1. How to integrate DSP with neural nets?

9 of 37

Differentiable signal processing

9

Backprop through DSP operations

  • Leveraging existing DSP tools and knowledge
  • High quality audio processing with few artifacts
  • Human understandable outputs that can be adjusted
  • Efficient and can easily run in real-time on CPU

10 of 37

Conveying intention

10

Traditional control parameters

Text-based prompt

By example (style transfer)

“Make my guitar sound bright and shiny”

11 of 37

Style transfer of audio effects

11

12 of 37

12

13 of 37

13

14 of 37

Audio production as a three stage process

14

1. Listen Perform an acoustic analysis of the input recording

2. Plan Establish an acoustic goal (style) considering the context

3. Execute Manipulate DSP controls to achieve this goal

15 of 37

15

Learning audio production by example

16 of 37

16

1 Automatic differentiation

Explicitly define signal processing operations in autodiff framework

Engel, Jesse, et al. "DDSP: Differentiable digital signal processing." ICLR (2021).

17 of 37

17

2 Neural proxy

(1) Pretraining

Frozen DSP neural proxy

(2) Training

(3) Inference

Steinmetz, Christian J., et al. "Automatic multitrack mixing with a differentiable mixing console of neural audio effects." ICASSP, 2021.

18 of 37

18

3 Neural proxy hybrid

(3) Inference

(2) Training

Use original DSP during inference

19 of 37

19

4 Gradient approximation

Simultaneous perturbation stochastic approximation (SPSA)

Finite differences (FD)

Martínez Ramírez, Marco A., et al. "Differentiable signal processing with black-box audio effects." ICASSP, 2021.

20 of 37

RECAP: Differentiable signal processing

20

  1. Automatic differentiation
  2. Neural proxy
  3. Neural proxy hybrid
  4. Gradient approximation

No existing comparison of these approaches in a unified setup.

21 of 37

21

Training details

RB-DSP Rule-based DSP

cTCN Conditional TCN

NP Neural Proxy

NP-HH Neural Proxy Half-hybrid

NP-FH Neural Proxy Full-hybrid

SPSA Gradient approximation

AD Automatic differentiation

Audio domain loss

Multi-resolution STFT

Training Datasets

Speech (LibriTTS)

Music (MTG-Jamendo)

Effects

6-band parametric EQ

Dynamic range compressor

Models

22 of 37

22

Evaluation metrics

PESQ Perceptual evaluation of speech quality

STFT Multi-resolution STFT error

General similarity

(full reference)

Spectral balance (EQ)�(high-level features)

Dynamics (Compression)�(high-level features)

MSD Large window log-mel spectrogram error

SCE Spectral centroid error

RMS Root mean square energy error

LUFS Perceptual loudness error

23 of 37

23

Synthetic audio production style transfer

  1. Rule-based DSP baseline outperformed by learned approaches
  2. Neural proxy hybrid approaches do not perform well
  3. Gradient approximation performs second best but struggles with instability
  4. Automatic differentiation performs best overall but is only an approximation of effects

24 of 37

24

Building a production style dataset/task

Styles are defined by distributions in the parameter space of the parametric EQ and dynamic range compressor.

Clean audio

Style dataset

EQ

DRC

25 of 37

25

Realistic audio production style transfer

26 of 37

26

Learning audio production representations

Frozen pretrained encoder

Linear classifier

27 of 37

Future directions

27

  1. Extend this approach with more differentiable effects (e.g. reverb, distortion, etc)
  2. Improved methods for training neural proxy (and hybrids)
  3. Methods for handling dynamic construction of the processing chain
  4. Adapt this approach for multichannel use cases (e.g. multitrack mixing)
  5. Zero-shot adaptation to a new set of audio effects (can I use the plugins in my DAW?)

28 of 37

Resources

28

Ready-to-go differentiable EQ and compressor

29 of 37

29

Colonel and Steinmetz et al., 2022 "Direct design of biquad filter cascades with deep learning by sampling random polynomials." IEEE ICASSP

Steinmetz et al., 2021 "Filtered noise shaping for time domain room impulse response estimation from reverberant speech." IEEE WASPAA (Best Student Paper Award)

Steinmetz et al., 2022 "Style transfer of audio effects with differentiable signal processing." Journal of the Audio Engineering Society

Differentiable IIR filters

Differentiable reverberation

Differentiable EQ and Compression

30 of 37

30

Efficient neural audio effects

Randomized neural networks

Steerable discovery

Steinmetz and Reiss, 2022 "Efficient neural networks for real-time modeling of analog dynamic range compression." 152nd AES Convention

Steinmetz and Reiss, 2021 "Steerable discovery of neural audio effects." NeurIPS 5th Workshop on Machine Learning for Creativity and Design

Steinmetz and Reiss, 2020 "Randomized overdrive neural networks." NeurIPS 4th Workshop on Machine Learning for Creativity and Design

31 of 37

Extra content

31

32 of 37

32

Experiments

  1. Synthetic production style transfer(matching input and reference)
  2. Realistic production style transfer (non-matching input and reference)
  3. Audio production representations(audio production style classification)
  4. Computational complexity

33 of 37

33

Audio production style transfer

Synthetic (training)

Realistic (evaluation)

Input

Reference

Input

Reference

High-level metrics

System

Prediction

System

Full Reference

Metric

Prediction

34 of 37

Automatic differentiation audio effects

34

This can be approximated with �a FIR (frequency domain) filter

Estimate IIR filter response with DFT and apply as a frequency domain FIR filter

Nercessian, Shahan. "Neural parametric equalizer matching using differentiable biquads." Proc. Int. Conf. Digital Audio Effects (eDAFx-20). 2020.

35 of 37

Contributions

35

  1. The first audio effects style transfer method to integrate audio effects as differentiable operators, optimized end-to-end with an audio-domain loss

  • Self-supervised training that enables automatic audio production without labeled or paired training data

  • A benchmark of differentiation strategies for audio effects, including compute cost, engineering difficulty, and performance

  • The development of novel neural proxy hybrid methods, and a differentiable dynamic range compressor.

36 of 37

36

Synthetic audio production style transfer

out-of-domain datasets

37 of 37

37

Computational complexity