1 of 37

Harmonai • 27 September 2022

Style transfer of audio effects �with differentiable signal processing

Christian J. Steinmetz^1,2

^@csteinmetz1

Nick J. Bryan²

Joshua D. Reiss¹

¹Queen Mary University of London

²Adobe Research

2 of 37

Christian Steinmetz

Queen Mary University of London

PhD in Artificial Intelligence and Music

Universitat Pompeu Fabra

Master in Sound and Music Computing

Clemson University

B.S. in Electrical Engineering

B.A. in Audio Technology

Minor in Mathematical Sciences

mixing / mastering / production

^@csteinmetz1

3 of 37

4 of 37

Producing high quality audio requires expertise

Demand for high quality audio

6 of 37

Deep learning for audio processing

The current paradigm

Neural network

Source separation

Speech enhancement

Audio effect modeling

Stöter et al., 2019, "Open-unmix-a reference implementation for music source separation." JOSS

Pascual et al., 2017 "SEGAN: Speech enhancement generative adversarial network." arXiv:1703.09452

Martínez Ramírez et al., 2020, "Deep learning for black-box modeling of audio effects." Applied Sciences

Audio In

Audio Out

7 of 37

Audio engineers solve problems with DSP

Controlling audio effects

Modeling acoustic spaces

Creating a mix

8 of 37

Building models that control DSP

Neural network

Signal processing

Control parameters

2. How to convey user intention?

1. How to integrate DSP with neural nets?

9 of 37

Differentiable signal processing

Backprop through DSP operations

Leveraging existing DSP tools and knowledge
High quality audio processing with few artifacts
Human understandable outputs that can be adjusted
Efficient and can easily run in real-time on CPU

10 of 37

Conveying intention

Traditional control parameters

Text-based prompt

By example (style transfer)

“Make my guitar sound bright and shiny”

11 of 37

Style transfer of audio effects

14 of 37

Audio production as a three stage process

1. Listen Perform an acoustic analysis of the input recording

2. Plan Establish an acoustic goal (style) considering the context

3. Execute Manipulate DSP controls to achieve this goal

15 of 37

Learning audio production by example

16 of 37

1 Automatic differentiation

Explicitly define signal processing operations in autodiff framework

Engel, Jesse, et al. "DDSP: Differentiable digital signal processing." ICLR (2021).

17 of 37

2 Neural proxy

(1) Pretraining

Frozen DSP neural proxy

(2) Training

(3) Inference

Steinmetz, Christian J., et al. "Automatic multitrack mixing with a differentiable mixing console of neural audio effects." ICASSP, 2021.

18 of 37

3 Neural proxy hybrid

(3) Inference

(2) Training

Use original DSP during inference

19 of 37

4 Gradient approximation

Simultaneous perturbation stochastic approximation (SPSA)

Finite differences (FD)

Martínez Ramírez, Marco A., et al. "Differentiable signal processing with black-box audio effects." ICASSP, 2021.

20 of 37

RECAP: Differentiable signal processing

Automatic differentiation
Neural proxy
Neural proxy hybrid
Gradient approximation

No existing comparison of these approaches in a unified setup.

21 of 37

Training details

RB-DSP Rule-based DSP

cTCN Conditional TCN

NP Neural Proxy

NP-HH Neural Proxy Half-hybrid

NP-FH Neural Proxy Full-hybrid

SPSA Gradient approximation

AD Automatic differentiation

Audio domain loss

Multi-resolution STFT

Training Datasets

Speech (LibriTTS)

Music (MTG-Jamendo)

Effects

6-band parametric EQ

Dynamic range compressor

Models

22 of 37

Evaluation metrics

PESQ Perceptual evaluation of speech quality

STFT Multi-resolution STFT error

General similarity

(full reference)

Spectral balance (EQ)�(high-level features)

Dynamics (Compression)�(high-level features)

MSD Large window log-mel spectrogram error

SCE Spectral centroid error

RMS Root mean square energy error

LUFS Perceptual loudness error

23 of 37

Synthetic audio production style transfer

Rule-based DSP baseline outperformed by learned approaches
Neural proxy hybrid approaches do not perform well
Gradient approximation performs second best but struggles with instability
Automatic differentiation performs best overall but is only an approximation of effects

24 of 37

Building a production style dataset/task

Styles are defined by distributions in the parameter space of the parametric EQ and dynamic range compressor.

Clean audio

Style dataset

DRC

25 of 37

Realistic audio production style transfer

26 of 37

Learning audio production representations

Frozen pretrained encoder

Linear classifier

27 of 37

Future directions

Extend this approach with more differentiable effects (e.g. reverb, distortion, etc)
Improved methods for training neural proxy (and hybrids)
Methods for handling dynamic construction of the processing chain
Adapt this approach for multichannel use cases (e.g. multitrack mixing)
Zero-shot adaptation to a new set of audio effects (can I use the plugins in my DAW?)

28 of 37

Resources

github.com/adobe-research/DeepAFx-ST

huggingface.co/spaces/nateraw/deepafx-st

Ready-to-go differentiable EQ and compressor

29 of 37

Colonel and Steinmetz et al., 2022 "Direct design of biquad filter cascades with deep learning by sampling random polynomials." IEEE ICASSP

Steinmetz et al., 2021 "Filtered noise shaping for time domain room impulse response estimation from reverberant speech." IEEE WASPAA (Best Student Paper Award)

Steinmetz et al., 2022 "Style transfer of audio effects with differentiable signal processing." Journal of the Audio Engineering Society

Differentiable IIR filters

Differentiable reverberation

Differentiable EQ and Compression

https://arxiv.org/abs/2110.03691

https://arxiv.org/abs/2107.07503

https://arxiv.org/abs/2207.08759

30 of 37

Efficient neural audio effects

Randomized neural networks

Steerable discovery

Steinmetz and Reiss, 2022 "Efficient neural networks for real-time modeling of analog dynamic range compression." 152nd AES Convention

Steinmetz and Reiss, 2021 "Steerable discovery of neural audio effects." NeurIPS 5th Workshop on Machine Learning for Creativity and Design

Steinmetz and Reiss, 2020 "Randomized overdrive neural networks." NeurIPS 4th Workshop on Machine Learning for Creativity and Design

https://arxiv.org/abs/2102.06200

https://arxiv.org/abs/2112.02926

https://arxiv.org/abs/2010.04237

31 of 37

Extra content

32 of 37

Experiments

Synthetic production style transfer �(matching input and reference)
Realistic production style transfer �(non-matching input and reference)
Audio production representations �(audio production style classification)
Computational complexity

33 of 37

Audio production style transfer

Synthetic (training)

Realistic (evaluation)

Input

Reference

Input

Reference

High-level metrics

System

Prediction