Harmonai • 27 September 2022
Style transfer of audio effects �with differentiable signal processing
Nick J. Bryan2
Joshua D. Reiss1
1Queen Mary University of London
2Adobe Research
2
Christian Steinmetz
Queen Mary University of London
PhD in Artificial Intelligence and Music
Universitat Pompeu Fabra
Master in Sound and Music Computing
Clemson University
B.S. in Electrical Engineering
B.A. in Audio Technology
Minor in Mathematical Sciences
mixing / mastering / production
3
More people are creating audio content
Music
Podcasts
Short-form content
Sound for Video
🔊
Producing high quality audio requires expertise
Demand for high quality audio
4
5
Deep learning for audio processing
The current paradigm
6
Neural network
Source separation
Speech enhancement
Audio effect modeling
Stöter et al., 2019, "Open-unmix-a reference implementation for music source separation." JOSS
Pascual et al., 2017 "SEGAN: Speech enhancement generative adversarial network." arXiv:1703.09452
Martínez Ramírez et al., 2020, "Deep learning for black-box modeling of audio effects." Applied Sciences
Audio In
Audio Out
Audio engineers solve problems with DSP
7
Controlling audio effects
Modeling acoustic spaces
Creating a mix
Building models that control DSP
8
Neural network
Signal processing
Control parameters
2. How to convey user intention?
1. How to integrate DSP with neural nets?
Differentiable signal processing
9
Backprop through DSP operations
Conveying intention
10
Traditional control parameters
Text-based prompt
By example (style transfer)
“Make my guitar sound bright and shiny”
Style transfer of audio effects
11
12
13
Audio production as a three stage process
14
1. Listen Perform an acoustic analysis of the input recording
2. Plan Establish an acoustic goal (style) considering the context
3. Execute Manipulate DSP controls to achieve this goal
15
Learning audio production by example
16
1 Automatic differentiation
Explicitly define signal processing operations in autodiff framework
Engel, Jesse, et al. "DDSP: Differentiable digital signal processing." ICLR (2021).
17
2 Neural proxy
(1) Pretraining
Frozen DSP neural proxy
(2) Training
(3) Inference
Steinmetz, Christian J., et al. "Automatic multitrack mixing with a differentiable mixing console of neural audio effects." ICASSP, 2021.
18
3 Neural proxy hybrid
(3) Inference
(2) Training
Use original DSP during inference
19
4 Gradient approximation
Simultaneous perturbation stochastic approximation (SPSA)
Finite differences (FD)
Martínez Ramírez, Marco A., et al. "Differentiable signal processing with black-box audio effects." ICASSP, 2021.
RECAP: Differentiable signal processing
20
No existing comparison of these approaches in a unified setup.
21
Training details
RB-DSP Rule-based DSP
cTCN Conditional TCN
NP Neural Proxy
NP-HH Neural Proxy Half-hybrid
NP-FH Neural Proxy Full-hybrid
SPSA Gradient approximation
AD Automatic differentiation
Audio domain loss
Multi-resolution STFT
Training Datasets
Speech (LibriTTS)
Music (MTG-Jamendo)
Effects
6-band parametric EQ
Dynamic range compressor
Models
22
Evaluation metrics
PESQ Perceptual evaluation of speech quality
STFT Multi-resolution STFT error
General similarity
(full reference)
Spectral balance (EQ)�(high-level features)
Dynamics (Compression)�(high-level features)
MSD Large window log-mel spectrogram error
SCE Spectral centroid error
RMS Root mean square energy error
LUFS Perceptual loudness error
23
Synthetic audio production style transfer
24
Building a production style dataset/task
Styles are defined by distributions in the parameter space of the parametric EQ and dynamic range compressor.
Clean audio
Style dataset
EQ
DRC
25
Realistic audio production style transfer
26
Learning audio production representations
Frozen pretrained encoder
Linear classifier
Future directions
27
Resources
28
Ready-to-go differentiable EQ and compressor
29
Colonel and Steinmetz et al., 2022 "Direct design of biquad filter cascades with deep learning by sampling random polynomials." IEEE ICASSP
Steinmetz et al., 2021 "Filtered noise shaping for time domain room impulse response estimation from reverberant speech." IEEE WASPAA (Best Student Paper Award)
Steinmetz et al., 2022 "Style transfer of audio effects with differentiable signal processing." Journal of the Audio Engineering Society
Differentiable IIR filters
Differentiable reverberation
Differentiable EQ and Compression
30
Efficient neural audio effects
Randomized neural networks
Steerable discovery
Steinmetz and Reiss, 2022 "Efficient neural networks for real-time modeling of analog dynamic range compression." 152nd AES Convention
Steinmetz and Reiss, 2021 "Steerable discovery of neural audio effects." NeurIPS 5th Workshop on Machine Learning for Creativity and Design
Steinmetz and Reiss, 2020 "Randomized overdrive neural networks." NeurIPS 4th Workshop on Machine Learning for Creativity and Design
Extra content
31
32
Experiments
33
Audio production style transfer
Synthetic (training)
Realistic (evaluation)
Input
Reference
Input
Reference
High-level metrics
System
Prediction
System
Full Reference
Metric
Prediction
Automatic differentiation audio effects
34
This can be approximated with �a FIR (frequency domain) filter
Estimate IIR filter response with DFT and apply as a frequency domain FIR filter
Nercessian, Shahan. "Neural parametric equalizer matching using differentiable biquads." Proc. Int. Conf. Digital Audio Effects (eDAFx-20). 2020.
Contributions
35
36
Synthetic audio production style transfer
out-of-domain datasets
37
Computational complexity