Speech Processing
Episode 10
Voice Quality Enhancement
Part 1
VQE Tasks
Why to do VQE
Lecture Plan
Noise reduction. Physics.
Noise reduction. Physics.
Noise reduction. Physics.
???
Noise reduction. Physics.
+
=
numpy.sum()
Noise reduction
Task:
Restore the speech signal from a mixture
+
=
Questions!
Reverberation. Acoustics.
The ray direction is not accurate and was selected for illustration purposes
Reverberation. Acoustics.
Clean
Reverberated
Reverberation. Acoustics.
Reverberation:
- Strong in bigger rooms
- Strong if speaker is far away from a mic
- Strong reverberation reduces voice quality
- Weak reverberation makes voice natural
Reverberation. Acoustics.
Acoustics is typically modeled with room impulse responses (RIR)
RIR is defined as the signal captured by the microphone from a unit impulse provided by speaker
Determined by the room and the locations of mic and speaker
Fully defines the acoustics:
signal_on_mic = signal_on_speaker * RIR
More on that on the next lecture
Partial de-reveberation
Keeping early reverberation and removing late reverberation
Improves perceived speech quality
X
Metrics
Objective:
- Downstream task (ASR WER/WACC)
- Signal: SNR, SI-SNR, SDR
- Subjective metric modeling:
- PESQ, STOI, COVL, DNSMOS
- Intrusive / non-intrusive
Easily accessible
Subjective:
- SIG: signal quality
- BAK: background noise quality
- OVRL: overall quality
Gold standard
Costly to obtain
SIG, BAK, OVRL
Illustrative example for crowd task
All rated 1 to 5 by human listeners.
SIG: how good is the signal quality
BAK: how good is the absense of noise
OVRL: how good is the overall quality?
Each participant is required to evaluate
SIG and BAK and only after that go to OVRL evaluation (ITU-T P.835 standard)
Non-intrusive
Gold standard, costly to obtain
DNSMOS
Neural network trained to approximate human ratings
Non-intrusive
Better correlated with human ratings than PESQ and POLQA
First version of DNSMOS
SNR
Input and output SNR are evaluated
SI-SNR: scale-invariant SNR
SDR: signal-to-distortion ratio, a version of SNR which is robust to small linear transforms
Intrusive
AEC: Acoustic Echo Canceling. ASR Downstream task.
Du
Du hast
Alice, stop, please
Du hast�mich
«Alice, stop, please»
Echo�Canceller
Noise reduction. Data simulation
How do we generate training data?
We don’t have natural noisy-clean pairs
+
Speech
dataset
Noise
dataset
Partial Dereverberation
+
Room acoustics
Room acoustics
Truncated room acoustics
Mic freq response
Use as target for training
Reverberation is added to match real-world conditions
Partial dereverberation improves voice quality
Question time!
Why do we love (complex) spectrograms?
+
=
+
=
Linearity
For complex, not for magnitude!
Why do we love (all) spectrograms?
Signal
Noise
Sparsity
Magnitude Masking Method
Both complex and magnitude
Why do we love (complex) spectrograms?
More reasons coming next lecture, but...
Why do we love (complex) spectrograms?
STFT is invertible!
STFT
ISTFT
How to use complex spectrograms for VQE?
ISTFT
STFT
Processing
Earlier works propose to enhance magnitude and use phase from the input noisy signal
ISTFT
STFT
Processing
Magnitude processing
Short-time Fourier Transform (STFT) Recap
1.Window transform
2.Window function
4.Discrete Fourier Transform
3.Padding
https://pytorch.org/docs/stable/generated/torch.stft.html
Shape: T / h x win_length
Shape: t x n_fft
Shape: t x n_fft /2 + 1
Complex
Inverse STFT (ISTFT)
1.Window transform
2.Window function
4.Discrete Fourier Transform
3.Padding
https://pytorch.org/docs/stable/generated/torch.istft.html
Shape: T / h x win_length
Shape: t x n_fft
Shape: t x n_fft /2 + 1
Complex
1.Inverse DFT
2.Crop
3.Window function
Yes, again,�not inverse!
4.Overlap-add
We want to make the output continuous and emphasize the center of a window
Why inverse?
Window mutiplication in ISTFT
ISTFT works with arbitrary inputs, not only with STFT outputs
Overlap-add of raw signals can lead to discontinuities
+
Inverse STFT (ISTFT)
1.Window transform
2.Window function
4.Discrete Fourier Transform
3.Padding
https://pytorch.org/docs/stable/generated/torch.istft.html
Shape: T / h x win_length
Shape: t x n_fft
Shape: t x n_fft /2 + 1
Complex
1.Inverse DFT
2.Crop
3.Window function
4.Overlap-add
Inverse STFT (ISTFT)
1.Window transform
2.Window function
https://pytorch.org/docs/stable/generated/torch.istft.html
Shape: T / h x win_length
3.Window function
4.Overlap-add
The composition must be identity!
Overlap-Add Operation
What is it?
Window Transform
win_size=4
hop_size=2
Window Transform
Overlap-add is defined for any input
Overlap-add
Corresponding cells are summed and normalized, e.g.
+
How normalized?
To make the overlap-add the exact inverse of window transform
+ double window function
Overlap-add
+
The composition of sliding window transform, double window-multiplication and overlap-add should be identity!
Overlap-add in streaming inference
+
+
Output and dropped from cache on the previous step
Output and dropped from cache on this step
Stays in cache after this step
Complex Spectrum Recap
1. Interpretable
2. Invertible
3. Questions?
Let’s train a neural network!
Let’s train a neural network!
Wait a moment. Isn’t there a simpler method?
There is a branch of science, called Digital Signal Processing (DSP)
It delivers remarkable results and DSP methods are typically much less demanding for computational resources.
But DSP methods are considered to perform poorly for single-microphone noise reduction with non-stationary noises
Neural networks confidently dominate Deep Noise Suppression challenge (an open challenge for real-world noise reduction)
Let’s train a neural network!
Neural network
More data
Break!
U-Net (hourglass) structure
Encoder
Decoder
Reconstructs high-resolution details!
U-Net (hourglass) structure
https://web.cse.ohio-state.edu/~wang.77/papers/Tan-Wang.taslp20.pdf
PHASE-AWARE SPEECH ENHANCEMENT WITH DEEP COMPLEX U-NET
Cruse-v2 (loss functions)
PoCoNet
DfNet-2
Are these models suitable for real-time processing?
Some are, some are not.
Real-time processing
Real-time processing is critical for VQE applications
What does real-time mean for VQE? VQE works on short chunks with low latency.
We have 2 types of delay:
Algorithmic: how much future information is required to process current time frame?
Computational: how much time of computation do we need to process a time frame?
Real-time-factor (RTF): Time-process-one-frame / frame-duration.�Should be < 1
What is frame duration for STFT?
U-Net (hourglass) structure
What does a model need for real-time processing?
No down-sampling on the time axis. Frequency axis is still ok.��Note: in vision horizontal and vertical directions have similar (spacial semantics).�In audio one direction stands for time, another one stands for frequencies.
Causal layers
Causal Convolution
PyTorch: left padding
Causal Layers
Convolution
Causal Convolution
Bidirectional RNN (LSTM, GRU)
Unidirectional (regular) RNN (LSTM, GRU)
Transformer Encoder
Transformer Decoder
Causal
Non-Causal
Spectral Magnitude Mapping.
A Convolutional Recurrent Neural Network for Real-Time Speech
Enhancement, 2018
Contribution: CRN beats LSTM baseline
Like many other methods, only estimates magnitude and does not estimate phase
Softplus enforces the estimated magnitude is >= 0
Signal
Noise
Training target:
MSE
Magnitude and Phase
Given a complex spectrogram S
Magnitude: |S|
Phase: angle(S)
Noisy phase can be used directly with enhanced magnitude to some extent
Complex Spectral Mapping. Two heads.
From magnitude spectral mapping to complex spectral mapping
Complex Spectral Mapping:
Complex spectrogram estimated directly
Basic approach. Complex Spectral Mapping. Two heads.
Network topology
Chosen as best considering quality and model size
Basic approach. Complex Spectral Mapping. Two heads.
Network topology
ICASSP 2021 DEEP NOISE SUPPRESSION CHALLENGE: DECOUPLING MAGNITUDE AND PHASE OPTIMIZATION WITH A TWO-STAGE DEEP NETWORK
TSCN-PP
Real-time
State-of-the-art in 2021
ICASSP 2021 DEEP NOISE SUPPRESSION CHALLENGE: DECOUPLING MAGNITUDE AND PHASE OPTIMIZATION WITH A TWO-STAGE DEEP NETWORK
TSCN-PP
3 stages:
1. Coarse magnitude estimation (CME)
2. Complex spectrum refinement (CSR)
3. Post processing
ICASSP 2021 DEEP NOISE SUPPRESSION CHALLENGE: DECOUPLING MAGNITUDE AND PHASE OPTIMIZATION WITH A TWO-STAGE DEEP NETWORK
TSCN-PP
Network Topology
U-Net:
Encoder
Engine
Decoder
ICASSP 2021 DEEP NOISE SUPPRESSION CHALLENGE: DECOUPLING MAGNITUDE AND PHASE OPTIMIZATION WITH A TWO-STAGE DEEP NETWORK. TCM. Inverted residual Convolutional Block
Pointwise Conv (channel expansion)
Depthwise Conv
Pointwise Conv (channel compression)
Residual connection
ICASSP 2021 DEEP NOISE SUPPRESSION CHALLENGE: DECOUPLING MAGNITUDE AND PHASE OPTIMIZATION WITH A TWO-STAGE DEEP NETWORK. TCM.
Convolutional blocks
Stacks with exponentially growing dilation factors
e.g. 1, 2, 4, 8, 16, 32, 1, 2, 4, 8, 16, 32
Handles both short-term and long-term dependencies
ICASSP 2021 DEEP NOISE SUPPRESSION CHALLENGE: DECOUPLING MAGNITUDE AND PHASE OPTIMIZATION WITH A TWO-STAGE DEEP NETWORK. TCM. Conv block modified.
What is new:
1. Inner convs are dilated but no longer depthwise
2. Channel dim compressed instead of expansion
3. Gating mechanism
4. 2 dilation factors in 1 DTCM layer: 2^r and 2M-r
ICASSP 2021 DEEP NOISE SUPPRESSION CHALLENGE: DECOUPLING MAGNITUDE AND PHASE OPTIMIZATION WITH A TWO-STAGE DEEP NETWORK. How to train.
Stage 1: CME-Net, loss:
Stage 2: CME-Net and CSR-Net trained jointly, loss:
MSE for complex spectrum
MSE for magnitude
Post-Processing. Inspired by RNNNoise.
Combined ML-DSP (digital signal processing) approach
ICASSP 2021 DEEP NOISE SUPPRESSION CHALLENGE: DECOUPLING MAGNITUDE AND PHASE OPTIMIZATION WITH A TWO-STAGE DEEP NETWORK. Finally.
Real-time
State-of-the-art in 2021:
DNS Challenge winner
What next?
Neural network architectures:
- Dual path architectures
Model representation:
- Magnitude spectrum mapping/masking
- Complex spectrum mapping/masking
- Deep Filter
- Waveform domain, TasNet (exotic)
VQE as front-end for other models:
- ASR, KWS, Biometrics
Digital Signal Processing and multi-microphone technics:
Other tasks:
- Acoustic echo canceling (AEC): next lecture
- Blind source separation, music source separation, audio-visual source separation
- Bandwidth extension
Generative VQE
Stay tuned!
What next?
Narrowband (4 kHz)
Fullband, restored (16 kHz)
Mask representation
Magnitude GeLU
Complex Mask
Complex-As-Channels
Deep Filter
Multi-stage approach