1 of 69

Speech Processing

Episode 10

Voice Quality Enhancement

Part 1

2 of 69

VQE Tasks

  • Noise Reduction
  • De-Reverberation
  • Acoustic Echo Canceling (next lecture)
  • Source Separation
  • Others

3 of 69

Why to do VQE

  • Teleconferencing
  • Front-end for other applications
  • Offline processing

4 of 69

Lecture Plan

  • 1. Task setting for Noise Reduction and data simulation
  • 2. Metrics
  • 3. STFT and ISTFT
  • 4. Neural networks for noise reduction

5 of 69

Noise reduction. Physics.

6 of 69

Noise reduction. Physics.

7 of 69

Noise reduction. Physics.

???

8 of 69

Noise reduction. Physics.

+

=

numpy.sum()

9 of 69

Noise reduction

Task:

Restore the speech signal from a mixture

+

=

10 of 69

Questions!

11 of 69

Reverberation. Acoustics.

The ray direction is not accurate and was selected for illustration purposes

12 of 69

Reverberation. Acoustics.

Clean

Reverberated

13 of 69

Reverberation. Acoustics.

Reverberation:

- Strong in bigger rooms

- Strong if speaker is far away from a mic

- Strong reverberation reduces voice quality

- Weak reverberation makes voice natural

14 of 69

Reverberation. Acoustics.

Acoustics is typically modeled with room impulse responses (RIR)

RIR is defined as the signal captured by the microphone from a unit impulse provided by speaker

Determined by the room and the locations of mic and speaker

Fully defines the acoustics:

signal_on_mic = signal_on_speaker * RIR

More on that on the next lecture

15 of 69

Partial de-reveberation

Keeping early reverberation and removing late reverberation

Improves perceived speech quality

X

16 of 69

Metrics

Objective:

- Downstream task (ASR WER/WACC)

- Signal: SNR, SI-SNR, SDR

- Subjective metric modeling:

- PESQ, STOI, COVL, DNSMOS

- Intrusive / non-intrusive

Easily accessible

Subjective:

- SIG: signal quality

- BAK: background noise quality

- OVRL: overall quality

Gold standard

Costly to obtain

17 of 69

SIG, BAK, OVRL

Illustrative example for crowd task

All rated 1 to 5 by human listeners.

SIG: how good is the signal quality

BAK: how good is the absense of noise

OVRL: how good is the overall quality?

Each participant is required to evaluate

SIG and BAK and only after that go to OVRL evaluation (ITU-T P.835 standard)

Non-intrusive

Gold standard, costly to obtain

18 of 69

DNSMOS

Neural network trained to approximate human ratings

Non-intrusive

Better correlated with human ratings than PESQ and POLQA

First version of DNSMOS

19 of 69

SNR

Input and output SNR are evaluated

SI-SNR: scale-invariant SNR

SDR: signal-to-distortion ratio, a version of SNR which is robust to small linear transforms

Intrusive

20 of 69

AEC: Acoustic Echo Canceling. ASR Downstream task.

Du

Du hast

Alice, stop, please

Du hast�mich

«Alice, stop, please»

EchoCanceller

21 of 69

Noise reduction. Data simulation

How do we generate training data?

We don’t have natural noisy-clean pairs

+

Speech

dataset

Noise

dataset

22 of 69

Partial Dereverberation

+

Room acoustics

Room acoustics

Truncated room acoustics

Mic freq response

Use as target for training

Reverberation is added to match real-world conditions

Partial dereverberation improves voice quality

23 of 69

Question time!

24 of 69

Why do we love (complex) spectrograms?

+

=

+

=

Linearity

For complex, not for magnitude!

25 of 69

Why do we love (all) spectrograms?

Signal

Noise

Sparsity

Magnitude Masking Method

Both complex and magnitude

26 of 69

Why do we love (complex) spectrograms?

More reasons coming next lecture, but...

27 of 69

Why do we love (complex) spectrograms?

STFT is invertible!

STFT

ISTFT

28 of 69

How to use complex spectrograms for VQE?

ISTFT

STFT

Processing

29 of 69

Earlier works propose to enhance magnitude and use phase from the input noisy signal

ISTFT

STFT

Processing

Magnitude processing

30 of 69

Short-time Fourier Transform (STFT) Recap

1.Window transform

2.Window function

4.Discrete Fourier Transform

3.Padding

https://pytorch.org/docs/stable/generated/torch.stft.html

Shape: T / h x win_length

Shape: t x n_fft

Shape: t x n_fft /2 + 1

Complex

31 of 69

Inverse STFT (ISTFT)

1.Window transform

2.Window function

4.Discrete Fourier Transform

3.Padding

https://pytorch.org/docs/stable/generated/torch.istft.html

Shape: T / h x win_length

Shape: t x n_fft

Shape: t x n_fft /2 + 1

Complex

1.Inverse DFT

2.Crop

3.Window function

Yes, again,not inverse!

4.Overlap-add

We want to make the output continuous and emphasize the center of a window

Why inverse?

32 of 69

Window mutiplication in ISTFT

ISTFT works with arbitrary inputs, not only with STFT outputs

Overlap-add of raw signals can lead to discontinuities

+

33 of 69

Inverse STFT (ISTFT)

1.Window transform

2.Window function

4.Discrete Fourier Transform

3.Padding

https://pytorch.org/docs/stable/generated/torch.istft.html

Shape: T / h x win_length

Shape: t x n_fft

Shape: t x n_fft /2 + 1

Complex

1.Inverse DFT

2.Crop

3.Window function

4.Overlap-add

34 of 69

Inverse STFT (ISTFT)

1.Window transform

2.Window function

https://pytorch.org/docs/stable/generated/torch.istft.html

Shape: T / h x win_length

3.Window function

4.Overlap-add

The composition must be identity!

35 of 69

Overlap-Add Operation

What is it?

36 of 69

Window Transform

win_size=4

hop_size=2

37 of 69

Window Transform

38 of 69

Overlap-add is defined for any input

39 of 69

Overlap-add

Corresponding cells are summed and normalized, e.g.

+

How normalized?

To make the overlap-add the exact inverse of window transform

+ double window function

40 of 69

Overlap-add

+

The composition of sliding window transform, double window-multiplication and overlap-add should be identity!

41 of 69

Overlap-add in streaming inference

+

+

Output and dropped from cache on the previous step

Output and dropped from cache on this step

Stays in cache after this step

42 of 69

Complex Spectrum Recap

1. Interpretable

2. Invertible

3. Questions?

43 of 69

Let’s train a neural network!

44 of 69

Let’s train a neural network!

Wait a moment. Isn’t there a simpler method?

There is a branch of science, called Digital Signal Processing (DSP)

It delivers remarkable results and DSP methods are typically much less demanding for computational resources.

But DSP methods are considered to perform poorly for single-microphone noise reduction with non-stationary noises

Neural networks confidently dominate Deep Noise Suppression challenge (an open challenge for real-world noise reduction)

45 of 69

Let’s train a neural network!

Neural network

More data

46 of 69

Break!

47 of 69

U-Net (hourglass) structure

Encoder

Decoder

Reconstructs high-resolution details!

48 of 69

U-Net (hourglass) structure

https://web.cse.ohio-state.edu/~wang.77/papers/Tan-Wang.taslp20.pdf

PHASE-AWARE SPEECH ENHANCEMENT WITH DEEP COMPLEX U-NET

Cruse-v2 (loss functions)

PoCoNet

DfNet-2

Are these models suitable for real-time processing?

Some are, some are not.

49 of 69

Real-time processing

Real-time processing is critical for VQE applications

What does real-time mean for VQE? VQE works on short chunks with low latency.

We have 2 types of delay:

Algorithmic: how much future information is required to process current time frame?

Computational: how much time of computation do we need to process a time frame?

Real-time-factor (RTF): Time-process-one-frame / frame-duration.Should be < 1

What is frame duration for STFT?

50 of 69

U-Net (hourglass) structure

What does a model need for real-time processing?

No down-sampling on the time axis. Frequency axis is still ok.��Note: in vision horizontal and vertical directions have similar (spacial semantics).In audio one direction stands for time, another one stands for frequencies.

Causal layers

51 of 69

Causal Convolution

PyTorch: left padding

52 of 69

Causal Layers

Convolution

Causal Convolution

Bidirectional RNN (LSTM, GRU)

Unidirectional (regular) RNN (LSTM, GRU)

Transformer Encoder

Transformer Decoder

Causal

Non-Causal

53 of 69

Spectral Magnitude Mapping.

A Convolutional Recurrent Neural Network for Real-Time Speech

Enhancement, 2018

Contribution: CRN beats LSTM baseline

Like many other methods, only estimates magnitude and does not estimate phase

Softplus enforces the estimated magnitude is >= 0

Signal

Noise

Training target:

MSE

54 of 69

Magnitude and Phase

Given a complex spectrogram S

Magnitude: |S|

Phase: angle(S)

Noisy phase can be used directly with enhanced magnitude to some extent

55 of 69

Complex Spectral Mapping. Two heads.

From magnitude spectral mapping to complex spectral mapping

Complex Spectral Mapping:

Complex spectrogram estimated directly

56 of 69

Basic approach. Complex Spectral Mapping. Two heads.

Network topology

Chosen as best considering quality and model size

57 of 69

Basic approach. Complex Spectral Mapping. Two heads.

Network topology

58 of 69

ICASSP 2021 DEEP NOISE SUPPRESSION CHALLENGE: DECOUPLING MAGNITUDE AND PHASE OPTIMIZATION WITH A TWO-STAGE DEEP NETWORK

TSCN-PP

Real-time

State-of-the-art in 2021

59 of 69

ICASSP 2021 DEEP NOISE SUPPRESSION CHALLENGE: DECOUPLING MAGNITUDE AND PHASE OPTIMIZATION WITH A TWO-STAGE DEEP NETWORK

TSCN-PP

3 stages:

1. Coarse magnitude estimation (CME)

2. Complex spectrum refinement (CSR)

3. Post processing

60 of 69

ICASSP 2021 DEEP NOISE SUPPRESSION CHALLENGE: DECOUPLING MAGNITUDE AND PHASE OPTIMIZATION WITH A TWO-STAGE DEEP NETWORK

TSCN-PP

Network Topology

U-Net:

Encoder

Engine

Decoder

61 of 69

ICASSP 2021 DEEP NOISE SUPPRESSION CHALLENGE: DECOUPLING MAGNITUDE AND PHASE OPTIMIZATION WITH A TWO-STAGE DEEP NETWORK. TCM. Inverted residual Convolutional Block

Pointwise Conv (channel expansion)

Depthwise Conv

Pointwise Conv (channel compression)

Residual connection

62 of 69

ICASSP 2021 DEEP NOISE SUPPRESSION CHALLENGE: DECOUPLING MAGNITUDE AND PHASE OPTIMIZATION WITH A TWO-STAGE DEEP NETWORK. TCM.

Convolutional blocks

Stacks with exponentially growing dilation factors

e.g. 1, 2, 4, 8, 16, 32, 1, 2, 4, 8, 16, 32

Handles both short-term and long-term dependencies

63 of 69

ICASSP 2021 DEEP NOISE SUPPRESSION CHALLENGE: DECOUPLING MAGNITUDE AND PHASE OPTIMIZATION WITH A TWO-STAGE DEEP NETWORK. TCM. Conv block modified.

What is new:

1. Inner convs are dilated but no longer depthwise

2. Channel dim compressed instead of expansion

3. Gating mechanism

4. 2 dilation factors in 1 DTCM layer: 2^r and 2M-r

64 of 69

ICASSP 2021 DEEP NOISE SUPPRESSION CHALLENGE: DECOUPLING MAGNITUDE AND PHASE OPTIMIZATION WITH A TWO-STAGE DEEP NETWORK. How to train.

Stage 1: CME-Net, loss:

Stage 2: CME-Net and CSR-Net trained jointly, loss:

MSE for complex spectrum

MSE for magnitude

65 of 69

Post-Processing. Inspired by RNNNoise.

Combined ML-DSP (digital signal processing) approach

66 of 69

ICASSP 2021 DEEP NOISE SUPPRESSION CHALLENGE: DECOUPLING MAGNITUDE AND PHASE OPTIMIZATION WITH A TWO-STAGE DEEP NETWORK. Finally.

Real-time

State-of-the-art in 2021:

DNS Challenge winner

67 of 69

What next?

Neural network architectures:

- Dual path architectures

Model representation:

- Magnitude spectrum mapping/masking

- Complex spectrum mapping/masking

- Deep Filter

- Waveform domain, TasNet (exotic)

VQE as front-end for other models:

- ASR, KWS, Biometrics

Digital Signal Processing and multi-microphone technics:

Other tasks:

- Acoustic echo canceling (AEC): next lecture

- Blind source separation, music source separation, audio-visual source separation

- Bandwidth extension

Generative VQE

Stay tuned!

68 of 69

What next?

Narrowband (4 kHz)

Fullband, restored (16 kHz)

69 of 69

Mask representation

Magnitude GeLU

Complex Mask

Complex-As-Channels

Deep Filter

Multi-stage approach