1 of 65

Speech Processing

Episode 10

Voice Quality Enhancement

Part 2

2 of 65

VQE Tasks

Noise Reduction (prev lecture)
De-Reverberation (prev lecture)
Acoustic Echo Canceling (this time!)
Source Separation
Others

3 of 65

Lecture Plan

1. AEC task setting
2. Acoustics basic. LTI systems.
3. AEC: Acoustic Echo Cancelling
4. Microphone arrays. Beamforming.

4 of 65

Intro to AEC

“Acoustic Echo Cancelling” = “Echo Cancelling” = “AEC”

5 of 65

Audio feedback

What if 2 devices join a conference in a single room?

Demo

https://www.youtube.com/watch?v=7tFPAEmGijM

6 of 65

What if 2 devices join a conference in a single room?

This effect is not always handled by modern communication systems

7 of 65

Audio feedback

8 of 65

Audio feedback

9 of 65

Audio feedback

10 of 65

Audio feedback

11 of 65

Audio feedback

It is desirable to break the connection between the speaker and the microphone

12 of 65

Audio feedback

Then a call would go normally

13 of 65

AEC reduces audio feedback

AEC

The connection can not be broken physically but we can compensate for it with AEC algorithms

14 of 65

Noise reduction recap

Task:

Restore the speech signal from a mixture

15 of 65

AEC reduces audio feedback

AEC

Far-end signal,

reference

Near-end signal

Echo

16 of 65

Notes on AEC

Difference with noise reduction:

Noise is part of near-end signal. AEC does not require to remove noise.

However, noise reduction and AEC can be done jointly.

Equivalent setting:

Given reference signal and input microphone estimate noise:

Microphone = near_end + echo

Properties:

Input SER (signal-to-echo ratio) can be very low (~ -30 dB).

Thanks to the reference channel, the performance of AEC (SER gain) can be much higher than the performance of noise reduction (SNR gain)

18 of 65

Intro to AEC

Can we just subtract loop-back from the microphone?
No!

19 of 65

Acoustics basics

The ray direction is not accurate and was selected for illustration purposes

Sound propagates from source to microphone in an environment

A great YouTube playlist: https://www.youtube.com/watch?v=yGeXEwdNd_s&list=PL_QS1A2ZqaG7p50cd0AgLeG9Q3TN64vZJ

20 of 65

A room with source and microphone in it defines a transfer function

Acoustics basics

21 of 65

A room with source and microphone in it defines a transfer function

Acoustics basics

h(x)

y1[t] = h(x1)[t]

y2[t] = h(x2)[t]

22 of 65

Room acoustics transfer function properties

1. Linearity

2. Time-invariance

LTI: Linear time-invariant system

Room acoustics transfer function is an LTI system.

Why? And what does LTI mean?

23 of 65

Linearity

1. h(ax) = ah(x) for any scalar a

2. h(x1 + x2) = h(x1) + h(x2) for signals x1, x2

h(x)

x[t]

ax[t] [a=0.5]

h(ax)(t) = ah(x)[t]

h(x)[t]

24 of 65

Linearity

1. h(ax) = ah(x) for any scalar a

2. h(x1 + x2) = h(x1) + h(x2) for signals x1, x2

h(x)

x1[t]

x2[t]

(x1 + x2)[t]

h(x1)[t]

h(x2)[t]

h(x1 + x2)[t] = h(x)[t] + h(x2)[t]

25 of 65

Linearity

1. h(ax) = ah(x) for any scalar a

2. h(x1 + x2) = h(x1) + h(x2) for signals x1, x2

Why is room acoustic transfer function linear?

Verified in experiments

Sound propagation is described by wave equation, which is linear:

26 of 65

Time-invariance

h(Tx) = Th(x) for any time shift T

h(x)

x[t]

Tx[t]

h(x)[t]

h(Tx)(t) = T(h(x))[t]

27 of 65

Time-invariance

h(Tx) = Th(x) for any time shift T

Why is room acoustics transfer function time-invariant?

We assume the room is static.

Feels intuitive.

Also follows from the wave equation.

28 of 65

Room acoustics transfer function properties

1. Linearity

2. Time-invariance

LTI: Linear time-invariant system

Room acoustics transfer function is an LTI system.

What can we get from it?

29 of 65

When do LTI assumptions break?

1. Real speaker and software loop-backs

2. Time variance in a room: walking with a smartphone, walking in a room

3. The physical model assumes that sound waves don’t affect the medium they propagate in. With high sound pressures (e.g. explosion) this assumption loses its accuracy.

30 of 65

Room acoustics transfer function properties

1. Linearity

2. Time-invariance

LTI: Linear time-invariant system

Room acoustics transfer function is an LTI system.

What can we get from it?

32 of 65

Convolution

33 of 65

LTI Theorem

LTI:

1. h(ax) = ah(x) for any scalar a

2. h(x1 + x2) = h(x1) + h(x2) for signals x1, x2

h(x)

x(t)

34 of 65

LTI Theorem proof

What is it all written here?

35 of 65

LTI Theorem proof

36 of 65

LTI Theorem proof

...

37 of 65

LTI Theorem proof

...

h(x)

38 of 65

Room acoustics is fully defined with its impulse response

LTI Theorem

x(t)

39 of 65

Examples

1. Output size: len(signal) + len(rir) - 1
2. Long reverberation, short reverberation [listen]
3. RIR zero-padding leads to equal zero-padding of the convolution result

42 of 65

Fast convolution. Convolution Theorem.�Why else we love STFT.

Let we have signal x of length T and an convolution kernel (a.k.a impulse response) h of length t. We want to calculate y = x * h. What is the computational complexity?

Naive: O(T x t)

43 of 65

Fast convolution. Convolution Theorem.�Why else we love STFT.

Let we have signal x of length T and an convolution kernel (a.k.a impulse response) h of length t. We want to calculate y = x * h. What is the computational complexity?

FFT: O(T log T + t log t)

X = FFT(x)

H = FFT(h)

Y = X x H

y = IFFT(Y)

44 of 65

Fast convolution. Convolution Theorem.�Why else we love STFT.

Question: What are the sizes of the tensors involved?

Question: What is the size of x * h?

Cdot denotes point-wise multiplication

45 of 65

Fast convolution. Convolution Theorem.�Why else we love STFT.

Complexity: O(N log N)

46 of 65

Fast convolution. Convolution Theorem.�Why else we love STFT.

47 of 65

Fast convolution. Convolution Theorem.�Why else we love STFT.

Corollary: in LTI systems (static room acoustics) a signal always preserves its frequencies

49 of 65

Waveform domain LMS

Acoustics:

mic = ref * rir + noise

Noise is uncorrelated with ref.

LMS: MSE(mic, ref x rir_est) → min

echo_est = ref x rir_est

near_end_est = mic – echo_est [this is what we will always be doing further on]

How to minimize MSE?

Analytically or with Gradient Descent

LMS: least mean squares

Advantage: the simplest method.

50 of 65

Streaming waveform domain LMS

Acoustics:

mic = ref * rir + noise

Noise is uncorrelated with ref.

LMS: MSE(mic, ref x rir_est) → min

Gradient descent of new samples

echo_est[t] = <rir_est^t, ref[t – T : t + 1]>

err[t] = (mic[t] – echo_est[t])^2

Grad = grad(err[t], rir_est^t)

rir_est^{t+1} = rir_est^{t} – alpha x grad

51 of 65

Streaming waveform domain NLMS

Acoustics:

mic = ref * rir + noise

Noise is uncorrelated with ref.

NLMS: normalized LMS. A modification of LMS which is invariant to multiplication by a scalar.

echo_est[t] = <rir_est^t, ref[t – T : t + 1]>

err[t] = (mic[t] – echo_est[t])^2 / mean(mic[t – T : t + 1] ^ 2)

Grad = grad(err[t], rir_est^t)

rir_est^{t+1} = rir_est^{t} – alpha x grad

More robust with regard to learning rate.

52 of 65

Block-LMS [towards frequency-domain LMS]

Just like streaming waveform domain LMS/NLMS but on chunks of sound instead of single samples.

Advantage: speed

Disadvantage: latency

53 of 65

Frequency domain LMS

Note that according to the linear model frequencies do not mix.

Heuristically if frequency LMS they do.

Check it in your homework!

54 of 65

Neural post-filter

How to train a neural network for AEC?

Just the way we did it in noise reduction, but:

1. Consider cascading your network after a linear AEC

2. For data generation employ augmentations to introduce nonlinearities of a speaker

56 of 65

Beamforming. Motivation. DAS.

https://aliexpress.ru/item/32841978358.html?sku_id=12000028573970654

Various consumer devices feature microphone arrays

Yandex Station Mini

ReSpeaker USB Mic Array V2.0

Why?

57 of 65

Delay and Sum (DAS) Beamforming

http://www.labbookpages.co.uk/audio/beamforming/delaySum.html

Microphone arrays enable spatial filtering!

Sound propagates through the space as waves.

The times of arrival are different for different microphones.

58 of 65

Delay and Sum Beamforming. Linear array.

1. We assume the speaker is much further away from the microphones than the microphones are from each other. Thus the wave is planar and we can characterize the speaker position by its angle.

2. For any angle the signals on the microphones will be similar except for a time shift by

3. Given target direction we delay-and-sum the signals to compensate from the delay from the target direction. That’s the beamforming.

4. For any other direction and any frequency we evaluate the gain to understand how good it is.

59 of 65

Delay and Sum Beamforming. Directivity Pattern.

DAS beamforming, 4 mics, linear, d=0.2m, f=1000Hz

Is it a good result?

As a result the microphone array turns into a directed microphone with the directivity pattern controlled by us

Beam pattern

60 of 65

Delay and Sum Beamforming. Directivity Pattern.

DAS beamforming, 4 mics, linear, d=0.2m, f=1000Hz

Is it a good result?

The array is huge in size and the frequency is relatively high.

As a result the microphone array turns into a directed microphone with the directivity pattern controlled by us

Beam pattern

61 of 65

Delay and Sum Beamforming. Directivity Pattern.

DAS beamforming, 4 mics, linear, d=0.03m, f=400Hz

In a more practical setting the suppression is around 1.73 dB, which is pretty weak

Can we do better?

62 of 65

Frequency domain beamforming

In DAS we delayed and summed the signals in waveform domain.

Delay can be expressed as convolution:

Convolution in time domain corresponds to convolution in time-frequency domain:

Typically beamforming is formulated in time-frequency domain

We can use arbitrary w_k with the only condition of preserving the target signal!

63 of 65

MVDR Beamforming

MVDR: Minimal Variance Distortionless Response Beamformer

Different from DAS, which only depends on the target direction, MVDR also depends on the input signal

MVDR minimizes the noise (or overall signal) energy with the constraint that the signal which comes from the target domain stays unchanged:

A good tutorial for beamforming can be found here:

https://pysdr.org/content/doa.html

https://pysdr.org/content/doa.html#mvdr-capon-beamformer

64 of 65

What was left

Check your homework:

1. Spectral domain LMS AEC

Not covered in the course:

1. Advanced linear AEC techniques: Kalman filter, RLS, double-talk detection

2. Advanced array techniques and beamforming: MVDR, GEVD, DOA estimation, end-to-end approaches: https://pysdr.org/content/doa.html

3. Combination of neural networks for single channel noise reduction with GEVD beamforming: https://groups.uni-paderborn.de/nt/pubs/2016/icassp_2016_heymann_paper.pdf

4. Deep learning for AEC: https://arxiv.org/abs/2306.03177

1 of 65

2 of 65

3 of 65

4 of 65

5 of 65

6 of 65

7 of 65

8 of 65

9 of 65

10 of 65

11 of 65

12 of 65

13 of 65

14 of 65

15 of 65

16 of 65

17 of 65

18 of 65

19 of 65

20 of 65

21 of 65

22 of 65

23 of 65

24 of 65

25 of 65

26 of 65

27 of 65

28 of 65

29 of 65

30 of 65

31 of 65

32 of 65

33 of 65

34 of 65

35 of 65

36 of 65

37 of 65

38 of 65

39 of 65

40 of 65

41 of 65

42 of 65

43 of 65

44 of 65

45 of 65

46 of 65

47 of 65

48 of 65

49 of 65

50 of 65

51 of 65

52 of 65

53 of 65

54 of 65

55 of 65

56 of 65

57 of 65

58 of 65

59 of 65

60 of 65

61 of 65

62 of 65

63 of 65

64 of 65

65 of 65