Speech Processing
Episode 10
Voice Quality Enhancement
Part 2
VQE Tasks
Lecture Plan
Intro to AEC
Audio feedback
What if 2 devices join a conference in a single room?
Demo
https://www.youtube.com/watch?v=7tFPAEmGijM
What if 2 devices join a conference in a single room?
This effect is not always handled by modern communication systems
Audio feedback
Audio feedback
Audio feedback
Audio feedback
Audio feedback
It is desirable to break the connection between the speaker and the microphone
Audio feedback
Then a call would go normally
AEC reduces audio feedback
AEC
The connection can not be broken physically but we can compensate for it with AEC algorithms
Noise reduction recap
Task:
Restore the speech signal from a mixture
+
=
AEC reduces audio feedback
AEC
=
+
Far-end signal,
reference
Near-end signal
Echo
Notes on AEC
Difference with noise reduction:
Noise is part of near-end signal. AEC does not require to remove noise.
However, noise reduction and AEC can be done jointly.
Equivalent setting:
Given reference signal and input microphone estimate noise:
Microphone = near_end + echo
Properties:
Input SER (signal-to-echo ratio) can be very low (~ -30 dB).
Thanks to the reference channel, the performance of AEC (SER gain) can be much higher than the performance of noise reduction (SNR gain)
QUESTION
Intro to AEC
Acoustics basics
The ray direction is not accurate and was selected for illustration purposes
Sound propagates from source to microphone in an environment
A great YouTube playlist: https://www.youtube.com/watch?v=yGeXEwdNd_s&list=PL_QS1A2ZqaG7p50cd0AgLeG9Q3TN64vZJ
A room with source and microphone in it defines a transfer function
Acoustics basics
A room with source and microphone in it defines a transfer function
Acoustics basics
h(x)
x1
x2
y1
y2
y1[t] = h(x1)[t]
y2[t] = h(x2)[t]
Room acoustics transfer function properties
1. Linearity
2. Time-invariance
LTI: Linear time-invariant system
Room acoustics transfer function is an LTI system.
Why? And what does LTI mean?
Linearity
1. h(ax) = ah(x) for any scalar a
2. h(x1 + x2) = h(x1) + h(x2) for signals x1, x2
h(x)
x[t]
ax[t] [a=0.5]
h(ax)(t) = ah(x)[t]
h(x)[t]
Linearity
1. h(ax) = ah(x) for any scalar a
2. h(x1 + x2) = h(x1) + h(x2) for signals x1, x2
h(x)
x1[t]
x2[t]
(x1 + x2)[t]
h(x1)[t]
h(x2)[t]
h(x1 + x2)[t] = h(x)[t] + h(x2)[t]
Linearity
1. h(ax) = ah(x) for any scalar a
2. h(x1 + x2) = h(x1) + h(x2) for signals x1, x2
Why is room acoustic transfer function linear?
Verified in experiments
Sound propagation is described by wave equation, which is linear:
Time-invariance
h(Tx) = Th(x) for any time shift T
h(x)
x[t]
Tx[t]
h(x)[t]
h(Tx)(t) = T(h(x))[t]
Time-invariance
h(Tx) = Th(x) for any time shift T
Why is room acoustics transfer function time-invariant?
We assume the room is static.
Feels intuitive.
Also follows from the wave equation.
Room acoustics transfer function properties
1. Linearity
2. Time-invariance
LTI: Linear time-invariant system
Room acoustics transfer function is an LTI system.
What can we get from it?
When do LTI assumptions break?
1. Real speaker and software loop-backs
2. Time variance in a room: walking with a smartphone, walking in a room
3. The physical model assumes that sound waves don’t affect the medium they propagate in. With high sound pressures (e.g. explosion) this assumption loses its accuracy.
Room acoustics transfer function properties
1. Linearity
2. Time-invariance
LTI: Linear time-invariant system
Room acoustics transfer function is an LTI system.
What can we get from it?
QUESTION
Convolution
LTI Theorem
LTI:
1. h(ax) = ah(x) for any scalar a
2. h(x1 + x2) = h(x1) + h(x2) for signals x1, x2
h(x)
x(t)
*
LTI Theorem proof
What is it all written here?
LTI Theorem proof
LTI Theorem proof
=
+
+
+
...
+
LTI Theorem proof
=
+
+
...
h(x)
=
+
+
+
+
Room acoustics is fully defined with its impulse response
LTI Theorem
x(t)
*
Examples
QUESTION
Break
Fast convolution. Convolution Theorem.�Why else we love STFT.
Let we have signal x of length T and an convolution kernel (a.k.a impulse response) h of length t. We want to calculate y = x * h. What is the computational complexity?
Naive: O(T x t)
Fast convolution. Convolution Theorem.�Why else we love STFT.
Let we have signal x of length T and an convolution kernel (a.k.a impulse response) h of length t. We want to calculate y = x * h. What is the computational complexity?
FFT: O(T log T + t log t)
X = FFT(x)
H = FFT(h)
Y = X x H
y = IFFT(Y)
Fast convolution. Convolution Theorem.�Why else we love STFT.
Question: What are the sizes of the tensors involved?
Question: What is the size of x * h?
Cdot denotes point-wise multiplication
Fast convolution. Convolution Theorem.�Why else we love STFT.
Complexity: O(N log N)
Fast convolution. Convolution Theorem.�Why else we love STFT.
Fast convolution. Convolution Theorem.�Why else we love STFT.
Corollary: in LTI systems (static room acoustics) a signal always preserves its frequencies
QUESTION
Waveform domain LMS
Acoustics:
mic = ref * rir + noise
Noise is uncorrelated with ref.
LMS: MSE(mic, ref x rir_est) → min
echo_est = ref x rir_est
near_end_est = mic – echo_est [this is what we will always be doing further on]
How to minimize MSE?
Analytically or with Gradient Descent
LMS: least mean squares
Advantage: the simplest method.
Streaming waveform domain LMS
Acoustics:
mic = ref * rir + noise
Noise is uncorrelated with ref.
LMS: MSE(mic, ref x rir_est) → min
Gradient descent of new samples
echo_est[t] = <rir_est^t, ref[t – T : t + 1]>
err[t] = (mic[t] – echo_est[t])^2
Grad = grad(err[t], rir_est^t)
rir_est^{t+1} = rir_est^{t} – alpha x grad
Streaming waveform domain NLMS
Acoustics:
mic = ref * rir + noise
Noise is uncorrelated with ref.
NLMS: normalized LMS. A modification of LMS which is invariant to multiplication by a scalar.
echo_est[t] = <rir_est^t, ref[t – T : t + 1]>
err[t] = (mic[t] – echo_est[t])^2 / mean(mic[t – T : t + 1] ^ 2)
Grad = grad(err[t], rir_est^t)
rir_est^{t+1} = rir_est^{t} – alpha x grad
More robust with regard to learning rate.
Block-LMS [towards frequency-domain LMS]
Just like streaming waveform domain LMS/NLMS but on chunks of sound instead of single samples.
Advantage: speed
Disadvantage: latency
Frequency domain LMS
Note that according to the linear model frequencies do not mix.
Heuristically if frequency LMS they do.
Check it in your homework!
Neural post-filter
How to train a neural network for AEC?
Just the way we did it in noise reduction, but:
1. Consider cascading your network after a linear AEC
2. For data generation employ augmentations to introduce nonlinearities of a speaker
QUESTION
Beamforming. Motivation. DAS.
=
https://aliexpress.ru/item/32841978358.html?sku_id=12000028573970654
Various consumer devices feature microphone arrays
Yandex Station Mini
ReSpeaker USB Mic Array V2.0
Why?
Delay and Sum (DAS) Beamforming
http://www.labbookpages.co.uk/audio/beamforming/delaySum.html
Microphone arrays enable spatial filtering!
Sound propagates through the space as waves.
The times of arrival are different for different microphones.
Delay and Sum Beamforming. Linear array.
1. We assume the speaker is much further away from the microphones than the microphones are from each other. Thus the wave is planar and we can characterize the speaker position by its angle.
2. For any angle the signals on the microphones will be similar except for a time shift by
3. Given target direction we delay-and-sum the signals to compensate from the delay from the target direction. That’s the beamforming.
4. For any other direction and any frequency we evaluate the gain to understand how good it is.
Delay and Sum Beamforming. Directivity Pattern.
DAS beamforming, 4 mics, linear, d=0.2m, f=1000Hz
Is it a good result?
As a result the microphone array turns into a directed microphone with the directivity pattern controlled by us
Beam pattern
Delay and Sum Beamforming. Directivity Pattern.
DAS beamforming, 4 mics, linear, d=0.2m, f=1000Hz
Is it a good result?
The array is huge in size and the frequency is relatively high.
As a result the microphone array turns into a directed microphone with the directivity pattern controlled by us
Beam pattern
Delay and Sum Beamforming. Directivity Pattern.
DAS beamforming, 4 mics, linear, d=0.03m, f=400Hz
In a more practical setting the suppression is around 1.73 dB, which is pretty weak
Can we do better?
Frequency domain beamforming
In DAS we delayed and summed the signals in waveform domain.
Delay can be expressed as convolution:
Convolution in time domain corresponds to convolution in time-frequency domain:
Typically beamforming is formulated in time-frequency domain
We can use arbitrary w_k with the only condition of preserving the target signal!
MVDR Beamforming
MVDR: Minimal Variance Distortionless Response Beamformer
Different from DAS, which only depends on the target direction, MVDR also depends on the input signal
MVDR minimizes the noise (or overall signal) energy with the constraint that the signal which comes from the target domain stays unchanged:
A good tutorial for beamforming can be found here:
https://pysdr.org/content/doa.html
https://pysdr.org/content/doa.html#mvdr-capon-beamformer
What was left
Check your homework:
1. Spectral domain LMS AEC
Not covered in the course:
1. Advanced linear AEC techniques: Kalman filter, RLS, double-talk detection
2. Advanced array techniques and beamforming: MVDR, GEVD, DOA estimation, end-to-end approaches: https://pysdr.org/content/doa.html
3. Combination of neural networks for single channel noise reduction with GEVD beamforming: https://groups.uni-paderborn.de/nt/pubs/2016/icassp_2016_heymann_paper.pdf
4. Deep learning for AEC: https://arxiv.org/abs/2306.03177
Thank you!