1 of 22

FACTORIAL HIDDEN �RESTRICTED BOLTZMANN MACHINES �FOR �NOISE ROBUST SPEECH RECOGNITION

Steven J. Rennie

Petr Fousek, and Pierre L. Dognin

October 24, 2012

IBM T. J. Watson Research Center

Factorial Hidden RBMs for Noise Robust Speech Recognition

2 of 22

Motivation

Noise-robust Automatic Speech Recognition (ASR)
Noise-robust Multi-talker ASR
Signal Separation/Isolation/Analysis/Decomposition

2

Some Applications

mobile computing

surveillance

signal re-composition/editing

acoustic forensics

robust audio search

artificial perception

enhanced hearing

Factorial Hidden RBMs for Noise Robust Speech Recognition

Linear-time Model-based Source Separation using Loopy Belief Propagation and the Max Interaction Model

Model-based source separation hinges on a very simple idea, which is to use all that is known--about the characteristics of the sources, and how they interact—to estimate the sources from mixed data.

A fundamental barrier to the proliferation of model-based separation techniques, however, is that, even for the simplest of source interactions (e.g. additive mixing), the separation problem scales exponentially with the number of sources. The observed mixture fully couples the sources: if each source has K configurations, then there are K^S combinations to search over when there are S sources.

In this talk, I will present a new algorithm for multi-talker speech separation and recognition that achieves super-human recognition performance and runs in linear time. The loopy belief propagation algorithm exploits the independence structure of the sources to make the algorithm linear in language model size, and exploits the factored structure of the interaction between the sources under the max interaction model to make the acoustic likelihood computation linear in acoustic model size. The algorithm therefore scales linearly with the number of sources.

The performance of this system in terms of combined, speed, accuracy, and scalability, is unprecedented. The work demonstrates that the model-based approach is indeed a viable solution to robust speech recognition, and source separation problems in general. Since the interaction between sources that mix additively in the signal domain is well approximated by a max interaction in the log spectral domain, the approach can be applied to any additive mixing problem.

Come on out and have a look-see.

The work has potentially far-reaching consequences.

has potentially wide-reaching implications. The system makes high quality

In this talk I will present a new system for multi-source separation and speech recognition

It is well appreciated that

Model-based source separation hinges on a very simple idea, which is to use all that is known---about the characteristics of the sources, and how they interact—to constrain the source estimation problem.

A fundamental barrier to the proliferation of model-based separation techniques, however, is that, even for the simplest of source interactions (e.g. additive), the separation problem scales exponentially with the number of sources. The observed mixture fully couples the sources: if each source has K configurations, then there are K^S combinations to search over when there are S sources.

In this talk, I will present a model-based

system for multi-talker speech separation and recognition.

To do inference we must therefore generally resort to approximate methods.

One approach is to approximate the interaction function with one that decouples. Subspace-based approaches do this

One approach is to approximate the interaction function in a manner that produces linear source coupling.

Such approaches are extremely efficient (e.g. Non-negative matrix factorization, Non-negative subspace analysis),

but do not perform well when the subspaces representing the sources overlap significantly.

The other main approach is to do a local search, by iteratively estimating the sources in some manner.

(e.g. using variational methods) by more conventional means such as a gradient

Approximate inference schemes generally resort to local search method

as much information as possible to constrain the sep

, which is to incorporate

everything that is known about the sources and how they interact when trying

into the estimation process

everything the sources and how they interact

everything you know about the problem, and the characteristics of the sources

In this talk, I will present

3 of 22

Why is Robust ASR hard?�

Multiple sources of interference, including speech
Computational explosion in the number of possible “acoustic states” of the environment
This makes data acquisition difficult
This makes statistical data analysis difficult

3