1 of 22

FACTORIAL HIDDEN �RESTRICTED BOLTZMANN MACHINES �FOR �NOISE ROBUST SPEECH RECOGNITION

Steven J. Rennie

Petr Fousek, and Pierre L. Dognin

October 24, 2012

IBM T. J. Watson Research Center

© 2011 IBM Corporation

Factorial Hidden RBMs for Noise Robust Speech Recognition

© 2012 IBM Corporation

2 of 22

Motivation

  • Noise-robust Automatic Speech Recognition (ASR)
  • Noise-robust Multi-talker ASR
  • Signal Separation/Isolation/Analysis/Decomposition

2

Some Applications

mobile computing

surveillance

signal re-composition/editing

acoustic forensics

robust audio search

artificial perception

enhanced hearing

Factorial Hidden RBMs for Noise Robust Speech Recognition

© 2012 IBM Corporation

3 of 22

Why is Robust ASR hard?�

  • Multiple sources of interference, including speech
  • Computational explosion in the number of possible “acoustic states” of the environment
  • This makes data acquisition difficult
  • This makes statistical data analysis difficult

3

Factorial Hidden RBMs for Noise Robust Speech Recognition

© 2012 IBM Corporation

4 of 22

Factorial Models of Noisy Speech

4

BIN WHITE BY Z 8 AGAIN

BIN GREEN WITH A 2 SOON

SET GREEN IN F 2 NOW

LAY RED WITH C 1 PLEASE

BIN WHITE BY Z 8 AGAIN

SET GREEN IN F 2 NOW

x dB

n dB

Traffic Noise

Engine Noise

Speech Babble

Airport Noise

Car Noise

Music

Speech

Speech

Speech

Factorial Hidden RBMs for Noise Robust Speech Recognition

© 2012 IBM Corporation

5 of 22

Combinatoric Considerations

5

p

(

s

n

)

p

(

x

n

j

s

n

)

s

n

x

n

s

N

x

N

p

(

x

N

j

s

N

)

p

(

s

N

)

p

(

x

1

j

s

1

)

p

(

s

1

)

s

1

x

1

p

(

s

n

)

p

(

x

n

j

s

n

)

s

n

x

n

O

(

Y

n

j

s

n

j

)

I

n

f

e

r

e

n

c

e

:

y

p

(

y

j

x

1

;

¢

¢

¢

;

x

N

)

I

n

t

e

r

a

c

t

i

o

n

m

o

d

e

l

S

o

u

r

c

e

M

o

d

e

l

s

:

-

f

e

a

t

u

r

e

s

x

n

-

s

t

a

t

e

s

s

n

-

n

u

m

b

e

r

o

f

s

t

a

t

e

s

j

s

n

j

- functions of

connected variables

Factorial Hidden RBMs for Noise Robust Speech Recognition

© 2012 IBM Corporation

6 of 22

4 sources…

6

PLACE GREEN WITH B 8 SOON

LAY BLUE AT P ZERO NOW

PLACE RED IN H 3 NOW

PLACE WHITE AT D ZERO SOON

PLACE GREEN WITH B 8 SOON

LAY BLUE AT P ZERO NOW

PLACE RED IN H 3 NOW

PLACE WHITE AT D ZERO SOON

0 dB

-7 dB

  • Data model has over 1 trillion acoustic states
  • Excellent separation using variational posterior with 1K modes

-7 dB

-7 dB

Factorial Hidden RBMs for Noise Robust Speech Recognition

© 2012 IBM Corporation

7 of 22

Motivation

  • Learn parts-based models
    • Distributed states
      • Compositional model
      • Better generalization

  • Leverage known interactions
    • Instead of learning transformation again and again

7

Factorial Hidden RBMs for Noise Robust Speech Recognition

© 2012 IBM Corporation

8 of 22

Review: Restricted Boltzmann Machines

  • A Markov Random Field (MRF)
    • Two layers, no connections between hidden layer nodes
    • For binary hidden, Gaussian visible units:

    • Form of conditional posterior of hidden units

8

l

o

g

p

(

v

;

h

)

=

¡

V

X

i

=

1

(

v

i

¡

b

i

)

2

2

¾

2

i

+

H

X

j

=

1

a

j

h

j

+

V

X

i

=

1

H

X

j

=

1

!

i

j

v

i

h

j

¡

Z

p

(

h

j

=

1

j

v

)

=

e

x

p

(

a

j

+

P

V

i

=

1

!

i

j

v

i

)

1

+

e

x

p

(

a

j

+

P

V

i

=

1

!

i

j

v

i

)

=

s

i

g

(

a

j

+

V

X

i

=

1

!

i

j

v

i

)

Factorial Hidden RBMs for Noise Robust Speech Recognition

© 2012 IBM Corporation

9 of 22

Review: Restricted Boltzmann Machines (cont’d)

    • Form of conditional prior of a visible unit

    • Can be represented as a mixture of Gaussians
    • Can be evaluated in time linear in the number of hidden units since

9

p

(

v

i

j

h

)

=

e

x

p

(

¡

(

v

i

¡

b

i

)

2

2

¾

2

i

+

P

H

j

=

1

!

i

j

v

i

h

j

)

R

v

i

e

x

p

(

¡

(

v

i

¡

b

i

)

2

2

¾

2

i

+

P

H

j

=

1

!

i

j

v

i

h

j

)

=

N

(

v

i

;

b

i

+

¾

2

i

H

X

j

=

1

!

i

j

h

j

;

¾

2

i

)

;

2

H

H

p

(

h

j

v

)

=

Q

i

p

(

h

j

j

v

)

Factorial Hidden RBMs for Noise Robust Speech Recognition

© 2012 IBM Corporation

10 of 22

Factorial Hidden Restricted Boltzmann Machines

    • Interaction Model describes how the visible units of multiple RBMs (two here) generate observed data
    • Inference now intractable due to explaining away effects
    • One solution: variational methods

      • Choose surrogate posterior q that makes inference tractable (bound tight without structural assumptions on q)

10

p

(

y

j

v

x

;

v

n

)

l

o

g

p

(

y

)

=

l

o

g

X

h

;

v

p

(

h

x

;

v

x

)

p

(

h

n

;

v

n

)

p

(

y

j

v

x

;

v

n

)

¸

X

h

;

v

q

(

h

;

v

)

l

o

g

p

(

h

x

;

v

x

)

p

(

h

n

;

v

n

)

p

(

y

j

v

)

q

(

h

;

v

)

=

E

q

(

v

x

;

v

n

)

[

l

o

g

p

(

y

j

v

)

]

+

X

i

=

x

;

n

E

q

(

h

i

;

v

i

)

[

l

o

g

p

(

h

i

;

v

i

)

q

(

h

i

;

v

i

)

]

´

L

Factorial Hidden RBMs for Noise Robust Speech Recognition

© 2012 IBM Corporation

11 of 22

FHRBM Model : Factor Graph

11

ª

(

v

x

;

h

x

)

v

x

h

x

l

x

ª

(

h

x

;

l

x

)

l

n

h

n

v

n

ª

(

v

n

;

h

n

)

ª

(

h

n

;

l

n

)

p

(

y

j

v

x

;

v

n

)

Speech Model

Noise Model

Interaction Model

y

Noisy

Data

Factorial Hidden RBMs for Noise Robust Speech Recognition

© 2012 IBM Corporation

12 of 22

FHRBMs for Robust ASR

    • Speech RBM

    • Noise RBM

    • Interaction Model (log Mel power spectrum)

    • Assumed form of surrogate posterior q

12

p

(

v

x

;

h

x

)

p

(

v

n

;

h

n

)

p

(

y

j

v

x

;

v

n

)

=

Y

f

N

(

y

f

;

g

(

v

f

)

;

Ã

2

f

)

;

g

(

v

f

)

=

l

o

g

(

e

x

p

(

v

x

f

)

+

e

x

p

(

v

n

f

)

)

v

f

=

[

v

x

f

v

n

f

]

T

=

Y

f

N

(

v

f

;

¹

f

;

©

f

)

Y

s

=

x

;

n

H

s

Y

j

=

1

(

°

h

s

j

)

h

s

j

(

1

¡

°

h

s

j

)

1

¡

h

s

j

q

(

h

x

;

v

x

;

h

n

;

v

n

)

=

Y

f

q

(

v

x

f

;

v

n

f

)

H

x

Y

j

=

1

q

(

h

x

j

)

H

n

Y

k

=

1

q

(

h

n

k

)

[ this choice ignores

phase interactions ]

Factorial Hidden RBMs for Noise Robust Speech Recognition

© 2012 IBM Corporation

13 of 22

FHRBMs for Robust ASR

    • Iteration:
      1. Update context-dependent linear approx. of interaction

13

p

(

y

j

v

x

;

v

n

)

¼

Y

f

N

(

y

f

;

g

(

¹

f

)

+

(

v

f

¡

¹

f

)

T

d

f

;

Ã

2

f

)

;

d

v

n

f

=

1

¡

d

v

x

f

d

v

x

f

=

s

i

g

(

¹

v

n

f

¡

¹

v

x

f

)

d

f

=

[

d

v

x

f

d

v

n

f

]

T

=

@

g

@

v

f

¯

¯

¯

v

f

=

¹

f

Factorial Hidden RBMs for Noise Robust Speech Recognition

© 2012 IBM Corporation

14 of 22

14

Factorial Hidden RBMs for Noise Robust Speech Recognition

© 2012 IBM Corporation

15 of 22

FHRBMs for Robust ASR

    • Iteration (cont’d):
      1. Update the variational parameters of source s

      • Toggle s (between s=x and s=n)

15

°

h

s

j

=

s

i

g

(

a

s

j

+

P

V

s

f

=

1

!

s

f

j

¹

v

s

f

)

Á

2

v

s

f

=

(

¾

¡

2

v

s

f

+

d

2

v

s

f

(

Ã

0

f

)

¡

2

)

¡

1

¹

v

s

f

=

Á

2

v

s

f

(

¾

¡

2

v

s

f

(

b

v

s

f

+

¾

2

v

s

f

P

H

s

j

=

1

!

s

f

j

°

h

s

j

)

+

d

v

s

f

(

Ã

0

f

)

¡

2

y

0

f

)

Influence of other

source’s network

and data

Influence of source’s network

Factorial Hidden RBMs for Noise Robust Speech Recognition

© 2012 IBM Corporation

16 of 22

Deep FHRBMs for Robust ASR

    • Updates readily generalize to use of deep belief network (DBNs) of RBMs
    • Example: Source RBMs with two hidden layers

      • Top Layer Variables

      • Variational distribution

      • New update for first hidden layer

      • Extension to use of source RBMs with more than two hidden layers straightforward…

16

l

s

=

f

l

s

1

;

l

s

k

;

:

:

:

;

l

s

L

s

g

q

(

l

s

)

=

Q

k

q

(

l

s

k

)

=

Q

k

°

l

s

k

°

h

s

j

=

s

i

g

(

a

s

j

+

P

V

s

i

=

1

!

s

i

j

¹

v

s

i

+

®

s

j

+

P

L

s

j

=

1

$

s

j

k

°

l

s

k

)

Influence of layer above

Influence of layer below

Factorial Hidden RBMs for Noise Robust Speech Recognition

© 2012 IBM Corporation

17 of 22

Speech RBM

17

Middle Layer (8 Units)

Middle Layer (32 Units)

Top Layer (8 Units)

Features (24 dim)

Top Layer (3 Units)

Features (24 dim)

Noise RBM

Factorial Hidden RBMs for Noise Robust Speech Recognition

© 2012 IBM Corporation

18 of 22

Model-based Noise Compensation Algorithms

18

Noisy Speech

Model

Noise Model

(NM)

Speech Model

(SM)

Interaction Model

(IM)

Algorithm

SM

NM

IM

DNA (CD)

GMM

Gaussian Process

Log-sum (VTS approx)

FHRBM

RBM

RBM

Log-sum (VTS approx)

SNM

GMM

Fixed Gaussian (first 10 frames)

Log-sum (VTS approx)

GMM-GMM

GMM

GMM

Log-sum (VTS approx)

Factorial Hidden RBMs for Noise Robust Speech Recognition

© 2012 IBM Corporation

19 of 22

Preliminary Results

  • Task: Test time only noise compensation, noisy in-car speech data
  • Recognizer: IBM embedded system (eVV)
  • AM: 10K Gaussians, 865 CD states
  • LM: task-specific grammars
  • Training data: 786 hrs, ~10K speakers, C&C, dialing, navigation
  • Test data: 206k words, well matched
  • DNA outperforms GMM-GMM system on this task (diffuse evolving noise)
  • FHRBM outperforms DNA, but not DNA with Condition Detection (CD)
  • CD could be used with FHRBMs…

19

j

µ

R

B

M

x

j

=

j

µ

G

M

M

x

j

1

2

3

1

2

3

1

2

3

1

2

3

1

2

3

1

2

1

2

3

1

2

3

2

Factorial Hidden RBMs for Noise Robust Speech Recognition

© 2012 IBM Corporation

20 of 22

Preliminary Results – WER vs. (biased) SNR

20

fMLLR off, SS off

Factorial Hidden RBMs for Noise Robust Speech Recognition

© 2012 IBM Corporation

21 of 22

Preliminary Results – WER vs. (biased) SNR

21

fMLLR on, SS on

Factorial Hidden RBMs for Noise Robust Speech Recognition

© 2012 IBM Corporation

22 of 22

Discussion

    • Preliminary results are promising
      • Models
        • DNA: (matched) quasi-stationary noise model, speech GMM
        • FHRBM: no dynamics yet, parts-based speech/noise RBMs
      • SNR estimates of each frequency band
        • DNA: estimated uniquely for every speech state for each frame
        • FHRBM: single set of SNR estimates for each frame
      • Initialization
        • DNA: noise model initialized on first 10 frames
        • FHRBM: only state posterior (feature layer not yet adapted)
    • Need to evaluate FHRBMs on more general noise containing non-stationary & structured elements
    • Need to explore model/inference procedures further: e.g. FHRBM a bootstrap for fast feed-forward system?

22

Factorial Hidden RBMs for Noise Robust Speech Recognition

© 2012 IBM Corporation