1 of 22

Rethinking Device Interaction: A Silent Speech Approach

Tanmay Srivastava^†, Prerna Khanna^†, Shijia Pan^★, VP Nguyen^‡, Shubham Jain^†

^†

^★

^‡

2 of 22

1

Speech interactions are UBIQUITOUS…

Oh no, the mutant

chicken is here!

Unreliable in noisy

environment

Don’t allow discreet

communication

Not catered for next-gen wearables

Speech interactions are NOT ALWAYS PRACTICAL…

3 of 22

2

Contactless USI Systems

Contact USI Systems

Other Input Modalities

(MobiCom’20)

(IUI’18)

(CHI’22)

(IMWUT’20)

(MobiCom’14)

(CHI’19)

(Mobisys’18)

The Search for Silent Alternatives

4 of 22

3

Using Silent Speech as a surrogate to speech

Contact SSI

Acceptable Form Factor

Unobtrusive

Hands-free

Jaw Motion

5 of 22

4

Analogy: Reconstruct the song by

watching the guitarist's hand

Jaw Motion

Speech

Guess the song!

Blinding Lights

6 of 22

5

Is it even possible to infer silent speech from JAW?

Accelerometer

Tempo mandibular

Joint

7 of 22

6

Are the signals detectable?

8 of 22

7

Are the signals recognizable?

JawSense (HotMobile’21)

INITIAL BREAKTHROUGH

9 of 22

8

Let’s recognize words…

Phonemes overlap in time to produce compound sounds

10 of 22

9

/stɑːrt/

Isolated /ɑː/

/ɑː/ (Nucleus) -> Jaw Downwards

/s/ (Onset) ->

Jaw Upwards

/t/ (Coda) Plosive

-

Breaking words into phonemes

11 of 22

10

Accuracy across different syllabic length words >0.9

How well can we recognize isolated words?

12 of 22

11

Accuracy across different syllabic length words >0.9

How well can we recognize isolated words?

MuteIt (UbiComp’22)

WE MADE IT PRACTICAL!!

13 of 22

12

Moving to natural silent conversation.. .

~Typing Speed

~Normal Speech Rate

14 of 22

13

Jaw Motion

6-axis IMU

Unvoiced

Spectrogram

Spectrogram to Text

Using Silent Speech as a surrogate to speech

15 of 22

14

SenSys’24

How do we get the spectrogram from IMU?

Simplify the task: Segment the IMU signal for a phrase into words.

Cross-modal Translation Model: Device custom model and loss function.

Handle variable user context: Leverage LLMs as gap-filling agents for phrases.

16 of 22

15

How do we learn this information?

Set alarm for 6 AM

High MSE

[a1, 0, a2, 0, a3,……an]

[a1, 0, 0, 0, a2, 0, 0, 0,………an]

Designing Loss Function

Prosody Loss

MSE

Prosody

17 of 22

16

We achieve > 94% accuracy

How well can we perform interaction tasks?

18 of 22

17

Add

Apples

Banana

To

Shopping

List

Add apples and bananas to shopping list.

Delete apples and bananas from shopping list.

1 syllable,

low vowel

1 syllable,

high vowel

&

Putting last 3.5 years together

JawSense

Unvoiced

MuteIt

THE FINAL SOLUTION….

19 of 22

18

Work done at MSR

What are the next steps?

1. Silent Speech on Commercial Wearables

Headphones

VR headset

AR headset

Earphones

20 of 22

19

What are the next steps?

2. Working with the Accessible Population

x10

Afternoon

21 of 22

20

Closing remarks

Silent speech is an ALTERNATE interaction modality.

Limited information but also limited context.

Need adaptation with accessible population.

22 of 22

22

SenSys’24

Quality of spectrogram generation