1 of 41

HOW DO COMPUTERS UNDERSTAND WHAT WE SAY?

UNIT 2, MODULE 2.2

1

2 of 41

OBJECTIVES

2

Students should be able to:

    • Explain how speech recognition works

3 of 41

3

Discussion

Do you use speech recognition on your device?

If so, which apps do you use and how well does it work?

4 of 41

Add Video

4

5 of 41

TODAY WE ARE GOING TO LEARN

HOW SPEECH RECOGNITION WORKS

5

6 of 41

SPEECH RECOGNITION STARTS WITH SOUND

6

Because there’s no sound in a vacuum.

In space, no one can hear you scream.

Physics tells us Sound is a rapid variation in air pressure

7 of 41

A MICROPHONE IS A SENSOR THAT TURNS SOUND INTO AN ELECTRICAL SIGNAL, OR “WAVEFORM”.

7

A Waveform

is a plot of the voltage over time.

Sound waves are variations in air pressure.

When we make sounds the air is vibrating.

The microphone is converting �air pressure to voltage.

8 of 41

LET’S LOOK AT HOW PEOPLE CREATE SOUNDS

8

9 of 41

WHAT DID YOU NOTICE ABOUT HOW THE TONGUE AND LIPS WERE INVOLVED IN BEATBOXING?

9

Discussion

10 of 41

HUMAN ARTICULATORY APPARATUS: HOW WE PRODUCE SPEECH

10

Source: https://link.springer.com/referenceworkentry/10.1007%2F978-0-387-73003-5_199

11 of 41

ADDITIONAL MRI VOCAL TRACT VIDEOS

11

12 of 41

PLAY WITH THE VOCAL TRACT DEMO�(HUMAN VOCAL TRACT)

12

13 of 41

WHAT DOES SOUND LOOK LIKE?

13

Time

Amplitude

Oscilloscope - a device for displaying waveforms

14 of 41

TRY IT: OSCILLOSCOPE DEMO

STEP 1: OPEN: HTTPS://ACADEMO.ORG/DEMOS/VIRTUAL-OSCILLOSCOPE/

STEP 2: SET SECONDS/DIV TO 5 MS

STEP 3: TALK INTO THE MIC � TO SEE WHAT YOUR � SPEECH LOOKS LIKE

14

15 of 41

LET’S CHOOSE A SIMPLE TYPE OF SOUND TO MAKE THE WAVEFORM UNDERSTANDABLE

STEP 3: IF YOU HAVE AN ELECTRONIC KEYBOARD YOU SHOULD SELECT AN ELECTRONIC ORGAN VOICE�IF NOT, YOU CAN USE A VIRTUAL KEYBOARD:

STEP 4: PLAY A LOW NOTE AND THEN PLAY A HIGH NOTE. HOW DO THEY DIFFER?�STEP 5: PLAY A LOW NOTE AND A HIGH NOTE AT THE SAME TIME. HOW DO THE WAVEFORMS COMBINE?

15

low C

very high C

low + very high C

16 of 41

TRY PLAYING A CHORD: �MIDDLE C PLUS G GIVES A MORE COMPLEX WAVEFORM

16

Time

Amplitude

17 of 41

HOWEVER, HUMAN SPEECH IS A VERY COMPLEX WAVEFORM

17

Time

Amplitude

18 of 41

A SPECTROGRAM SHOWS THE FREQUENCY COMPONENTS OF SOUNDS

18

Frequency

Time

Do you see any differences in the axis labels?

This is human speech.

19 of 41

TRY IT: SPECTROGRAM DEMO

STEP 1: OPEN HTTPS://SPECTROGRAM.SCIENCEMUSIC.ORG/

STEP 2: CLICK ON THE MICROPHONE ICON TWICE TO�START THE DEMO.

STEP 3: MAKE SOUNDS AND OBSERVE THE �SPECTROGRAM.

19

Frequency

Time

20 of 41

HOW TO READ A SPECTROGRAM

20

Amplitude

Time

This spectrogram shows the frequency: low C

Frequency

Time

Frequency

Time

Amplitude

Time

This spectrogram shows the frequency: very high C

21 of 41

HOW TO READ A SPECTROGRAM

21

Frequency

Time

Amplitude

Time

Frequency

Time

THIS SPECTROGRAM SHOWS THE CHORD: LOW + VERY HIGH CC

Amplitude

Time

C

C

G

C+G

This spectrogram shows shows the chord: middle C plus G

22 of 41

HUMAN SPEECH

SPEECH SPECTROGRAM OF DAVE SAYING:�� “EVERY CHILD DESERVES TO LEARN� ABOUT ARTIFICIAL INTELLIGENCE.”

22

23 of 41

READING A SPECTROGRAM OF HUMAN SPEECH: �FORMANTS F1, F2, AND F3

23

Formants are bands of high energy in the speech signal. �Different vowels and consonants have different formant patterns.

F3

F2

F1

24 of 41

24

/a/ as in “block”

/u/ as in “blue”

/i/ as in “bleed”

25 of 41

FROM SPECTROGRAMS TO SOUNDS

THIS IS AN EARLY 1980S CONVOLUTIONAL NEURAL NET �THAT RECOGNIZES SOUNDS.

SPECTROGRAMS GO IN; SOUNDS (PHONEMES) COME OUT:�IN THIS CASE /B/, /D/, OR /G/.

25

waveform

spectrogram

neural network (phoneme recognizer)

26 of 41

HOW DO COMPUTERS UNDERSTAND WHAT WE SAY?

UNDERSTANDING IMPLIES GRASPING THE MEANING OF WHAT WE SAY.

26

Let’s return to our original question:

27 of 41

WHERE DOES MEANING COME FROM?

27

The transformation from sound to meaning takes place in stages, with increasingly abstract features and higher level knowledge applied at each stage.

The audio abstraction pipeline

28 of 41

What people understand about their own language

Every language has a lexicon consisting of all the words in that language.

All languages have a grammar: a set of rules for generating logical communication.

While every language has a different set of rules, all languages do obey rules.

You subconsciously know the rules of your language even though you probably can’t say what they are.

28

Major Levels of Linguistics

29 of 41

What knowledge do smart assistants need to understand human speech

29

Notice any similarities for earlier slides?

Smart assistant use the same information that humans use to understand speech.

30 of 41

30

Sound is variations in air pressure.

“unforgettable sight”

Meet our resident sound expert - Xenomorph from the movie Alien. �She will tell us all about the stages of the audio abstraction pipeline.

When we speak, the air is vibrating.

31 of 41

31

A microphone turns sound into an electrical signal, or “waveform”.

“unforgettable sight”

“unforgettable sight”

waveform

32 of 41

32

A spectrum analyzer turns the waveform into a spectrogram.

spectrum analyzer

waveform

sound waves

spectrogram

“unforgettable sight”

33 of 41

33

Sounds (phonemes) can be extracted from the spectrogram.

spectrum analyzer

waveform

sound waves

spectrogram

“unforgettable sight”

phoneme recognizer�(sounds)

/ ^ n f o r g ɜ t a b ʊ l # s aI t /

34 of 41

34

Morphemes (word stems and affixes) can be assembled from phonemes. But /s aI t/ is ambiguous: is it “site” or “sight” or “cite”?

spectrum analyzer

waveform

sound waves

spectrogram

“unforgettable sight”

phoneme recognizer

/ ^ n f o r g ɜ t a b ʊ l

# s aI t /

morpheme recognizer

(stems and affixes)

un + forget + able site/sight/cite

35 of 41

35

A parser assigns grammatical categories to words and groups them into phrases. “NP” means noun phrase.

spectrum analyzer

waveform

sound waves

spectrogram

“unforgettable sight”

phoneme recognizer

/ ^ n f o r g ɜ t a b ʊ l

# s aI t /

morpheme recognizer

(stems and affixes)

un + forget + able

site/sight/cite

(NP (ADJ “unforgettable”)� (N “site”/”sight”/”cite”))

parser

(syntax)

36 of 41

36

A semantic analyzer gives the meaning of the phrase and resolves the ambiguity.

spectrum analyzer

waveform

sound waves

spectrogram

“unforgettable sight”

phoneme recognizer

/ ^ n f o r g ɜ t a b ʊ l

# s aI t /

morpheme recognizer

(stems and affixes)

un + forget + able

site/sight/cite

(NP (ADJ “unforgettable”)� (N “site”/”sight”/”cite”))

parser

(syntax)

semantic analyzer

(meaning)

unforgettable sight

37 of 41

RECAP: THE AUDIO ABSTRACTION PIPELINE

37

Waveform

Spectrogram

Sounds

(Phonology)

Stems + Affixes�(Morphology)

Grammar�(Syntax)

Meaning�(Semantics)

/ ^ n f o r g ɜ t a b ʊ l # s aI t /

un + forget + able site/sight/cite

(NP (ADJ “unforgettable”)� (N “site”/”sight”/”cite”))

“unforgettable sight”

38 of 41

OPTIONAL VIDEO SUMMARY OF THE AUDIO ABSTRACTION PIPELINE

38

39 of 41

HOW DO COMPUTERS RESOLVE AMBIGUITIES IN SPEECH?

39

Fact: �Speech is ambiguous

40 of 41

GOOGLE SPEECH API: REVEALING THE AMBIGUITY

40

Try this Demo & Tutorial:

What do you observe about how this demo works?

41 of 41

SPEECH IS AMBIGUOUS

THE FOLLOWING PHRASES SOUND THE SAME:

  • “HOW TO RECOGNIZE SPEECH”
  • “HOW TO WRECK A NICE BEACH”

HOW DO WE COPE WITH AMBIGUITY?

  1. MULTIPLE SOURCES OF CONSTRAINT: PHONOLOGY, MORPHOLOGY, SYNTAX, SEMANTICS
  2. STATISTICAL KNOWLEDGE ABOUT WORD SEQUENCES GAINED FROM HUGE CORPORA

**EXAMPLES OF HOMOPHONES

41

A corpus (pl. corpora) is a collection of spoken or written text.