HOW DO COMPUTERS UNDERSTAND WHAT WE SAY?
UNIT 2, MODULE 2.2
1
OBJECTIVES
2
Students should be able to:
3
Discussion
Do you use speech recognition on your device?
If so, which apps do you use and how well does it work?
Add Video
4
TODAY WE ARE GOING TO LEARN
HOW SPEECH RECOGNITION WORKS
5
SPEECH RECOGNITION STARTS WITH SOUND
6
Because there’s no sound in a vacuum.
In space, no one can hear you scream.
Physics tells us Sound is a rapid variation in air pressure
A MICROPHONE IS A SENSOR THAT TURNS SOUND INTO AN ELECTRICAL SIGNAL, OR “WAVEFORM”.
7
A Waveform
is a plot of the voltage over time.
Sound waves are variations in air pressure.
When we make sounds the air is vibrating.
The microphone is converting �air pressure to voltage.
LET’S LOOK AT HOW PEOPLE CREATE SOUNDS
8
WHAT DID YOU NOTICE ABOUT HOW THE TONGUE AND LIPS WERE INVOLVED IN BEATBOXING?
9
Discussion
HUMAN ARTICULATORY APPARATUS: HOW WE PRODUCE SPEECH
10
Source: https://link.springer.com/referenceworkentry/10.1007%2F978-0-387-73003-5_199
ADDITIONAL MRI VOCAL TRACT VIDEOS
11
PLAY WITH THE VOCAL TRACT DEMO�(HUMAN VOCAL TRACT)
12
Video of the demo: https://experiments.withgoogle.com/pink-trombone
WHAT DOES SOUND LOOK LIKE?
13
Time
Amplitude
Oscilloscope - a device for displaying waveforms
TRY IT: OSCILLOSCOPE DEMO
STEP 1: OPEN: HTTPS://ACADEMO.ORG/DEMOS/VIRTUAL-OSCILLOSCOPE/
STEP 2: SET SECONDS/DIV TO 5 MS
STEP 3: TALK INTO THE MIC � TO SEE WHAT YOUR � SPEECH LOOKS LIKE
14
LET’S CHOOSE A SIMPLE TYPE OF SOUND TO MAKE THE WAVEFORM UNDERSTANDABLE
STEP 3: IF YOU HAVE AN ELECTRONIC KEYBOARD YOU SHOULD SELECT AN ELECTRONIC ORGAN VOICE�IF NOT, YOU CAN USE A VIRTUAL KEYBOARD:
STEP 4: PLAY A LOW NOTE AND THEN PLAY A HIGH NOTE. HOW DO THEY DIFFER?�STEP 5: PLAY A LOW NOTE AND A HIGH NOTE AT THE SAME TIME. HOW DO THE WAVEFORMS COMBINE?
15
low C
very high C
low + very high C
TRY PLAYING A CHORD: �MIDDLE C PLUS G GIVES A MORE COMPLEX WAVEFORM
16
Time
Amplitude
HOWEVER, HUMAN SPEECH IS A VERY COMPLEX WAVEFORM
17
Time
Amplitude
A SPECTROGRAM SHOWS THE FREQUENCY COMPONENTS OF SOUNDS
�
18
Frequency
Time
Do you see any differences in the axis labels?
This is human speech.
TRY IT: SPECTROGRAM DEMO
STEP 1: OPEN HTTPS://SPECTROGRAM.SCIENCEMUSIC.ORG/
STEP 2: CLICK ON THE MICROPHONE ICON TWICE TO�START THE DEMO.
STEP 3: MAKE SOUNDS AND OBSERVE THE �SPECTROGRAM.
19
Frequency
Time
HOW TO READ A SPECTROGRAM
20
Amplitude
Time
This spectrogram shows the frequency: low C
Frequency
Time
Frequency
Time
Amplitude
Time
This spectrogram shows the frequency: very high C
HOW TO READ A SPECTROGRAM
21
Frequency
Time
Amplitude
Time
Frequency
Time
THIS SPECTROGRAM SHOWS THE CHORD: LOW + VERY HIGH CC
Amplitude
Time
C
C
G
C+G
This spectrogram shows shows the chord: middle C plus G
HUMAN SPEECH
SPEECH SPECTROGRAM OF DAVE SAYING:�� “EVERY CHILD DESERVES TO LEARN� ABOUT ARTIFICIAL INTELLIGENCE.”
22
READING A SPECTROGRAM OF HUMAN SPEECH: �FORMANTS F1, F2, AND F3
23
Formants are bands of high energy in the speech signal. �Different vowels and consonants have different formant patterns.
F3
F2
F1
24
/a/ as in “block”
/u/ as in “blue”
/i/ as in “bleed”
FROM SPECTROGRAMS TO SOUNDS
THIS IS AN EARLY 1980S CONVOLUTIONAL NEURAL NET �THAT RECOGNIZES SOUNDS.
SPECTROGRAMS GO IN; SOUNDS (PHONEMES) COME OUT:�IN THIS CASE /B/, /D/, OR /G/.
25
waveform
spectrogram
neural network (phoneme recognizer)
HOW DO COMPUTERS UNDERSTAND WHAT WE SAY?
UNDERSTANDING IMPLIES GRASPING THE MEANING OF WHAT WE SAY.
26
Let’s return to our original question:
WHERE DOES MEANING COME FROM?
27
The transformation from sound to meaning takes place in stages, with increasingly abstract features and higher level knowledge applied at each stage.
The audio abstraction pipeline
What people understand about their own language
Every language has a lexicon consisting of all the words in that language.
All languages have a grammar: a set of rules for generating logical communication.
While every language has a different set of rules, all languages do obey rules.
You subconsciously know the rules of your language even though you probably can’t say what they are.
28
Major Levels of Linguistics
What knowledge do smart assistants need to understand human speech
29
Notice any similarities for earlier slides?
Smart assistant use the same information that humans use to understand speech.
30
Sound is variations in air pressure.
“unforgettable sight”
Meet our resident sound expert - Xenomorph from the movie Alien. �She will tell us all about the stages of the audio abstraction pipeline.
When we speak, the air is vibrating.
31
A microphone turns sound into an electrical signal, or “waveform”.
“unforgettable sight”
“unforgettable sight”
waveform
32
A spectrum analyzer turns the waveform into a spectrogram.
spectrum analyzer
waveform
sound waves
spectrogram
“unforgettable sight”
33
Sounds (phonemes) can be extracted from the spectrogram.
spectrum analyzer
waveform
sound waves
spectrogram
“unforgettable sight”
phoneme recognizer�(sounds)
/ ^ n f o r g ɜ t a b ʊ l # s aI t /
34
Morphemes (word stems and affixes) can be assembled from phonemes. But /s aI t/ is ambiguous: is it “site” or “sight” or “cite”?
spectrum analyzer
waveform
sound waves
spectrogram
“unforgettable sight”
phoneme recognizer
/ ^ n f o r g ɜ t a b ʊ l
# s aI t /
morpheme recognizer
(stems and affixes)
un + forget + able site/sight/cite
35
A parser assigns grammatical categories to words and groups them into phrases. “NP” means noun phrase.
spectrum analyzer
waveform
sound waves
spectrogram
“unforgettable sight”
phoneme recognizer
/ ^ n f o r g ɜ t a b ʊ l
# s aI t /
morpheme recognizer
(stems and affixes)
un + forget + able
site/sight/cite
(NP (ADJ “unforgettable”)� (N “site”/”sight”/”cite”))
parser
(syntax)
36
A semantic analyzer gives the meaning of the phrase and resolves the ambiguity.
spectrum analyzer
waveform
sound waves
spectrogram
“unforgettable sight”
phoneme recognizer
/ ^ n f o r g ɜ t a b ʊ l
# s aI t /
morpheme recognizer
(stems and affixes)
un + forget + able
site/sight/cite
(NP (ADJ “unforgettable”)� (N “site”/”sight”/”cite”))
parser
(syntax)
semantic analyzer
(meaning)
unforgettable sight
RECAP: THE AUDIO ABSTRACTION PIPELINE
37
Waveform
Spectrogram
Sounds
(Phonology)
Stems + Affixes�(Morphology)
Grammar�(Syntax)
Meaning�(Semantics)
/ ^ n f o r g ɜ t a b ʊ l # s aI t /
un + forget + able site/sight/cite
(NP (ADJ “unforgettable”)� (N “site”/”sight”/”cite”))
“unforgettable sight”
OPTIONAL VIDEO SUMMARY OF THE AUDIO ABSTRACTION PIPELINE
38
HOW DO COMPUTERS RESOLVE AMBIGUITIES IN SPEECH?
39
Fact: �Speech is ambiguous
GOOGLE SPEECH API: REVEALING THE AMBIGUITY
40
SPEECH IS AMBIGUOUS
THE FOLLOWING PHRASES SOUND THE SAME:
HOW DO WE COPE WITH AMBIGUITY?
**EXAMPLES OF HOMOPHONES
41
A corpus (pl. corpora) is a collection of spoken or written text.