Building speech technology for unwritten languages using visual information
Dr. Odette Scharenborg
SpeechLab/Multimedia Computing Group
Delft University of Technology
The Netherlands
Twitter: @Oscharenborg
E-mail: O.E.Scharenborg@tudelft.nl
Logo (Animatie)
~6900
[Scharenborg et al., IEEE TASL, 2020]
Tekst en Beeld (M)
Logo (Animatie)
~2%
Tekst en Beeld (M)
Logo (Animatie)
Unwritten languages
= Languages without a common writing system
~ 3000 languages in the world
[Scharenborg et al., IEEE TASL, 2020]
Building ASR without text?
Tekst en Beeld (M)
Logo (Animatie)
… but any 4-year-old child demonstrates
Auditory information
+
Visual information
Tekst en Beeld (M)
Logo (Animatie)
Visually-grounded speech processing
to discover word-like units from the speech signal,
using speech-image associations
[Harwarth & Glass, 2015; Harwarth et al. 2016; Chrupala et al., 2017]
Logo (Animatie)
Speech-based image retrieval
[“A brown and white dog is running through the snow”]
Example from FlickR8K + AMT recordings - 40k spoken captions [Harwath & Glass, 2015]
= whole sentence IN / image OUT
Logo (Animatie)
Research question and rationale
Logo (Animatie)
Successful word recognition
= Retrieval of images containing a target word’s visual referent
= Percentage of those retrieved test images that contains the visual referent according to our annotations
Logo (Animatie)
Speech-based image retrieval
= The model learned to link the auditory signal for ‘dog’ to the visual representation of a dog
“dog”
[Havard, Chevrot, Besacier, CoNLL, 2019]
= isolated word IN / image OUT
Tekst en Beeld (M)
Logo (Animatie)
Natural speech
“dog”
= isolated word IN / image OUT
[Scholten, Merkx, Scharenborg, ISCAS, 2021;
Merkx, Scholten, Frank, Ernestus, Scharenborg, Cognitive Computation, 2022]
Tekst en Beeld (M)
Logo (Animatie)
Natural speech
“dog”
= isolated word IN / image OUT
[Scholten, Merkx, Scharenborg, ISCAS, 2021;
Merkx, Scholten, Frank, Ernestus, Scharenborg, Cognitive Computation, 2022]
Tekst en Beeld (M)
Logo (Animatie)
50 words + 50 verbs
[Merkx, Scholten, Frank, Ernestus,
Scharenborg, Cognitive Computation, 2022]
Tekst en Beeld (M)
Logo (Animatie)
VGS architecture
Image encoder Speech caption encoder
🡨 not used
🡨 MFCCs
[Merkx, Scholten, Frank, Ernestus, Scharenborg, Cognitive Computation, 2022]
Tabel
Logo (Animatie)
Results
🡺 Generalisation to new speakers and recording conditions
Tabel
Logo (Animatie)
Information from nouns vs. verb
Grafiek
Logo (Animatie)
Time-course of word recognition
[Merkx, Scholten, Frank, Ernestus,
Scharenborg, Cognitive Computation, 2022]
Grafiek
Logo (Animatie)
(Psycho)linguistic findings
Grafiek
Logo (Animatie)
Generating images from synthesized speech descriptions
[Wang, Qiao, Zhu, Hanjalic, Scharenborg,
IEEE/ACM Trans. Audio, Sp and Lang Proc, 2021]
Grafiek
Logo (Animatie)
Natural speech of FlickR8K
Grafiek
Logo (Animatie)
Conclusions
Visually-grounded speech models
Logo (Animatie)
Logo (Animatie)
Acoustic models
Transcriptions are used to learn the mapping of the sounds of a language and their acoustic patterns
Transcription
Transcription
Transcription
Tekst en Tabel
Logo (Animatie)
Language models
Text is used to learn the order of words in a language
Tekst en Tabel
Logo (Animatie)
Acoustic model
Transcription
[Scharenborg et al., IEEE TASL, 2020]
Language model
Tekst en Beeld (M)
Logo (Animatie)
Discovering word-like units
Example from FlickR8K [Rashtchian et al., 2010]
Automatic speech recognition Visually grounded speech processing
“A brown and white dog is running through the snow”
Logo (Animatie)