1 of 26

Building speech technology for unwritten languages using visual information

Dr. Odette Scharenborg

SpeechLab/Multimedia Computing Group

Delft University of Technology

The Netherlands

Twitter: @Oscharenborg

E-mail: O.E.Scharenborg@tudelft.nl

Logo (Animatie)

2 of 26

~6900

[Scharenborg et al., IEEE TASL, 2020]

Tekst en Beeld (M)

Logo (Animatie)

3 of 26

~2%

Tekst en Beeld (M)

Logo (Animatie)

4 of 26

Unwritten languages

= Languages without a common writing system

~ 3000 languages in the world

Zero-resource approach: discover acoustic/linguistic units of the low-resource language from the raw speech data
Multi-language approach: try to create universal or cross-linguistic ASR systems by training on multiple languages
Parallel data between speech from an unwritten language and translations in another language, often done by field linguists

[Scharenborg et al., IEEE TASL, 2020]

Building ASR without text?

Tekst en Beeld (M)

Logo (Animatie)

5 of 26

… but any 4-year-old child demonstrates

Learn a language communication system before learning to read and write

From raw sensory signals

With limited human supervision

Auditory information

+

Visual information

Tekst en Beeld (M)

Logo (Animatie)

6 of 26

Visually-grounded speech processing

Use visual information (images)

to discover word-like units from the speech signal,

using speech-image associations

[Harwarth & Glass, 2015; Harwarth et al. 2016; Chrupala et al., 2017]

Logo (Animatie)

7 of 26

Speech-based image retrieval

[“A brown and white dog is running through the snow”]

Example from FlickR8K + AMT recordings - 40k spoken captions [Harwath & Glass, 2015]

= whole sentence IN / image OUT

Logo (Animatie)

8 of 26

Research question and rationale

Visually-grounded speech (VGS) models are exposed to complete speech utterances (no text) during training

But what word-like information do they actually encode?

How good are visually-grounded models at recognizing words?

To what extent do VGS models exhibit word recognition processes similar to human listeners?

Logo (Animatie)

9 of 26

Successful word recognition

= Retrieval of images containing a target word’s visual referent

For each target word embedding: Calculate the similarity to all test image embeddings and retrieve the ten most similar images

Multiple correct images can be found per word

Evaluation metric: Precision@10 (P@10)

= Percentage of those retrieved test images that contains the visual referent according to our annotations

Logo (Animatie)

10 of 26

Speech-based image retrieval

MSCOCO synthetic speech: 80 nouns from object categories

Median P@10 = 0.8

Model encodes the presence of these nouns in the sentence representations
And ‘recognises’ individual words

= The model learned to link the auditory signal for ‘dog’ to the visual representation of a dog

“dog”

[Havard, Chevrot, Besacier, CoNLL, 2019]

= isolated word IN / image OUT

Tekst en Beeld (M)

Logo (Animatie)

11 of 26

Natural speech

Does a VGS model trained on natural speech learn to recognise words, and does this generalise to isolated words?

Is the model’s word recognition process affected by word competition?

Does the model learn the difference between singular and plural nouns?

“dog”

= isolated word IN / image OUT

[Scholten, Merkx, Scharenborg, ISCAS, 2021;

Merkx, Scholten, Frank, Ernestus, Scharenborg, Cognitive Computation, 2022]

Tekst en Beeld (M)

Logo (Animatie)

12 of 26

Natural speech

Exp 1: Manually cut words from FlickR8K test sentences

With coarticulation
Continuous speech

Exp 2: Produced in isolation
2 AmEng speakers (1F+1M)

“Cleaner”
No coarticulation

“dog”

= isolated word IN / image OUT

[Scholten, Merkx, Scharenborg, ISCAS, 2021;

Merkx, Scholten, Frank, Ernestus, Scharenborg, Cognitive Computation, 2022]

Tekst en Beeld (M)

Logo (Animatie)

13 of 26

50 words + 50 verbs

[Merkx, Scholten, Frank, Ernestus,

Scharenborg, Cognitive Computation, 2022]

Tekst en Beeld (M)

Logo (Animatie)

14 of 26

VGS architecture

Image encoder Speech caption encoder

🡨 not used

🡨 MFCCs

[Merkx, Scholten, Frank, Ernestus, Scharenborg, Cognitive Computation, 2022]

Tabel

Logo (Animatie)

15 of 26

Results

Experiment 1: 50 nouns + 50 verbs manually cut from FlickR8K test sentences
P@10 = 0.44

Experiment 2: 50 nouns + 50 verbs produced in isolation
Singular words: P@10 = .52 / median P@10 = .6
Plural words: P@10 = .48 / median P@10 = .5

🡺 Generalisation to new speakers and recording conditions

Tabel

Logo (Animatie)

16 of 26

Information from nouns vs. verb

Recognition of verbs is overall much worse than for the nouns
Many verbs have a P@10 of zero
They are not recognised at all
Verb information harder to extract from images

Grafiek

Logo (Animatie)

17 of 26

Time-course of word recognition

[Merkx, Scholten, Frank, Ernestus,

Scharenborg, Cognitive Computation, 2022]

Gating experiment
Recognition scores increase when more phonemes are processed
Longer words are recognised better than shorter words
For the plural nouns, recognition scores tend to drop at the last phoneme

Grafiek

Logo (Animatie)

18 of 26

(Psycho)linguistic findings

The higher the number of consonants in the input word, the higher recognition performance

The more similar word candidates there are (i.e., a larger word-initial cohort), the lower recognition performance

Grafiek

Logo (Animatie)

19 of 26

Generating images from synthesized speech descriptions

[Wang, Qiao, Zhu, Hanjalic, Scharenborg,

IEEE/ACM Trans. Audio, Sp and Lang Proc, 2021]

Translating speech descriptions to photo-realistic images
Without using text information
Artificial speech CUB-200 bird database

Captures/is able to generate high quality photos with semantically-consistent information, e.g., regarding color

Grafiek

Logo (Animatie)

20 of 26

Natural speech of FlickR8K

Grafiek

Logo (Animatie)

21 of 26

Conclusions

Visually-grounded speech models

Learn the mapping between the spoken word and information in the images
which is better for pictureable words such as nouns, compared to verbs
Can generalize this ability to “new” speech
Can recognize words from only a partial phoneme sequence; thus, the model learns “phoneme” information
Learn linguistic (= morphological) information about plurals and singulars
Are affected by word competition
Also capture more fine-grained semantic information
Limitation: Performance on retrieval tasks still needs improvement

Logo (Animatie)

22 of 26

Logo (Animatie)

23 of 26

Acoustic models

Transcriptions are used to learn the mapping of the sounds of a language and their acoustic patterns

Transcription

Tekst en Tabel

Logo (Animatie)

24 of 26

Language models

Text is used to learn the order of words in a language

Tekst en Tabel

Logo (Animatie)

25 of 26

Acoustic model

Transcription

[Scharenborg et al., IEEE TASL, 2020]

Language model

Tekst en Beeld (M)

Logo (Animatie)

26 of 26

Discovering word-like units

Example from FlickR8K [Rashtchian et al., 2010]

Automatic speech recognition Visually grounded speech processing

“A brown and white dog is running through the snow”

Logo (Animatie)