1 of 26

Building speech technology for unwritten languages using visual information

Dr. Odette Scharenborg

SpeechLab/Multimedia Computing Group

Delft University of Technology

The Netherlands

Twitter: @Oscharenborg

E-mail: O.E.Scharenborg@tudelft.nl

Logo (Animatie)

2 of 26

~6900

[Scharenborg et al., IEEE TASL, 2020]

Tekst en Beeld (M)

Logo (Animatie)

3 of 26

~2%

Tekst en Beeld (M)

Logo (Animatie)

4 of 26

Unwritten languages

= Languages without a common writing system

~ 3000 languages in the world

  • Zero-resource approach: discover acoustic/linguistic units of the low-resource language from the raw speech data
  • Multi-language approach: try to create universal or cross-linguistic ASR systems by training on multiple languages
  • Parallel data between speech from an unwritten language and translations in another language, often done by field linguists

[Scharenborg et al., IEEE TASL, 2020]

Building ASR without text?

Tekst en Beeld (M)

Logo (Animatie)

5 of 26

… but any 4-year-old child demonstrates

  • Learn a language communication system before learning to read and write

  • From raw sensory signals

  • With limited human supervision

Auditory information

+

Visual information

Tekst en Beeld (M)

Logo (Animatie)

6 of 26

Visually-grounded speech processing

  • Use visual information (images)

to discover word-like units from the speech signal,

using speech-image associations

[Harwarth & Glass, 2015; Harwarth et al. 2016; Chrupala et al., 2017]

Logo (Animatie)

7 of 26

Speech-based image retrieval

[“A brown and white dog is running through the snow”]

Example from FlickR8K + AMT recordings - 40k spoken captions [Harwath & Glass, 2015]

= whole sentence IN / image OUT

Logo (Animatie)

8 of 26

Research question and rationale

  • Visually-grounded speech (VGS) models are exposed to complete speech utterances (no text) during training

  • But what word-like information do they actually encode?

  • How good are visually-grounded models at recognizing words?

  • To what extent do VGS models exhibit word recognition processes similar to human listeners?

Logo (Animatie)

9 of 26

Successful word recognition

= Retrieval of images containing a target word’s visual referent

  • For each target word embedding: Calculate the similarity to all test image embeddings and retrieve the ten most similar images

  • Multiple correct images can be found per word

  • Evaluation metric: Precision@10 (P@10)

= Percentage of those retrieved test images that contains the visual referent according to our annotations

Logo (Animatie)

10 of 26

Speech-based image retrieval

  • MSCOCO synthetic speech: 80 nouns from object categories

  • Median P@10 = 0.8

  • Model encodes the presence of these nouns in the sentence representations
  • And ‘recognises’ individual words

= The model learned to link the auditory signal for ‘dog’ to the visual representation of a dog

“dog”

[Havard, Chevrot, Besacier, CoNLL, 2019]

= isolated word IN / image OUT

Tekst en Beeld (M)

Logo (Animatie)

11 of 26

Natural speech

  • Does a VGS model trained on natural speech learn to recognise words, and does this generalise to isolated words?

  • Is the model’s word recognition process affected by word competition?

  • Does the model learn the difference between singular and plural nouns?

“dog”

= isolated word IN / image OUT

[Scholten, Merkx, Scharenborg, ISCAS, 2021;

Merkx, Scholten, Frank, Ernestus, Scharenborg, Cognitive Computation, 2022]

Tekst en Beeld (M)

Logo (Animatie)

12 of 26

Natural speech

  • Exp 1: Manually cut words from FlickR8K test sentences
    • With coarticulation
    • Continuous speech

  • Exp 2: Produced in isolation
  • 2 AmEng speakers (1F+1M)
    • “Cleaner”
    • No coarticulation

“dog”

= isolated word IN / image OUT

[Scholten, Merkx, Scharenborg, ISCAS, 2021;

Merkx, Scholten, Frank, Ernestus, Scharenborg, Cognitive Computation, 2022]

Tekst en Beeld (M)

Logo (Animatie)

13 of 26

50 words + 50 verbs

[Merkx, Scholten, Frank, Ernestus,

Scharenborg, Cognitive Computation, 2022]

Tekst en Beeld (M)

Logo (Animatie)

14 of 26

VGS architecture

Image encoder Speech caption encoder

🡨 not used

🡨 MFCCs

[Merkx, Scholten, Frank, Ernestus, Scharenborg, Cognitive Computation, 2022]

Tabel

Logo (Animatie)

15 of 26

Results

  • Experiment 1: 50 nouns + 50 verbs manually cut from FlickR8K test sentences
  • P@10 = 0.44

  • Experiment 2: 50 nouns + 50 verbs produced in isolation
  • Singular words: P@10 = .52 / median P@10 = .6
  • Plural words: P@10 = .48 / median P@10 = .5

🡺 Generalisation to new speakers and recording conditions

Tabel

Logo (Animatie)

16 of 26

Information from nouns vs. verb

  • Recognition of verbs is overall much worse than for the nouns
  • Many verbs have a P@10 of zero
  • They are not recognised at all
  • Verb information harder to extract from images

Grafiek

Logo (Animatie)

17 of 26

Time-course of word recognition

[Merkx, Scholten, Frank, Ernestus,

Scharenborg, Cognitive Computation, 2022]

  • Gating experiment
  • Recognition scores increase when more phonemes are processed
  • Longer words are recognised better than shorter words
  • For the plural nouns, recognition scores tend to drop at the last phoneme

Grafiek

Logo (Animatie)

18 of 26

(Psycho)linguistic findings

  • The higher the number of consonants in the input word, the higher recognition performance

  • The more similar word candidates there are (i.e., a larger word-initial cohort), the lower recognition performance

Grafiek

Logo (Animatie)

19 of 26

Generating images from synthesized speech descriptions

[Wang, Qiao, Zhu, Hanjalic, Scharenborg,

IEEE/ACM Trans. Audio, Sp and Lang Proc, 2021]

  • Translating speech descriptions to photo-realistic images
  • Without using text information
  • Artificial speech CUB-200 bird database

  • Captures/is able to generate high quality photos with semantically-consistent information, e.g., regarding color

Grafiek

Logo (Animatie)

20 of 26

Natural speech of FlickR8K

Grafiek

Logo (Animatie)

21 of 26

Conclusions

Visually-grounded speech models

  • Learn the mapping between the spoken word and information in the images
  • which is better for pictureable words such as nouns, compared to verbs
  • Can generalize this ability to “new” speech
  • Can recognize words from only a partial phoneme sequence; thus, the model learns “phoneme” information
  • Learn linguistic (= morphological) information about plurals and singulars
  • Are affected by word competition
  • Also capture more fine-grained semantic information
  • Limitation: Performance on retrieval tasks still needs improvement

Logo (Animatie)

22 of 26

Logo (Animatie)

23 of 26

Acoustic models

Transcriptions are used to learn the mapping of the sounds of a language and their acoustic patterns

Transcription

Transcription

Transcription

Tekst en Tabel

Logo (Animatie)

24 of 26

Language models

Text is used to learn the order of words in a language

Tekst en Tabel

Logo (Animatie)

25 of 26

Acoustic model

Transcription

[Scharenborg et al., IEEE TASL, 2020]

Language model

Tekst en Beeld (M)

Logo (Animatie)

26 of 26

Discovering word-like units

Example from FlickR8K [Rashtchian et al., 2010]

Automatic speech recognition Visually grounded speech processing

“A brown and white dog is running through the snow”

Logo (Animatie)