1 of 8

2 of 8

Project Background

Generate both text and speech data in South Africa’s official languages

Open access, public domain datasets

Language data from domains other than the government domain (NCHLT Text Corpus)

Develop digital presence for under-resourced languages (Tshivenda, Xitsonga etc.)

Empower language communities

Establish a point of departure for language projects and technologies

3 of 8

Mozilla Common Voice

  • …an initiative to help teach machines how real people speak (https://commonvoice.mozilla.org/en)
  • Language communities can donate their voices to an open source voice database
  • Creative Commons 0.0 – freely accessible to all
  • Self-driven peer evaluations for quality assurance

4 of 8

How does Common Voice work?

5 of 8

  1. Choose your language at the top right corner of the page
  2. Click on the microphone next to “Speak”
  3. Click on the microphone on the contribution page
  4. Read the sentence on the screen while recording your voice
  5. Submit your recording to see the next sentence or click “skip” if you don’t feel comfortable reading a specific sentence

6 of 8

How does Common Voice work?

7 of 8

  1. Choose your language at the top right corner of the page
  2. Click on the Play icon next to “Listen”
  3. Read the sentence on the screen
  4. Click on the Play icon to listen to recordings of other participants
  5. Ensure that the sound quality is good and that the spoken words match the sentence on the screen
  6. Click “Yes” to verify that the voice donation is correct or “No” to discard it

8 of 8

For Researchers

  • Localize the Common Voice website for a specific language using Pontoon

https://pontoon.mozilla.org/

  • Contribute sentences to the dataset by uploading Creative Commons 0.0 sentences via GitHub

https://github.com/common-voice/common-voice

  • Get the language community involved by creating campaigns and events

https://common-voice.github.io/community-playbook/sub_pages/mobilization.html