JavaScript isn't enabled in your browser, so this file can't be opened. Enable and reload.

1 of 8

2 of 8

Project Background

Generate both text and speech data in South Africa’s official languages

Open access, public domain datasets

Language data from domains other than the government domain (NCHLT Text Corpus)

Develop digital presence for under-resourced languages (Tshivenda, Xitsonga etc.)

Empower language communities

Establish a point of departure for language projects and technologies

Mozilla Common Voice

…an initiative to help teach machines how real people speak (https://commonvoice.mozilla.org/en)
Language communities can donate their voices to an open source voice database
Creative Commons 0.0 – freely accessible to all
Self-driven peer evaluations for quality assurance

How does Common Voice work?

Choose your language at the top right corner of the page
Click on the microphone next to “Speak”
Click on the microphone on the contribution page
Read the sentence on the screen while recording your voice
Submit your recording to see the next sentence or click “skip” if you don’t feel comfortable reading a specific sentence

How does Common Voice work?

Choose your language at the top right corner of the page
Click on the Play icon next to “Listen”
Read the sentence on the screen
Click on the Play icon to listen to recordings of other participants
Ensure that the sound quality is good and that the spoken words match the sentence on the screen
Click “Yes” to verify that the voice donation is correct or “No” to discard it

For Researchers

Contribute sentences to the dataset by uploading Creative Commons 0.0 sentences via GitHub