African Language Grid User Guide
Welcome to the African Language Grid, which attempts to identify all languages spoken in Africa, and the knowledge and ICT resources available for each. Please let martin at kamusi dot org know about any errors or omissions that you may find.
The current version of the Grid is only available on Google Sheets. We would like to convert the project to a user-friendly website driven by a real database, but that requires funding that is not generally forthcoming for projects involving African languages. Meanwhile, you are invited to use the spreadsheet as is, or save a copy to your own account for your personal use (select File/ Make a Copy). Downloading a personal copy may make it easier for you to sort, filter, and search, especially if you have a problematic Internet connection and can import the Grid to your local copy of OpenOffice or Excel. Also, because the master file is “live”, you might experience erratic behaviors on Google as we sort, filter, and revise data on the back end.
Be aware that we continually update the contents of the Grid, so you should update your personal copy from time to time. The shortlink http://kamu.si/african-language-grid will stay current via kamfupi.link, even if we migrate to a better data platform.
The spreadsheet contains two tabs. “AllAfrica” is the primary tab. Column A contains the 3 letter “ISO 639-3” code for each language, and all the other columns contain information relating to the language designated by that code. About 150 languages do not have ISO codes but have been recognized as languages by organizations or researchers outside of the ISO framework; these are listed in the Grid as “unc”, for uncoded. The “CountryGrid” tab lists each language spoken in any given country, as identified in Column E of the “AllAfrica” tab. There is not 100% consistency across tabs (6 languages from AllAfrica are missing on CountryGrid, and a few late decisions about the Column B primary name in AllAfrica did not get replicated to CountryGrid), which will get ironed out if and when we can transition to a proper database, or you can alert us to any discrepancies you notice. The remainder of this User Guide pertains to the AllAfrica tab.
Listing “every language” runs into all sorts of problems regarding language names. Some languages have many names. Some named tongues might in fact be dialects of a larger language. The Grid does not pretend to resolve these issues, but it does provide information that can lead to resolution through further research. At present, Column B lists what might be the most common name for each language when referenced in English. Column C lists known alternate names, usually in English or the language’s own term for itself (ethnonyms). Column D lists a French version of the language name. Wessel Poelman at KU Leuven in Belgium has kindly prepared a file that lists all of the known language names as rendered in all the languages that have been submitted to CLDR, which is too much data to include in the Grid until we can migrate from Sheets to an actual database, but you can view and download at http://kamu.si/name-mapping-across-languages.
Column E identifies the countries where a language is known to be spoken indigenously. Languages generally pre-date national borders and often overlap them, and speaker populations shrink, swell, and migrate, so the data will occasionally miss a country or list one that it shouldn’t.
Column F identifies resource pages for 136 languages for which we created spreadsheets where community members could contribute references to print or digital resources for their languages. This experiment proved too cumbersome to get very far in Sheets, but can be revived in an interactive database.
Column G offers random notes for relevant information about a language that didn’t have another home.
Column H shows the Glottocode for each language that has one, and links to the appropriate Glottolog page. Some languages have Glottocodes but not ISO codes. Some have ISO codes but not Glottocodes. The Glottolog page tends to have important information for each of its languages. Their bibliography of print resources, at the bottom of each page, is invaluable, and should be scoured for in-depth research on any language. Many of the listed readings are decades or centuries old and must be requisitioned from library shelves; rather than being dismissed as “out of date”, researchers are encouraged to view them as treasures that might showcase the best available scholarship about a particular language.
Column I links to a Wikipedia page for almost every language. Some of these pages give a fair amount of useful information. Some merely note that a language exists, or existed in the past. (The Grid lists the occasional extinct language, when there seem to be resources that may be deployed for digital investigation). Some languages do not have their own Wikipedia page, but are referenced on a page for a larger language of which they may be considered a dialect.
Column J gives population estimates for many of the languages. Do not believe these numbers. They are generally broad estimates, from some point in time that could be decades ago. Africa’s population has doubled in the past 25 years, so a language that was estimated to have X number of speakers in 2000 might have 2X today. Or, it might have X/2, if it is a small language whose youth are shifting to a more powerful tongue. The number in Column J should only be seen as vaguely indicative. Column K boils the estimate to an order of magnitude, the number of zeros attached to the number of speakers, which might be more useful for broad comparisons. Column L gives an assessment of the language’s stability according to the UNESCO Atlas of the World’s Languages in Danger. Column M indicates whether a language has some sort of recognized national or official status in any country. Column N links to the page for the official African Union ACALAN commission (African Academy of Languages) in place for 22 cross-border languages or language groups.
Column O gives the two-letter ISO 639-1 code that you may encounter for references to some languages. In the early days, ISO thought that 676 codes (26x26) would suffice for all the languages in the world that would ever have a digital presence. 676 is a lot less than the nearly 2400 African languages you’ll find on the Grid. The initial undervaluation by ISO shows the insignificance with which most African languages are held in Silicon Valley and by international organizations.
For the moment, ignore the columns on the far right with headers like this: 🎓¹🌍. These are intended to show which languages are taught in primary, secondary, and tertiary education, in Africa and at universities abroad, or online. The research has not been conducted, and there are no present funds to do so, but we live in hope.
The remaining columns, beginning with Column P, show the digital resources that may be available for each language, or bibliographic sources for print resources. The general location of each data source is linked from Row 2 (the green row). Please visit those links for information about what the column is referencing. Many of the individual cells contain links to the precise page for a language’s data within the larger resource - let us know if any of those links go bad over time.
A few resources of particular note.
• Lanfrica is an ambitious project to keep abreast of new digital resources for African languages, especially datasets that might be useful for NLP and AI.
• AfroLID (🌍🆔) contains data that can often identify the language that a text is written in. If a language is not in AfroLID, chances are extremely slim that it will appear in any language technology, including and especially AI.
• Fineweb2 is the source for most African language data for Apertus, the Swiss multilingual LLM, introduced in the second half of 2025. Almost all of the languages are limited to about 30,000 tokens from New Testament biblical translations.
• MADLAD-400 (Multilingual Audited Dataset: Low-resource And Document-level) is a document-level multilingual dataset based on Common Crawl, that uses all snapshots of Common Crawl available as of August 1, 2022. Of the 84 African languages that have some data, most have very low numbers of clean documents (as few as 26) and sentences (as few as 1000), as shown in the relevant columns.
• Webonary contains online dictionaries for 156 languages and growing. Many of these are small languages. Many of these are small dictionaries. As independent lexicography endeavors, there is little consistency in the approach to data from one to the next. All of them are free, and represent serious human research.
The African Language Grid strives to include all relevant resources. If you know about something we are missing, please alert martin at kamusi dot org.