Wikimedia’s new language metrics
as a resource for insights
about (minority) languages
Caroline Myrick
Wikimedia Foundation
User:CMyrick-WMF
About me
Overview
Context
Currently, 334 languages have at least one HOSTED content project.
700+ additional languages have TEST projects in Wikimedia Incubator, Wikiversity Beta, or Multilingual Wikisource
Many Many Lots
projects + languages = of data
Let’s consider the six living Celtic languages…
“Map of Celtic Nations-flag shades.svg” CC BY-SA 3.0 by QuartierLatin1968
File derived from: “Celtic Nations.svg” (uploaded by OsgoodeLawyer)
There are 17 content project editions in Celtic languages
6 Wikipedia editions
6 Wiktionary editions
2 Wikisource editions
2 Wikiquote edition
1 Wikibooks edition
3 Wikivoyage test projects
3 Wikisource test projects
2 Wikinews test projects
2 Wikiquote test projects
1 Wikibooks test projects
There are 11 test projects* in Celtic languages
*test projects: pre-hosted projects located in Wikimedia Incubator, Wikiversity Beta, Multilingual Wikisource
There are 20,000+ Commons file captions written in Celtic languages
File: CC-BY-SA-4.0 | User:Llywelyn2000
The Celtic languages are written in the
Latin script.
Wikimedia projects are home to more than 35 different scripts.
Many of the Celtic languages are translatable using MinT.
Irish, Manx, Welsh, and Scottish Gaelic can be translated into and from
>250 languages
Breton can be translated to from 2 languages
Content Translation is available2 for each Celtic language Wikipedia.
2 For logged in users. In some languages it must be enabled as a beta feature, and in others it is a usual user preference enabled by default.
Content Translation is available* for each Celtic language Wikipedia.
| languages translated into | languages translated from |
Breton | 15 | 11 |
Cornish | 6 | 10 |
Irish | 18 | 21 |
Manx | 3 | 5 |
Scottish Gaelic | 8 | 16 |
Welsh | 29 | 34 |
Where can we find data about languages (or a specific language) within or across Wikimedia projects?
Where did I find these Celtic language stats?
Language metrics
project
Language-related data live in many places.
(meta-wiki)
(Incubator)
(meta-wiki)
Proposals for closing projects
(meta-wiki)
Language-related data live in many places.
Language-related data live in many places.
Language-related data live in many places.
(Presto/Hive/Spark/data dumps)
(MySQL/MariaDb)
(data dumps)
Language-related data live in many places.
Language-related data live in many places.
(Wikimedia Github)
(Wikimedia Extension database)
Language-related data live in many places.
(And many more!)
TEST WIKI DATA
HOSTED WIKI DATA
BACK-END DATA
“Funnel (PSF).png” | Public Domain: Pearson Scott Foresman
TEST WIKI DATA
HOSTED WIKI DATA
STATE OF LANGUAGES METRICS
BACK-END DATA
“Funnel (PSF).png” | Public Domain: Pearson Scott Foresman
Language coverage across
hosted content projects?
STATE OF LANGUAGES METRICS
Scripts
and
directionality?
MiNT
availability?
Search support?
Language coverage in
Incubator?
Language coverage across multilingual projects?
Language coverage in Wikiversity Beta?
Test project graduation
rates?
Language coverage in Multilingual Wikisource?
Length of
time spent
Incubating?
Machine translation availability?
Content translation availability?
“Funnel (PSF).png” | Public Domain: Pearson Scott Foresman
At the Wikimedia Foundation, these metrics will be able to help us…
“Funnel (PSF).png” | Public Domain: Pearson Scott Foresman
At the Wikimedia Foundation, these metrics will be able to help…
Wikimedia volunteers could use these metrics to…
“Funnel (PSF).png” | Public Domain: Pearson Scott Foresman
What’s next?
There’s more work to be done
.“Funnel (PSF).png” | Public Domain: Pearson Scott Foresman
HOSTED WIKI DATA
TEST WIKI DATA
MORE INSIGHTS ABOUT
STATE OF LANGUAGES
ENGINEERING DATA
PRODUCT DATA
.“Funnel (PSF).png” | Public Domain: Pearson Scott Foresman
HOSTED WIKI DATA
TEST WIKI DATA
MORE INSIGHTS ABOUT
STATE OF LANGUAGES
GLOBAL LANGUAGE DATA
“Languages_world_map.svg” | GFDL CC-BY-SA-migrated-3.0: Julien1311
“Funnel (PSF).png” | Public Domain: Pearson Scott Foresman
ENGINEERING DATA
Q&A
Learn more
Leave feedback
Thanks for your attention!
Get in touch with me:
Caroline Myrick
Email: cmyrick@wikimedia.org
Office hours: mediawiki.org/wiki/Wikimedia_Research/Office_hours
APPENDIX
Metrics Tables
Additional language metric visualizations
Of the 333 …
Most have 1 or 2 content projects.
Thirteen have all 8 content projects.
Most of those languages have 1 or 2 projects.
Thirteen have all 8 content projects.
Test wikis outnumber hosted wikis across project types
Celtic language | Wikibooks | Wikinews | Wikipedia | Wikiquote | Wikisource | Wikiversity | Wikivoyage | Wiktionary |
Breton | – | – | Hosted | Hosted | Hosted | – | Test | Hosted |
Cornish | – | – | Hosted | Test (closed) | Test | – | – | Hosted |
Irish | Test (closed) | – | Hosted | Test (closed) | Test | – | Test | Hosted |
Manx | – | – | Hosted | – | – | – | – | Hosted |
Scottish Gaelic | – | Test | Hosted | – | Test | – | – | Hosted |
Welsh | Hosted | Test | Hosted | Hosted | Hosted | – | Test | Hosted |