1 of 53

Wikimedia’s new language metrics

as a resource for insights

about (minority) languages

Caroline Myrick

Wikimedia Foundation

User:CMyrick-WMF

2 of 53

About me

  • Wikimedia Foundation Research staff
  • Volunteer editor
  • Academic background
    • Sociolinguistics (MA, PhD)
    • Language variation in the Caribbean and U.S.

3 of 53

Overview

  • Language metrics (~10 min)
    • Context
    • Celtic language insights
    • Language metrics project
  • Live demo (~10 min)
  • Wrap up (~5 min)
    • What’s next
    • Q&A

4 of 53

Context

5 of 53

Currently, 334 languages have at least one HOSTED content project.

6 of 53

700+ additional languages have TEST projects in Wikimedia Incubator, Wikiversity Beta, or Multilingual Wikisource

7 of 53

Many Many Lots

projects + languages = of data

8 of 53

Let’s consider the six living Celtic languages…

  • Gàidhlig, Scottish Gaelic
  • Gaeilge, Irish
  • Gaelg, Manx
  • Cymraeg, Welsh
  • Kernowek, Cornish
  • Brezhoneg, Breton

“Map of Celtic Nations-flag shades.svg” CC BY-SA 3.0 by QuartierLatin1968

File derived from: “Celtic Nations.svg” (uploaded by OsgoodeLawyer)

9 of 53

There are 17 content project editions in Celtic languages

6 Wikipedia editions

6 Wiktionary editions

2 Wikisource editions

2 Wikiquote edition

1 Wikibooks edition

10 of 53

3 Wikivoyage test projects

3 Wikisource test projects

2 Wikinews test projects

2 Wikiquote test projects

1 Wikibooks test projects

There are 11 test projects* in Celtic languages

*test projects: pre-hosted projects located in Wikimedia Incubator, Wikiversity Beta, Multilingual Wikisource

11 of 53

There are 20,000+ Commons file captions written in Celtic languages

File: CC-BY-SA-4.0 | User:Llywelyn2000

12 of 53

The Celtic languages are written in the

Latin script.

Wikimedia projects are home to more than 35 different scripts.

13 of 53

Many of the Celtic languages are translatable using MinT.

Irish, Manx, Welsh, and Scottish Gaelic can be translated into and from

>250 languages

Breton can be translated to from 2 languages

14 of 53

Content Translation is available2 for each Celtic language Wikipedia.

2 For logged in users. In some languages it must be enabled as a beta feature, and in others it is a usual user preference enabled by default.

15 of 53

Content Translation is available* for each Celtic language Wikipedia.

languages translated

into

languages translated

from

Breton

15

11

Cornish

6

10

Irish

18

21

Manx

3

5

Scottish Gaelic

8

16

Welsh

29

34

16 of 53

Where can we find data about languages (or a specific language) within or across Wikimedia projects?

Where did I find these Celtic language stats?

17 of 53

Language metrics

project

18 of 53

Language-related data live in many places.

19 of 53

Language-related data live in many places.

20 of 53

Language-related data live in many places.

21 of 53

Language-related data live in many places.

Analytics Data Lake

(Presto/Hive/Spark/data dumps)

MediaWiki database

(MySQL/MariaDb)

22 of 53

Language-related data live in many places.

23 of 53

Language-related data live in many places.

language-data/..langdb.yaml

(Wikimedia Github)

Content Translation table

(Wikimedia Extension database)

24 of 53

Language-related data live in many places.

(And many more!)

25 of 53

TEST WIKI DATA

HOSTED WIKI DATA

BACK-END DATA

“Funnel (PSF).png” | Public Domain: Pearson Scott Foresman

26 of 53

TEST WIKI DATA

HOSTED WIKI DATA

STATE OF LANGUAGES METRICS

BACK-END DATA

“Funnel (PSF).png” | Public Domain: Pearson Scott Foresman

27 of 53

Language coverage across

hosted content projects?

STATE OF LANGUAGES METRICS

Scripts

and

directionality?

MiNT

availability?

Search support?

Language coverage in

Incubator?

Language coverage across multilingual projects?

Language coverage in Wikiversity Beta?

Test project graduation

rates?

Language coverage in Multilingual Wikisource?

Length of

time spent

Incubating?

Machine translation availability?

Content translation availability?

“Funnel (PSF).png” | Public Domain: Pearson Scott Foresman

28 of 53

At the Wikimedia Foundation, these metrics will be able to help us…

  • Track language coverage and representation

  • Identify gaps in the “sum of all knowledge” and monitor gap closure

  • Identify knowledge gaps and monitor gap closure

“Funnel (PSF).png” | Public Domain: Pearson Scott Foresman

29 of 53

At the Wikimedia Foundation, these metrics will be able to help…

30 of 53

Wikimedia volunteers could use these metrics to…

  • View language coverage trends

  • Identify gaps for individual volunteer work, or for campaigns and events

  • Track impact after interventions

“Funnel (PSF).png” | Public Domain: Pearson Scott Foresman

31 of 53

Demo

32 of 53

33 of 53

What’s next?

34 of 53

There’s more work to be done

  • Develop additional metrics

  • Productionize datasets

  • Incorporate and/or connect to additional data

  • Standardize reporting

.“Funnel (PSF).png” | Public Domain: Pearson Scott Foresman

35 of 53

HOSTED WIKI DATA

TEST WIKI DATA

MORE INSIGHTS ABOUT

STATE OF LANGUAGES

ENGINEERING DATA

PRODUCT DATA

.“Funnel (PSF).png” | Public Domain: Pearson Scott Foresman

36 of 53

HOSTED WIKI DATA

TEST WIKI DATA

MORE INSIGHTS ABOUT

STATE OF LANGUAGES

GLOBAL LANGUAGE DATA

Languages_world_map.svg” | GFDL CC-BY-SA-migrated-3.0: Julien1311

“Funnel (PSF).png” | Public Domain: Pearson Scott Foresman

ENGINEERING DATA

37 of 53

Q&A

38 of 53

  • Questions or comments about these metrics?

  • How could you use these (or other) metrics?

  • Questions about the coverage of Celtic languages which weren’t covered?

39 of 53

Learn more

Leave feedback

40 of 53

Thanks for your attention!

Get in touch with me:

Caroline Myrick

Email: cmyrick@wikimedia.org

Office hours: mediawiki.org/wiki/Wikimedia_Research/Office_hours

41 of 53

42 of 53

APPENDIX

43 of 53

Metrics Tables

44 of 53

45 of 53

46 of 53

47 of 53

48 of 53

Additional language metric visualizations

49 of 53

Of the 333

Most have 1 or 2 content projects.

Thirteen have all 8 content projects.

50 of 53

Most of those languages have 1 or 2 projects.

Thirteen have all 8 content projects.

51 of 53

Test wikis outnumber hosted wikis across project types

52 of 53

Celtic language

Wikibooks

Wikinews

Wikipedia

Wikiquote

Wikisource

Wikiversity

Wikivoyage

Wiktionary

Breton

Hosted

Hosted

Hosted

Test

Hosted

Cornish

Hosted

Test (closed)

Test

Hosted

Irish

Test (closed)

Hosted

Test (closed)

Test

Test

Hosted

Manx

Hosted

Hosted

Scottish Gaelic

Test

Hosted

Test

Hosted

Welsh

Hosted

Test

Hosted

Hosted

Hosted

Test

Hosted

53 of 53