1 of 11

Wikidata Powered

Language Keyboards

Andrew Tavis McAllister

Initiator and lead developer

DATA RE

USE DAYS

14 - 24 March 2022

DATA RE

USE DAYS

14 - 24 March 2022

2 of 11

Schedule

  • Presentation (30 min)
    • Project Description
    • Demo (15 of 30 min)
    • Data and Future Plans
  • Discussion (rest of session)

* Questions always welcome!

3 of 11

Goals

  • Keyboards for second language learners
  • French, German, Italian, Portuguese, Russian, Spanish, Swedish
  • “Never leave your keyboard”
  • Open-source + open-data

Scribe: “...a person who serves as a professional copyist, especially one who made copies of manuscripts before the invention of automatic printing.” - Wikipedia

4 of 11

Data Process

  • WDQS queries for each word type of each language
  • JSON outputs saved in app
    • Fast and no internet required
  • Updates via a Python script
    • WikidataIntegrator for queries
    • Data formatting
    • 15 minutes (agile)

5 of 11

Features

  • Use in any app
  • Annotation
    • Noun Forms
    • Preposition-cases
      • “With whom…”
  • Commands
    • Translation (single word)
    • Verb conjugation
    • Noun plurals

* No autocomplete or correct as of now

6 of 11

Demo

Scribe’s German Keyboard

DATA RE

USE DAYS

14 - 24 March 2022

7 of 11

Current Data

Language

L2 Speakers

Nouns (forms)

Verbs

Translations*

Preps

French

194.2 million

15,788 (3)

1,246**

67,652

-

German

59.1 million

28,089 (4)

3,130

67,652

187

Italian

13.4 million

783 (3)

71**

67,652

-

Portuguese

25.2 million

4,662 (3)

189

67,652

-

Russian

104.1 million

194,394 (4)

11**

67,652

12

Spanish

73.6 million

9,452 (3)

2,062

67,652

-

Swedish

3.2 million

41,187 (3)

4,138

67,652

-

* Machine translations ** Mostly just infinitives

8 of 11

Future Plans

  • Android and desktop
    • Android: Kotlin
    • Desktop: Python
    • Download language packs
  • Autocomplete and correct
  • Wikidata based translations
  • Language practice (later)
  • More languages

Büro

Scribe im Büro

N

9 of 11

Available Data - 1/2

* Includes postpositions

Language

L2 Speakers

Nouns (forms)

Verbs

Adjectives

Preps

Basque

0.4 mil passive

14,493 (2)

3,968

277

4

Bengali

39.0 million

7,660 (2)

268

1,379

1

Czech

1.3 million

4,805 (4)

289

4,721

47

Danish

0.1 million

7,996 (3)

3,189

1,199

53

Estonian

-

60,257 (2)

7,932

9,146

552*

Hausa: Boko

26.3 million

1,062 (3)

307

58

3

Hebrew

4 million

20,620 (3)

4,705

4,269

59

10 of 11

Available Data - 2/2

* Postpositions ** Postpositional particles

Language

L2 Speakers

Nouns (forms)

Verbs

Adjectives

Preps

Hindustani: H/U

258/161 million

430 (3)

73

100

12*

Japanese: Hira

1 million

330 (1)

304

118

0**

Malayalam

0.7 million

58,598 (4)

3,971

86

12

Mandarin

198.7 million

322 (1)

67

54

3

Norwegian: NB

-

8,728 (4)

2,809

2,244

105

Polish

0.7 million

1,332 (4/6)

12

445

23

Ukrainian

-

124 (4)

4

15,830

0

11 of 11

Thank you for your interest and efforts!

To stay in touch or contribute:

Andrew Tavis McAllister

andrew.t.mcallister@gmail.com

DATA RE

USE DAYS

14 - 24 March 2022

DATA RE

USE DAYS

14 - 24 March 2022