1 of 13

U4U = ÚFAL for Ukraine

2 of 13

U4U Activities

  • UkrainianCzech machine translation models
    • T2T (Martin Popel)
    • Marian (Jindřich Libovický, Jinda Helcl)
  • uk⇔cs in WMT shared task
  • Charles Translator
    • Web frontend (http://translator.cuni.cz, David Nápravník & hackathon volunteers)
    • Android app (1k+ downloads from Google Play, Tomáš Krabač)
    • Backend server: Ondřej Košarko and Lindat team
  • Communication with media, companies and volunteers
    • Michal Novák, Ruda Rosa,...
  • Data collection and cleaning
    • Michal Novák, Lucie Poláková, Denys Boyko, Zdeněk Kasner, Jarka Hlaváčová,�Mariia Anisimova, MP, JL, JH, InterCorp team,…

3 of 13

Thousands of translations per week (April-September)

4 of 13

Frontends

70 % web frontend

25 % Android app

5 % API

5 of 13

6 of 13

7 of 13

Top “organizations”

Requests (k)

days

uk⇒cs

name

12000

187

39 %

unknown

439

119

0 %

PomahejUkrajine

175

134

0 %

ArthurApp

36

143

92 %

the top user

8

127

65 %

Telegram bot

4

25

2 %

fra (since August)

4

10

0 %

NewtonTechnologies

1

5

0 %

ÚMČ Praha Řeporyje

1

18

3 %

Farnost Horní Planá

>1

10 organizations, 5 users

8 of 13

Organizations

9 of 13

WMT general MT (former news) shared task

  • uk⇔cs test set extracted from Charles Translator logs
  • Automatically filtered (prefix duplicates, Russian,...)
  • Manually selected, pseudonymized
    • uk⇒cs 2812 segments (mostly single sentences)
    • cs⇒uk 1930 segments
  • Reference translations by a professional translator
    • suspicion: uk⇒cs via English, postedited?

Starting now: manual evaluation. Volunteers needed.

10 of 13

WMT general MT (former news) shared task

11 of 13

WMT general MT (former news) shared task

12 of 13

Translation of city names (MT Marathon project, WIP)

13 of 13

Plans for future

  • uk and cs ASR service at Lindat (Peter Polák, Ondřej Klejch)
    • Test if better than Google ASR
  • Android app
    • Dialog mode (switch translation direction easily,...)
  • Translation quality
    • Better training data filtering and other tricks from AMU submission
    • Don’t translate URLs
    • Markup translation
  • More training data
    • Books from the National library (find best Ukrainian OCR, doc alignment,...)
    • Better CCMatrix
    • Newest InterCorp, (Open)Subtitles
  • Russian-Czech, multilingual
  • Marian into production