1 of 1

When FastText Pays Attention

Efficient Estimation of Word Representations Using Positional Weighting

Vítek Novotný, Michal Štefánik, Dávid Lupták, Eniafe F. Ayetiran, Petr Sojka

MIR research group <mir.fi.muni.cz>, Faculty of Informatics, Masaryk University

Bibliography

KAHNEMAN, Daniel. Thinking, fast and slow. Macmillan, 2011.
PETERS, Ellen, et al. Numeracy and decision making. Psychological science, 2006, 17.5: 407-413.
CLARK, Kevin, et al. What does BERT look at? An analysis of BERT’s attention. arXiv:1906.04341, 2019.
MIKOLOV, Tomáš, et al. Advances in pre-training distributed word representations. arXiv:1712.09405, 2017.
NOVOTNÝ, Vít, et al. Towards useful word embeddings. RASLAN 2020, 2020, 37.

Introduction

Ellen et al. [2] show that both systems are mutually supportive.
Clark et al. [3] show that ensembling shallow log-bilinear LMs (FastText) and deep attention-based LMs (BERT) significantly outperforms either on dependency parsing. [2, Table 3]
Mikolov et al. [4] introduce positional weighting to FastText�and receive SOTA on English word analogy task (85%).
We open-source pw, evaluate on qualitative & extrinsic tasks.

Qualitative Evaluation

We measure the importance of context words at position p.�Words around masked word most important, left > right context
We cluster position features. Clusters boost context words:

Text Classification

Positional model consistently outperforms base FastText.

Language Modeling

Positional model consistently outperforms base FastText.

(8% features)

(11% features)

(81% features)