When FastText Pays Attention
Efficient Estimation of Word Representations Using Positional Weighting
Vítek Novotný, Michal Štefánik, Dávid Lupták, Eniafe F. Ayetiran, Petr Sojka
MIR research group <mir.fi.muni.cz>, Faculty of Informatics, Masaryk University
Bibliography
- KAHNEMAN, Daniel. Thinking, fast and slow. Macmillan, 2011.
- PETERS, Ellen, et al. Numeracy and decision making. Psychological science, 2006, 17.5: 407-413.
- CLARK, Kevin, et al. What does BERT look at? An analysis of BERT’s attention. arXiv:1906.04341, 2019.
- MIKOLOV, Tomáš, et al. Advances in pre-training distributed word representations. arXiv:1712.09405, 2017.
- NOVOTNÝ, Vít, et al. Towards useful word embeddings. RASLAN 2020, 2020, 37.
Introduction
- Kahneman [1] categorizes human cognition into two systems:
- System 1: fast, automatic, emotional, unconscious, …
- System 2: slow, effortful, logical, conscious, …
- Ellen et al. [2] show that both systems are mutually supportive.
- Clark et al. [3] show that ensembling shallow log-bilinear LMs (FastText) and deep attention-based LMs (BERT) significantly outperforms either on dependency parsing. [2, Table 3]
- Mikolov et al. [4] introduce positional weighting to FastText�and receive SOTA on English word analogy task (85%).
- We open-source pw, evaluate on qualitative & extrinsic tasks.
Qualitative Evaluation
- We measure the importance of context words at position p.�Words around masked word most important, left > right context
- We cluster position features. Clusters boost context words:
- Antepositional: with, under, in, of, …
- Postpositional: ago, nonwithstanding, …
- Informational: fascism, tornado, …, August
Text Classification
- We use the Soft Cosine Measure with kNN classifier [5].
Positional model consistently outperforms base FastText.
Language Modeling
- We use an LSTM with lookup table init.’d to context vectors [5].
Positional model consistently outperforms base FastText.
(8% features)
(11% features)
(81% features)