1 of 24

PBoS: Probabilistic Bag-of-Subwords for Generalizing Word Embedding

Zhao Jinman, Shawn Zhong, Xiaomin Zhang, Yingyu Liang

Accepted to Findings of EMNLP 2020. Presenting at SustaiNLP 2020.

2 of 24

Learn

Spelling ↦ Word vector

Why?

3 of 24

Background: word embedding

Word embedding:

Word ↦ Vector ∈ Rⁿ

4 of 24

Background: word embedding

Word embedding:

Word ↦ Vector ∈ Rⁿ

Word ID (within some fixed vocabulary)

(By many widely used word embedding methods*.)

*word2vec (Mikolov et al., 2013), GloVe (Pennington et al.. 2014)., etc.

5 of 24

Background: word embedding

☹️️ OOV words

🤔️ workshop ⇔ high ⇔ higher

=> Looking into word spelling -- subword-level models.

Word embedding:

Word ↦ Vector ∈ Rⁿ

Word ID (within some fixed vocabulary)

6 of 24

Motivation: generalizing word embedding

Existing embedding usually assume fixed-size vocabulary.

☹️️ Out-of-vocabulary (OOV) word problem.

👍🏻️ Provide embeddings for OOV words without expensive retraining.

Context helps but...

🤔️ How much we can do by looking at words themselves?

7 of 24

Learn

Spelling ↦ Word vector

How?

8 of 24

Inspirations

Words are often made of meaningful parts (subwords = char. n-grams).

We can often guess the meaning by looking at a unseen word: “postEMNLP”.

Some subwords are more likely than others. E.g. “farm” vs “arml” in “farmland”.

Subwords segment words. E.g. “hig/her” vs “high/er”.

A model should be able to figure out meaningful subwords on its own.

Language speakers do this without linguistic training.

9 of 24

Previous proposal

Character-level RNN (Stratos 2017; MIMICK, Pinter et al., 2017)

Compact, but no explicit modeling of subwords and embedding quality not the best.

Bag-of-Subwords (Bojanowski et al., 2017; Zhao et al., 2018)

Simple yet effective, but assert uniform weights. w(“farm”) = w(“arml”) in “farmland”.

Self-attention + subword hashing, KVQ-FH (Sasaki et al., 2019)

Good embedding quality, but no modeling of subword segmentation and is it an overkill?

Morpheme-based models (Cotterell et al., 2016; Zhu et al., 2019; and many more)

Need an external morphological analyzer such as Morfessor (Virpioja et al., 2013)

10 of 24

PBoS:

subword segmentation

+

subword-based

word vector composition

11 of 24

PBoS: probabilistic bag-of-subwords

w = “higher”

Segmentations:

higher

…

hig/her

high/er

…

hi/gh/er

…

h/i/g/h/e/r

😃

v(“high/er”) = v(“high”) + v(“er”)

😃

🤔️

v(“hig/her”) = v(“hig”) + v(“her”)

🔺

Learnable parameters

🔻

Subword frequency

p(“hig/her” | w) ∝ p(“hig”) p(“her”)

p(“high/er” | w) ∝ p(“high”) p(“er”)

“Just like BoS”

12 of 24

PBoS: efficient algorithm

=> only 30% overhead compared to BoS!

“subword weight w.r.t. w”

(dynamic programming)

☹️️

👍🏻️

13 of 24

Evaluation:

How does

subword segmentation

work?

14 of 24

Experiments: subword segmentation

Top subword segmentations for some example “words”.

=> PBoS assign sensible likelihood to subword segmentations.

15 of 24

Experiments: affix prediction

Task: “replaceable” -> “-able”; “rename” -> “re-”. Unambiguous ones dropped.

Method: PBoS: taking the top-ranked affix. BoS: random choosing an affix.

=> PBoS almost doubles the scores than BoS.

Benchmark: derivational morphology dataset (Lazaridou et al., 2013).

16 of 24

Evaluation:

How does

word vector composition

work?

17 of 24

Scenario: generalizing word embedding

Pre-trained word embeddings over a finite set of words:

Vocabulary ➝ Rⁿ

Word ID ↦ Vector ∈ Rⁿ

Spelling ↦ Vector ∈ Rⁿ

(Subword-level model)

Loss minimization

Learn

Generalizing towards OOV words without expensive retraining.
More controlled comparison between subword-level models.

18 of 24

Baselines

Character-level RNN, MIMICK (Pinter et al., 2017)
Bag-of-Subwords, BoS (Zhao et al., 2018)
Self-attention + subword hashing, KVQ-FH (Sasaki et al., 2019)

19 of 24

Experiments: word similarity

Task and method: predict similarity between a pair of words, using cosine distance.

Metric: correlation (Spearman’s 𝝆⨉100) against human judgement.

=> PBoS consistently outperforms target vectors, especially in case of high OOV rate.

Target: Pre-trained word2vec vectors over English Google News dump.

Benchmarks: WordSim353 ( Finkelstein et al., 2001), RareWord (Luong et al., 2013), Card-660 (Pilehvar et al., 2018).

20 of 24

Experiments: multilingual word similarity

PBoS >≈ KVQ-FH > BoS

Target vectors: Wikipedia2Vec (Yamada et al., 2020).

Benchmarks: multilingual WordSim353 and SemLex999 (Leviant and Reichart, 2015).

21 of 24

Experiments: Part-of-Speech tagging

Task: predicting the part of speech for each word.

Method: logistic regression classifier over the vectors of neighboring words.

VERB

NOUN

POS tags

avoid

travel

PROPN

EMNLP

PART

to

...

VERB

attend

Sentence

online

ADV

...

Logistic

regression

Evaluation protocol of Kiros et al. (2015) and Li et al. (2017).

22 of 24

=> PBoS is the best on 22/23 languages

and often leads by a big margin.

Target vectors: PolyGlot (Al-Rfou’ et al., 2013). POS tagging dataset: Universal Dependency 1.4.

23 of 24

Conclusion

PBoS simultaneously models subword segmentation and word vector composition.
PBoS efficiently considers all possible subword segmentations of a word and derives meaningful subword weights to better compose word embeddings.
Experiments suggest PBoS’ advantage for generalizing pre-trained word embedding, even over a more complexed attention-based model.
Future directions: Application to learning word embedding. Effect of hashing.

24 of 24

Thank you!

Q & A

Check out our Findings paper for more details!

Code available at: https://github.com/jmzhao/pbos