PBoS: Probabilistic Bag-of-Subwords for Generalizing Word Embedding
Zhao Jinman, Shawn Zhong, Xiaomin Zhang, Yingyu Liang
Accepted to Findings of EMNLP 2020. Presenting at SustaiNLP 2020.
Learn
Spelling ↦ Word vector
Why?
Background: word embedding
Word embedding:
Word ↦ Vector ∈ Rn
Background: word embedding
Word embedding:
Word ↦ Vector ∈ Rn
Word ID (within some fixed vocabulary)
(By many widely used word embedding methods*.)
*word2vec (Mikolov et al., 2013), GloVe (Pennington et al.. 2014)., etc.
Background: word embedding
☹️️ OOV words
🤔️ workshop ⇔ high ⇔ higher
=> Looking into word spelling -- subword-level models.
Word embedding:
Word ↦ Vector ∈ Rn
Word ID (within some fixed vocabulary)
Motivation: generalizing word embedding
☹️️ Out-of-vocabulary (OOV) word problem.
👍🏻️ Provide embeddings for OOV words without expensive retraining.
🤔️ How much we can do by looking at words themselves?
Learn
Spelling ↦ Word vector
How?
Inspirations
We can often guess the meaning by looking at a unseen word: “postEMNLP”.
Some subwords are more likely than others. E.g. “farm” vs “arml” in “farmland”.
Subwords segment words. E.g. “hig/her” vs “high/er”.
Language speakers do this without linguistic training.
Previous proposal
PBoS:
subword segmentation
+
subword-based
word vector composition
PBoS: probabilistic bag-of-subwords
w = “higher”
Segmentations:
higher
…
hig/her
high/er
…
hi/gh/er
…
h/i/g/h/e/r
😃
v(“high/er”) = v(“high”) + v(“er”)
😃
😃
🤔️
v(“hig/her”) = v(“hig”) + v(“her”)
🔺
Learnable parameters
🔻
Subword frequency
p(“hig/her” | w) ∝ p(“hig”) p(“her”)
p(“high/er” | w) ∝ p(“high”) p(“er”)
“Just like BoS”
PBoS: efficient algorithm
=> only 30% overhead compared to BoS!
“subword weight w.r.t. w”
(dynamic programming)
☹️️
👍🏻️
Evaluation:
How does
subword segmentation
work?
Experiments: subword segmentation
Top subword segmentations for some example “words”.
=> PBoS assign sensible likelihood to subword segmentations.
Experiments: affix prediction
Task: “replaceable” -> “-able”; “rename” -> “re-”. Unambiguous ones dropped.
Method: PBoS: taking the top-ranked affix. BoS: random choosing an affix.
=> PBoS almost doubles the scores than BoS.
Benchmark: derivational morphology dataset (Lazaridou et al., 2013).
Evaluation:
How does
word vector composition
work?
Scenario: generalizing word embedding
Pre-trained word embeddings over a finite set of words:
Vocabulary ➝ Rn
Word ID ↦ Vector ∈ Rn
Spelling ↦ Vector ∈ Rn
(Subword-level model)
Loss minimization
Learn
Baselines
Experiments: word similarity
Task and method: predict similarity between a pair of words, using cosine distance.
Metric: correlation (Spearman’s 𝝆⨉100) against human judgement.
=> PBoS consistently outperforms target vectors, especially in case of high OOV rate.
Target: Pre-trained word2vec vectors over English Google News dump.
Benchmarks: WordSim353 ( Finkelstein et al., 2001), RareWord (Luong et al., 2013), Card-660 (Pilehvar et al., 2018).
Experiments: multilingual word similarity
PBoS >≈ KVQ-FH > BoS
Target vectors: Wikipedia2Vec (Yamada et al., 2020).
Benchmarks: multilingual WordSim353 and SemLex999 (Leviant and Reichart, 2015).
Experiments: Part-of-Speech tagging
Task: predicting the part of speech for each word.
Method: logistic regression classifier over the vectors of neighboring words.
VERB
NOUN
POS tags
avoid
travel
PROPN
EMNLP
PART
to
...
...
VERB
attend
Sentence
online
ADV
...
...
Logistic
regression
Evaluation protocol of Kiros et al. (2015) and Li et al. (2017).
=> PBoS is the best on 22/23 languages
and often leads by a big margin.
Target vectors: PolyGlot (Al-Rfou’ et al., 2013). POS tagging dataset: Universal Dependency 1.4.
Conclusion
Thank you!
Q & A
Check out our Findings paper for more details!
Code available at: https://github.com/jmzhao/pbos