Document Embeddings
A concise literature review
by Shay Palachy
For a complementary literature review see https://towardsdatascience.com/document-embedding-techniques-fed3e7a6a25d
Agenda
Agenda
Motivation
Motivation
Recall: Word embeddings are mappings of words into real-numbered vectors
Modern word embedding spaces
Motivation
Applications
Approaches
Approaches
(self-supervised learning)
unsupervised learning
Classic techniques
A recap
Bag-of-words
Bag-of-ngrams
tf-idf weighting
LDA (topic models)
The probabilistic model shift from bag-of-words to LDA
LDA
To generate a set of M documents of lengths {Nᵢ}, assuming a predetermined number of K topics:
(Dirichlet dist.)
LDA
every document is now a mixture of topics!
Unsupervised document embedding techniques
Unsupervised document embedding techniques
n-gram embeddings
Averaging word embeddings
Averaging word embeddings
Averaging word embeddings
Sent2Vec
sent2vec = unsupervised fastText, w/:�1) context = entire sentence�2) class labels = vocabulary words
Paragraph Vectors (doc2vec)
Paragraph Vectors: Distributed Memory (PV-DM)
(preserves order information)
paragraph matrix D
avg/concat
classifier
word embeddings
(like word2vec’s CBOW)
Paragraph Vectors: Distributed Bag of Words (PV-DBOW)
(like word2vec’s skip-gram)
Document vector through corruption (doc2vecC)
represent documents as embedding avg of randomly sampled words
Corrupt = randomly drop many words
Skip-thought vectors
shared word embedding layer
different encoders for prev/next
Skip-thought vectors
Three improvements to skip-thought: [Tang et al, 2017]
([Gan et al, 2016] use a hierarchical CNN-LSTM-based encoder instead)
Quick-thought vectors
quick-thought
skip-thought
Quick-thought vectors
Word Mover’s Embedding (WME)
Given a rich unsupervised word embedding space
Supervised document embedding techniques
Learning document embeddings from labeled data
word embeddings
stacks of 3 bidirectional GRUs
both branches share parameter weights
Task-specific supervised document embeddings
Jointly learning sentence representations
Trends and challenges
Trends and challenges
Trends and challenges
Trends and challenges
(no winners so far)
How to choose which technique to use
How to choose which technique to use
How to choose which technique to use
That’s it!
For a more thorough overview please read:
https://towardsdatascience.com/document-embedding-techniques-fed3e7a6a25d