1 of 55

Document Embeddings

A concise literature review

by Shay Palachy

www.shaypalachy.com

For a complementary literature review see https://towardsdatascience.com/document-embedding-techniques-fed3e7a6a25d

3 of 55

Agenda

Motivation
Approaches
Classic Techniques
Unsupervised document embedding techniques
Supervised document embedding techniques
Trends and challenges
How to choose which technique to use

5 of 55

Motivation

Recall: Word embeddings are mappings of words into real-numbered vectors

The building blocks for text representation for ML models (handle only numeric data)
Lead to great improvement in almost every NLP task
Modern methods yield rich vector space representations...

6 of 55

Modern word embedding spaces

Axes/dimensions in the embedding space capture deep semantic or syntactic concepts and phenomena.
Distances in the embedding space reflect real semantic and syntactic differences between words.
Millions of words are represented using only dozens or hundreds of dimensions.

8 of 55

Motivation

So far deriving a representation for a sequence of words from word embedding meant concating/averaging them
Is this the best we can do?
Can supervised and self-supervised techniques used for word embeddings be extended to learn more meaningful embeddings for larger units of texts?

9 of 55

Applications

doc2vec: Text classification and sentiment analysis tasks [Le & Mikolov, 2014] and document similarity tasks [Dai et al, 2015].
Skip-thought: Semantic relatedness, paraphrase detection, image-sentence ranking, classification, sentiment [Kiros et al, 2015] POS tags and dependency relations [Broere, 2017].
Deep Semantic Similarity Model: Information retrieval and web search ranking, ad selection/relevance, contextual entity search and interestingness tasks, question answering, knowledge inference, image captioning, and machine translation tasks.

11 of 55

Approaches

Summarizing word vectors
Topic modelling
Encoder-decoder models
Supervised representation learning

Learning document embeddings from labeled data
Task-specific supervised document embeddings
Jointly learning sentence representations

(self-supervised learning)

unsupervised learning

12 of 55

Classic techniques

A recap

13 of 55

Bag-of-words

Choose vocabulary size
Take K most important words (how? frequency?)
1 where a vocabulary words appears, 0 otherwise

14 of 55

Bag-of-ngrams

Gain back some of the word order information
Encode short phrases as “words”

15 of 55

tf-idf weighting

TF - Count the number of appearance in a document
The TF term grows as the word appears more often, while the IDF term increases with the word’s rarity
Adjusts frequency scores for the fact that some words appear more (or less) frequently

16 of 55

LDA (topic models)

Bag-of-words = A simple probabilistic model of documents as distributions over words
Add a latent (hidden) intermediate layer of K topics
Topics are now characterized by distributions over words, while documents are distributions over topics

17 of 55

The probabilistic model shift from bag-of-words to LDA

18 of 55

LDA

To generate a set of M documents of lengths {Nᵢ}, assuming a predetermined number of K topics:

(Dirichlet dist.)

20 of 55

LDA

While inferring the model, a vector space of dimension K was inferred
It captures in some way the topics or themes in our corpus and the way they are shared between documents in it
Actually an embedding space for these documents�

Depending on the choice of K it can be of a significantly smaller dimension than the vocabulary

every document is now a mixture of topics!

21 of 55

Unsupervised document embedding techniques

22 of 55

Unsupervised document embedding techniques

n-gram embeddings
Averaging word embeddings
Sent2Vec
Paragraph vectors (doc2vec)
Doc2VecC
Skip-thought vectors
FastSent
Quick-thought vectors
Word Mover’s Embedding (WME)

23 of 55

n-gram embeddings

Identify many short phrases using a data-driven approach
Treat them as individual tokens during the training of the word2vec model
Less suitable for learning longer phrases
Does not generalizes well to unseen phrases

[Mikolov et al, 2013b]

24 of 55

Averaging word embeddings

Use a fixed (unlearnable) vector summarization operator (e.g. avg, sum, concat)
Learn word embeddings in a preceding layer, using a learning target tailored for document embedding

e.g. using a sentence to predict context sentences

Main advantage: Word embeddings are optimized for averaging into document representations!

25 of 55

Averaging word embeddings

[Kenter et al, 2016] [Hill et al, 2016] [Sinoara et al, 2019]

26 of 55

Averaging word embeddings

[Arora et al, 2016] showed it to be a very good baseline by adding two small variations:

Use a smooth inverse frequency weighting scheme
Remove the common discourse component (find w/ PCA) from word vectors, presumably related to syntax

Python implementation

27 of 55

Sent2Vec

Combine the two above methods; word2vec’s CBOW w/:

Word n-grams
Optimize the word/n-grams embeddings for averaging into document vectors

No frequent word subsampling (to get n-grams features)
No dynamic context windows; context = whole sentence

C++-based Python implementation

[Pagliardini et al, 2017] [Gupta et al, 2019]

28 of 55

sent2vec = unsupervised fastText, w/:�1) context = entire sentence�2) class labels = vocabulary words

29 of 55

Paragraph Vectors (doc2vec)

Add a memory vector, capturing paragraph topic/context
Map paragraphs to unique vectors
At prediction time, compute vectors for new paragraphs

Gensim, PyTorch

[Le & Mikolov, 2014]

30 of 55

Paragraph Vectors: Distributed Memory (PV-DM)

Predict word from m previous words + paragraph�
Concat paragraph and word vectors

(preserves order information)

paragraph matrix D

avg/concat

classifier

word embeddings

(like word2vec’s CBOW)

31 of 55

Paragraph Vectors: Distributed Bag of Words (PV-DBOW)

Predict context words from paragraph

(like word2vec’s skip-gram)

32 of 55

Document vector through corruption (doc2vecC)

C implementation

[Chen, 2017]

represent documents as embedding avg of randomly sampled words

Corrupt = randomly drop many words

Training speed-up
Regularization

33 of 55

Skip-thought vectors

Predict previous/next sentence from current one
Learn vector representations w/ RNN encoder-decoder
Learn word embedding for small vocabulary (20,000)

Learn a mapping from word2vec to this vector space

Python, PyTorch, TensorFlow

[Kiros et al, 2015]

34 of 55

shared word embedding layer

different encoders for prev/next

35 of 55

Skip-thought vectors

Three improvements to skip-thought: [Tang et al, 2017]

Only learn to decode the next sentence
Add avg+max connections between encoder and decoder

Allows non-linear non-parametric feature engineering

Good word embedding initialization

([Gan et al, 2016] use a hierarchical CNN-LSTM-based encoder instead)

36 of 55

Quick-thought vectors

Reformulate sentence context prediction as a supervised classification problem

Python implementation

[Logeswaran & Lee, 2018]

37 of 55

quick-thought

skip-thought

38 of 55

Quick-thought vectors

Reformulate sentence context prediction as a supervised classification problem
Learn two (RNN) encoders: f for input, g for candidates
Candidate set built from (1) valid context sentences +�(2) other non-context sentences
Training objective maximizes the probability of identifying the correct context sentences

Python implementation

[Logeswaran & Lee, 2018]

39 of 55

Word Mover’s Embedding (WME)

Given a rich unsupervised word embedding space

Build doc distance metric w/ Word Mover’s Distance (WMD)
Derive from it a positive-definite kernel w/ D2KE
Derive doc embedding w/ random features approx of kernel

Python implementation

[Wu et al, 2018b]

41 of 55

Supervised document embedding techniques

42 of 55

Learning document embeddings from labeled data

Explicitly learn phrase embeddings w/ a parallel corpus of phrases for statistical machine translation [Cho et al, 2014a]
Minimize cosine similarity between pairs of paraphrases [Wieting et al, 2015]
Train neural language models to map dictionary definitions to pre-trained word embeddings of the words defined by those definitions [Hill et al, 2015]

43 of 55

word embeddings

stacks of 3 bidirectional GRUs

both branches share parameter weights

44 of 55

Task-specific supervised document embeddings

Train a neural network on some supervised NLP tasks
Input word embeddings
All hidden layers of the network can be considered to produce a vector embedding of an input document
Produces task-specific document embeddings
Less robust than unsupervised ones

46 of 55

Jointly learning sentence representations

Learn sentence representations from multiple text classification tasks
Combine them with pre-trained word-level and sentence-level encoders
Get robust sentence representations that are useful for transfer learning

48 of 55

Trends and challenges

49 of 55

Trends and challenges

Encoder-Decoder Optimization

Architecture: NN/CNN/RNN
Hyperparams: n-grams, projection functions, weighing
Goal: Improve success metrics
Goal: Train models over larger corpora / in shorter time

50 of 55

Trends and challenges

Learning objective design

Quick-thought
Word Mover’s Distance
Innovations might be applicable to the problem of word embedding

51 of 55

Trends and challenges

Benchmarking
Open-sourcing

Reproducibility & application to real word problems

Cross-task applicability

More of a problem for supervised method

Lack of large labeled corpora

Only a problem for supervised methods

(no winners so far)

52 of 55

How to choose which technique to use

53 of 55

How to choose which technique to use

Averaging word vectors is a strong baseline

Focus on good word embeddings
Compare different techniques
Try some tricks (like in [Arora et al, 2016])

Performance can be a key consideration

Fast: Word vector avg, sent2vec and FastSent
Slower: doc2vec

54 of 55

How to choose which technique to use

Consider the validity of the learning objective to your task

skip/quick-thought model strong relations between sentences based on their distance in a document

Many open-source implementations means you can compare different solutions
There are no clear task-specific leaders. :(

1 of 55

2 of 55

3 of 55

4 of 55

5 of 55

6 of 55

7 of 55

8 of 55

9 of 55

10 of 55

11 of 55

12 of 55

13 of 55

14 of 55

15 of 55

16 of 55

17 of 55

18 of 55

19 of 55

20 of 55

21 of 55

22 of 55

23 of 55

24 of 55

25 of 55

26 of 55

27 of 55

28 of 55

29 of 55

30 of 55

31 of 55

32 of 55

33 of 55

34 of 55

35 of 55

36 of 55

37 of 55

38 of 55

39 of 55

40 of 55

41 of 55

42 of 55

43 of 55

44 of 55

45 of 55

46 of 55

47 of 55

48 of 55

49 of 55

50 of 55

51 of 55

52 of 55

53 of 55

54 of 55

55 of 55