1 of 55

Document Embeddings

A concise literature review

by Shay Palachy

2 of 55

Agenda

3 of 55

Agenda

  • Motivation
  • Approaches
  • Classic Techniques
  • Unsupervised document embedding techniques
  • Supervised document embedding techniques
  • Trends and challenges
  • How to choose which technique to use

4 of 55

Motivation

5 of 55

Motivation

Recall: Word embeddings are mappings of words into real-numbered vectors

  • The building blocks for text representation for ML models (handle only numeric data)
  • Lead to great improvement in almost every NLP task
  • Modern methods yield rich vector space representations...

6 of 55

Modern word embedding spaces

  • Axes/dimensions in the embedding space capture deep semantic or syntactic concepts and phenomena.
  • Distances in the embedding space reflect real semantic and syntactic differences between words.
  • Millions of words are represented using only dozens or hundreds of dimensions.

7 of 55

8 of 55

Motivation

  • So far deriving a representation for a sequence of words from word embedding meant concating/averaging them
  • Is this the best we can do?
  • Can supervised and self-supervised techniques used for word embeddings be extended to learn more meaningful embeddings for larger units of texts?

9 of 55

Applications

  • doc2vec: Text classification and sentiment analysis tasks [Le & Mikolov, 2014] and document similarity tasks [Dai et al, 2015].
  • Skip-thought: Semantic relatedness, paraphrase detection, image-sentence ranking, classification, sentiment [Kiros et al, 2015] POS tags and dependency relations [Broere, 2017].
  • Deep Semantic Similarity Model: Information retrieval and web search ranking, ad selection/relevance, contextual entity search and interestingness tasks, question answering, knowledge inference, image captioning, and machine translation tasks.

10 of 55

Approaches

11 of 55

Approaches

  • Summarizing word vectors
  • Topic modelling
  • Encoder-decoder models
  • Supervised representation learning
    • Learning document embeddings from labeled data
    • Task-specific supervised document embeddings
    • Jointly learning sentence representations

(self-supervised learning)

unsupervised learning

12 of 55

Classic techniques

A recap

13 of 55

Bag-of-words

  • Choose vocabulary size
  • Take K most important words (how? frequency?)
  • 1 where a vocabulary words appears, 0 otherwise

14 of 55

Bag-of-ngrams

  • Gain back some of the word order information
  • Encode short phrases as “words”

15 of 55

tf-idf weighting

  • TF - Count the number of appearance in a document
  • The TF term grows as the word appears more often, while the IDF term increases with the word’s rarity
  • Adjusts frequency scores for the fact that some words appear more (or less) frequently

16 of 55

LDA (topic models)

  • Bag-of-words = A simple probabilistic model of documents as distributions over words
  • Add a latent (hidden) intermediate layer of K topics
  • Topics are now characterized by distributions over words, while documents are distributions over topics

17 of 55

The probabilistic model shift from bag-of-words to LDA

18 of 55

LDA

To generate a set of M documents of lengths {Nᵢ}, assuming a predetermined number of K topics:

(Dirichlet dist.)

19 of 55

20 of 55

LDA

  • While inferring the model, a vector space of dimension K was inferred
  • It captures in some way the topics or themes in our corpus and the way they are shared between documents in it
  • Actually an embedding space for these documents�
    • Depending on the choice of K it can be of a significantly smaller dimension than the vocabulary

every document is now a mixture of topics!

21 of 55

Unsupervised document embedding techniques

22 of 55

Unsupervised document embedding techniques

  • n-gram embeddings
  • Averaging word embeddings
  • Sent2Vec
  • Paragraph vectors (doc2vec)
  • Doc2VecC
  • Skip-thought vectors
  • FastSent
  • Quick-thought vectors
  • Word Mover’s Embedding (WME)

23 of 55

n-gram embeddings

  • Identify many short phrases using a data-driven approach
  • Treat them as individual tokens during the training of the word2vec model
  • Less suitable for learning longer phrases
  • Does not generalizes well to unseen phrases

24 of 55

Averaging word embeddings

  • Use a fixed (unlearnable) vector summarization operator (e.g. avg, sum, concat)
  • Learn word embeddings in a preceding layer, using a learning target tailored for document embedding
    • e.g. using a sentence to predict context sentences
  • Main advantage: Word embeddings are optimized for averaging into document representations!

25 of 55

Averaging word embeddings

26 of 55

Averaging word embeddings

  • [Arora et al, 2016] showed it to be a very good baseline by adding two small variations:
    • Use a smooth inverse frequency weighting scheme
    • Remove the common discourse component (find w/ PCA) from word vectors, presumably related to syntax

27 of 55

Sent2Vec

  • Combine the two above methods; word2vec’s CBOW w/:
    • Word n-grams
    • Optimize the word/n-grams embeddings for averaging into document vectors
  • No frequent word subsampling (to get n-grams features)
  • No dynamic context windows; context = whole sentence

28 of 55

sent2vec = unsupervised fastText, w/:�1) context = entire sentence�2) class labels = vocabulary words

29 of 55

Paragraph Vectors (doc2vec)

  • Add a memory vector, capturing paragraph topic/context
  • Map paragraphs to unique vectors
  • At prediction time, compute vectors for new paragraphs

30 of 55

Paragraph Vectors: Distributed Memory (PV-DM)

  • Predict word from m previous words + paragraph�
  • Concat paragraph and word vectors

(preserves order information)

paragraph matrix D

avg/concat

classifier

word embeddings

(like word2vec’s CBOW)

31 of 55

Paragraph Vectors: Distributed Bag of Words (PV-DBOW)

  • Predict context words from paragraph

(like word2vec’s skip-gram)

32 of 55

Document vector through corruption (doc2vecC)

represent documents as embedding avg of randomly sampled words

Corrupt = randomly drop many words

  • Training speed-up
  • Regularization

33 of 55

Skip-thought vectors

  • Predict previous/next sentence from current one
  • Learn vector representations w/ RNN encoder-decoder
  • Learn word embedding for small vocabulary (20,000)
    • Learn a mapping from word2vec to this vector space

34 of 55

shared word embedding layer

different encoders for prev/next

35 of 55

Skip-thought vectors

Three improvements to skip-thought: [Tang et al, 2017]

  • Only learn to decode the next sentence
  • Add avg+max connections between encoder and decoder
    • Allows non-linear non-parametric feature engineering
  • Good word embedding initialization

([Gan et al, 2016] use a hierarchical CNN-LSTM-based encoder instead)

36 of 55

Quick-thought vectors

  • Reformulate sentence context prediction as a supervised classification problem

37 of 55

quick-thought

skip-thought

38 of 55

Quick-thought vectors

  • Reformulate sentence context prediction as a supervised classification problem
  • Learn two (RNN) encoders: f for input, g for candidates
  • Candidate set built from (1) valid context sentences +�(2) other non-context sentences
  • Training objective maximizes the probability of identifying the correct context sentences

39 of 55

Word Mover’s Embedding (WME)

Given a rich unsupervised word embedding space

  • Build doc distance metric w/ Word Mover’s Distance (WMD)
  • Derive from it a positive-definite kernel w/ D2KE
  • Derive doc embedding w/ random features approx of kernel

40 of 55

41 of 55

Supervised document embedding techniques

42 of 55

Learning document embeddings from labeled data

  • Explicitly learn phrase embeddings w/ a parallel corpus of phrases for statistical machine translation [Cho et al, 2014a]
  • Minimize cosine similarity between pairs of paraphrases [Wieting et al, 2015]
  • Train neural language models to map dictionary definitions to pre-trained word embeddings of the words defined by those definitions [Hill et al, 2015]

43 of 55

word embeddings

stacks of 3 bidirectional GRUs

both branches share parameter weights

44 of 55

Task-specific supervised document embeddings

  • Train a neural network on some supervised NLP tasks
  • Input word embeddings
  • All hidden layers of the network can be considered to produce a vector embedding of an input document
  • Produces task-specific document embeddings
  • Less robust than unsupervised ones

45 of 55

46 of 55

Jointly learning sentence representations

  • Learn sentence representations from multiple text classification tasks
  • Combine them with pre-trained word-level and sentence-level encoders
  • Get robust sentence representations that are useful for transfer learning

47 of 55

48 of 55

Trends and challenges

49 of 55

Trends and challenges

  • Encoder-Decoder Optimization
    • Architecture: NN/CNN/RNN
    • Hyperparams: n-grams, projection functions, weighing
    • Goal: Improve success metrics
    • Goal: Train models over larger corpora / in shorter time

50 of 55

Trends and challenges

  • Learning objective design
    • Quick-thought
    • Word Mover’s Distance
    • Innovations might be applicable to the problem of word embedding

51 of 55

Trends and challenges

  • Benchmarking
  • Open-sourcing
    • Reproducibility & application to real word problems
  • Cross-task applicability
    • More of a problem for supervised method
  • Lack of large labeled corpora
    • Only a problem for supervised methods

(no winners so far)

52 of 55

How to choose which technique to use

53 of 55

How to choose which technique to use

  • Averaging word vectors is a strong baseline
    • Focus on good word embeddings
    • Compare different techniques
    • Try some tricks (like in [Arora et al, 2016])
  • Performance can be a key consideration
    • Fast: Word vector avg, sent2vec and FastSent
    • Slower: doc2vec

54 of 55

How to choose which technique to use

  • Consider the validity of the learning objective to your task
    • skip/quick-thought model strong relations between sentences based on their distance in a document
  • Many open-source implementations means you can compare different solutions
  • There are no clear task-specific leaders. :(

55 of 55

That’s it!