5 of 41

Geometry of Word Embeddings

Cosine Similarity

Formula on the right

Find Synonym

Cosine metric

take the choice_embedding closest to the word embedding
notice when to use argmax for cosine metric and argmin for others

6 of 41

Analogies

Find analogy word

Linear direction is from embeddings of a to aa
Analogy_vector offsets embeddings of b with linear direction

7 of 41

Bias in Word Embeddings

Word association with attribute

Cos Similarity of a with word embedding: mean(cos_similarity(word_embedding, for all a_embeddings))
Association is the difference between the a’s and b’s cos similarity

8 of 41

Bias in Word Embeddings

Word Embedding Association Test

Sets of words

X: flower names
Y: insect names

Attribute words

A: pleasant terms
B unpleasant terms

Diff association: (association of X – Y).item()

9 of 41

From Word Embeddings to Sentence Level Embeddings

Get Sentence Embedding

Tokens: word_tokenize(sentence) if token is in embeddings
Use_POS

pos_tags = nltk.pos_tag(tokens) remember to use tag at index 1
Sentence_embedding is the weighted sum of embeddings
notice the shape of pos_weights: pos_weights.view(-1,1)

10 of 41

KNN Classifier using Glove-Based Sentence Embeddings

Embedding of train X: stack sentence embedding for all x

Predict

Similarity is between embeddings of train X and test X
Indice of KNN: topk(Similarity, k), remember to use the correct dim=1
Labels of KNN: y[Indice]
Most common label: use torch.mode function

11 of 41

Word Embeddings: A Quick Review

Motivation:

Represent words in a computationally efficient and semantically meaningful way

Evaluation:

Intrinsic: word similarities, TOEFL-like synonyms, analogies, etc.
Extrinsic: do the embeddings improve system performance?

Using embeddings in your model:

Freeze embeddings and use as-is in your model
Fine-tune embeddings, updating them as you train

12 of 41

Man-woman relations in embeddings

13 of 41

Comparative-superlative relations in embeddings

14 of 41

Distributional Hypothesis, again

A word’s meaning is given by words that appear frequently close by
When a word w appears in text, its context is the set of words that appear nearby (in some window).
Dense Vectors From 10,000 feet:

Find a bunch of times that w occurs in text.
Use the many contexts of w to build a vector.

These context words define banking.

15 of 41

Dense Word Vectors

Let’s assign each word a dense word vector
But each word’s vector should be similar to vectors of words that appear in similar contexts.
Example:

0.281		0.271		-0.121
0.129		0.110		0.930
U.S. =	0.312	Washington =	0.311	grass =	0.121
	-1.29		-1.33		1.53
	-0.21		-0.11		-0.51

If words appear in similar contexts, they have similar vectors!

16 of 41

“U.S.” and “Washington” occur in similar contexts!

17 of 41

"Static" Word Embeddings

Each word maps to a single vector, based on their occurrence with other words in a large corpus.

Connects to LSA/I, parallels to LMs

Examples of popular pretrained word embeddings:

word2vec: Trained on Google News
GloVe: Trained on Wikipedia, Gigaword, Common Crawl, or Twitter
FastText: Trained on Wikipedia or Common Crawl

18 of 41

Word2Vec: Overview

Word2Vec is a framework for learning word vectors. Basic Idea:
We have a large corpus of text.
Every word in a fixed vocabulary is assigned a vector.
Go through each position t in the text, which has a center word c and outside (context) words o.
Use the similarity of the word vectors for c and o to calculate the probability of o given c.
Training: Continuously adjust the word vectors to maximize this probability.

19 of 41

Word2Vec: Overview

Example for computing P(w_t+j| w_t)

20 of 41

Word2Vec: Overview

Example for computing P(w_t+j| w_t)

21 of 41

Word2Vec: Loss Function

For each position t = 1 … T, predict context words within a fixed-size window of size m, given the center word w_t
Likelihood (θ = parameters of the model, or things we want to optimize):

For each position in the text.

For each word within the window

Probability of word in window given center word.

22 of 41

Word2Vec: Loss Function

Loss function J: Averaged negative log-likelihood

Work in logspace!
Negative to turn the problem from a maximization problem into a minimization problem

If we minimize the loss function J, then we maximize the predictive accuracy!

23 of 41

Word2Vec: Loss Function

Question: How do we calculate P(w_t+j| w_t) ?
Answer: Use two vectors per word w.

Use the vector v_wwhen w is the center word.
Use the vector u_wwhen w is the context word.

Thus, for a center word c and a context word o:

Look familiar?

24 of 41

Word2Vec: Now with Vectors!

Example for computing P(w_t+j| w_t)

25 of 41

Word2Vec: Now with Vectors!

Example for computing P(w_t+j| w_t)

26 of 41

Word2Vec: Why this prediction function?

Softmax shows up again.
We can train this with gradient descent.
This model puts words that frequently co-occur nearby in vector space (to maximize the dot product).

27 of 41

Clusters of dense word vectors

28 of 41

Why separate center and context vectors?

Why use two vectors (one for when the word is the context, one for when the word is the center)?

Makes optimization/training easier in practice.
Our final word vector is traditionally average of the context and center vector for a word.

29 of 41

Why separate center and context vectors?

Another angle:

30 of 41

Two Variants of Word2Vec

SkipGram (what we’ve seen so far): Predict context (outside) words given the center word.
CBOW: Predict center word from the sum of surrounding word vectors.

31 of 41

CBOW in practice

32 of 41

Skipgram is like the reverse of CBOW?

33 of 41

Okay, okay just kidding, here's the real SkipGram diagram:

34 of 41

Contextualized Word Embeddings

Premise: define a vector for each token based its context in the data

How do we get context? RNN-based Neural LM’s

Hidden state h_iat timestep i represents the left-context of token x_i
Compute an analogous right-context by training a right-to-left LM
Simplest approach: concatenate the two contexts to get an embedding

35 of 41

Contextualized Word Embeddings

ELMo (Peters et al., 2018)

Used a multi-layer, bidirectional LSTM
Using ELMo instead of static vectors: instant SOTA on a lot of benchmark tasks

36 of 41

ELMo, visually

37 of 41

BERT

BERT (Devlin et al., 2019) :

Instead of RNN, it uses transformers.
Learning objectives:

Masked Language Model (MLM): randomly mask out words for model to predict.
Next Sentence Prediction (NSP): given a pair of sentences, does the second sentence follow the first one? Helpful for understanding the relationship between sentences (for QA, NLI, etc.).

38 of 41

BERT

Pretrain + finetune like we discussed!

BERT’s Performance on GLUE tasks (Devlin et al., 2019)

39 of 41

BERTology

Many many ideas are built on BERT:

Multilingual BERT (Devlin et al., 2019):

pretrained on 104 language.

RoBERTa (Liu et al., 2019):

removed NSP objective;
trained with larger mini-batches
larger learning rates;
more data;
longer pretraining time.

Overview: Rogers et al. (2020)
T5 (Raffel et al., 2019): model that explored many different options

1 of 41

2 of 41

3 of 41

4 of 41

5 of 41

6 of 41

7 of 41

8 of 41

9 of 41

10 of 41

11 of 41

12 of 41

13 of 41

14 of 41

15 of 41

16 of 41

17 of 41

18 of 41

19 of 41

20 of 41

21 of 41

22 of 41

23 of 41

24 of 41

25 of 41

26 of 41

27 of 41

28 of 41

29 of 41

30 of 41

31 of 41

32 of 41

33 of 41

34 of 41

35 of 41

36 of 41

37 of 41

38 of 41

39 of 41

40 of 41

41 of 41