1 of 238

[6]

Bengio et al 2003 : A Neural Probabilistic Language Model

9 of 238

Introduction

10 of 238

Sequential transfer learning

Learn on one task / dataset, then transfer to another task / dataset

word2vec

GloVe

skip-thought

InferSent

ELMo

ULMFiT

GPT

BERT

classification

sequence labeling

Q&A

....

Pretraining

Adaptation

11 of 238

Pretraining tasks and datasets

Unlabeled data and self-supervision

Supervised pretraining

Very common in vision, less in NLP due to lack of large supervised datasets
Machine translation
NLI for sentence representations
Task-specific—transfer from one Q&A dataset to another

Easy to gather very large corpora: Wikipedia, news, web crawl, social media, etc.
Training takes advantage of distributional hypothesis: “You shall know a word by the company it keeps” (Firth, 1957), often formalized as training some variant of language model
Focus on efficient algorithms to make use of plentiful data�

12 of 238

Target tasks and datasets

Target tasks are typically supervised and span a range of common NLP tasks:

Sentence or document classification (e.g. sentiment)
Sentence pair classification (e.g. NLI, paraphrase)
Word level (e.g. sequence labeling, extractive Q&A)
Structured prediction (e.g. parsing)
Generation (e.g. dialogue, summarization)

13 of 238

Concrete example—word vectors

Word embedding methods (e.g. word2vec) learn one vector per word:

cat = [0.1, -0.2, 0.4, …]

dog = [0.2, -0.1, 0.7, …]

14 of 238

Concrete example—word vectors

Word embedding methods (e.g. word2vec) learn one vector per word:

cat = [0.1, -0.2, 0.4, …]

dog = [0.2, -0.1, 0.7, …]

PRP VBP PRP NN CC NN .

I love my cat and dog .

15 of 238

Concrete example—word vectors

Word embedding methods (e.g. word2vec) learn one vector per word:

cat = [0.1, -0.2, 0.4, …]

dog = [0.2, -0.1, 0.7, …]

PRP VBP PRP NN CC NN .

I love my cat and dog .

I love my cat and dog . }-> “positive"

16 of 238

Major Themes

17 of 238

Major themes: From words to words-in-context

Word vectors

Sentence / doc vectors

Word-in-context vectors

cats = [0.2, -0.3, …]

dogs = [0.4, -0.5, …]

It’s raining cats and dogs.

We have two cats.

[0.8, 0.9, …]

[-1.2, 0.0, …]

}

We have two cats.

}

[1.2, -0.3, …]

It’s raining cats and dogs.

}

[-0.4, 0.9, …]

18 of 238

Major themes: LM pretraining

Many successful pretraining approaches are based on language modeling
Informally, a LM learns P_ϴ(text) or P_ϴ(text | some other text)
Doesn’t require human annotation
Many languages have enough text to learn high capacity model
Versatile—can learn both sentence and word representations with a variety of objective functions

19 of 238

Devlin et al 2019: BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

1 layer

Major themes: From shallow to deep

24 layers

20 of 238

Major themes: pretraining vs target task

Sentence / document representations not useful for word level predictions
Word vectors can be pooled across contexts, but often outperformed by other methods
In contextual word vectors, bidirectional context important

Choice of pretraining and target tasks are coupled

In general:

Similar pretraining and target tasks → best results

21 of 238

Agenda

[2] Pretraining

[6]

(Melamud et al., CoNLL 2016)

22 of 238

2. Pretraining

Image credit: Creative Stall

23 of 238

Overview

Language model pretraining
Word vectors
Sentence and document vectors
Contextual word vectors
Interesting properties of pretraining
Cross-lingual pretraining

24 of 238

Word Type Representation

We [have a ??? and three] dogs

We have a ???

We have a MASK and three dogs

We have a ???

We like pets. }

LM pretraining

word2vec, Mikolov et al (2013)

Skip-Thought (Kiros et al., 2015)

ELMo, Peters et al. 2018, ULMFiT (Howard & Ruder 2018), GPT (Radford e t al. 201 8)

BERT, Devlin et al 2019

???

25 of 238

Word vectors

26 of 238

Why embed words?

Embeddings are themselves parameters—can be learned
Sharing representations across tasks
Lower dimensional space

Better for computation—difficult to handle sparse vectors.

27 of 238

Word Type Representation

Unsupervised pretraining : Pre-Neural

Latent Semantic Analysis (LSA)—SVD of term-document matrix, (Deerwester et al., 1990)

Latent Dirichlet Allocation (LDA)—Documents are mixtures of topics and topics are mixtures of words (Blei et al., 2003)

Brown clusters, hard hierarchical clustering based on n-gram LMs, (Brown et al. 1992)

28 of 238

Word Type Representation

n-gram neural language model (Bengio et al. 2003)

Supervised multitask word embeddings (Collobert and Weston, 2008)

Word vector pretraining

29 of 238

30 of 238

Sentence and document vectors

31 of 238

Doc2vec

Paragraph vector

Unsupervised paragraph embeddings (Le & Mikolov, 2014)

SOTA classification (IMDB, SST)

32 of 238

Doc2vec

Skip-Thought Vectors

Predict previous / next sentence with seq2seq model (Kiros et al., 2015)

Hidden state of encoder transfers to sentence tasks (classification, semantic similarity)

33 of 238

Dai & Le (2015): Pretrain a sequence autoencoder (SA) and generative LM

34 of 238

Autoencoder pretraining

Supervised sentence embeddings

Also possible to train sentence embeddings with supervised objective

Paragram-phrase: uses paraphrase database for supervision, best for paraphrase and semantic similarity (Wieting et al. 2016)
InferSent: bi-LSTM trained on SNLI + MNLI (Conneau et al. 2017)
GenSen: multitask training (skip-thought, machine translation, NLI, parsing) (Subramanian et al. 2018)

35 of 238

Contextual word vectors

36 of 238

Contextual word vectors - Motivation

Word vectors compress all contexts into a single vector

Nearest neighbor GloVe vectors to “play”

VERB

playing

played

NOUN

game

games

players

football

plays

Play

ADJ

multiplayer

37 of 238

Contextual word vectors - Key Idea

Instead of learning one vector per word, learn a vector that depends on context

Many approaches based on language models

f(play | The kids play a game in the park.)

f(play | The Broadway play premiered yesterday.)

38 of 238

Sentence completion

Lexical substitution

WSD

Use bidirectional LSTM and cloze prediction objective (a 1 layer masked LM)

Learn representations for both words and contexts (minus word)

context2vec

39 of 238

Pretrain two LMs (forward and backward) and add to sequence tagger.

SOTA NER and chunking results

(Peters et al. ACL 2017)

TagLM

40 of 238

Pretrain encoder and decoder with LMs (everything shaded is pretrained).

Large boost for MT.

(Ramachandran et al, EMNLP 2017)

Unsupervised Pretraining for Seq2Seq

41 of 238

Pretrain bidirectional encoder with MT supervision, extract LSTM states

Adding CoVe with GloVe gives improvements for classification, NLI, Q&A

(McCann et al, NeurIPS 2017)

CoVe

42 of 238

Pretrain deep bidirectional LM, extract contextual word vectors as learned linear combination of hidden states

SOTA for 6 diverse tasks

(Peters et al, NAACL 2018)

ELMo

43 of 238

Pretrain AWD-LSTM LM, fine-tune LM in two stages with different adaptation techniques

SOTA for six classification datasets

(Howard and Ruder, ACL 2018)

ULMFiT

44 of 238

(Radford et al., 2018)

Pretrain large 12-layer left-to-right Transformer, fine tune for sentence, sentence pair and multiple choice questions.

SOTA results for 9 tasks.

GPT

45 of 238

(Devlin et al. 2019)

BERT

BERT pretrains both sentence and contextual word representations,

using masked LM and next sentence prediction.

BERT-large has 340M parameters, 24 layers!

46 of 238

(Devlin et al. 2019)

BERT

SOTA GLUE benchmark results (sentence pair classification).

47 of 238

(Devlin et al. 2019)

BERT

SOTA SQuAD v1.1 (and v2.0) Q&A

48 of 238

Other pretraining objectives

Contextual string representations (Akbik et al., COLING 2018)—SOTA NER results
Cross-view training (Clark et al. EMNLP 2018)—improve supervised tasks with unlabeled data
Cloze-driven pretraining (Baevski et al. (2019)—SOTA NER and constituency parsing

49 of 238

Why does language modeling work so well?

Language modeling is a very difficult task, even for humans.
Language models are expected to compress any possible context into a vector that generalizes over possible completions.

“They walked down the street to ???”

To have any chance at solving this task, a model is forced to learn syntax, semantics, encode facts about the world, etc.
Given enough data, a huge model, and enough compute, can do a reasonable job!
Empirically works better than translation, autoencoding: “Language Modeling Teaches You More Syntax than Translation Does” (Zhang et al. 2018)

50 of 238

Sample efficiency

51 of 238

(Peters et al, NAACL 2018)

Pretraining reduces need for annotated data

52 of 238

(Howard and Ruder, ACL 2018)

Pretraining reduces need for annotated data

53 of 238

(Clark et al. EMNLP 2018)

Pretraining reduces need for annotated data

54 of 238

Scaling up pretraining

55 of 238

More data → better word vectors

(Pennington et al 2014)

Scaling up pretraining

56 of 238

Pretrained Language Models: More Data

Scaling up pretraining

Baevski et al. (2019)

57 of 238

Bigger model → better results

(Devlin et al 2019)

Scaling up pretraining

58 of 238

Cross-lingual pretraining

59 of 238

Much work on training cross-lingual word embeddings (Overview: Ruder et al. (2017))
Idea: train each language separately, then align.
Recent work aligning ELMo: Schuster et al., (NAACL 2019)
ACL 2019 Tutorial on Unsupervised Cross-lingual Representation Learning

Cross-lingual pretraining

60 of 238

Cross-lingual Polyglot Pretraining

Key idea: Share vocabulary and representations across languages by training one model on many languages.

Advantages: Easy to implement, enables cross-lingual pretraining by itself

Disadvantages: Leads to under-representation of low-resource languages

LASER: Use parallel data for sentence representations (Artetxe & Schwenk, 2018)
Multilingual BERT: BERT trained jointly on 100 languages
Rosita: Polyglot contextual representations (Mulcaire et al., NAACL 2019)
XLM: Cross lingual LM (Lample & Conneau, 2019)

61 of 238

Hands-on #1:

Pretraining a Transformer Language Model

Image credit: Chanaky

62 of 238

Hands-on: Overview

Goals:

Let’s make these recent works “uncool again” i.e. as accessible as possible
Expose all the details in a simple, concise and self-contained code-base
Show that transfer learning can be simple (less hand-engineering) & fast (pretrained model)

Plan

Build a GPT-2 / BERT model
Pretrain it on a rather large corpus with ~100M words
Adapt it for a target task to get SOTA performances

Material:

Colab: http://tiny.cc/NAACLTransferColab ⇨ code of the following slides
Code: http://tiny.cc/NAACLTransferCode ⇨ same code organized in a repo

Current developments in Transfer Learning combine new approaches for training schemes (sequential training) as well as models (transformers) ⇨ can look intimidating and complex

63 of 238

Hands-on pre-training

Colab: https://tinyurl.com/NAACLTransferColab

Repo: https://tinyurl.com /NAACLTransferCode

64 of 238

summing words and position embeddings
applying a succession of transformer blocks with:

layer normalisation
a self-attention module
dropout and a residual connection�
another layer normalisation
a feed-forward module with one hidden layer and a non linearity: Linear ⇨ ReLU/gelu ⇨ Linear
dropout and a residual connection

(Child et al, 2019)

Hands-on pre-training

Our core model will be a Transformer. Large-scale transformer architectures (GPT-2, BERT, XLM…) are very similar to each other and consist of:�

Main differences between GPT/GPT-2/BERT are the objective functions:

causal language modeling for GPT
masked language modeling for BERT (+ next sentence prediction)

We’ll play with both

65 of 238

Let’s code the backbone of our model!

PyTorch 1.1 now has a nn.MultiHeadAttention module: lets us encapsulate the self-attention logic while still controlling the internals of the Transformer.

Hands-on pre-training

66 of 238

Two attention masks?

padding_mask masks the padding tokens. It is specific to each sample in the batch:

attn_mask is the same for all samples in the batch. It masks the previous tokens for causal transformers:

Hands-on pre-training

67 of 238

1. A pretraining head on top of our core model: we choose a language modeling head with tied weights

Hands-on pre-training

To pretrain our model, we need to add a few elements: a head, a loss and initialize weights.

We add these elements with a pretraining model encapsulating our model.

2. Initialize the weights

3. Define a loss function: we choose a cross-entropy loss on current (or next) token predictions

68 of 238

Hyper-parameters taken from Dai et al., 2018 (Transformer-XL) ⇨ ~50M parameters causal model.

Use a large dataset for pre-trainining: WikiText-103 with 103M tokens (Merity et al., 2017).

Instantiate our model and optimizer (Adam)

Hands-on pre-training

Now let’s take care of our data and configuration

We'll use a pre-defined open vocabulary tokenizer: BERT’s model cased tokenizer.

69 of 238

Hands-on pre-training

A simple update loop.

We use gradient accumulation to have a large batch size even on 1 GPU (>64).

Learning rate schedule:

- linear warmup to start�- then cosine or inverse square root decrease

Go!

And we’re done: let’s train!

no warm-up

70 of 238

Hands-on pre-training — Concluding remarks

On pretraining

Intensive: in our case 5h–20h on 8 V100 GPUs (few days w. 1 V100) to reach a good perplexity ⇨ share your pretrained models
Robust to the choice of hyper-parameters (apart from needing a warm-up for transformers)
Language modeling is a hard task, your model should not have enough capacity to overfit if your dataset is large enough ⇨ you can just start the training and let it run.
Masked-language modeling: typically 2-4 times slower to train than LM�We only mask 15% of the tokens ⇨ smaller signal

For the rest of this tutorial�We don’t have enough time to do a full pretraining�⇨ we pretrained two models for you before the tutorial

71 of 238

Hands-on pre-training — Concluding remarks

First model:

exactly the one we built together ⇨ a 50M parameters causal Transformer
Trained 15h on 8 V100
Reached a word-level perplexity of 29 on wikitext-103 validation set (quite competitive)

Second model:

Same model but trained with a masked-language modeling objective (see the repo)
Trained 30h on 8 V100
Reached a “masked-word” perplexity of 8.3 on wikitext-103 validation set

Dai et al., 2018

Wikitext-103 Validation/Test PPL

72 of 238

Agenda

[2] Pretraining

[6]

Peters et al.. EMNLP 2018

73 of 238

3. What is in a Representation?

Image credit: Caique Lima

74 of 238

Why care about what is in a representation?

Extrinsic evaluation with downstream tasks

Complex, diverse with task-specific quirks

Interpretability!

Are we getting our results because of the right reasons?
Uncovering biases...

Language-aware representations

To generalize to other tasks, new inputs
As intermediates for possible improvements to pretraining

Swayamdipta, 2019

75 of 238

What to analyze?

Variations

Architecture (RNN / Transformer)
Layers
Pretraining Objectives

Embeddings

Word
Contextualized

Network Activations

76 of 238

Analysis Method 1: Visualization

Hold the embeddings / network activations static or frozen

77 of 238

Visualizing Embedding Geometries

Plotting embeddings in a lower dimensional (2D/3D) space

t-SNE van der Maaten & Hinton, 2008
PCA projections

Visualizing word analogies Mikolov et al. 2013

Spatial relations
w_king - w_man + w_woman ~ w_queen

High-level view of lexical semantics

Only a limited number of examples
Connection to other tasks is unclear Goldberg, 2017

Image: Tensorflow

Pennington et al., 2014

78 of 238

Visualizing Neuron Activations

Neuron activation values correlate with features / labels

Indicates learning of recognizable features

How to select which neuron? Hard to scale!
Interpretable != Important (Morcos et al., 2018)

Karpathy et al., 2016

Radford et al., 2017

79 of 238

Visualizing Layer-Importance Weights

Layer-wise analysis (static)

How important is each layer for a given performance on a downstream task?

Weighted average of layers

Task and architecture specific!

Also see Tenney et al., ACL 2019

80 of 238

Visualizing Attention Weights

Visualization: Attention Weights

Popular in machine translation, or other seq2seq architectures:

Alignment between words of source and target.
Long-distance word-word dependencies (intra-sentence attention)

Vaswani et al., 2017

Sheds light on architectures

Having sophisticated attention mechanisms can be a good thing!
Layer-specific

Interpretation can be tricky

Few examples only - cherry picking?
Robust corpus-wide trends? Next!

81 of 238

Analysis Method 2: Behavioral Probes

Linzen et al., 2016; Gulordava et al. 2018; Marvin et al., 2018

RNN-based language models

number agreement in subject-verb dependencies
natural and nonce or ungrammatical sentences
evaluate on output perplexity

RNNs outperform other non-neural baselines.

Performance improves when trained explicitly with syntax (Kuncoro et al. 2018)

Kuncoro et al. 2018

82 of 238

Analysis Method 2: Behavioral Probes

Linzen et al., 2016; Gulordava et al. 2018; Marvin et al., 2018

RNN-based language models (RNN-based)

number agreement in subject-verb dependencies
For natural and nonce/ungrammatical sentences
LM perplexity differences

RNNs outperform other non-neural baselines.

Performance improves when trained explicitly with syntax (Kuncoro et al. 2018)

Probe: Might be vulnerable to co-occurrence biases

“dogs in the neighborhood bark(s)”
Nonce sentences might be too different from original...

Kuncoro et al. 2018

83 of 238

Analysis Method 3: Classifier Probes

Hold the embeddings / network activations static and

train a simple supervised model on top

Probe classification task (Linear / MLP)

84 of 238

Probing Surface-level Features

Given a sentence, predict properties such as

Length
Is a word in the sentence?

Zhang et al. 2018; Liu et al., 2018; Conneau et al., 2018

Given a word in a sentence predict properties such as:

Previously seen words, contrast with language model
Position of word in the sentence

Checks ability to memorize

Well-trained, richer architectures tend to fare better
Training on linguistic data memorizes better

85 of 238

Probing Morphology, Syntax, Semantics

Sentence-level Syntax

Tree Depth

Tense of main clause verb

Top Constituents

Long-distance number agreement

# Objects

Subject-Verb Agreement

Morphology

Word-level syntax

POS tags, CCG supertags
Constituent parent, grandparent…

Adi et al., 2017; Conneau et al., 2018; Belinkov et al., 2017; Zhang et al., 2018; Blevins et al., 2018; Tenney et al. 2019; Liu et al., 2019

Partial syntax

Dependency relations

Partial semantics

Entity Relations
Coreference
Roles

86 of 238

Probing classifier findings

Liu et al. NAACL 2019

Tenney et al., ACL 2019

Hewitt et al., 2019

87 of 238

Probing classifier findings

Liu et al. (NAACL 2019)

Tenney et al., ACL 2019

Contextualized > non-contextualized

Especially on syntactic tasks
Closer performance on semantic tasks
Bidirectional context is important

BERT (large) almost always gets the highest performance

Grain of salt: Different contextualized representations were trained on different data, using different architectures...

Hewitt et. al., 2019

88 of 238

Probing: Layers of the network

Layer-wise analysis (dynamic)

RNN layers: General linguistic properties

Lowest layers: morphology
Middle layers: syntax
Highest layers: Task-specific semantics

Transformer layers:

Different trends for different tasks; middle-heavy
Also see Tenney et. al., 2019

Fig. from Liu et al. (NAACL 2019)

89 of 238

Probing: Pretraining Objectives

Language modeling outperforms other unsupervised and supervised objectives.

Machine Translation
Dependency Parsing
Skip-thought

Low-resource settings (size of training data) might result in opposite trends.

Zhang et al., 2018; Blevins et al., 2018; Liu et al., 2019;

90 of 238

What have we learnt so far?

Representations are predictive of certain linguistic phenomena:

Alignments in translation, Syntactic hierarchies

Pretraining with and without syntax:

Better performance with syntax
But without, some notion of syntax at least (Williams et al. 2018)

Network architectures determine what is in a representation

Syntax and BERT Transformer (Tenney et al., 2019; Goldberg, 2019)
Different layer-wise trends across architectures

91 of 238

Open questions about probes

What information should a good probe look for?

Probing a probe!

What does probing performance tell us?

Hard to synthesize results across a variety of baselines...

Can introduce some complexity in itself

linear or non-linear classification.
behavioral: design of input sentences

Should we be using probes as evaluation metrics?

might defeat the purpose...

92 of 238

Analysis Method 4: Model Alterations

Progressively erase or mask network components

Word embedding dimensions
Hidden units
Input - words / phrases

Li et al., 2016

93 of 238

So, what is in a representation?

Depends on how you look at it!

Visualization:

bird’s eye view
few samples -- might call to mind cherry-picking

Probes:

discover corpus-wide specific properties
may introduce own biases...

Network ablations:

great for improving modeling,
could be task specific

Analysis methods as tools to aid model development!

94 of 238

Very current and ongoing!

First column for citations in and before 2015

95 of 238

What’s next?

Conneau et al., 2018

Correlation of probes to downstream tasks

Linguistic Awareness

Interpretability

Interpretability + transferability to downstream tasks is key

Up next!

96 of 238

Some Pointers

Suite of word-based and word-pair-based tasks: Liu et al. 2019 (3B Semantics) https://github.com/nelson-liu/contextual-repr-analysis

Structural Probes: Hewitt & Manning 2019 (9E Machine Learning)

Overview of probes : Belinkov & Glass, 2019 (7F Poster #18)

97 of 238

Break

Image credit: Andrejs Kirma

98 of 238

Transfer Learning in Natural Language Processing

Transfer Learning in NLP

Follow along with the tutorial:

Slides: https://tinyurl.com/NAACLTransfer
Colab: https://tinyurl.com/NAACLTransferColab
Code: https://tinyurl.com/NAACLTransferCode

Questions:

Twitter: #NAACLTransfer during the tutorial
Whova: “Questions for the tutorial on Transfer Learning in NLP” topic
Ask us during the break or after the tutorial

99 of 238

Agenda

[2] Pretraining

[6]