1 of 47

Transfer Learning in NLP

Daniel Pressel

Interactions LLC

International Summer School on Deep Learning 2019

2 of 47

Deep Learning has transformed NLP

Deep Learning success in NLP: A non-exhaustive list:

Named Entity Recognition
Part-of-speech Tagging
Machine Translation
Parsing
Document Classification
Question Answering
Coreference Resolution
Recognizing Textual Entailment

3 of 47

Dependency Parsing

Parse sentence into a graph of arcs between dependents and their heads
Transition Parsing (e.g. Arc Standard):

Define a set of valid moves that used together will yield a parse graph
Start with an initial “configuration”
At each step, ask a “guide” to predict a transition until we have covered the sentence

4 of 47

Arc-Standard Parsing

* https://ai.googleblog.com/2016/05/announcing-syntaxnet-worlds-most.html

5 of 47

DNNs improve transition parsing!

Prior to 2014

Use SVM or Perceptron as guide

Chen and Manning 2014:

MLP guide

Kipperwasser and Goldberg 2016

BiLSTM + MLP (pictured)

Stack pointer network from Ma et al., 2018
Fernandez-Gonzalez and Gomez-Rodriguez, 2019

pointer network
eliminate stack and buffer
only 1 transition type!

6 of 47

The Changing Landscape of Dependency Parsing

Before Chen and Manning 2014*

After Chen and Manning 2014**

* Chen and Manning, 2014

**Fernandez-Gonzalez and Gomez-Rodriguez, 2019

7 of 47

Transfer Learning has Transformed Deep Learning for NLP!

*Devlin et al. 2019

8 of 47

Named Entity Recognition (NER)

Named Entity Recognition is a task to spot phrases that are entities and label the entity type

My name is Dan Pressel and I live in the US

O O O B-PER I-PER O O O O O B-LOC

9 of 47

Named Entity Recognition before DNNs (NER)

Define a set of features to help identify named entities

Word shape
Gazetteers

Use a structured classifier that predicts the most likely coverage through a sentence

MaxEnt Markov Model (MEMM)
Conditional Random Field (CRF)

10 of 47

Named Entity Recognition after DNNs (NER)

11 of 47

DNNs and Transfer LearniNG are Helping!

*Ruder, Peters, Swayamdipta & Wolf, NAACL, 2019

12 of 47

Sota in nlp 2019

Many State-of-the-Art models are built using transfer learning
Most successful technique is generative pre-training of a language model

First, learn to predict words
Train on a large corpus of text, transfer to downstream application

13 of 47

First Some Background

Use of DNNs has changed the starting point for NLP problems a bit

Convert sparse representations to dense continuous ones

Often use a pre-training technique like word2vec to create a distributed representation and plug those in

“You shall know a word by the company it keeps” - J.R. Firth, 1957

14 of 47

Word2vec objectives

CBOW: Given fixed surrounding window context, predict the middle word
Skip-gram: Given middle word, predict fixed surrounding window

15 of 47

One-hot vectors

One-hot: with |V| array of vocabulary, only one “on” (1), the rest “off” (0)
Represents the word at the temporal position t in T
|T|x|V| array representing a sentence

16 of 47

Lookup table-based Word embeddings

One-hot vector multiply by weight matrix yields row
Equivalent to looking up by the index

Efficient, tensor contains only indices for “on” values

17 of 47

Word embeddings in a classification architecture

Embeddings make up lowest layer, feed to some pooling mechanism

LSTM final hidden state
Convolutional Net followed by Max pooling
Max/Mean pooling

Some optional stacking followed by a projection to number of classes

*Collobert et al 2011

18 of 47

Motivation for Contextualized representations

Pre-trained embeddings caused breakthrough in NLP

E.g. Classification and NER started to rely heavily on these features
Linear and deep models started to use these features

For any surface representation, there is only one word vector

It seems like the same surface word should have different representations when the context differs
How can we learn contextual word vectors?

19 of 47

Causal language modeling

I’d like an Italian sub with everything, light ________ .

Can you guess the next word?

It probably is not “toothbrush” or “sandbox”
Maybe “oil?”

Can we teach a model to predict it?

Intuitively, we’d like a low probability on “toothbrush” and a high probability on “oil”

IRL, vocabulary is huge

How to handle unknown words?

20 of 47

Why Language Models For Pretraining?

Previous slide foreshadows how difficult this task can be
Model is forced to learn some syntax, semantics, coreference resolution, dependency parsing to try and solve
Unlike other tasks we might use, the training data is unlimited

21 of 47

An LSTM language model with characters

Replace word lookups with char lookups over word
Convolutional max-over-time pooling
One or more highway layers
One or more LSTM layers
Projection to vocabulary size
Softmax
Can train left-to-right and right-to-left and sum losses for biLM

22 of 47

Ok, so we trained a language model, now what?

ELMo-style biLM encoder

Character-word embeddings at layer 1
biLSTM layer 2
biLSTM layer 3

23 of 47

Some options for downstream use

Transform each input into a contextualized representation

Freeze them or fine-tune? Maybe slow gradients?
Pool them and fine-tune the whole model
ELMo objective

According to Peters et al., 2019, use as features when downstream task is very different

24 of 47

LSTM-based LMs

Learn different representations at different layers, just like in CV

As layers get higher, representation moves from syntax to meaning

Many tasks in NLP and each requires some different degree of knowledge

Implies that different contributions desirable for different layers depending on downstream task
Train a linear combination of layers

25 of 47

But I heard Attention is all you Need??

Goal: eliminate LSTMs

Hard to parallelize due to autoregressive nature
Even with LSTM, long distance dependencies are challenging

But LSTMs are shown to be useful for language, how to get around them?

Seq2seq already uses attention, can we just do that?

26 of 47

Background: Vanilla Seq2Seq

Translates but doesn’t perform well on long contexts

27 of 47

Background: Seq2Seq with Attention

28 of 47

Background: Seq2Seq with Attention

Linear combination of input informs output token
Works incredibly well

Every seq2seq model today uses attention
What if we replace every LSTM with attention?

29 of 47

The Transformer

30 of 47

Transformer innovations

Multi-head attention
Lots of layer normalization
Self-attention in encoder and decoder
Pyramidal mask to mask futures
Linear warm-up in training regime
Need some way to distinguish between same word at offset 6 and 14

Use positional embeddings

31 of 47

GPT: Transformers are cool! Lets use for Pre-training!

32 of 47

Pre-training Architectures: GPT

Method

Train Causal Transformer Encoder
For downstream tasks, remove LM head and replace with downstream head
Use BPE instead of character-level modeling

Strengths

Can parallelize, pretty optimal on GPU hardware
High capacity pre-trained LM yields strong results on downstream tasks
BPE is much faster than character-level modeling
Trained on a much larger corpus than ELMo with LDD
Large context window (256)

Weaknesses

BPE is not ideal for tasks that need morphological features
Unidirectional LM

33 of 47

BERT: GPT is cool but BiLM is important!

34 of 47

Pre-training Architectures: BERT

Method

Train 12-24 Layer Transformer with Next Sentence Prediction (NSP) Task and Masked Language Model (MLM) Task
For downstream tasks, remove LM head and replace with downstream head
Use BPE instead of character-level modeling

Strengths

Optimized for downstream tasks not LM

SoTA on many tasks, researchers are still discovering new strengths

BPE is much faster than character-level modeling
Trained on a massive corpus

Weaknesses

Subword not ideal for tasks that need morphological features
Cannot easily compare LM performance
MLM objective is slow to train

35 of 47

GPT-2: No seriously, GPT IS cool

Scale up GPT

Massive context (1024)
Larger vocab (~50k)
Moves layer norm around
Changes initialization

Zero-shot gets SoTA on well-studied datasets!
Generates long, relatively coherent statements

36 of 47

GPT-2: Sample

37 of 47

XLM: Multilingual Pre-Training

*Lample et al., 2019

38 of 47

What are we learning though?

How can we understand what these models are doing?

What LM objective will help me for downstream task X?

Look at the Neurons
Probe
Attention Weights

We will cover this in the tutorial!

39 of 47

Looking at Neuron Activations?

*Karpathy et al., 2016

40 of 47

Probing

Fix our contextual representations and train a single layer on a downstream task
YMMV: Does not perform well on NER, grammatical error detection, and conjunct identification (Liu et al., 2019)

*Liu et al., 2019