1 of 47

Transfer Learning in NLP

Daniel Pressel

Interactions LLC

International Summer School on Deep Learning 2019

2 of 47

Deep Learning has transformed NLP

  • Deep Learning success in NLP: A non-exhaustive list:
    • Named Entity Recognition
    • Part-of-speech Tagging
    • Machine Translation
    • Parsing
    • Document Classification
    • Question Answering
    • Coreference Resolution
    • Recognizing Textual Entailment

3 of 47

Dependency Parsing

  • Parse sentence into a graph of arcs between dependents and their heads
  • Transition Parsing (e.g. Arc Standard):
    • Define a set of valid moves that used together will yield a parse graph
    • Start with an initial “configuration”
    • At each step, ask a “guide” to predict a transition until we have covered the sentence

4 of 47

Arc-Standard Parsing

5 of 47

DNNs improve transition parsing!

  • Prior to 2014
    • Use SVM or Perceptron as guide
  • Chen and Manning 2014:
    • MLP guide
  • Kipperwasser and Goldberg 2016
    • BiLSTM + MLP (pictured)
  • Stack pointer network from Ma et al., 2018
  • Fernandez-Gonzalez and Gomez-Rodriguez, 2019
    • pointer network
    • eliminate stack and buffer
    • only 1 transition type!

6 of 47

The Changing Landscape of Dependency Parsing

Before Chen and Manning 2014*

After Chen and Manning 2014**

* Chen and Manning, 2014

**Fernandez-Gonzalez and Gomez-Rodriguez, 2019

7 of 47

Transfer Learning has Transformed Deep Learning for NLP!

*Devlin et al. 2019

8 of 47

Named Entity Recognition (NER)

  • Named Entity Recognition is a task to spot phrases that are entities and label the entity type

My name is Dan Pressel and I live in the US

O O O B-PER I-PER O O O O O B-LOC

9 of 47

Named Entity Recognition before DNNs (NER)

  • Define a set of features to help identify named entities
    • Word shape
    • Gazetteers
  • Use a structured classifier that predicts the most likely coverage through a sentence
    • MaxEnt Markov Model (MEMM)
    • Conditional Random Field (CRF)

10 of 47

Named Entity Recognition after DNNs (NER)

11 of 47

DNNs and Transfer LearniNG are Helping!

*Ruder, Peters, Swayamdipta & Wolf, NAACL, 2019

12 of 47

Sota in nlp 2019

  • Many State-of-the-Art models are built using transfer learning
  • Most successful technique is generative pre-training of a language model
    • First, learn to predict words
    • Train on a large corpus of text, transfer to downstream application

13 of 47

First Some Background

  • Use of DNNs has changed the starting point for NLP problems a bit
    • Convert sparse representations to dense continuous ones
  • Often use a pre-training technique like word2vec to create a distributed representation and plug those in

“You shall know a word by the company it keeps” - J.R. Firth, 1957

14 of 47

Word2vec objectives

  • CBOW: Given fixed surrounding window context, predict the middle word
  • Skip-gram: Given middle word, predict fixed surrounding window

15 of 47

One-hot vectors

  • One-hot: with |V| array of vocabulary, only one “on” (1), the rest “off” (0)
  • Represents the word at the temporal position t in T
  • |T|x|V| array representing a sentence

16 of 47

Lookup table-based Word embeddings

  • One-hot vector multiply by weight matrix yields row
  • Equivalent to looking up by the index
    • Efficient, tensor contains only indices for “on” values

17 of 47

Word embeddings in a classification architecture

  • Embeddings make up lowest layer, feed to some pooling mechanism
    • LSTM final hidden state
    • Convolutional Net followed by Max pooling
    • Max/Mean pooling
  • Some optional stacking followed by a projection to number of classes

*Collobert et al 2011

18 of 47

Motivation for Contextualized representations

  • Pre-trained embeddings caused breakthrough in NLP
    • E.g. Classification and NER started to rely heavily on these features
    • Linear and deep models started to use these features
  • For any surface representation, there is only one word vector
    • It seems like the same surface word should have different representations when the context differs
    • How can we learn contextual word vectors?

19 of 47

Causal language modeling

I’d like an Italian sub with everything, light ________ .

  • Can you guess the next word?
    • It probably is not “toothbrush” or “sandbox”
    • Maybe “oil?”
  • Can we teach a model to predict it?
    • Intuitively, we’d like a low probability on “toothbrush” and a high probability on “oil”
  • IRL, vocabulary is huge
    • How to handle unknown words?

20 of 47

Why Language Models For Pretraining?

  • Previous slide foreshadows how difficult this task can be
  • Model is forced to learn some syntax, semantics, coreference resolution, dependency parsing to try and solve
  • Unlike other tasks we might use, the training data is unlimited

21 of 47

An LSTM language model with characters

  • Replace word lookups with char lookups over word
  • Convolutional max-over-time pooling
  • One or more highway layers
  • One or more LSTM layers
  • Projection to vocabulary size
  • Softmax
  • Can train left-to-right and right-to-left and sum losses for biLM

22 of 47

Ok, so we trained a language model, now what?

  • ELMo-style biLM encoder
    • Character-word embeddings at layer 1
    • biLSTM layer 2
    • biLSTM layer 3

23 of 47

Some options for downstream use

  • Transform each input into a contextualized representation
    • Freeze them or fine-tune? Maybe slow gradients?
    • Pool them and fine-tune the whole model
    • ELMo objective
  • According to Peters et al., 2019, use as features when downstream task is very different

24 of 47

LSTM-based LMs

  • Learn different representations at different layers, just like in CV
    • As layers get higher, representation moves from syntax to meaning
  • Many tasks in NLP and each requires some different degree of knowledge
    • Implies that different contributions desirable for different layers depending on downstream task
    • Train a linear combination of layers

25 of 47

But I heard Attention is all you Need??

  • Goal: eliminate LSTMs
    • Hard to parallelize due to autoregressive nature
    • Even with LSTM, long distance dependencies are challenging
  • But LSTMs are shown to be useful for language, how to get around them?
    • Seq2seq already uses attention, can we just do that?

26 of 47

Background: Vanilla Seq2Seq

  • Translates but doesn’t perform well on long contexts

27 of 47

Background: Seq2Seq with Attention

28 of 47

Background: Seq2Seq with Attention

  • Linear combination of input informs output token
  • Works incredibly well
    • Every seq2seq model today uses attention
    • What if we replace every LSTM with attention?

29 of 47

The Transformer

30 of 47

Transformer innovations

  • Multi-head attention
  • Lots of layer normalization
  • Self-attention in encoder and decoder
  • Pyramidal mask to mask futures
  • Linear warm-up in training regime
  • Need some way to distinguish between same word at offset 6 and 14
    • Use positional embeddings

31 of 47

GPT: Transformers are cool! Lets use for Pre-training!

32 of 47

Pre-training Architectures: GPT

  • Method
    • Train Causal Transformer Encoder
    • For downstream tasks, remove LM head and replace with downstream head
    • Use BPE instead of character-level modeling
  • Strengths
    • Can parallelize, pretty optimal on GPU hardware
    • High capacity pre-trained LM yields strong results on downstream tasks
    • BPE is much faster than character-level modeling
    • Trained on a much larger corpus than ELMo with LDD
    • Large context window (256)
  • Weaknesses
    • BPE is not ideal for tasks that need morphological features
    • Unidirectional LM

33 of 47

BERT: GPT is cool but BiLM is important!

34 of 47

Pre-training Architectures: BERT

  • Method
    • Train 12-24 Layer Transformer with Next Sentence Prediction (NSP) Task and Masked Language Model (MLM) Task
    • For downstream tasks, remove LM head and replace with downstream head
    • Use BPE instead of character-level modeling
  • Strengths
    • Optimized for downstream tasks not LM
      • SoTA on many tasks, researchers are still discovering new strengths
    • BPE is much faster than character-level modeling
    • Trained on a massive corpus
  • Weaknesses
    • Subword not ideal for tasks that need morphological features
    • Cannot easily compare LM performance
    • MLM objective is slow to train

35 of 47

GPT-2: No seriously, GPT IS cool

  • Scale up GPT
    • Massive context (1024)
    • Larger vocab (~50k)
    • Moves layer norm around
    • Changes initialization
  • Zero-shot gets SoTA on well-studied datasets!
  • Generates long, relatively coherent statements

36 of 47

GPT-2: Sample

37 of 47

XLM: Multilingual Pre-Training

*Lample et al., 2019

38 of 47

What are we learning though?

  • How can we understand what these models are doing?
    • What LM objective will help me for downstream task X?
  • Look at the Neurons
  • Probe
  • Attention Weights
    • We will cover this in the tutorial!

39 of 47

Looking at Neuron Activations?

*Karpathy et al., 2016

40 of 47

Probing

  • Fix our contextual representations and train a single layer on a downstream task
  • YMMV: Does not perform well on NER, grammatical error detection, and conjunct identification (Liu et al., 2019)

*Liu et al., 2019

41 of 47

Attention Weight Visualization

  • For Transformers, we can take a look at the attention heads individually
    • These weights do inform us of what the model is learning!

42 of 47

Catastrophic Forgetting

  • Problem: as we are learning in-domain (downstream) task, we are “forgetting” what we learned in the general purpose one
    • Most applications may not care
    • If we do care, use multi-task learning to prevent overfitting
      • How to select?

43 of 47

Moving Forward

  • Exciting time in NLP!
  • Generative Pre-Training and Transfer Learning are powerful tools that have transformed NLP
    • They require a significant amount of time to train
    • The pre-training objectives are important to downstream tasks
  • Transformer-type architectures are trending up
    • LSTM-based models like ELMo are less popular
    • Transformers are computationally efficient to train
    • Slightly easier to interpret than LSTMs
  • Scratching surface of downstream performance but…
    • these models already have significant power as shown in benchmarks like GLUE

44 of 47

References

45 of 47

References

46 of 47

References

47 of 47

References