1 of 238

Transfer Learning in

Natural Language Processing

June 2, 2019

NAACL-HLT 2019

1

Sebastian Ruder

Matthew Peters

Swabha

Swayamdipta

Thomas Wolf

2 of 238

Transfer Learning in Natural Language Processing

Transfer Learning in NLP

2

Follow along with the tutorial:

Questions:

  • Twitter: #NAACLTransfer during the tutorial
  • Whova: “Questions for the tutorial on Transfer Learning in NLP” topic
  • Ask us during the break or after the tutorial

3 of 238

What is transfer learning?

3

4 of 238

Why transfer learning in NLP?

  • Many NLP tasks share common knowledge about language (e.g. linguistic representations, structural similarities)
  • Tasks can inform each other—e.g. syntax and semantics
  • Annotated data is rare, make use of as much supervision as available.

  • Empirically, transfer learning has resulted in SOTA for many supervised NLP tasks (e.g. classification, information extraction, Q&A, etc).

4

5 of 238

Why transfer learning in NLP? (Empirically)

5

Performance on Named Entity Recognition (NER) on CoNLL-2003 (English) over time

6 of 238

Types of transfer learning in NLP

6

We will focus on this

7 of 238

What this tutorial is about and what it’s not about

  • Goal: provide broad overview of transfer methods in NLP, focusing on the most empirically successful methods as of today (mid 2019)
  • Provide practical, hands on advice → by end of tutorial, everyone has ability to apply recent advances to text classification task

  • What this is not: Comprehensive (it’s impossible to cover all related papers in one tutorial!)
  • (Bender Rule: This tutorial is mostly for work done in English, extensibility to other languages depends on availability of data and resources.)

7

8 of 238

Agenda

8

9 of 238

  1. Introduction

9

10 of 238

Sequential transfer learning

Learn on one task / dataset, then transfer to another task / dataset

10

word2vec

GloVe

skip-thought

InferSent

ELMo

ULMFiT

GPT

BERT

classification

sequence labeling

Q&A

....

Pretraining

Adaptation

11 of 238

Pretraining tasks and datasets

  • Unlabeled data and self-supervision

  • Supervised pretraining

11

    • Very common in vision, less in NLP due to lack of large supervised datasets
    • Machine translation
    • NLI for sentence representations
    • Task-specific—transfer from one Q&A dataset to another
    • Easy to gather very large corpora: Wikipedia, news, web crawl, social media, etc.
    • Training takes advantage of distributional hypothesis: “You shall know a word by the company it keeps” (Firth, 1957), often formalized as training some variant of language model
    • Focus on efficient algorithms to make use of plentiful data�

12 of 238

Target tasks and datasets

Target tasks are typically supervised and span a range of common NLP tasks:

  • Sentence or document classification (e.g. sentiment)
  • Sentence pair classification (e.g. NLI, paraphrase)
  • Word level (e.g. sequence labeling, extractive Q&A)
  • Structured prediction (e.g. parsing)
  • Generation (e.g. dialogue, summarization)

12

13 of 238

Concrete example—word vectors

Word embedding methods (e.g. word2vec) learn one vector per word:

13

cat = [0.1, -0.2, 0.4, …]

dog = [0.2, -0.1, 0.7, …]

14 of 238

Concrete example—word vectors

Word embedding methods (e.g. word2vec) learn one vector per word:

14

cat = [0.1, -0.2, 0.4, …]

dog = [0.2, -0.1, 0.7, …]

PRP VBP PRP NN CC NN .

I love my cat and dog .

15 of 238

Concrete example—word vectors

Word embedding methods (e.g. word2vec) learn one vector per word:

15

cat = [0.1, -0.2, 0.4, …]

dog = [0.2, -0.1, 0.7, …]

PRP VBP PRP NN CC NN .

I love my cat and dog .

I love my cat and dog . }-> “positive"

16 of 238

Major Themes

16

17 of 238

Major themes: From words to words-in-context

Word vectors

Sentence / doc vectors

Word-in-context vectors

17

cats = [0.2, -0.3, …]

dogs = [0.4, -0.5, …]

It’s raining cats and dogs.

We have two cats.

[0.8, 0.9, …]

[-1.2, 0.0, …]

}

}

We have two cats.

}

[1.2, -0.3, …]

It’s raining cats and dogs.

}

[-0.4, 0.9, …]

18 of 238

Major themes: LM pretraining

  • Many successful pretraining approaches are based on language modeling
  • Informally, a LM learns Pϴ(text) or Pϴ(text | some other text)
  • Doesn’t require human annotation
  • Many languages have enough text to learn high capacity model
  • Versatile—can learn both sentence and word representations with a variety of objective functions

18

19 of 238

1 layer

Major themes: From shallow to deep

19

24 layers

20 of 238

Major themes: pretraining vs target task

  • Sentence / document representations not useful for word level predictions
  • Word vectors can be pooled across contexts, but often outperformed by other methods
  • In contextual word vectors, bidirectional context important

20

Choice of pretraining and target tasks are coupled

In general:

  • Similar pretraining and target tasks → best results

21 of 238

Agenda

21

22 of 238

2. Pretraining

22

Image credit: Creative Stall

23 of 238

Overview

  • Language model pretraining
  • Word vectors
  • Sentence and document vectors
  • Contextual word vectors
  • Interesting properties of pretraining
  • Cross-lingual pretraining

23

24 of 238

Word Type Representation

We [have a ??? and three] dogs

We have a ???

We have a MASK and three dogs

We have a ???

We like pets. }

LM pretraining

word2vec, Mikolov et al (2013)

Skip-Thought (Kiros et al., 2015)

24

???

25 of 238

Word vectors

25

26 of 238

Why embed words?

  • Embeddings are themselves parameters—can be learned
  • Sharing representations across tasks
  • Lower dimensional space
    • Better for computation—difficult to handle sparse vectors.

26

27 of 238

Word Type Representation

Unsupervised pretraining : Pre-Neural

Latent Semantic Analysis (LSA)—SVD of term-document matrix, (Deerwester et al., 1990)

Latent Dirichlet Allocation (LDA)—Documents are mixtures of topics and topics are mixtures of words (Blei et al., 2003)

Brown clusters, hard hierarchical clustering based on n-gram LMs, (Brown et al. 1992)

27

28 of 238

Word Type Representation

n-gram neural language model (Bengio et al. 2003)

Supervised multitask word embeddings (Collobert and Weston, 2008)

Word vector pretraining

28

29 of 238

See also:

word2vec

Efficient algorithm + large scale training → high quality word vectors

(Mikolov et al., 2013)

29

30 of 238

Sentence and document vectors

30

31 of 238

Doc2vec

Paragraph vector

Unsupervised paragraph embeddings (Le & Mikolov, 2014)

SOTA classification (IMDB, SST)

31

32 of 238

Doc2vec

Skip-Thought Vectors

Predict previous / next sentence with seq2seq model (Kiros et al., 2015)

Hidden state of encoder transfers to sentence tasks (classification, semantic similarity)

32

33 of 238

Dai & Le (2015): Pretrain a sequence autoencoder (SA) and generative LM

See also:

Autoencoder pretraining

SOTA classification (IMDB)

33

34 of 238

Autoencoder pretraining

Supervised sentence embeddings

Also possible to train sentence embeddings with supervised objective

  • Paragram-phrase: uses paraphrase database for supervision, best for paraphrase and semantic similarity (Wieting et al. 2016)
  • InferSent: bi-LSTM trained on SNLI + MNLI (Conneau et al. 2017)
  • GenSen: multitask training (skip-thought, machine translation, NLI, parsing) (Subramanian et al. 2018)

34

35 of 238

Contextual word vectors

35

36 of 238

Contextual word vectors - Motivation

Word vectors compress all contexts into a single vector

Nearest neighbor GloVe vectors to “play

36

VERB

playing

played

NOUN

game

games

players

football

??

plays

Play

ADJ

multiplayer

37 of 238

Contextual word vectors - Key Idea

Instead of learning one vector per word, learn a vector that depends on context

Many approaches based on language models

37

f(play | The kids play a game in the park.)

f(play | The Broadway play premiered yesterday.)

!=

38 of 238

Sentence completion

Lexical substitution

WSD

Use bidirectional LSTM and cloze prediction objective (a 1 layer masked LM)

Learn representations for both words and contexts (minus word)

context2vec

38

39 of 238

Pretrain two LMs (forward and backward) and add to sequence tagger.

SOTA NER and chunking results

TagLM

39

40 of 238

Pretrain encoder and decoder with LMs (everything shaded is pretrained).

Large boost for MT.

Unsupervised Pretraining for Seq2Seq

40

41 of 238

Pretrain bidirectional encoder with MT supervision, extract LSTM states

Adding CoVe with GloVe gives improvements for classification, NLI, Q&A

CoVe

41

42 of 238

Pretrain deep bidirectional LM, extract contextual word vectors as learned linear combination of hidden states

SOTA for 6 diverse tasks

ELMo

42

43 of 238

Pretrain AWD-LSTM LM, fine-tune LM in two stages with different adaptation techniques

SOTA for six classification datasets

ULMFiT

43

44 of 238

Pretrain large 12-layer left-to-right Transformer, fine tune for sentence, sentence pair and multiple choice questions.

SOTA results for 9 tasks.

GPT

44

45 of 238

BERT

45

BERT pretrains both sentence and contextual word representations,

using masked LM and next sentence prediction.

BERT-large has 340M parameters, 24 layers!

46 of 238

BERT

46

SOTA GLUE benchmark results (sentence pair classification).

47 of 238

BERT

47

SOTA SQuAD v1.1 (and v2.0) Q&A

48 of 238

Other pretraining objectives

48

  • Contextual string representations (Akbik et al., COLING 2018)—SOTA NER results
  • Cross-view training (Clark et al. EMNLP 2018)—improve supervised tasks with unlabeled data
  • Cloze-driven pretraining (Baevski et al. (2019)—SOTA NER and constituency parsing

49 of 238

Why does language modeling work so well?

49

  • Language modeling is a very difficult task, even for humans.
  • Language models are expected to compress any possible context into a vector that generalizes over possible completions.
    • “They walked down the street to ???”
  • To have any chance at solving this task, a model is forced to learn syntax, semantics, encode facts about the world, etc.
  • Given enough data, a huge model, and enough compute, can do a reasonable job!
  • Empirically works better than translation, autoencoding: “Language Modeling Teaches You More Syntax than Translation Does” (Zhang et al. 2018)

50 of 238

Sample efficiency

50

51 of 238

Pretraining reduces need for annotated data

51

52 of 238

Pretraining reduces need for annotated data

52

53 of 238

Pretraining reduces need for annotated data

53

54 of 238

Scaling up pretraining

54

55 of 238

More data → better word vectors

(Pennington et al 2014)

Scaling up pretraining

55

56 of 238

Pretrained Language Models: More Data

Scaling up pretraining

56

57 of 238

Bigger model → better results

(Devlin et al 2019)

Scaling up pretraining

57

58 of 238

Cross-lingual pretraining

58

59 of 238

Cross-lingual pretraining

59

60 of 238

Cross-lingual Polyglot Pretraining

Key idea: Share vocabulary and representations across languages by training one model on many languages.

Advantages: Easy to implement, enables cross-lingual pretraining by itself

Disadvantages: Leads to under-representation of low-resource languages

60

61 of 238

Hands-on #1:

Pretraining a Transformer Language Model

61

Image credit: Chanaky

62 of 238

Hands-on: Overview

  • Goals:
    • Let’s make these recent works “uncool again” i.e. as accessible as possible
    • Expose all the details in a simple, concise and self-contained code-base
    • Show that transfer learning can be simple (less hand-engineering) & fast (pretrained model)
  • Plan
    • Build a GPT-2 / BERT model
    • Pretrain it on a rather large corpus with ~100M words
    • Adapt it for a target task to get SOTA performances
  • Material:

62

Current developments in Transfer Learning combine new approaches for training schemes (sequential training) as well as models (transformers) ⇨ can look intimidating and complex

63 of 238

Hands-on pre-training

63

64 of 238

  • summing words and position embeddings
  • applying a succession of transformer blocks with:
    • layer normalisation
    • a self-attention module
    • dropout and a residual connection�
    • another layer normalisation
    • a feed-forward module with one hidden layer and a non linearity: Linear ⇨ ReLU/gelu ⇨ Linear
    • dropout and a residual connection
  • The

Hands-on pre-training

64

Our core model will be a Transformer. Large-scale transformer architectures (GPT-2, BERT, XLM…) are very similar to each other and consist of:�

Main differences between GPT/GPT-2/BERT are the objective functions:

  • causal language modeling for GPT
  • masked language modeling for BERT (+ next sentence prediction)

We’ll play with both

65 of 238

Let’s code the backbone of our model!

PyTorch 1.1 now has a nn.MultiHeadAttention module: lets us encapsulate the self-attention logic while still controlling the internals of the Transformer.

Hands-on pre-training

65

66 of 238

Two attention masks?

  • padding_mask masks the padding tokens. It is specific to each sample in the batch:
  • attn_mask is the same for all samples in the batch. It masks the previous tokens for causal transformers:

Hands-on pre-training

66

67 of 238

1. A pretraining head on top of our core model: we choose a language modeling head with tied weights

Hands-on pre-training

67

To pretrain our model, we need to add a few elements: a head, a loss and initialize weights.

We add these elements with a pretraining model encapsulating our model.

2. Initialize the weights

3. Define a loss function: we choose a cross-entropy loss on current (or next) token predictions

68 of 238

Hyper-parameters taken from Dai et al., 2018 (Transformer-XL) ⇨ ~50M parameters causal model.

Use a large dataset for pre-trainining: WikiText-103 with 103M tokens (Merity et al., 2017).

Instantiate our model and optimizer (Adam)

Hands-on pre-training

68

Now let’s take care of our data and configuration

We'll use a pre-defined open vocabulary tokenizer: BERT’s model cased tokenizer.

69 of 238

Hands-on pre-training

69

A simple update loop.

We use gradient accumulation to have a large batch size even on 1 GPU (>64).

Learning rate schedule:

- linear warmup to start�- then cosine or inverse square root decrease

Go!

And we’re done: let’s train!

no warm-up

70 of 238

Hands-on pre-training — Concluding remarks

70

  • On pretraining
    • Intensive: in our case 5h–20h on 8 V100 GPUs (few days w. 1 V100) to reach a good perplexity ⇨ share your pretrained models
    • Robust to the choice of hyper-parameters (apart from needing a warm-up for transformers)
    • Language modeling is a hard task, your model should not have enough capacity to overfit if your dataset is large enough ⇨ you can just start the training and let it run.
    • Masked-language modeling: typically 2-4 times slower to train than LM�We only mask 15% of the tokens ⇨ smaller signal

  • For the rest of this tutorial�We don’t have enough time to do a full pretraining⇨ we pretrained two models for you before the tutorial

71 of 238

Hands-on pre-training — Concluding remarks

71

  • First model:
    • exactly the one we built together ⇨ a 50M parameters causal Transformer
    • Trained 15h on 8 V100
    • Reached a word-level perplexity of 29 on wikitext-103 validation set (quite competitive)
  • Second model:
    • Same model but trained with a masked-language modeling objective (see the repo)
    • Trained 30h on 8 V100
    • Reached a “masked-word” perplexity of 8.3 on wikitext-103 validation set

Wikitext-103 Validation/Test PPL

72 of 238

Agenda

72

73 of 238

3. What is in a Representation?

73

Image credit: Caique Lima

74 of 238

Why care about what is in a representation?

  • Extrinsic evaluation with downstream tasks
    • Complex, diverse with task-specific quirks
  • Interpretability!
    • Are we getting our results because of the right reasons?
    • Uncovering biases...
  • Language-aware representations
    • To generalize to other tasks, new inputs
    • As intermediates for possible improvements to pretraining

74

75 of 238

What to analyze?

  • Variations
    • Architecture (RNN / Transformer)
    • Layers
    • Pretraining Objectives

75

  • Embeddings
    • Word
    • Contextualized
  • Network Activations

76 of 238

Analysis Method 1: Visualization

76

Hold the embeddings / network activations static or frozen

77 of 238

Visualizing Embedding Geometries

  • Plotting embeddings in a lower dimensional (2D/3D) space
    • t-SNE van der Maaten & Hinton, 2008
    • PCA projections

  • Visualizing word analogies Mikolov et al. 2013
    • Spatial relations
    • wking - wman + wwoman ~ wqueen

  • High-level view of lexical semantics
    • Only a limited number of examples
    • Connection to other tasks is unclear Goldberg, 2017

77

Image: Tensorflow

78 of 238

Visualizing Neuron Activations

  • Neuron activation values correlate with features / labels
  • Indicates learning of recognizable features
    • How to select which neuron? Hard to scale!
    • Interpretable != Important (Morcos et al., 2018)

78

79 of 238

Visualizing Layer-Importance Weights

Layer-wise analysis (static)

  • How important is each layer for a given performance on a downstream task?
    • Weighted average of layers

79

  • Task and architecture specific!

80 of 238

Visualizing Attention Weights

Visualization: Attention Weights

  • Popular in machine translation, or other seq2seq architectures:
    • Alignment between words of source and target.
    • Long-distance word-word dependencies (intra-sentence attention)

80

  • Sheds light on architectures
    • Having sophisticated attention mechanisms can be a good thing!
    • Layer-specific
  • Interpretation can be tricky
    • Few examples only - cherry picking?
    • Robust corpus-wide trends? Next!

81 of 238

Analysis Method 2: Behavioral Probes

81

  • RNN-based language models
    • number agreement in subject-verb dependencies
    • natural and nonce or ungrammatical sentences
    • evaluate on output perplexity
  • RNNs outperform other non-neural baselines.

  • Performance improves when trained explicitly with syntax (Kuncoro et al. 2018)

82 of 238

Analysis Method 2: Behavioral Probes

82

  • RNN-based language models (RNN-based)
    • number agreement in subject-verb dependencies
    • For natural and nonce/ungrammatical sentences
    • LM perplexity differences
  • RNNs outperform other non-neural baselines.

  • Performance improves when trained explicitly with syntax (Kuncoro et al. 2018)

  • Probe: Might be vulnerable to co-occurrence biases
    • “dogs in the neighborhood bark(s)”
    • Nonce sentences might be too different from original...

83 of 238

Analysis Method 3: Classifier Probes

83

Hold the embeddings / network activations static and

train a simple supervised model on top

Probe classification task (Linear / MLP)

84 of 238

Probing Surface-level Features

  • Given a sentence, predict properties such as
    • Length
    • Is a word in the sentence?

84

  • Given a word in a sentence predict properties such as:
    • Previously seen words, contrast with language model
    • Position of word in the sentence
  • Checks ability to memorize
    • Well-trained, richer architectures tend to fare better
    • Training on linguistic data memorizes better

85 of 238

Probing Morphology, Syntax, Semantics

Sentence-level Syntax

Tree Depth

Tense of main clause verb

Top Constituents

Long-distance number agreement

# Objects

Subject-Verb Agreement

  • Morphology

  • Word-level syntax
    • POS tags, CCG supertags
    • Constituent parent, grandparent…

85

  • Partial syntax
    • Dependency relations

  • Partial semantics
    • Entity Relations
    • Coreference
    • Roles

86 of 238

Probing classifier findings

86

87 of 238

Probing classifier findings

87

  • Contextualized > non-contextualized
    • Especially on syntactic tasks
    • Closer performance on semantic tasks
    • Bidirectional context is important

  • BERT (large) almost always gets the highest performance
    • Grain of salt: Different contextualized representations were trained on different data, using different architectures...

88 of 238

Probing: Layers of the network

Layer-wise analysis (dynamic)

  • RNN layers: General linguistic properties
    • Lowest layers: morphology
    • Middle layers: syntax
    • Highest layers: Task-specific semantics
  • Transformer layers:
    • Different trends for different tasks; middle-heavy
    • Also see Tenney et. al., 2019

88

89 of 238

Probing: Pretraining Objectives

  • Language modeling outperforms other unsupervised and supervised objectives.
    • Machine Translation
    • Dependency Parsing
    • Skip-thought

  • Low-resource settings (size of training data) might result in opposite trends.

89

90 of 238

What have we learnt so far?

  • Representations are predictive of certain linguistic phenomena:
    • Alignments in translation, Syntactic hierarchies

90

  • Pretraining with and without syntax:
    • Better performance with syntax
    • But without, some notion of syntax at least (Williams et al. 2018)
  • Network architectures determine what is in a representation
    • Syntax and BERT Transformer (Tenney et al., 2019; Goldberg, 2019)
    • Different layer-wise trends across architectures

91 of 238

Open questions about probes

  • What information should a good probe look for?
    • Probing a probe!

91

  • What does probing performance tell us?
    • Hard to synthesize results across a variety of baselines...
  • Can introduce some complexity in itself
    • linear or non-linear classification.
    • behavioral: design of input sentences
  • Should we be using probes as evaluation metrics?
    • might defeat the purpose...

92 of 238

Analysis Method 4: Model Alterations

  • Progressively erase or mask network components
    • Word embedding dimensions
    • Hidden units
    • Input - words / phrases

92

93 of 238

So, what is in a representation?

  • Depends on how you look at it!
    • Visualization:
      • bird’s eye view
      • few samples -- might call to mind cherry-picking
    • Probes:
      • discover corpus-wide specific properties
      • may introduce own biases...
    • Network ablations:
      • great for improving modeling,
      • could be task specific

93

  • Analysis methods as tools to aid model development!

94 of 238

Very current and ongoing!

94

First column for citations in and before 2015

95 of 238

What’s next?

Correlation of probes to downstream tasks

  • Linguistic Awareness

  • Interpretability

Interpretability + transferability to downstream tasks is key

  • Up next!

96 of 238

Some Pointers

  • Structural Probes: Hewitt & Manning 2019 (9E Machine Learning)
    • Overview of probes : Belinkov & Glass, 2019 (7F Poster #18)

96

97 of 238

Break

97

Image credit: Andrejs Kirma

98 of 238

Transfer Learning in Natural Language Processing

Transfer Learning in NLP

98

Follow along with the tutorial:

Questions:

  • Twitter: #NAACLTransfer during the tutorial
  • Whova: “Questions for the tutorial on Transfer Learning in NLP” topic
  • Ask us during the break or after the tutorial

99 of 238

Agenda

99

100 of 238

4. Adaptation

100

Image credit: Ben Didier

101 of 238

4 – How to adapt the pretrained model

Several orthogonal directions we can make decisions on:

  1. Architectural modifications?�How much to change the pretrained model architecture for adaptation��
  2. Optimization schemes?�Which weights to train during adaptation and following what schedule ��
  3. More signal: Weak supervision, Multi-tasking & Ensembling�How to get more supervision signal for the target task

102 of 238

4.1 – Architecture

Two general options:

  1. Keep pretrained model internals unchanged:�Add classifiers on top, embeddings at the bottom, use outputs as features
  2. Modify pretrained model internal architecture: �Initialize encoder-decoders, task-specific modifications, adapters

102

Image credit: Darmawansyah

103 of 238

4.1.A – Architecture: Keep model unchanged

General workflow:

  • Remove pretraining task head if not useful for target task
    1. Example: remove softmax classifier from pretrained LM
    2. Not always needed: some adaptation schemes re-use the pretraining objective/task, e.g. for multi-task learning

103

104 of 238

4.1.A – Architecture: Keep model unchanged

General workflow:

Task-specific, randomly initialized

General, pretrained

  • Add target task-specific layers on top/bottom of pretrained model
    • Simple: adding linear layer(s) on top of the pretrained model

104

105 of 238

4.1.A – Architecture: Keep model unchanged

General workflow:

  • Add target task-specific layers on top/bottom of pretrained model
    • Simple: adding linear layer(s) on top of the pretrained model
    • More complex: model output as input for a separate model
    • Often beneficial when target task requires interactions that are not available in pretrained embedding

105

106 of 238

4.1.B – Architecture: Modifying model internals

Various reasons:

  • Adapting to a structurally different target task
    • Ex: Pretraining with a single input sequence (ex: language modeling) but adapting to a task with several input sequences (ex: translation, conditional generation...)
    • Use the pretrained model weights to initialize as much as possible of a structurally different target task model
    • Ex: Use monolingual LMs to initialize encoder and decoder parameters for MT (Ramachandran et al., EMNLP 2017; Lample & Conneau, 2019)

106

107 of 238

4.1.B – Architecture: Modifying model internals

Various reasons:

  • Task-specific modifications
    • Provide pretrained model with capabilities that are useful for the target task
    • Ex: Adding skip/residual connections, attention (Ramachandran et al., EMNLP 2017)

107

108 of 238

4.1.B – Architecture: Modifying model internals

  • Using less parameters for adaptation:
    • Less parameters to fine-tune
    • Can be very useful given the increasing size of model parameters
    • Ex: add bottleneck modules (“adapters”) between the layers of the pretrained model (Rebuffi et al., NIPS 2017; CVPR 2018)

Various reasons:

108

109 of 238

4.1.B – Architecture: Modifying model internals

Adapters

  • Commonly connected with a residual connection in parallel to an existing layer
  • Most effective when placed at every layer (smaller effect at bottom layers)
  • Different operations (convolutions, self-attention) possible
  • Particularly suitable for modular architectures like Transformers (Houlsby et al., ICML 2019; Stickland and Murray, ICML 2019)

109

Image credit: Caique Lima

110 of 238

4.1.B – Architecture: Modifying model internals

  • Multi-head attention (MH; shared across layers) is used in parallel with self-attention (SA) layer of BERT
  • Both are added together and fed into a layer-norm (LN)

110

111 of 238

Hands-on #2:

Adapting our pretrained model

111

Image credit: Chanaky

112 of 238

Hands-on: Model adaptation

  • Plan
    • Start from our Transformer language model
    • Adapt the model to a target task:
      • keep the model core unchanged, load the pretrained weights
      • add a linear layer on top, newly initialized
      • use additional embeddings at the bottom, newly initialized
  • Reminder — material is here:

112

Let’s see how a simple fine-tuning scheme can be implemented with our pretrained model:

113 of 238

Adaptation task

  • We select a text classification task as the downstream task

  • TREC-6: The Text REtrieval Conference (TREC) Question Classification (Li et al., COLING 2002)�
  • TREC consists of open-domain, fact-based questions divided into broad semantic categories contains 5500 labeled training questions & 500 testing questions with 6 labels:� NUM, LOC, HUM, DESC, ENTY, ABBR

Hands-on: Model adaptation

113

Ex:

  • How did serfdom develop in and then leave Russia ? —> DESC
  • What films featured the character Popeye Doyle ? —> ENTY

Transfer learning models shine on this type of low-resource task

114 of 238

  • Modifications:
    • Keep model internals unchanged
    • Add a linear layer on top
    • Add an additional embedding (classification token) at the bottom�
  • Computation flow:
    • Model input: the tokenized question with a classification token at the end
    • Extract the last hidden-state associated to the classification token
    • Pass the hidden-state in a linear layer and softmax to obtain class probabilities

Hands-on: Model adaptation

114

First adaptation scheme

115 of 238

Let’s load and prepare our dataset:

Fine-tuning hyper-parameters:

– 6 classes in TREC-6

– Use fine tuning hyper parameters from Radford et al., 2018:

  • learning rate from 6.5e-5 to 0.0
  • fine-tune for 3 epochs

Hands-on: Model adaptation

115

- trim to the transformer input size & add a classification token at the end of each sample,

- pad to the left,

- convert to tensors,

- extract a validation set.

116 of 238

Adapt our model architecture

Replace the pre-training head (language modeling) with the classification head:

A linear layer, which takes as input the hidden-state of the [CLF] token (using a mask)

Keep our pretrained model unchanged as the backbone.

* Initialize all the weights of the model.

Hands-on: Model adaptation

116

* Reload common weights from the pretrained model.

117 of 238

Our fine-tuning code:

We will evaluate on our validation and test sets:

* validation: after each epoch

* test: at the end

A simple training update function:

* prepare inputs: transpose and build padding & classification token masks

* we have options to clip and accumulate gradients

Schedule:

* linearly increasing to lr

* linearly decreasing to 0.0

Hands-on: Model adaptation

117

118 of 238

We can now fine-tune our model on TREC:

We are at the state-of-the-art

(ULMFiT)

Remarks:

  • The error rate goes down quickly! After one epoch we already have >90% accuracy.�⇨ Fine-tuning is highly data efficient in Transfer Learning
  • We took our pre-training & fine-tuning hyper-parameters straight from the literature on related models.�⇨ Fine-tuning is often robust to the exact choice of hyper-parameters

Hands-on: Model adaptation – Results

118

119 of 238

Let’s conclude this hands-on with a few additional words on robustness & variance.

  • Large pretrained models (e.g. BERT large) are prone to degenerate performance when fine-tuned on tasks with small training sets.
  • Observed behavior is often “on-off”: it either works very well or doesn’t work at all.
  • Understanding the conditions and causes of this behavior (models, adaptation schemes) is an open research question.

Hands-on: Model adaptation – Results

119

120 of 238

4.2 – Optimization

Several directions when it comes to the optimization itself:

  1. Choose which weights we should update�Feature extraction, fine-tuning, adapters�
  2. Choose how and when to update the weights�From top to bottom, gradual unfreezing, discriminative fine-tuning�
  3. Consider practical trade-offsSpace and time complexity, performance

120

Image credit: ProSymbols, purplestudio, Markus, Alfredo

121 of 238

4.2.A – Optimization: Which weights?

The main question: To tune or not to tune (the pretrained weights)?

  • Do not change pretrained weights�Feature extraction, adapters�
  • Change pretrained weights�Fine-tuning

121

Image credit: purplestudio

122 of 238

4.2.A – Optimization: Which weights?

Don’t touch the pretrained weights!��Feature extraction:

  • Weights are frozen

122

❄️

123 of 238

4.2.A – Optimization: Which weights?

Don’t touch the pretrained weights!��Feature extraction:

  • Weights are frozen
  • A linear classifier is trained on top of the pretrained representations

123

❄️

124 of 238

4.2.A – Optimization: Which weights?

Don’t touch the pretrained weights!��Feature extraction:

  • Weights are frozen
  • A linear classifier is trained on top of the pretrained representations
  • Don’t just use features of the top layer!
  • Learn a linear combination of layers (Peters et al., NAACL 2018, Ruder et al., AAAI 2019)

124

❄️

125 of 238

4.2.A – Optimization: Which weights?

Don’t touch the pretrained weights!

Feature extraction:

  • Alternatively, pretrained representations are used as features in downstream model

125

126 of 238

4.2.A – Optimization: Which weights?

Don’t touch the pretrained weights!��Adapters

  • Task-specific modules that are added in between existing layers

126

127 of 238

4.2.A – Optimization: Which weights?

Don’t touch the pretrained weights!��Adapters

  • Task-specific modules that are added in between existing layers
  • Only adapters are trained

127

128 of 238

4.2.A – Optimization: Which weights?

Yes, change the pretrained weights!

Fine-tuning:

  • Pretrained weights are used as initialization for parameters of the downstream model
  • The whole pretrained architecture is trained during the adaptation phase

128

129 of 238

Hands-on #3:

Using Adapters and freezing

129

Image credit: Chanaky

130 of 238

  • Modifications:
    • add Adapters inside the backbone model: Linear ⇨ ReLU ⇨ Linear� with a skip-connection
  • As previously:
    • add a linear layer on top
    • use an additional embedding (classification token) at the bottom

Hands-on: Model adaptation

130

Second adaptation scheme: Using Adapters

  • Houlsby et al., ICML 2019

We will only train the adapters, the added linear layer and the embeddings. The other parameters of the model will be frozen.

131 of 238

Let’s adapt our model architecture

Add the adapter modules:

Bottleneck layers with 2 linear layers and a non-linear activation function (ReLU)

Hidden dimension is small: e.g. 32, 64, 256

Inherit from our pretrained model to have all the modules.

The Adapters are inserted inside skip-connections after:

  • the attention module
  • the feed-forward module

Hands-on: Model adaptation

131

132 of 238

Now we need to freeze the portions of our model we don’t want to train.

We just indicate that no gradient is needed for the frozen parameters by setting param.requires_grad to False for the frozen parameters:

In our case we will train 25% of the parameters. The model is small & deep (many adapters) and we need to train the embeddings so the ratio stay quite high. For a larger model this ratio would be a lot lower.

Hands-on: Model adaptation

132

133 of 238

Results similar to full-fine-tuning case with advantage of training only 25% of the full model parameters.

For a small 50M parameters model this method is overkill ⇨ for 300M–1.5B parameters models.

We use a hidden dimension of 32 for the adapters and a learning rate ten times higher for the fine-tuning (we have added quite a lot of newly initialized parameters to train from scratch).

Hands-on: Model adaptation

133

134 of 238

4.2.B – Optimization: What schedule?

We have decided which weights to update, but in which order and how should be update them?

Motivation: We want to avoid overwriting useful pretrained information and maximize positive transfer.

Related concept: Catastrophic forgetting (McCloskey & Cohen, 1989; French, 1999)When a model forgets the task it was originally trained on.

134

Image credit: Markus

135 of 238

4.2.B – Optimization: What schedule?

A guiding principle:�Update from top-to-bottom

  • Progressively in time: freezing
  • Progressively in intensity: Varying the learning rates
  • Progressively vs. the pretrained model: Regularization

135

136 of 238

4.2.B – Optimization: Freezing

Main intuition: Training all layers at the same time on data of a different distribution and task may lead to instability and poor solutions.

Solution: Train layers individually to give them time to adapt to new task and data.

Goes back to layer-wise training of early deep neural networks (Hinton et al., 2006; Bengio et al., 2007).

136

137 of 238

4.2.B – Optimization: Freezing

  • Freezing all but the top layer (Long et al., ICML 2015)

137

138 of 238

4.2.B – Optimization: Freezing

  • Freezing all but the top layer (Long et al., ICML 2015)
  • Chain-thaw (Felbo et al., EMNLP 2017): training one layer at a time
    1. Train new layer

138

139 of 238

4.2.B – Optimization: Freezing

  • Freezing all but the top layer (Long et al., ICML 2015)
  • Chain-thaw (Felbo et al., EMNLP 2017): training one layer at a time
    • Train new layer
    • Train one layer at a time

139

140 of 238

4.2.B – Optimization: Freezing

  • Freezing all but the top layer (Long et al., ICML 2015)
  • Chain-thaw (Felbo et al., EMNLP 2017): training one layer at a time
    • Train new layer
    • Train one layer at a time

140

141 of 238

4.2.B – Optimization: Freezing

  • Freezing all but the top layer (Long et al., ICML 2015)
  • Chain-thaw (Felbo et al., EMNLP 2017): training one layer at a time
    • Train new layer
    • Train one layer at a time

141

142 of 238

4.2.B – Optimization: Freezing

  • Freezing all but the top layer (Long et al., ICML 2015)
  • Chain-thaw (Felbo et al., EMNLP 2017): training one layer at a time
    • Train new layer
    • Train one layer at a time
    • Train all layers

142

143 of 238

4.2.B – Optimization: Freezing

143

144 of 238

4.2.B – Optimization: Freezing

144

145 of 238

4.2.B – Optimization: Freezing

145

146 of 238

4.2.B – Optimization: Freezing

146

147 of 238

4.2.B – Optimization: Freezing

147

148 of 238

4.2.B – Optimization: Freezing

  • Freezing all but the top layer (Long et al., ICML 2015)
  • Chain-thaw (Felbo et al., EMNLP 2017): training one layer at a time
  • Gradually unfreezing (Howard & Ruder, ACL 2018): unfreeze one layer after another
  • Sequential unfreezing (Chronopoulou et al., NAACL 2019): hyper-parameters that determine length of fine-tuning
    • Fine-tune additional parameters for epochs
    • Fine-tune pretrained parameters without embedding layer for epochs

148

149 of 238

4.2.B – Optimization: Freezing

  • Freezing all but the top layer (Long et al., ICML 2015)
  • Chain-thaw (Felbo et al., EMNLP 2017): training one layer at a time
  • Gradually unfreezing (Howard & Ruder, ACL 2018): unfreeze one layer after another
  • Sequential unfreezing (Chronopoulou et al., NAACL 2019): hyper-parameters that determine length of fine-tuning
    • Fine-tune additional parameters for epochs
    • Fine-tune pretrained parameters without embedding layer for epochs
    • Train all layers until convergence

149

150 of 238

4.2.B – Optimization: Freezing

Commonality: Train all parameters jointly in the end

150

151 of 238

Hands-on #4:

Using gradual unfreezing

151

Image credit: Chanaky

152 of 238

Gradual unfreezing is similar to our previous freezing process.�We start by freezing all the model except the newly added parameters:

We then gradually unfreeze an additional block along the training so that we train the full model at the end:

Find index of layer to unfreeze

Name pattern matching

Unfreezing interval

Hands-on: Adaptation

152

153 of 238

Gradual unfreezing has not been investigated in details for Transformer models� ⇨ no specific hyper-parameters advocated in the literature

Residual connections may have an impact on the method

⇨ should probably adapt LSTM hyper-parameters

Hands-on: Adaptation

153

We show simple experiments in the Colab. Better hyper-parameters settings can probably be found.

154 of 238

4.2.B – Optimization: Learning rates

Main idea: Use lower learning rates to avoid overwriting useful information.

Where and when?

  • Lower layers (capture general information)
  • Early in training (model still needs to adapt to target distribution)
  • Late in training (model is close to convergence)

154

155 of 238

4.2.B – Optimization: Learning rates

  • Discriminative fine-tuning (Howard & Ruder, ACL 2018)
    • Lower layers capture general information�→ Use lower learning rates for lower layers

155

156 of 238

4.2.B – Optimization: Learning rates

  • Discriminative fine-tuning
  • Triangular learning rates (Howard & Ruder, ACL 2018)
    • Quickly move to a suitable region, then slowly converge over time

156

157 of 238

4.2.B – Optimization: Learning rates

  • Discriminative fine-tuning
  • Triangular learning rates (Howard & Ruder, ACL 2018)
    • Quickly move to a suitable region, then slowly converge over time
    • Also known as “learning rate warm-up”
    • Used e.g. in Transformer (Vaswani et al., NIPS 2017) and Transformer-based methods (BERT, GPT)
    • Facilitates optimization; easier to escape suboptimal local minima

157

158 of 238

4.2.B – Optimization: Regularization

Main idea: minimize catastrophic forgetting by encouraging target model parameters to stay close to pretrained model parameters�using a regularization term .

158

159 of 238

4.2.B – Optimization: Regularization

  • Simple method:�Regularize new parameters not to deviate too much from pretrained ones (Wiese et al., CoNLL 2017):�

159

160 of 238

4.2.B – Optimization: Regularization

  • More advanced (elastic weight consolidation; EWC): Focus on parameters that are important for the pretrained task based on the Fisher information matrix�(Kirkpatrick et al., PNAS 2017):�

160

161 of 238

4.2.B – Optimization: Regularization

EWC has downsides in continual learning:

  • May over-constrain parameters
  • Computational cost is linear in the number of tasks (Schwarz et al., ICML 2018)

161

162 of 238

4.2.B – Optimization: Regularization

  • If tasks are similar, we may also encourage source and target predictions to be close based on cross-entropy, similar to distillation:�

162

163 of 238

Hands-on #5:

Using discriminative learning

163

Image credit: Chanaky

164 of 238

Discriminative learning rate can be implemented using two steps in our example:

We can then compute the learning rate of each group depending on its label (at each training iteration):

First we organize the parameters of the various layers in labelled parameters groups in the optimizer:

Hyper-parameter

Hands-on: Model adaptation

164

165 of 238

4.2.C – Optimization: Trade-offs

Several trade-offs when choosing which weights to update:

  • Space complexity�Task-specific modifications, additional parameters, parameter reuse�
  • Time complexity�Training time�
  • Performance

165

Image credit: Alfredo

166 of 238

4.2.C – Optimization trade-offs: Space

Task-specific modifications

Additional parameters

Parameter reuse

166

Many

Few

Feature extraction

Fine-tuning

Adapters

Many

Few

Feature extraction

Fine-tuning

Adapters

All

None

Feature extraction

Fine-tuning

Adapters

167 of 238

4.2.C – Optimization trade-offs: Time

Training time

167

Feature extraction

Fine-tuning

Adapters

Slow

Fast

168 of 238

4.2.C – Optimization trade-offs: Performance

  • Rule of thumb: If task source and target tasks are dissimilar*, use feature extraction (Peters et al., 2019)
  • Otherwise, feature extraction and fine-tuning often perform similar
  • Fine-tuning BERT on textual similarity tasks works significantly better
  • Adapters achieve performance competitive with fine-tuning
  • Anecdotally, Transformers are easier to fine-tune (less sensitive to hyper-parameters) than LSTMs

*dissimilar: certain capabilities (e.g. modelling inter-sentence relations) are beneficial for target task, but pretrained model lacks them (see more later)

168

169 of 238

4.3 – Getting more signal

The target task is often a low-resource task. We can often�improve the performance of transfer learning by �combining a diverse set of signals:

  • From fine-tuning a single model on a single adaptation task….�The Basic: fine-tuning the model with a simple classification objective�
  • … to gathering signal from other datasets and related tasks … �Fine-tuning with Weak Supervision, Multi-tasking and Sequential Adaptation�
  • … to ensembling models�Combining the predictions of several fine-tuned models

169

Image credit: Naveen

170 of 238

4.3.A – Getting more signal: Basic fine-tuning

Simple example of fine-tuning on a text classification task:

  • Extract a single fixed-length vector from the model:�hidden state of first/last token or mean/max of hidden-states�
  • Project to the classification space with an additional classifier�
  • Train with a classification objective

170

171 of 238

4.3.B – Getting more signal: Related datasets/tasks

  • Sequential adaptation�Intermediate fine-tuning on related datasets and tasks
  • Multi-task fine-tuning with related tasks�Such as NLI tasks in GLUE�
  • Dataset Slicing�When the model consistently underperforms on particular slices of the data
  • Semi-supervised learning�Use unlabelled data to improve model consistency

171

172 of 238

4.3.B – Getting more signal: Sequential adaptation

Fine-tuning on related high-resource dataset

  1. Fine-tune model on related task with more data

172

173 of 238

4.3.B – Getting more signal: Sequential adaptation

Fine-tuning on related high-resource dataset

  1. Fine-tune model on related task with more data
  2. Fine-tune model on target task
  • Helps particularly for tasks with limited data and similar tasks (Phang et al., 2018)
  • Improves sample complexity on target task (Yogatama et al., 2019)

173

174 of 238

4.3.B – Getting more signal: Multi-task fine-tuning

Fine-tune the model jointly on related tasks

  • For each optimization step, sample a task and a batch for training.
  • Train via multi-task learning for a couple of epochs.

174

175 of 238

4.3.B – Getting more signal: Multi-task fine-tuning

Fine-tune the model jointly on related tasks

  • For each optimization step, sample a task and a batch for training.
  • Train via multi-task learning for a couple of epochs.
  • Fine-tune on the target task only for a few epochs at the end.

175

176 of 238

4.3.B – Getting more signal: Multi-task fine-tuning

Fine-tune the model with an unsupervised auxiliary task

  • Language modelling is a related task!
  • Fine-tuning the LM helps adapting the pretrained parameters to the target dataset.
  • Helps even without pretraining (Rei et al., ACL 2017)
  • Can optionally anneal ratio (Chronopoulou et al., NAACL 2019)
  • Used as a separate step in ULMFiT

176

177 of 238

4.3.B – Getting more signal: Dataset slicing

Use auxiliary heads that are trained only on particular subsets of the data

  • Analyze errors of the model
  • Use heuristics to automatically identify challenging subsets of the training data
  • Train auxiliary heads jointly with main head

See also Massive Multi-task Learning with Snorkel MeTaL

177

178 of 238

4.3.B – Getting more signal: Semi-supervised learning

Can be used to make model predictions more consistent using unlabelled data

  • Main idea: Minimize distance between predictions on original input and perturbed input

178

179 of 238

4.3.B – Getting more signal: Semi-supervised learning

Can be used to make model predictions more consistent using unlabelled data

  • Perturbation can be noise, masking (Clark et al., EMNLP 2018), data augmentation, e.g. back-translation (Xie et al., 2019)

179

180 of 238

4.3.C – Getting more signal: Ensembling

Reaching the state-of-the-art by ensembling independently fine-tuned models

  • Ensembling models�Combining the predictions of models fine-tuned with various hyper-parameters
  • Knowledge distillationDistill an ensemble of fine-tuned models in a single smaller model

180

181 of 238

4.3.C – Getting more signal: Ensembling

Model fine-tuned...

  • on different tasks
  • on different dataset-splits
  • with different parameters (dropout, initializations…)
  • from variant of pre-trained models (e.g. cased/uncased)

181

Combining the predictions of models fine-tuned with various hyper-parameters.

182 of 238

4.3.C – Getting more signal: Distilling

  • knowledge distillation: train a student model on soft targets produced by the teacher (the ensemble) ��
  • Relative probabilities of the teacher labels contain information about how the teacher generalizes

182

Distilling ensembles of large models back in a single model

183 of 238

Hands-on #6:

Using multi-task learning

183

Image credit: Chanaky

184 of 238

Multitasking with a classification loss + language modeling loss.

Create two heads:

– language modeling head

– classification head

Total loss is a weighted sum of

– language modeling loss and

– classification loss

Hands-on: Multi-task learning

184

185 of 238

Multi-tasking helped us improve over single-task full-model fine-tuning!

We use a coefficient of 1.0 for the classification loss and 0.5 for the language modeling loss and fine-tune a little longer (6 epochs instead of 3 epochs, the validation loss was still decreasing).

Hands-on: Multi-task learning

185

186 of 238

Agenda

186

187 of 238

5. Downstream applications�Hands-on examples

187

Image credit: Fahmi

188 of 238

5. Downstream applications - Hands-on examples

In this section we will explore downstream applications and practical considerations along two orthogonal directions:

  1. What are the various applications of transfer learning in NLP�Document/sequence classification, Token-level classification, Structured prediction and Language generation
  2. How to leverage several frameworks & libraries for practical applications�Tensorflow, PyTorch, Keras and third-party libraries like fast.ai, HuggingFace...

188

189 of 238

Practical considerations

Frameworks & libraries: practical considerations

189

  • Pretraining large-scale models is costly�Use open-source models�Share your pretrained models

“Energy and Policy Considerations for Deep Learning in NLP” - Strubell, Ganesh, McCallum - ACL 2019

  • Sharing/accessing pretrained models
    • Hubs: Tensorflow Hub, PyTorch Hub
    • Author released checkpoints: ex BERT, GPT...
    • Third-party libraries: AllenNLP, fast.ai, HuggingFace
  • Design considerations
    • Hubs/libraries:
      • Simple to use but can be difficult to modify model internal architecture
    • Author released checkpoints:
      • More difficult to use but you have full control over the model internals

190 of 238

5. Downstream applications - Hands-on examples

  • Sequence and document level classification�Hands-on: Document level classification (fast.ai)�
  • Token level classification�Hands-on: Question answering (Google BERT & Tensorflow/TF Hub)
  • Language generation�Hands-on: Dialog Generation (OpenAI GPT & HuggingFace/PyTorch Hub)�

190

Icons credits: David, Susannanova, Flatart, ProSymbols

191 of 238

5.A – Sequence & document level classification

Transfer learning for document classification using the fast.ai library.

  • Target task:�IMDB: a binary sentiment classification dataset containing 25k highly polar movie reviews for training, 25k for testing and additional unlabeled data.�http://ai.stanford.edu/~amaas/data/sentiment/
  • Fast.ai has in particular:
    • a pre-trained English model available for download
    • a standardized data block API
    • easy access to standard datasets like IMDB
  • Fast.ai is based on PyTorch

191

192 of 238

fast.ai gives access to many high-level API out-of-the-box for vision, text, tabular data and collaborative filtering.

DataBunch for the language model and the classifier

Load IMDB dataset & inspect it.

Load an AWD-LSTM (Merity et al., 2017) pretrained on WikiText-103 & fine-tune it on IMDB using the language modeling loss.

Fast.ai then comprises all the high level modules needed to quickly setup a transfer learning experiment.

5.A – Document level classification using fast.ai

192

The library is designed for speed of experimentation, e.g. by importing all necessary modules at once in interactive computing environments, like:

193 of 238

Now we fine-tune in two steps:�

Once we have a fine-tune language model (AWD-LSTM), we can create a text classifier by adding a classification head with:

– A layer to concatenate the final outputs of the RNN with the maximum and average of all the intermediate outputs (along the sequence length)

– Two blocks of nn.BatchNorm1d nn.Dropout nn.Linear nn.ReLU with a hidden dimension of 50.

5.A – Document level classification using fast.ai

193

1. train the classification head only while keeping the language model frozen, and

2. fine-tune the whole architecture.

194 of 238

5.B – Token level classification: BERT & Tensorflow

Transfer learning for token level classification: Google’s BERT in TensorFlow.

  • Target task:�SQuAD: a question answering dataset.�https://rajpurkar.github.io/SQuAD-explorer/
  • In this example we will directly use a Tensorflow checkpoint
    • Example: https://github.com/google-research/bert
    • We use the usual Tensorflow workflow: create model graph comprising the core model and the added/modified elements
    • Take care of variable assignments when loading the checkpoint

194

195 of 238

Let’s adapt BERT to the target task.

Replace the pre-training head (language modeling) with a classification head:

a linear projection layer to estimate 2 probabilities for each token:

– being the start of an answer

– being the end of an answer.

Keep our core model unchanged.

5.B – SQuAD with BERT & Tensorflow

195

196 of 238

Load our pretrained checkpoint

To load our checkpoint, we just need to setup an assignement_map from the variables of the checkpoint to the model variable, keeping only the variables in the model.

And we can use

tf.train.init_from_ckeckpoint

5.B – SQuAD with BERT & Tensorflow

196

197 of 238

TensorFlow-Hub

TensorFlow Hub is a library for sharing machine learning models as self-contained pieces of TensorFlow graph with their weights and assets.

Working directly with TensorFlow requires to have access to–and include in your code– the full code of the pretrained model.

Modules are automatically downloaded and cached when instantiated.

Each time a module m is called e.g. y = m(x), it adds operations to the current TensorFlow graph to compute y from x.

5.B – SQuAD with BERT & Tensorflow

197

198 of 238

Tensorflow Hub host a nice selection of pretrained models for NLP

Tensorflow Hub can also used with Keras exactly how we saw in the BERT example

The main limitations of Hubs are:

  • No access to the source code of the model (black-box)
  • Not possible to modify the internals of the model (e.g. to add Adapters)

5.B – SQuAD with BERT & Tensorflow

198

199 of 238

5.C – Language Generation: OpenAI GPT & PyTorch

Transfer learning for language generation: OpenAI GPT and HuggingFace library.

  • Target task:�ConvAI2 – The 2nd Conversational Intelligence Challenge for training and evaluating models for non-goal-oriented dialogue systems, i.e. chit-chathttp://convai.io
  • HuggingFace library of pretrained models
    • a repository of large scale pre-trained models with BERT, GPT, GPT-2, Transformer-XL
    • provide an easy way to download, instantiate and train pre-trained models in PyTorch
  • HuggingFace’s models are now also accessible using PyTorch Hub

199

200 of 238

A dialog generation task:

5.C – Chit-chat with OpenAI GPT & PyTorch

200

Language generation tasks are close to the language modeling pre-training objective, but:

  • Language modeling pre-training involves a single input: a sequence of words.
  • In a dialog setting: several type of contexts are provided to generate an output sequence:
    • knowledge base: persona sentences,
    • history of the dialog: at least the last utterance from the user,
    • tokens of the output sequence that have already been generated.

How should we adapt the model?

201 of 238

Golovanov, Kurbanov, Nikolenko, Truskovskyi, Tselousov and Wolf, ACL 2019

5.C – Chit-chat with OpenAI GPT & PyTorch

201

Several options:

  • Duplicate the model to initialize an encoder-decoder structure�e.g. Lample & Conneau, 2019
  • Use a single model with concatenated inputs�see e.g. Wolf et al., 2019, Khandelwal et al. 2019

Concatenate the various context separated by delimiters and add position and segment embeddings

202 of 238

5.C – Chit-chat with OpenAI GPT & PyTorch

202

Let’s import pretrained versions of OpenAI GPT tokenizer and model.

Now most of the work is about preparing the inputs for the model.

Then train our model using the pretraining language modeling objective.

And add a few new tokens to the vocabulary

We organize the contexts in segments

Add delimiter at the extremities of the segments

And build our word, position and segment inputs for the model.

203 of 238

5.C – Chit-chat with OpenAI GPT & PyTorch

203

PyTorch Hub

Last Friday, the PyTorch team soft-launched a beta version of PyTorch Hub. Let’s have a quick look.

  • PyTorch Hub is based on GitHub repositories
  • A model is shared by adding a hubconf.py script to the root of a GitHub repository
  • Both model definitions and pre-trained weights can be shared
  • More details: https://pytorch.org/hub and https://pytorch.org/docs/stable/hub.html

In our case, to use torch.hub instead of pytorch-pretrained-bert, we can simply call torch.hub.load with the path to pytorch-pretrained-bert GitHub repository:

PyTorch Hub will fetch the model from the master branch on GitHub. This means that you don’t need to package your model (pip) & users will always access the most recent version (master).

204 of 238

Agenda

204

205 of 238

6. Open problems and future directions

205

Image credit: Yazmin Alanis

206 of 238

6. Open problems and future directions

  • Shortcomings of pretrained language models�
  • Pretraining tasks�
  • Tasks and task similarity�
  • Continual learning and meta-learning�
  • Bias

206

Image credit: Yazmin Alanis

207 of 238

Shortcomings of pretrained language models

  • Recap: LM can be seen as a general pretraining task; with enough data, compute, and capacity a LM can learn a lot.
  • In practice, many things that are less represented in text are harder to learn
  • Pretrained language models are bad at
    • fine-grained linguistic tasks (Liu et al., NAACL 2019)
    • common sense (when you actually make it difficult; Zellers et al., ACL 2019); natural language generation (maintaining long-term dependencies, relations, coherence, etc.)
    • tend to overfit to surface form information when fine-tuned; ‘rapid surface learners’
    • ...

207

208 of 238

Shortcomings of pretrained language models

Large, pretrained language models can be difficult to optimize.

  • Fine-tuning is often unstable and has a high variance, particularly if the target datasets are very small
  • Devlin et al. (NAACL 2019) note that large (24-layer) version of BERT is particularly prone to degenerate performance; multiple random restarts are sometimes necessary as also investigated in detail in (Phang et al., 2018)

208

209 of 238

Shortcomings of pretrained language models

Current pretrained language models are very large.

  • Do we really need all these parameters?
  • Recent work shows that only a few of the attention heads in BERT are required (Voita et al., ACL 2019).
  • More work needed to understand model parameters.
  • Pruning and distillation are two ways to deal with this.
  • See also: the lottery ticket hypothesis (Frankle et al., ICLR 2019).

209

210 of 238

Pretraining tasks

Shortcomings of the language modeling objective:

  • Not appropriate for all models
    • If we condition on more inputs, need to pretrain those parts
    • E.g. the decoder in sequence-to-sequence learning (Song et al., ICML 2019)
  • Left-to-right bias not always be best
    • Objectives that take into account more context (such as masking) seem useful (less sample-efficient)
    • Possible to combine different LM variants (Dong et al., 2019)
  • Weak signal for semantics and long-term context vs. strong signal for syntax and short-term word co-occurrences
    • Need incentives that promote encoding what we care about, e.g. semantics

210

211 of 238

More diverse self-supervised objectives

  • Taking inspiration from computer vision

Sampling a patch and a neighbour and predicting their spatial configuration (Doersch et al., ICCV 2015)

Image colorization (Zhang et al., ECCV 2016)

  • Self-supervision in language mostly based on word co-occurrence (Ando and Zhang, 2005)�
  • Supervision on different levels of meaning
    • Discourse, document, sentence, etc.
    • Using other signals, e.g. meta-data�
  • Emphasizing different qualities of language

Pretraining tasks

211

212 of 238

Pretraining tasks

Specialized pretraining tasks that teach what our model is missing

  • Develop specialized pretraining tasks that explicitly learn such relationships
  • Other pretraining tasks could explicitly learn reasoning or understanding
    • Arithmetic, temporal, causal, etc.; discourse, narrative, conversation, etc.�
  • Pretrained representations could be connected in a sparse and modular way

212

213 of 238

Pretraining tasks

Need for grounded representations

  • Limits of distributional hypothesis—difficult to learn certain types of information from raw text
    • Human reporting bias: not stating the obvious (Gordon and Van Durme, AKBC 2013)
    • Common sense isn’t written down
    • Facts about named entities
    • No grounding to other modalities�
  • Possible solutions:
    • Incorporate other structured knowledge (e.g. knowledge bases like ERNIE, Zhang et al 2019)
    • Multimodal learning (e.g. with visual representations like VideoBERT, Sun et al. 2019)
    • Interactive/human-in-the-loop approaches (e.g. dialog, Hancock et al. 2018)

213

214 of 238

Tasks and task similarity

Many tasks can be expressed as variants of language modeling

  • Language itself can directly be used to specify tasks, inputs, and outputs, e.g. by framing as QA (McCann et al., 2018)
  • Dialog-based learning without supervision by forward prediction (Weston, NIPS 2016)
  • NLP tasks formulated as cloze prediction objective (Children Book Test, LAMBADA, Winograd, ...)
  • Triggering task behaviors via prompts e.g. TL; DR:, translation prompt (Radford, Wu et al. 2019); enables zero-shot adaptation
  • Questioning the notion of a “task” in NLP

214

215 of 238

Tasks and task similarity

  • Intuitive similarity of pretraining and target tasks (NLI, classification) correlates with better downstream performance
  • Do not have a clear understanding of when and how two tasks are similar and relate to each other
  • One way to gain more understanding: Large-scale empirical studies of transfer such as Taskonomy (Zamir et al., CVPR 2018)
  • Should be helpful for designing better and specialized pretraining tasks

215

216 of 238

Continual and meta-learning

  • Current transfer learning performs adaptation once.
  • Ultimately, we’d like to have models that continue to retain and accumulate knowledge across many tasks (Yogatama et al., 2019).
  • No distinction between pretraining and adaptation; just one stream of tasks.
  • Main challenge towards this: Catastrophic forgetting.
  • Different approaches from the literature:
    • Memory, regularization, task-specific weights, etc.

216

217 of 238

Continual and meta-learning

  • Objective of transfer learning: Learn a representation that is general and useful for many tasks.
  • Objective does not incentivize ease of adaptation (often unstable); does not learn how to adapt it.
  • Meta-learning combined with transfer learning could make this more feasible.
  • However, most existing approaches are restricted to the few-shot setting and only learn a few steps of adaptation.

217

218 of 238

Bias

  • Bias has been shown to be pervasive in word embeddings and neural models in general
  • Large pretrained models necessarily have their own sets of biases
  • There is a blurry boundary between common-sense and bias
  • We need ways to remove such biases during adaptation
  • A small fine-tuned model should be harder to misuse

218

219 of 238

Conclusion

  • Themes: words-in-context, LM pretraining, deep models�
  • Pretraining gives better sample-efficiency, can be scaled up�
  • Predictive of certain features—depends how you look at it�
  • Performance trade-offs, from top-to-bottom�
  • Transfer learning is simple to implement, practically useful�
  • Still many shortcomings and open problems

219

220 of 238

Questions?

If you found these slides helpful, consider citing the tutorial as:�@inproceedings{ruder2019transfer,� title={Transfer Learning in Natural Language Processing},� author={Ruder, Sebastian and Peters, Matthew E and Swayamdipta, Swabha and Wolf, Thomas},� booktitle={Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Tutorials},� pages={15--18},� year={2019}�}

  • Twitter: #NAACLTransfer
  • Whova: “Questions for the tutorial on Transfer Learning in NLP” topic

220

221 of 238

Extra slides

221

222 of 238

Why transfer learning in NLP? (Empirically)

222

BERT + X

223 of 238

GLUE* performance over time

223

*General Language Understanding Evaluation (GLUE; Wang et al., 2019):�includes 11 diverse NLP tasks

224 of 238

Pretrained Language Models: More Parameters

224

225 of 238

More word vectors

225

GLoVe: very large scale (840B tokens), co-occurrence based. Learns linear relationships (SOTA word analogy) (Pennington et al., 2014)

  • fastText: incorporates subword information (Bojanowski et al., 2017)

fastText

skipgram

226 of 238

SOTA sequence modeling results

Semi-supervised Sequence Modeling with Cross-View Training

226

227 of 238

Pretrain bidirectional character level model, extract embeddings from first/last character

SOTA CoNLL 2003 NER results

Contextual String Embeddings

227

228 of 238

Cloze-driven Pretraining of Self-attention Networks

228

Pretraining

Fine-tuning

SOTA NER and PTB constituency parsing, ~3.3% less than BERT-large for GLUE

229 of 238

Model is jointly pretrained on three variants of LM (bidirectional, left-to-right, seq-to-seq)

SOTA on three natural language generation tasks

UniLM - Dong et al., 2019

229

230 of 238

Pretrain encoder-decoder

Masked Sequence to Sequence Pretraining (MASS)

230

231 of 238

What matters: Pretraining Objective, Encoder

Probing tasks for sentential features:

  • Bag-of-Vectors is surprisingly good at capturing sentence-level proper- ties, thanks to redundancies in natural linguistic input.
  • BiLSTM-based models are better than CNN-based models at capturing interesting linguistic knowledge, with same objective
  • Objective matters - training on NLI is bad. Most tasks are structured so a seq 2 tree objective works best.
  • Supervised objectives for sentence embeddings do better than unsupervised, like SkipThought (Kiros et al.)

231

232 of 238

An inspiration from Computer Vision

�From lower to higher layers, information goes from general to task-specific.

232

Image credit: Distill

233 of 238

Other methods for analysis

Other analyses

  • Textual omission and multi-modal: Kadar et al. , 16
  • Adversarial Approaches
    • Adversary: input which differs from original just enough to change the desired prediction
      • SQuAD: Jia & Liang, 2017
      • NLI: Glockner et al., 2018; Minervini & Riedel, 2018
      • Machine Translation: Belinkov & Bisk, 2018
    • Requires identification (manual or automatic) of inputs to modify.

233

Adversarial methods

234 of 238

Analysis: Inputs and Outputs

What to analyze?

  • Embeddings
    • Word types and tokens
    • Sentence
    • Document
  • Network Activations
    • RNNs
    • CNNs
    • Feed-forward nets
  • Layers
  • Pretraining Objectives

What to look for?

  • Surface-level features
  • Lexical features
    • E.g. POS tags
  • Morphology
  • Syntactic Structure
    • Word-level
    • Sentence-level
  • Semantic Structure
    • E.g. Roles, Coreference

234

Belinkov et al. (2019)—More details in Table 1.

235 of 238

Analysis: Methods

  • Visualization:
    • 2-D plots
    • Attention mechanisms
    • Network activations

  • Model Alterations:
    • Network Erasure
    • Perturbations
  • Model Probes:
    • Surface-level features
    • Syntactic features
    • Semantic features

Model Alterations

Visualization

Model Probes

* Not hard and fast categories

236 of 238

Analysis / Evaluation : Adversarial Methods

Adversarial Approaches

  • How does this say what’s in a representation?
    • Roundabout: what’s wrong with a representation...

236

Credits: Jia & Liang (2017) and Percy Liang. AI Frontiers. 2018

237 of 238

Probes are simple linear / neural layers

Liu et al., NAACL 2019

237

238 of 238

What is still left unanswered?

  • Interpretability is difficult Lipton et al., 2016
    • Many variables make synthesis challenging
    • Choice of model architecture, pretraining objective determines informativeness of representations

Transferability to downstream tasks

Interpretability is important, but not enough on its own.

Interpretability + transferability to downstream tasks is key - that’s next!