1 of 110

Transfer Learning in NLP�NLPL Winter School

Thomas Wolf - HuggingFace Inc.

1

2 of 110

Overview

  • Session 1: Transfer Learning - Pretraining and representations
  • Session 2: Transfer Learning - Adaptation and downstream tasks
  • Session 3: Transfer Learning - Limitations, open-questions, future directions

Many slides are adapted from a Tutorial on Transfer Learning in NLP I gave at NAACL 2019 with my amazing collaborators

👈

2

Sebastian Ruder

Matthew Peters

Swabha

Swayamdipta

3 of 110

Transfer Learning in NLP

NLPL Winter School

Session 2

3

4 of 110

Transfer Learning in Natural Language Processing

Transfer Learning in NLP

4

Follow along with the tutorial:

5 of 110

Agenda

5

6 of 110

4. Adaptation

6

Image credit: Ben Didier

7 of 110

4 – How to adapt the pretrained model

Several orthogonal directions we can make decisions on:

  • Architectural modifications?�How much to change the pretrained model architecture for adaptation��
  • Optimization schemes?�Which weights to train during adaptation and following what schedule ��
  • More signal: Weak supervision, Multi-tasking & Ensembling�How to get more supervision signal for the target task

8 of 110

4.1 – Architecture

Two general options:

  • Keep pretrained model internals unchanged:�Add classifiers on top, embeddings at the bottom, use outputs as features
  • Modify pretrained model internal architecture: �Initialize encoder-decoders, task-specific modifications, adapters

8

Image credit: Darmawansyah

9 of 110

4.1.A – Architecture: Keep model unchanged

General workflow:

  • Remove pretraining task head if not useful for target task
    • Example: remove softmax classifier from pretrained LM
    • Not always needed: some adaptation schemes re-use the pretraining objective/task, e.g. for multi-task learning

9

10 of 110

4.1.A – Architecture: Keep model unchanged

General workflow:

Task-specific, randomly initialized

General, pretrained

  • Add target task-specific layers on top/bottom of pretrained model
    • Simple: adding linear layer(s) on top of the pretrained model

10

11 of 110

4.1.A – Architecture: Keep model unchanged

General workflow:

  • Add target task-specific layers on top/bottom of pretrained model
    • Simple: adding linear layer(s) on top of the pretrained model
    • More complex: model output as input for a separate model
    • Often beneficial when target task requires interactions that are not available in pretrained embedding

11

12 of 110

4.1.B – Architecture: Modifying model internals

Various reasons:

  • Adapting to a structurally different target task
    • Ex: Pretraining with a single input sequence (ex: language modeling) but adapting to a task with several input sequences (ex: translation, conditional generation...)
    • Use the pretrained model weights to initialize as much as possible of a structurally different target task model
    • Ex: Use monolingual LMs to initialize encoder and decoder parameters for MT (Ramachandran et al., EMNLP 2017; Lample & Conneau, 2019)

12

13 of 110

4.1.B – Architecture: Modifying model internals

Various reasons:

  • Task-specific modifications
    • Provide pretrained model with capabilities that are useful for the target task
    • Ex: Adding skip/residual connections, attention (Ramachandran et al., EMNLP 2017)

13

14 of 110

4.1.B – Architecture: Modifying model internals

  • Using less parameters for adaptation:
    • Less parameters to fine-tune
    • Can be very useful given the increasing size of model parameters
    • Ex: add bottleneck modules (“adapters”) between the layers of the pretrained model (Rebuffi et al., NIPS 2017; CVPR 2018)

Various reasons:

14

15 of 110

4.1.B – Architecture: Modifying model internals

Adapters

  • Commonly connected with a residual connection in parallel to an existing layer
  • Most effective when placed at every layer (smaller effect at bottom layers)
  • Different operations (convolutions, self-attention) possible
  • Particularly suitable for modular architectures like Transformers (Houlsby et al., ICML 2019; Stickland and Murray, ICML 2019)

15

Image credit: Caique Lima

16 of 110

4.1.B – Architecture: Modifying model internals

  • Multi-head attention (MH; shared across layers) is used in parallel with self-attention (SA) layer of BERT
  • Both are added together and fed into a layer-norm (LN)

16

17 of 110

Hands-on #2:

Adapting our pretrained model

17

Image credit: Chanaky

18 of 110

Hands-on: Model adaptation

  • Plan
    • Start from our Transformer language model
    • Adapt the model to a target task:
      • keep the model core unchanged, load the pretrained weights
      • add a linear layer on top, newly initialized
      • use additional embeddings at the bottom, newly initialized
  • Reminder — material is here:

18

Let’s see how a simple fine-tuning scheme can be implemented with our pretrained model:

19 of 110

Adaptation task

  • We select a text classification task as the downstream task

  • TREC-6: The Text REtrieval Conference (TREC) Question Classification (Li et al., COLING 2002)�
  • TREC consists of open-domain, fact-based questions divided into broad semantic categories contains 5500 labeled training questions & 500 testing questions with 6 labels:� NUM, LOC, HUM, DESC, ENTY, ABBR

Hands-on: Model adaptation

19

Ex:

  • How did serfdom develop in and then leave Russia ? —> DESC
  • What films featured the character Popeye Doyle ? —> ENTY

Transfer learning models shine on this type of low-resource task

20 of 110

  • Modifications:
    • Keep model internals unchanged
    • Add a linear layer on top
    • Add an additional embedding (classification token) at the bottom�
  • Computation flow:
    • Model input: the tokenized question with a classification token at the end
    • Extract the last hidden-state associated to the classification token
    • Pass the hidden-state in a linear layer and softmax to obtain class probabilities

Hands-on: Model adaptation

20

First adaptation scheme

21 of 110

Let’s load and prepare our dataset:

Fine-tuning hyper-parameters:

– 6 classes in TREC-6

– Use fine tuning hyper parameters from Radford et al., 2018:

  • learning rate from 6.5e-5 to 0.0
  • fine-tune for 3 epochs

Hands-on: Model adaptation

21

- trim to the transformer input size & add a classification token at the end of each sample,

- pad to the left,

- convert to tensors,

- extract a validation set.

22 of 110

Adapt our model architecture

Replace the pre-training head (language modeling) with the classification head:

A linear layer, which takes as input the hidden-state of the [CLF] token (using a mask)

Keep our pretrained model unchanged as the backbone.

* Initialize all the weights of the model.

Hands-on: Model adaptation

22

* Reload common weights from the pretrained model.

23 of 110

Our fine-tuning code:

We will evaluate on our validation and test sets:

* validation: after each epoch

* test: at the end

A simple training update function:

* prepare inputs: transpose and build padding & classification token masks

* we have options to clip and accumulate gradients

Schedule:

* linearly increasing to lr

* linearly decreasing to 0.0

Hands-on: Model adaptation

23

24 of 110

We can now fine-tune our model on TREC:

We are at the state-of-the-art

(ULMFiT)

Remarks:

  • The error rate goes down quickly! After one epoch we already have >90% accuracy.�⇨ Fine-tuning is highly data efficient in Transfer Learning
  • We took our pre-training & fine-tuning hyper-parameters straight from the literature on related models.�⇨ Fine-tuning is often robust to the exact choice of hyper-parameters

Hands-on: Model adaptation – Results

24

25 of 110

Let’s conclude this hands-on with a few additional words on robustness & variance.

  • Large pretrained models (e.g. BERT large) are prone to degenerate performance when fine-tuned on tasks with small training sets.
  • Observed behavior is often “on-off”: it either works very well or doesn’t work at all.
  • Understanding the conditions and causes of this behavior (models, adaptation schemes) is an open research question.

Hands-on: Model adaptation – Results

25

26 of 110

4.2 – Optimization

Several directions when it comes to the optimization itself:

  • Choose which weights we should update�Feature extraction, fine-tuning, adapters�
  • Choose how and when to update the weights�From top to bottom, gradual unfreezing, discriminative fine-tuning�
  • Consider practical trade-offsSpace and time complexity, performance

26

Image credit: ProSymbols, purplestudio, Markus, Alfredo

27 of 110

4.2.A – Optimization: Which weights?

The main question: To tune or not to tune (the pretrained weights)?

  • Do not change pretrained weights�Feature extraction, adapters�
  • Change pretrained weights�Fine-tuning

27

Image credit: purplestudio

28 of 110

4.2.A – Optimization: Which weights?

Don’t touch the pretrained weights!��Feature extraction:

  • Weights are frozen

28

❄️

29 of 110

4.2.A – Optimization: Which weights?

Don’t touch the pretrained weights!��Feature extraction:

  • Weights are frozen
  • A linear classifier is trained on top of the pretrained representations

29

❄️

30 of 110

4.2.A – Optimization: Which weights?

Don’t touch the pretrained weights!��Feature extraction:

  • Weights are frozen
  • A linear classifier is trained on top of the pretrained representations
  • Don’t just use features of the top layer!
  • Learn a linear combination of layers (Peters et al., NAACL 2018, Ruder et al., AAAI 2019)

30

❄️

31 of 110

4.2.A – Optimization: Which weights?

Don’t touch the pretrained weights!

Feature extraction:

  • Alternatively, pretrained representations are used as features in downstream model

31

32 of 110

4.2.A – Optimization: Which weights?

Don’t touch the pretrained weights!��Adapters

  • Task-specific modules that are added in between existing layers

32

33 of 110

4.2.A – Optimization: Which weights?

Don’t touch the pretrained weights!��Adapters

  • Task-specific modules that are added in between existing layers
  • Only adapters are trained

33

34 of 110

4.2.A – Optimization: Which weights?

Yes, change the pretrained weights!

Fine-tuning:

  • Pretrained weights are used as initialization for parameters of the downstream model
  • The whole pretrained architecture is trained during the adaptation phase

34

35 of 110

Hands-on #3:

Using Adapters and freezing

35

Image credit: Chanaky

36 of 110

  • Modifications:
    • add Adapters inside the backbone model: Linear ⇨ ReLU ⇨ Linear� with a skip-connection
  • As previously:
    • add a linear layer on top
    • use an additional embedding (classification token) at the bottom

Hands-on: Model adaptation

36

Second adaptation scheme: Using Adapters

  • Houlsby et al., ICML 2019

We will only train the adapters, the added linear layer and the embeddings. The other parameters of the model will be frozen.

37 of 110

Let’s adapt our model architecture

Add the adapter modules:

Bottleneck layers with 2 linear layers and a non-linear activation function (ReLU)

Hidden dimension is small: e.g. 32, 64, 256

Inherit from our pretrained model to have all the modules.

The Adapters are inserted inside skip-connections after:

  • the attention module
  • the feed-forward module

Hands-on: Model adaptation

37

38 of 110

Now we need to freeze the portions of our model we don’t want to train.

We just indicate that no gradient is needed for the frozen parameters by setting param.requires_grad to False for the frozen parameters:

In our case we will train 25% of the parameters. The model is small & deep (many adapters) and we need to train the embeddings so the ratio stay quite high. For a larger model this ratio would be a lot lower.

Hands-on: Model adaptation

38

39 of 110

Results similar to full-fine-tuning case with advantage of training only 25% of the full model parameters.

For a small 50M parameters model this method is overkill ⇨ for 300M–1.5B parameters models.

We use a hidden dimension of 32 for the adapters and a learning rate ten times higher for the fine-tuning (we have added quite a lot of newly initialized parameters to train from scratch).

Hands-on: Model adaptation

39

40 of 110

4.2.B – Optimization: What schedule?

We have decided which weights to update, but in which order and how should be update them?

Motivation: We want to avoid overwriting useful pretrained information and maximize positive transfer.

Related concept: Catastrophic forgetting (McCloskey & Cohen, 1989; French, 1999)When a model forgets the task it was originally trained on.

40

Image credit: Markus

41 of 110

4.2.B – Optimization: What schedule?

A guiding principle:�Update from top-to-bottom

  • Progressively in time: freezing
  • Progressively in intensity: Varying the learning rates
  • Progressively vs. the pretrained model: Regularization

41

42 of 110

4.2.B – Optimization: Freezing

Main intuition: Training all layers at the same time on data of a different distribution and task may lead to instability and poor solutions.

Solution: Train layers individually to give them time to adapt to new task and data.

Goes back to layer-wise training of early deep neural networks (Hinton et al., 2006; Bengio et al., 2007).

42

43 of 110

4.2.B – Optimization: Freezing

  • Freezing all but the top layer (Long et al., ICML 2015)

43

44 of 110

4.2.B – Optimization: Freezing

  • Freezing all but the top layer (Long et al., ICML 2015)
  • Chain-thaw (Felbo et al., EMNLP 2017): training one layer at a time
    • Train new layer

44

45 of 110

4.2.B – Optimization: Freezing

  • Freezing all but the top layer (Long et al., ICML 2015)
  • Chain-thaw (Felbo et al., EMNLP 2017): training one layer at a time
    • Train new layer
    • Train one layer at a time

45

46 of 110

4.2.B – Optimization: Freezing

  • Freezing all but the top layer (Long et al., ICML 2015)
  • Chain-thaw (Felbo et al., EMNLP 2017): training one layer at a time
    • Train new layer
    • Train one layer at a time

46

47 of 110

4.2.B – Optimization: Freezing

  • Freezing all but the top layer (Long et al., ICML 2015)
  • Chain-thaw (Felbo et al., EMNLP 2017): training one layer at a time
    • Train new layer
    • Train one layer at a time

47

48 of 110

4.2.B – Optimization: Freezing

  • Freezing all but the top layer (Long et al., ICML 2015)
  • Chain-thaw (Felbo et al., EMNLP 2017): training one layer at a time
    • Train new layer
    • Train one layer at a time
    • Train all layers

48

49 of 110

4.2.B – Optimization: Freezing

49

50 of 110

4.2.B – Optimization: Freezing

50

51 of 110

4.2.B – Optimization: Freezing

51

52 of 110

4.2.B – Optimization: Freezing

52

53 of 110

4.2.B – Optimization: Freezing

53

54 of 110

4.2.B – Optimization: Freezing

  • Freezing all but the top layer (Long et al., ICML 2015)
  • Chain-thaw (Felbo et al., EMNLP 2017): training one layer at a time
  • Gradually unfreezing (Howard & Ruder, ACL 2018): unfreeze one layer after another
  • Sequential unfreezing (Chronopoulou et al., NAACL 2019): hyper-parameters that determine length of fine-tuning
    • Fine-tune additional parameters for epochs
    • Fine-tune pretrained parameters without embedding layer for epochs

54

55 of 110

4.2.B – Optimization: Freezing

  • Freezing all but the top layer (Long et al., ICML 2015)
  • Chain-thaw (Felbo et al., EMNLP 2017): training one layer at a time
  • Gradually unfreezing (Howard & Ruder, ACL 2018): unfreeze one layer after another
  • Sequential unfreezing (Chronopoulou et al., NAACL 2019): hyper-parameters that determine length of fine-tuning
    • Fine-tune additional parameters for epochs
    • Fine-tune pretrained parameters without embedding layer for epochs
    • Train all layers until convergence

55

56 of 110

4.2.B – Optimization: Freezing

Commonality: Train all parameters jointly in the end

56

57 of 110

Hands-on #4:

Using gradual unfreezing

57

Image credit: Chanaky

58 of 110

Gradual unfreezing is similar to our previous freezing process.�We start by freezing all the model except the newly added parameters:

We then gradually unfreeze an additional block along the training so that we train the full model at the end:

Find index of layer to unfreeze

Name pattern matching

Unfreezing interval

Hands-on: Adaptation

58

59 of 110

Gradual unfreezing has not been investigated in details for Transformer models� ⇨ no specific hyper-parameters advocated in the literature

Residual connections may have an impact on the method

⇨ should probably adapt LSTM hyper-parameters

Hands-on: Adaptation

59

We show simple experiments in the Colab. Better hyper-parameters settings can probably be found.

60 of 110

4.2.B – Optimization: Learning rates

Main idea: Use lower learning rates to avoid overwriting useful information.

Where and when?

  • Lower layers (capture general information)
  • Early in training (model still needs to adapt to target distribution)
  • Late in training (model is close to convergence)

60

61 of 110

4.2.B – Optimization: Learning rates

  • Discriminative fine-tuning (Howard & Ruder, ACL 2018)
    • Lower layers capture general information�→ Use lower learning rates for lower layers

61

62 of 110

4.2.B – Optimization: Learning rates

  • Discriminative fine-tuning
  • Triangular learning rates (Howard & Ruder, ACL 2018)
    • Quickly move to a suitable region, then slowly converge over time

62

63 of 110

4.2.B – Optimization: Learning rates

  • Discriminative fine-tuning
  • Triangular learning rates (Howard & Ruder, ACL 2018)
    • Quickly move to a suitable region, then slowly converge over time
    • Also known as “learning rate warm-up”
    • Used e.g. in Transformer (Vaswani et al., NIPS 2017) and Transformer-based methods (BERT, GPT)
    • Facilitates optimization; easier to escape suboptimal local minima

63

64 of 110

4.2.B – Optimization: Regularization

Main idea: minimize catastrophic forgetting by encouraging target model parameters to stay close to pretrained model parameters�using a regularization term .

64

65 of 110

4.2.B – Optimization: Regularization

  • Simple method:�Regularize new parameters not to deviate too much from pretrained ones (Wiese et al., CoNLL 2017):�

65

66 of 110

4.2.B – Optimization: Regularization

  • More advanced (elastic weight consolidation; EWC): Focus on parameters that are important for the pretrained task based on the Fisher information matrix�(Kirkpatrick et al., PNAS 2017):�

66

67 of 110

4.2.B – Optimization: Regularization

EWC has downsides in continual learning:

  • May over-constrain parameters
  • Computational cost is linear in the number of tasks (Schwarz et al., ICML 2018)

67

68 of 110

4.2.B – Optimization: Regularization

  • If tasks are similar, we may also encourage source and target predictions to be close based on cross-entropy, similar to distillation:�

68

69 of 110

Hands-on #5:

Using discriminative learning

69

Image credit: Chanaky

70 of 110

Discriminative learning rate can be implemented using two steps in our example:

We can then compute the learning rate of each group depending on its label (at each training iteration):

First we organize the parameters of the various layers in labelled parameters groups in the optimizer:

Hyper-parameter

Hands-on: Model adaptation

70

71 of 110

4.2.C – Optimization: Trade-offs

Several trade-offs when choosing which weights to update:

  • Space complexity�Task-specific modifications, additional parameters, parameter reuse�
  • Time complexity�Training time�
  • Performance

71

Image credit: Alfredo

72 of 110

4.2.C – Optimization trade-offs: Space

Task-specific modifications

Additional parameters

Parameter reuse

72

Many

Few

Feature extraction

Fine-tuning

Adapters

Many

Few

Feature extraction

Fine-tuning

Adapters

All

None

Feature extraction

Fine-tuning

Adapters

73 of 110

4.2.C – Optimization trade-offs: Time

Training time

73

Feature extraction

Fine-tuning

Adapters

Slow

Fast

74 of 110

4.2.C – Optimization trade-offs: Performance

  • Rule of thumb: If task source and target tasks are dissimilar*, use feature extraction (Peters et al., 2019)
  • Otherwise, feature extraction and fine-tuning often perform similar
  • Fine-tuning BERT on textual similarity tasks works significantly better
  • Adapters achieve performance competitive with fine-tuning
  • Anecdotally, Transformers are easier to fine-tune (less sensitive to hyper-parameters) than LSTMs

*dissimilar: certain capabilities (e.g. modelling inter-sentence relations) are beneficial for target task, but pretrained model lacks them (see more later)

74

75 of 110

4.3 – Getting more signal

The target task is often a low-resource task. We can often�improve the performance of transfer learning by �combining a diverse set of signals:

  • From fine-tuning a single model on a single adaptation task….�The Basic: fine-tuning the model with a simple classification objective�
  • … to gathering signal from other datasets and related tasks … �Fine-tuning with Weak Supervision, Multi-tasking and Sequential Adaptation�
  • … to ensembling models�Combining the predictions of several fine-tuned models

75

Image credit: Naveen

76 of 110

4.3.A – Getting more signal: Basic fine-tuning

Simple example of fine-tuning on a text classification task:

  • Extract a single fixed-length vector from the model:�hidden state of first/last token or mean/max of hidden-states�
  • Project to the classification space with an additional classifier�
  • Train with a classification objective

76

77 of 110

4.3.B – Getting more signal: Related datasets/tasks

  • Sequential adaptation�Intermediate fine-tuning on related datasets and tasks
  • Multi-task fine-tuning with related tasks�Such as NLI tasks in GLUE�
  • Dataset Slicing�When the model consistently underperforms on particular slices of the data
  • Semi-supervised learning�Use unlabelled data to improve model consistency

77

78 of 110

4.3.B – Getting more signal: Sequential adaptation

Fine-tuning on related high-resource dataset

  • Fine-tune model on related task with more data

78

79 of 110

4.3.B – Getting more signal: Sequential adaptation

Fine-tuning on related high-resource dataset

  • Fine-tune model on related task with more data
  • Fine-tune model on target task
  • Helps particularly for tasks with limited data and similar tasks (Phang et al., 2018)
  • Improves sample complexity on target task (Yogatama et al., 2019)

79

80 of 110

4.3.B – Getting more signal: Multi-task fine-tuning

Fine-tune the model jointly on related tasks

  • For each optimization step, sample a task and a batch for training.
  • Train via multi-task learning for a couple of epochs.

80

81 of 110

4.3.B – Getting more signal: Multi-task fine-tuning

Fine-tune the model jointly on related tasks

  • For each optimization step, sample a task and a batch for training.
  • Train via multi-task learning for a couple of epochs.
  • Fine-tune on the target task only for a few epochs at the end.

81

82 of 110

4.3.B – Getting more signal: Multi-task fine-tuning

Fine-tune the model with an unsupervised auxiliary task

  • Language modelling is a related task!
  • Fine-tuning the LM helps adapting the pretrained parameters to the target dataset.
  • Helps even without pretraining (Rei et al., ACL 2017)
  • Can optionally anneal ratio (Chronopoulou et al., NAACL 2019)
  • Used as a separate step in ULMFiT

82

83 of 110

4.3.B – Getting more signal: Dataset slicing

Use auxiliary heads that are trained only on particular subsets of the data

  • Analyze errors of the model
  • Use heuristics to automatically identify challenging subsets of the training data
  • Train auxiliary heads jointly with main head

See also Massive Multi-task Learning with Snorkel MeTaL

83

84 of 110

4.3.B – Getting more signal: Semi-supervised learning

Can be used to make model predictions more consistent using unlabelled data

  • Main idea: Minimize distance between predictions on original input and perturbed input

84

85 of 110

4.3.B – Getting more signal: Semi-supervised learning

Can be used to make model predictions more consistent using unlabelled data

  • Perturbation can be noise, masking (Clark et al., EMNLP 2018), data augmentation, e.g. back-translation (Xie et al., 2019)

85

86 of 110

4.3.C – Getting more signal: Ensembling

Reaching the state-of-the-art by ensembling independently fine-tuned models

  • Ensembling models�Combining the predictions of models fine-tuned with various hyper-parameters
  • Knowledge distillationDistill an ensemble of fine-tuned models in a single smaller model

86

87 of 110

4.3.C – Getting more signal: Ensembling

Model fine-tuned...

  • on different tasks
  • on different dataset-splits
  • with different parameters (dropout, initializations…)
  • from variant of pre-trained models (e.g. cased/uncased)

87

Combining the predictions of models fine-tuned with various hyper-parameters.

88 of 110

4.3.C – Getting more signal: Distilling

  • knowledge distillation: train a student model on soft targets produced by the teacher (the ensemble) ��
  • Relative probabilities of the teacher labels contain information about how the teacher generalizes

88

Distilling ensembles of large models back in a single model

89 of 110

Hands-on #6:

Using multi-task learning

89

Image credit: Chanaky

90 of 110

Multitasking with a classification loss + language modeling loss.

Create two heads:

– language modeling head

– classification head

Total loss is a weighted sum of

– language modeling loss and

– classification loss

Hands-on: Multi-task learning

90

91 of 110

Multi-tasking helped us improve over single-task full-model fine-tuning!

We use a coefficient of 1.0 for the classification loss and 0.5 for the language modeling loss and fine-tune a little longer (6 epochs instead of 3 epochs, the validation loss was still decreasing).

Hands-on: Multi-task learning

91

92 of 110

Agenda

92

93 of 110

5. Downstream applications�Hands-on examples

93

Image credit: Fahmi

94 of 110

5. Downstream applications - Hands-on examples

In this section we will explore downstream applications and practical considerations along two orthogonal directions:

  • What are the various applications of transfer learning in NLP�Document/sequence classification, Token-level classification, Structured prediction and Language generation
  • How to leverage several frameworks & libraries for practical applications�Tensorflow, PyTorch, Keras and third-party libraries like fast.ai, HuggingFace...

94

95 of 110

Practical considerations

Frameworks & libraries: practical considerations

95

  • Pretraining large-scale models is costly�Use open-source models�Share your pretrained models

“Energy and Policy Considerations for Deep Learning in NLP” - Strubell, Ganesh, McCallum - ACL 2019

  • Sharing/accessing pretrained models
    • Hubs: Tensorflow Hub, PyTorch Hub
    • Author released checkpoints: ex BERT, GPT...
    • Third-party libraries: AllenNLP, fast.ai, HuggingFace
  • Design considerations
    • Hubs/libraries:
      • Simple to use but can be difficult to modify model internal architecture
    • Author released checkpoints:
      • More difficult to use but you have full control over the model internals

96 of 110

5. Downstream applications - Hands-on examples

  • Sequence and document level classification�Hands-on: Document level classification (fast.ai)�
  • Token level classification�Hands-on: Question answering (Google BERT & Tensorflow/TF Hub)
  • Language generation�Hands-on: Dialog Generation (OpenAI GPT & HuggingFace/PyTorch Hub)�

96

Icons credits: David, Susannanova, Flatart, ProSymbols

97 of 110

5.A – Sequence & document level classification

Transfer learning for document classification using the fast.ai library.

  • Target task:�IMDB: a binary sentiment classification dataset containing 25k highly polar movie reviews for training, 25k for testing and additional unlabeled data.�http://ai.stanford.edu/~amaas/data/sentiment/
  • Fast.ai has in particular:
    • a pre-trained English model available for download
    • a standardized data block API
    • easy access to standard datasets like IMDB
  • Fast.ai is based on PyTorch

97

98 of 110

fast.ai gives access to many high-level API out-of-the-box for vision, text, tabular data and collaborative filtering.

DataBunch for the language model and the classifier

Load IMDB dataset & inspect it.

Load an AWD-LSTM (Merity et al., 2017) pretrained on WikiText-103 & fine-tune it on IMDB using the language modeling loss.

Fast.ai then comprises all the high level modules needed to quickly setup a transfer learning experiment.

5.A – Document level classification using fast.ai

98

The library is designed for speed of experimentation, e.g. by importing all necessary modules at once in interactive computing environments, like:

99 of 110

Now we fine-tune in two steps:�

Once we have a fine-tune language model (AWD-LSTM), we can create a text classifier by adding a classification head with:

– A layer to concatenate the final outputs of the RNN with the maximum and average of all the intermediate outputs (along the sequence length)

– Two blocks of nn.BatchNorm1d nn.Dropout nn.Linear nn.ReLU with a hidden dimension of 50.

5.A – Document level classification using fast.ai

99

1. train the classification head only while keeping the language model frozen, and

2. fine-tune the whole architecture.

100 of 110

5.B – Token level classification: BERT & Tensorflow

Transfer learning for token level classification: Google’s BERT in TensorFlow.

  • Target task:�SQuAD: a question answering dataset.�https://rajpurkar.github.io/SQuAD-explorer/
  • In this example we will directly use a Tensorflow checkpoint
    • Example: https://github.com/google-research/bert
    • We use the usual Tensorflow workflow: create model graph comprising the core model and the added/modified elements
    • Take care of variable assignments when loading the checkpoint

100

101 of 110

Let’s adapt BERT to the target task.

Replace the pre-training head (language modeling) with a classification head:

a linear projection layer to estimate 2 probabilities for each token:

– being the start of an answer

– being the end of an answer.

Keep our core model unchanged.

5.B – SQuAD with BERT & Tensorflow

101

102 of 110

Load our pretrained checkpoint

To load our checkpoint, we just need to setup an assignement_map from the variables of the checkpoint to the model variable, keeping only the variables in the model.

And we can use

tf.train.init_from_ckeckpoint

5.B – SQuAD with BERT & Tensorflow

102

103 of 110

TensorFlow-Hub

TensorFlow Hub is a library for sharing machine learning models as self-contained pieces of TensorFlow graph with their weights and assets.

Working directly with TensorFlow requires to have access to–and include in your code– the full code of the pretrained model.

Modules are automatically downloaded and cached when instantiated.

Each time a module m is called e.g. y = m(x), it adds operations to the current TensorFlow graph to compute y from x.

5.B – SQuAD with BERT & Tensorflow

103

104 of 110

Tensorflow Hub host a nice selection of pretrained models for NLP

Tensorflow Hub can also used with Keras exactly how we saw in the BERT example

The main limitations of Hubs are:

  • No access to the source code of the model (black-box)
  • Not possible to modify the internals of the model (e.g. to add Adapters)

5.B – SQuAD with BERT & Tensorflow

104

105 of 110

5.C – Language Generation: OpenAI GPT & PyTorch

Transfer learning for language generation: OpenAI GPT and HuggingFace library.

  • Target task:�ConvAI2 – The 2nd Conversational Intelligence Challenge for training and evaluating models for non-goal-oriented dialogue systems, i.e. chit-chathttp://convai.io
  • HuggingFace library of pretrained models
    • a repository of large scale pre-trained models with BERT, GPT, GPT-2, Transformer-XL
    • provide an easy way to download, instantiate and train pre-trained models in PyTorch
  • HuggingFace’s models are now also accessible using PyTorch Hub

105

106 of 110

A dialog generation task:

5.C – Chit-chat with OpenAI GPT & PyTorch

106

Language generation tasks are close to the language modeling pre-training objective, but:

  • Language modeling pre-training involves a single input: a sequence of words.
  • In a dialog setting: several type of contexts are provided to generate an output sequence:
    • knowledge base: persona sentences,
    • history of the dialog: at least the last utterance from the user,
    • tokens of the output sequence that have already been generated.

How should we adapt the model?

107 of 110

Golovanov, Kurbanov, Nikolenko, Truskovskyi, Tselousov and Wolf, ACL 2019

5.C – Chit-chat with OpenAI GPT & PyTorch

107

Several options:

  • Duplicate the model to initialize an encoder-decoder structure�e.g. Lample & Conneau, 2019
  • Use a single model with concatenated inputs�see e.g. Wolf et al., 2019, Khandelwal et al. 2019

Concatenate the various context separated by delimiters and add position and segment embeddings

108 of 110

5.C – Chit-chat with OpenAI GPT & PyTorch

108

Let’s import pretrained versions of OpenAI GPT tokenizer and model.

Now most of the work is about preparing the inputs for the model.

Then train our model using the pretraining language modeling objective.

And add a few new tokens to the vocabulary

We organize the contexts in segments

Add delimiter at the extremities of the segments

And build our word, position and segment inputs for the model.

109 of 110

5.C – Chit-chat with OpenAI GPT & PyTorch

109

PyTorch Hub

Last Friday, the PyTorch team soft-launched a beta version of PyTorch Hub. Let’s have a quick look.

  • PyTorch Hub is based on GitHub repositories
  • A model is shared by adding a hubconf.py script to the root of a GitHub repository
  • Both model definitions and pre-trained weights can be shared
  • More details: https://pytorch.org/hub and https://pytorch.org/docs/stable/hub.html

In our case, to use torch.hub instead of pytorch-pretrained-bert, we can simply call torch.hub.load with the path to pytorch-pretrained-bert GitHub repository:

PyTorch Hub will fetch the model from the master branch on GitHub. This means that you don’t need to package your model (pip) & users will always access the most recent version (master).

110 of 110

That’s all for this time

110

Image credit: Andrejs Kirma