1 of 44

Transfer Learning and tools for Conversational Agents

Thomas Wolf - HuggingFace Inc.

1

2 of 44

Hugging Face Inc.

2

3 of 44

Hugging Face: Democratizing NLP

  • Core research goals:
    • For many: intelligence as making sense of data
    • For us: intelligence as creativity, interaction, adaptability
  • Started with Conversational AI (text/image/sound interaction):
    • Neural Language Generation in a Conversational AI game
    • Product used by more than 3M users, 600M+ messages exchanged
    • With a science team who:
      • Conducted research in Dialog Systems
      • Open-source tools

Transfer Learning in NLP: Concepts, Tools & Trends - Thomas Wolf - Slide 3

4 of 44

Transfer Learning for Language Generation

A dialog generation task:

Transfer Learning in NLP: Concepts, Tools & Trends - Thomas Wolf - Slide 4

More complex adaptation:

5 of 44

Transfer Learning for�Language Generation

�The Conversational Intelligence Challenge 2

« ConvAI2 »

(NIPS 2018 competition)

Transfer Learning in NLP: Concepts, Tools & Trends - Thomas Wolf - Slide 5

Final Automatic Evaluation Leaderboard (hidden test set)

6 of 44

Hugging Face: Democratizing NLP

  • Develop & open-source tools for Transfer Learning in NLP
  • We want to accelerate, catalyse and democratize research-level work in Natural Language Understanding as well as Natural Language Generation

Transfer Learning in NLP: Concepts, Tools & Trends - Thomas Wolf - Slide 6

7 of 44

Democratizing NLP – sharing knowledge, code, data

  • Knowledge sharing
    • NAACL 2019 / EMNLP 2020 Tutorial (Transfer Learning / Neural Lang Generation)
    • Workshop NeuralGen 2019 (Language Generation with Neural Networks)
    • Workshop SustaiNLP 2020 (Environmental/computational friendly NLP)
    • EurNLP Summit (European NLP summit)
  • Code & model sharing: Open-sourcing the “right way”
    • Two extremes: 1000-commands research-code ⟺ 1-command production code
      • To target the widest community – our goal is to be 👆 right in the middle
    • Breaking barriers
      • Researchers / Practitioners
      • PyTorch / TensorFlow
    • Speeding up and fueling research in Natural Language Processing
      • Make people stand on the shoulders of giants

Transfer Learning in NLP: Concepts, Tools & Trends - Thomas Wolf - Slide 7

8 of 44

Libraries

8

9 of 44

Transformers library

We’ve built an opinionated framework providing state-of-the-art general-purpose tools for Natural Language Understanding and Generation.

Features:

  • Super easy to use – fast to on-board
  • For everyone – NLP researchers, practitioners, educators
  • State-of-the-Art performances – on both NLU and NLG tasks
  • Reduce costs/footprint 30+ pretrained models in 100+ languages
  • Deep interoperability between TensorFlow 2.0 and PyTorch

Transfer Learning in NLP: Concepts, Tools & Trends - Thomas Wolf - Slide 9

10 of 44

Transformers library

Transfer Learning in NLP: Concepts, Tools & Trends - Thomas Wolf - Slide 10

11 of 44

Transformers library: code example

Transfer Learning in NLP: Concepts, Tools & Trends - Thomas Wolf - Slide 11

💥 Check it out at 💥 �https://github.com/huggingface/transformers

12 of 44

Transformers: model hub

Transfer Learning in NLP: Concepts, Tools & Trends - Thomas Wolf - Slide 12

💥 Check it out at 💥 �huggingface.co

13 of 44

Transfer Learning in NLP: Concepts, Tools & Trends - Thomas Wolf - Slide 13

14 of 44

Tokenizers library

Now that neural nets have fast implementations, a bottleneck in Deep-Learning based NLP pipelines is often tokenization: converting strings ➡️ model inputs.

We have just released 🤗Tokenizers: ultra-fast & versatile tokenization

Features:

  • Encode 1GB in 20sec
  • BPE/byte-level-BPE/WordPiece/SentencePiece...
  • Bindings in python/js/rust…
  • Link: https://github.com/huggingface/tokenizers

Transfer Learning in NLP: Concepts, Tools & Trends - Thomas Wolf - Slide 14

15 of 44

Datasets library

The full data-processing pipeline goes beyond tokenization and models to include data access and preprocessing at the beginning and model evaluation at the end.

We have recently released a new library 🤗Datasets to improve the situation on both ends of the pipeline.

Transfer Learning in NLP: Concepts, Tools & Trends - Thomas Wolf - Slide 15

Data

Tokenization

Prediction

Datasets

Tokenizers

Transformers

Metrics

Datasets

16 of 44

Datasets library

Datasets is a lightweight and extensible library to easily access and process datasets and evaluation metrics for Natural Language Processing (NLP).

Features:

  • One-line access to 150+ datasets and metrics – Open/collaborative hub
  • Built-in interoperability with Numpy, Pandas, PyTorch and Tensorflow 2
  • Lightweight and fast with a transparent and pythonic API
  • Strive on large datasets: Wikipedia (18GB) only take 9 MB of RAM when
  • Smart caching: never wait for your data to process several times
  • Link: https://github.com/huggingface/datasets

Transfer Learning in NLP: Concepts, Tools & Trends - Thomas Wolf - Slide 16

17 of 44

Datasets: code example

Transfer Learning in NLP: Concepts, Tools & Trends - Thomas Wolf - Slide 17

💥 Check it out at 💥�https://github.com/huggingface/datasets

18 of 44

Datasets: datasets hub

Transfer Learning in NLP: Concepts, Tools & Trends - Thomas Wolf - Slide 18

💥 Check it out at 💥 �huggingface.co

19 of 44

Tools for generation

19

20 of 44

Decoding methods for language generation� with Transformers

Since Feb 2020 (v.2.4.0) the Transformers library include a method to generate from any model provided with output embeddings with many methods.

Transfer Learning in NLP: Concepts, Tools & Trends - Thomas Wolf - Slide 20

21 of 44

Decoding methods for language generation� with Transformers

Beam search reduces the risk of missing hidden high probability word sequences by keeping the most likely num_beams of hypotheses at each time step and eventually choosing the hypothesis that has the overall highest probability.

Transfer Learning in NLP: Concepts, Tools & Trends - Thomas Wolf - Slide 21

22 of 44

Decoding methods for language generation� with Transformers

Beam search reduces the risk of missing hidden high probability word sequences.

Transfer Learning in NLP: Concepts, Tools & Trends - Thomas Wolf - Slide 22

23 of 44

Decoding methods for language generation� with Transformers

In open-ended generation, beam search might not be the best option:

  • Beam search work well where the length of the desired generation is more or less predictable (machine translation, summarization…)
  • Beam search suffers from repetitive generation.�This is hard to control in story generation: trade-off between forced "no-repetition" and repeating cycles of identical n-grams
  • Ari Holtzman et al. (2019): high quality human language does not follow a distribution of high probability next words.�Humans want generated text to surprise and not be boring/predictable.

Transfer Learning in NLP: Concepts, Tools & Trends - Thomas Wolf - Slide 23

24 of 44

Decoding methods for language generation� with Transformers

Transfer Learning in NLP: Concepts, Tools & Trends - Thomas Wolf - Slide 24

25 of 44

Decoding methods for language generation

Transfer Learning in NLP: Concepts, Tools & Trends - Thomas Wolf - Slide 25

26 of 44

Decoding methods for language generation

For more information:

Transfer Learning in NLP: Concepts, Tools & Trends - Thomas Wolf - Slide 26

27 of 44

Thanks for listening!

27

28 of 44

Concepts

What is Transfer Learning?

28

29 of 44

What is Transfer Learning?

Adapted from NAACL 2019 Tutorial: https://tinyurl.com/NAACLTransfer

Transfer Learning in NLP: Concepts, Tools & Trends - Thomas Wolf - Slide 29

30 of 44

Sequential Transfer Learning

Learn on one task/dataset, transfer to another task/dataset

Transfer Learning in NLP: Concepts, Tools & Trends - Thomas Wolf - Slide 30

word2vec

GloVe

skip-thought

InferSent

ELMo

ULMFiT

GPT

BERT

DistilBERT

Text classification

Word labeling

Question-Answering

....

Pretraining

Adaptation

Computationally intensive�step

General purpose

model

31 of 44

Training: The rise of language modeling pretraining

Many currently successful pretraining approaches are based on language modeling: learning to predict Pϴ(text) or Pϴ(text | other text)

Advantages:

  • Doesn’t require human annotation – self-supervised
  • Many languages have enough text to learn high capacity model
  • Versatile – can be used to learn both sentence and word representations with a variety of objective functions

Transfer Learning in NLP: Concepts, Tools & Trends - Thomas Wolf - Slide 31

32 of 44

Pretraining Transformers models (BERT, GPT…)

Transfer Learning in NLP: Concepts, Tools & Trends - Thomas Wolf - Slide 32

33 of 44

Sequential Transfer Learning

Learn on one task/dataset, transfer to another task/dataset

Transfer Learning in NLP: Concepts, Tools & Trends - Thomas Wolf - Slide 33

word2vec

GloVe

skip-thought

InferSent

ELMo

ULMFiT

GPT

BERT

DistilBERT

Text classification

Word labeling

Question-Answering

....

Pretraining

Adaptation

Computationally intensive�step

General purpose

model

Data-efficient

step

Task-specific

high-performance

model

34 of 44

Model: Adapting for target task

General workflow:

  1. Remove pretraining task head (if not used for target task)�
  2. Add target task-specific elements on top/bottom:�- simple: linear layer(s)�- complex: full LSTM on top

Sometimes very complex: Adapting to a structurally different task

Ex: Pretraining with a single input sequence and adapting to a task with� several input sequences (ex: translation, conditional generation...)� ➯ Use pretrained model to initialize as much as possible of target model� ➯ Ramachandran et al., EMNLP 2017; Lample & Conneau, 2019

Transfer Learning in NLP: Concepts, Tools & Trends - Thomas Wolf - Slide 34

35 of 44

Downstream tasks and

Model Adaptation:

Quick Examples

35

36 of 44

A – Transfer Learning for text classification

Transfer Learning in NLP: Concepts, Tools & Trends - Thomas Wolf - Slide 36

Pretrained� model

Adaptation� Head

Tokenizer

Jim Henson was a puppeteer

Jim

Henson

was

a

puppet

##eer

Tokenization

11067

5567

245

120

7756

9908

1.2

2.7

0.6

-0.2

3.7

9.1

-2.1

3.1

1.5

-4.7

2.4

6.7

6.1

2.4

7.3

-0.6

-3.1

2.5

1.9

-0.1

0.7

2.1

4.2

-3.1

Classifier model

Convert

to

vocabulary

indices

Pretrained

model

True

0.7886

False

-0.223

37 of 44

A – Transfer Learning for text classification

Remarks:

  • The error rate goes down quickly! After one epoch we already have >90% accuracy.�⇨ Fine-tuning is highly data efficient in Transfer Learning
  • We took our pre-training & fine-tuning hyper-parameters straight from the literature on related models.�⇨ Fine-tuning is often robust to the exact choice of hyper-parameters

Transfer Learning in NLP: Concepts, Tools & Trends - Thomas Wolf - Slide 37

38 of 44

Trends and limits of

Transfer Learning in NLP

38

39 of 44

Model size and Computational efficiency

  • Recent trends
    • Going big on model sizes - over 1 billion parameters as become the norm for SOTA

39

Google

GShard

600B

GPT3

175B

40 of 44

Model size and Computational efficiency

Why is this a problem?

  • Narrowing the research competition field
    • what is the place of academia in today’s NLP?�fine-tuning? analysis and BERTology? critics?
  • Environmental costs

  • Is bigger-is-better a scientific research program?

40

“Energy and Policy Considerations for Deep Learning in NLP” - Strubell, Ganesh, McCallum - ACL 2019

41 of 44

Model size and Computational efficiency

Reducing the size of a pretrained model

Three main techniques currently investigated:

  • DistillationDistilBert: 95% of Bert performances in a model 40% smaller and 60% faster
  • Pruning��
  • QuantizationFrom FP32 to INT8�

41

42 of 44

The generalization problem:

  • Models are brittle: fail when text is modified, even with meaning preserved
  • Models are spurious: memorize artifacts and biases instead of truly learning

Brittle Spurious

������

��

Robin Jia and Percy Liang, “Adversarial Examples for Evaluating Reading Comprehension� Systems,” ArXiv:1707.07328 [Cs], July 23, 2017, http://arxiv.org/abs/1707.07328

R. Thomas McCoy, Junghyun Min, and Tal Linzen, “BERTs of a Feather Do Not Generalize Together: Large Variability in Generalization across Models with Similar � Test Set Performance,” ArXiv:1911.02969 [Cs], November 7, 2019, http://arxiv.org/abs/1911.02969.

  • The inductive bias question

42

43 of 44

Shortcoming of language modeling in general

Need for grounded representations

  • Limits of distributional hypothesis—difficult to learn certain types of information from raw text
    • Human reporting bias: not stating the obvious (Gordon and Van Durme, 2013)
    • Common sense isn’t written down
    • Facts about named entities
    • No relation with other modalities (image, audio…)�
  • Possible solutions:
    • Incorporate structured knowledge (e.g. databases - ERNIE: Zhang et al 2019)
    • Multimodal learning (e.g. visual representations - VideoBERT: Sun et al. 2019)
    • Interactive/human-in-the-loop approaches (e.g. dialog: Hancock et al. 2018)

Transfer Learning in NLP: Concepts, Tools & Trends - Thomas Wolf - Slide 43

44 of 44

Current transfer learning performs adaptation once.

  • Ultimately, we’d like to have models that continue to retain and accumulate knowledge across many tasks (Yogatama et al., 2019).
  • No distinction between pretraining and adaptation; just one stream of tasks.

Main challenge:

Catastrophic forgetting.

Different approaches from the literature:

  • Memory
  • Regularization
  • Task-specific weights, etc.

  • Continual and meta-learning

44