Transfer Learning in NLP
Daniel Pressel
Interactions LLC
International Summer School on Deep Learning 2019
Deep Learning has transformed NLP
Dependency Parsing
Arc-Standard Parsing
DNNs improve transition parsing!
The Changing Landscape of Dependency Parsing
Before Chen and Manning 2014*
After Chen and Manning 2014**
* Chen and Manning, 2014
**Fernandez-Gonzalez and Gomez-Rodriguez, 2019
Transfer Learning has Transformed Deep Learning for NLP!
*Devlin et al. 2019
Named Entity Recognition (NER)
My name is Dan Pressel and I live in the US
O O O B-PER I-PER O O O O O B-LOC
Named Entity Recognition before DNNs (NER)
Named Entity Recognition after DNNs (NER)
DNNs and Transfer LearniNG are Helping!
*Ruder, Peters, Swayamdipta & Wolf, NAACL, 2019
Sota in nlp 2019
First Some Background
“You shall know a word by the company it keeps” - J.R. Firth, 1957
Word2vec objectives
One-hot vectors
Lookup table-based Word embeddings
Word embeddings in a classification architecture
*Collobert et al 2011
Motivation for Contextualized representations
Causal language modeling
I’d like an Italian sub with everything, light ________ .
Why Language Models For Pretraining?
An LSTM language model with characters
Ok, so we trained a language model, now what?
Some options for downstream use
LSTM-based LMs
But I heard Attention is all you Need??
Background: Vanilla Seq2Seq
Background: Seq2Seq with Attention
Background: Seq2Seq with Attention
The Transformer
Transformer innovations
GPT: Transformers are cool! Lets use for Pre-training!
Pre-training Architectures: GPT
BERT: GPT is cool but BiLM is important!
Pre-training Architectures: BERT
GPT-2: No seriously, GPT IS cool
GPT-2: Sample
XLM: Multilingual Pre-Training
*Lample et al., 2019
What are we learning though?
Looking at Neuron Activations?
*Karpathy et al., 2016
Probing
*Liu et al., 2019
Attention Weight Visualization
Catastrophic Forgetting
Moving Forward
References
References
References
References