1
1 Decoding
Domain Adaptation &
Adapting Pretrained Models 1
with wing.nus
Deep transfer RL �for text summarization
Reference: Keneshloo, Yaser & Ramakrishnan, Naren & Reddy, Chandan. (2018). Deep Transfer Reinforcement Learning for Text Summarization.
Text summarization and ROUGE scores
The text summarization task
long text → short text containing the essential information
That’s about all there is to it.
What is a good summary ?
⇒ automatically scoring summaries is really hard.
The ROUGE-n scores
Given a reference summary R and a machine generated summary M,
Lin, Chin-Yew. 2004. ROUGE: a Package for Automatic Evaluation of Summaries. In Proceedings of the Workshop on Text Summarization Branches Out (WAS 2004), Barcelona, Spain, July 25 - 26, 2004.
Extractive and abstractive summarization
Extractive
⊕ If the spans are sentences, the result will be syntactically correct.
⊕ Can’t totally muck up the ROUGE scores.
⊖ It “only” needs to be intelligent enough to pay attention to the right spans.
⊖ Can’t improve on bland, verbose input�
Abstractive
⊕ A better approach towards truly modelling semantics than the extractive one
⊖ Very hard. ROUGE is too primitive to capture the subtleties of paraphrasing
State of the art of summarization models
Basic sequence to sequence
Ilya Sutskever, Oriol Vinyals, and Quoc V Le. “Sequence to sequence learning with neural networks”. In Neural Information Processing Systems, 2014.
Timestep 1 2 3 4 5 6 7 8
Pointer networks
Vinyals, Oriol, Meire Fortunato, and Navdeep Jaitly. "Pointer networks." Advances in neural information processing systems. 2015.
Pointer-generator model
Nallapati, Ramesh, et al. "Abstractive Text Summarization using Sequence-to-sequence RNNs and Beyond." Proceedings of The 20th SIGNLL Conference on Computational Natural Language Learning. 2016.
Pointer-generator model
Nallapati, Ramesh, et al. "Abstractive Text Summarization using Sequence-to-sequence RNNs and Beyond." Proceedings of The 20th SIGNLL Conference on Computational Natural Language Learning. 2016.
Pointer-generator model - Coverage mechanism
A. See, P. J. Liu, and C. D. Manning. “Get to the point: Summarization with pointer-generator networks”. In ACL, volume 1, pages 1073–1083, 2017
Classical transfer learning
Knowledge distillation
Teacher (trained) model
Student model
Generalized models
Transfer network layers and fine-tune
Trained neural net
Copy layers
Other model
Transfer network used as baseline in the paper
Transfer Reinforcement Learning
The Cross-Entropy loss
The above equation reads as follows: “Given some input text, the CE loss for a model is none other than the probability that this model produces all the words in the reference summary.” — Note that this implies teacher forcing.
Exposure bias and inflexibility in Cross-Entropy loss
① The model is directly trained from the ground truth, which follows the source distribution ⇒ poor generalization to other target distributions
② The ground truth is not an absolute, unique correct summary, yet this score takes it very literally ⇒ further unnecessary hindrance to generalization
RL objective
The above equation reads as follows: “the objective of the reinforcement learning problem here is to maximize the expected ROUGE score when we sample the summary from the proposed hybrid (pointer-generator / seq2seq) network parametrized by Θ (where Θ = all weights and biases in the model)”
REINFORCE algorithm for policy gradient descent
Ŷ, the greedy output from the model, is subtracted from the trajectory rewards so as to make trajectories that are less rewarding than average less likely, and those more rewarding more likely.
RL objective revisited for transfer learning
It’s simpler than it looks, it’s mostly about sampling from both datasets and striking a balance ζ between the loss on one and on the other.
Final RL loss function
The actual loss function used for RL is another parametrized mix of CE and TRL loss.
Note that this this mix is used only for fine-tuning after normal MLE training on the source dataset.
Paulus, Romain, Caiming Xiong, and Richard Socher. "A deep reinforced model for abstractive summarization." arXiv preprint arXiv:1705.04304 (2017). ← useful paper to understand the one we’re presenting
Experimental results
ROUGE-{1,2,L} F1 scores
Universal Language Model Fine-tuning for Text Classification
Domain adaptation vs Transfer Learning
In domain adaptation, the source and target domains all have the same feature space (but different distributions); in contrast, transfer learning includes cases where the target domain's feature space is different from the source feature space or spaces.
The standard classification setting is a input distribution p(X) and a label distribution p(Y|X). Domain adaptation: when p(X) changes between training and test. Transfer learning: when p(Y|X) changes between training and test.
In other words, in Domain Adaptation the input distribution changes but the labels remain the same; in Transfer Learning, the input distributions stays the same, but the labels change.
Universal Language Model Fine-tuning (ULMFiT)
Universal Language Model Finetuning (ULMFiT)
Pretrains a language model (LM) on a large general-domain corpus and fine-tunes it on the target task. The method is universal in the sense that it meets these practical criteria:
ULMFit: Three approaches
ULMFit: Three approaches (cont)
ULMFit: Three approaches (cont)