1 of 70

1

1 Decoding

Domain Adaptation &

Adapting Pretrained Models 1

with wing.nus

2 of 70

3 of 70

4 of 70

5 of 70

6 of 70

7 of 70

8 of 70

9 of 70

10 of 70

11 of 70

12 of 70

13 of 70

14 of 70

15 of 70

16 of 70

17 of 70

18 of 70

19 of 70

20 of 70

21 of 70

22 of 70

23 of 70

24 of 70

25 of 70

26 of 70

27 of 70

28 of 70

29 of 70

30 of 70

31 of 70

32 of 70

33 of 70

34 of 70

35 of 70

Deep transfer RL �for text summarization

Reference: Keneshloo, Yaser & Ramakrishnan, Naren & Reddy, Chandan. (2018). Deep Transfer Reinforcement Learning for Text Summarization.

36 of 70

Text summarization and ROUGE scores

37 of 70

The text summarization task

long text → short text containing the essential information

That’s about all there is to it.

What is a good summary ?

Syntactically / semantically correct
Contains all the relevant information while maintaining brevity
Subjective human judgement is obviously central to properly assess those.

⇒ automatically scoring summaries is really hard.

38 of 70

The ROUGE-n scores

Given a reference summary R and a machine generated summary M,

Let O be the collection of all overlapping n-grams between M and R (one match suffices for repetitions in R)
Let |R|, |M|, |O| be the number of words in R, M, O

Lin, Chin-Yew. 2004. ROUGE: a Package for Automatic Evaluation of Summaries. In Proceedings of the Workshop on Text Summarization Branches Out (WAS 2004), Barcelona, Spain, July 25 - 26, 2004.

39 of 70

Extractive and abstractive summarization

Extractive

Pick important spans from the source text and reuse them in the summary.

⊕ If the spans are sentences, the result will be syntactically correct.

⊕ Can’t totally muck up the ROUGE scores.

⊖ It “only” needs to be intelligent enough to pay attention to the right spans.

⊖ Can’t improve on bland, verbose input�

Abstractive

Using a combination of LSTM/GRU, transformers and intelligent tricks, create an abstract representation of the input text, and output a short sequence that paraphrases the important information.

⊕ A better approach towards truly modelling semantics than the extractive one

⊖ Very hard. ROUGE is too primitive to capture the subtleties of paraphrasing

40 of 70

State of the art of summarization models

41 of 70

Basic sequence to sequence

DNNs : powerful but typically require fixed input and output dimensionality.
But NLP tasks require processing sequences of varying length�
Seq2seq is an LSTM-based model, that can handle such sequences

Ilya Sutskever, Oriol Vinyals, and Quoc V Le. “Sequence to sequence learning with neural networks”. In Neural Information Processing Systems, 2014.

Timestep 1 2 3 4 5 6 7 8

42 of 70

Pointer networks

This model is not specific to summarization but we’ll use the pointer concept
Seq2seq limitation : trained with a fixed vocabulary size
In this example we find the convex hull of a set of points on a 2d plane
Attention mechanism generates a probability distribution over input
Model was shown successful on convex hull, travelling salesman problem, delaunay triangulation

Vinyals, Oriol, Meire Fortunato, and Navdeep Jaitly. "Pointer networks." Advances in neural information processing systems. 2015.

43 of 70

Pointer-generator model

Ability to handle out-of-vocabulary words : important for transfer learning
Soft switch mechanism σ_j ∈ [0, 1]

Generate word from�vocabulary distribution p_vocab
or copy from source, based on attention distribution α_j = Σ_i α_ij

Nallapati, Ramesh, et al. "Abstractive Text Summarization using Sequence-to-sequence RNNs and Beyond." Proceedings of The 20th SIGNLL Conference on Computational Natural Language Learning. 2016.

44 of 70

Pointer-generator model

σ_j is learned, at each decoder step j, (step 2 in the exemple figure), based on :

current input word (y₂)
decoder state (s₂)
context vector (c₂)

Context vector c_j = �is the attention-weighted sum of the encoder states

Nallapati, Ramesh, et al. "Abstractive Text Summarization using Sequence-to-sequence RNNs and Beyond." Proceedings of The 20th SIGNLL Conference on Computational Natural Language Learning. 2016.

45 of 70

Pointer-generator model - Coverage mechanism

Repetition is a common problem of seq2seq models, particularly problematic for summarization, a repetition is “useless” information.
To reduce repetitions, a “coverage vector” c^t = ∑ a^{[0 to t-1]}is added to the attention mechanism (where t is the decoder timestep)
Intuitively, it is a distribution that represents the degree of coverage received by the source words so far.
The loss function is modified to penalize the overlap between each attention distribution and previous coverage : λ ∑_i min(a_i^t, c_i^t)

A. See, P. J. Liu, and C. D. Manning. “Get to the point: Summarization with pointer-generator networks”. In ACL, volume 1, pages 1073–1083, 2017

46 of 70

Classical transfer learning

47 of 70

Knowledge distillation

Teacher (trained) model

Student model

Trained model treated as ground truth and probed (maybe with generated data?) to create training data
Student model learns from this

48 of 70

Generalized models

Chimera that attempts many tasks (many outputs)
Forced common architecture
Implicit knowledge transfer between tasks

49 of 70

Transfer network layers and fine-tune

Trained neural net

Copy layers

Other model

50 of 70

Transfer network used as baseline in the paper

Copy full architecture (all LSTM hidden layers weights) from network trained on source dataset S
Fine-tune all layers without any clamps on target dataset G

51 of 70

Transfer Reinforcement Learning

52 of 70

The Cross-Entropy loss

Similar to the usual seq2seq formulation — we are building a conditional language model; here, the “language of summarization” given an input text

The above equation reads as follows: “Given some input text, the CE loss for a model is none other than the probability that this model produces all the words in the reference summary.” — Note that this implies teacher forcing.

53 of 70

Exposure bias and inflexibility in Cross-Entropy loss

① The model is directly trained from the ground truth, which follows the source distribution ⇒ poor generalization to other target distributions

② The ground truth is not an absolute, unique correct summary, yet this score takes it very literally ⇒ further unnecessary hindrance to generalization

54 of 70

RL objective

Generally, RL tries to maximize the expected reward over time.
Here, luckily, we are in a fully observed setting (so we deal with states and not observations ⇒ FOMDP), and we don’t need fiddling with our reward function: r(state) = ROUGE

The above equation reads as follows: “the objective of the reinforcement learning problem here is to maximize the expected ROUGE score when we sample the summary from the proposed hybrid (pointer-generator / seq2seq) network parametrized by Θ (where Θ = all weights and biases in the model)”

55 of 70

REINFORCE algorithm for policy gradient descent

To minimize this loss, we ascent its gradient along Θ,
We use samples of the output of the network to estimate this gradient and to adjust its parameters:
We add a baseline term b to compensate for the fact that the gradient is sensitive to a constant shift in samples (common instability problem w/ PG)

Ŷ, the greedy output from the model, is subtracted from the trajectory rewards so as to make trajectories that are less rewarding than average less likely, and those more rewarding more likely.

56 of 70

RL objective revisited for transfer learning

It’s simpler than it looks, it’s mostly about sampling from both datasets and striking a balance ζ between the loss on one and on the other.

57 of 70

Final RL loss function

The actual loss function used for RL is another parametrized mix of CE and TRL loss.

Note that this this mix is used only for fine-tuning after normal MLE training on the source dataset.

Paulus, Romain, Caiming Xiong, and Richard Socher. "A deep reinforced model for abstractive summarization." arXiv preprint arXiv:1705.04304 (2017). ← useful paper to understand the one we’re presenting

58 of 70

Experimental results

59 of 70

ROUGE-{1,2,L} F1 scores

60 of 70

Universal Language Model Fine-tuning for Text Classification

61 of 70

Domain adaptation vs Transfer Learning

In domain adaptation, the source and target domains all have the same feature space (but different distributions); in contrast, transfer learning includes cases where the target domain's feature space is different from the source feature space or spaces.

62 of 70

The standard classification setting is a input distribution p(X) and a label distribution p(Y|X). Domain adaptation: when p(X) changes between training and test. Transfer learning: when p(Y|X) changes between training and test.

In other words, in Domain Adaptation the input distribution changes but the labels remain the same; in Transfer Learning, the input distributions stays the same, but the labels change.

63 of 70

Universal Language Model Fine-tuning (ULMFiT)

Universal Language Model Fine-tuning (ULMFiT), a method that can be used to achieve CV-like transfer learning for any task for NLP.
discriminative fine-tuning, slanted triangular learning rates, and gradual unfreezing, novel techniques to retain previous knowledge and avoid catastrophic forgetting during fine-tuning.
significantly outperform the state-of-the-art on six representative text classification datasets, with an error reduction of 18-24% on the majority of datasets.

64 of 70

Universal Language Model Finetuning (ULMFiT)

Pretrains a language model (LM) on a large general-domain corpus and fine-tunes it on the target task. The method is universal in the sense that it meets these practical criteria:

works across tasks varying in document size, number, and label type;
uses a single architecture and training process;
requires no custom feature engineering or preprocessing; and
does not require additional in-domain documents or labels.

65 of 70

66 of 70

ULMFit: Three approaches

General domain LM pre-training

Wikitext-103
28,595 preprocessed Wikipedia articles and 103 million words

67 of 70

ULMFit: Three approaches (cont)

Target task LM fine-tuning

fine-tune the LM on data of the target task
Discriminative fine-tuning (separate learning rates for each layer)
x = Last layer learning rates , x/2.6 = lower layer learning rates
Slanted triangular learning rates, a modification of TLR (Smith, 2017)

Short increase of initial learning rate, then long decrease period

68 of 70