1 of 13

Rethinking Why Intermediate-Task Fine-Tuning Works

Ting-Yun Chang

tingyun@usc.edu.tw

Chi-Jen Lu

cjlu@iis.sinica.edu.tw

2 of 13

Intermediate-Task Fine-Tuning

  • Improve the pretrained LMs final performance on the target task
    • ELMo, GPT, BERT, BART, RoBERTa...
  • Vanilla fine-tuning is sometimes very brittle -> avoid degenerate runs

3 of 13

What kinds of intermediate tasks work well?

  • In many cases, intermediate tasks are not helpful
  • Generally, those requiring commonsense reasoning abilities can help different target tasks
    • MNLI, HellaSwag, CosmosQA

Intermediate-Task Transfer Learning with Pretrained Models for Natural Language Understanding:

When and Why Does It Work? (Pruksachatkun et al. 2020)

Blue: helpful; Red: hurtful

4 of 13

But HellaSwag is a synthetic dataset...

Previous work has found that RoBERTa tends to use artifacts in HellaSwag to make predictions

5 of 13

Ablating Common Sense: Two Simple Baselines (1)

HellaSwag-p

6 of 13

Ablating Common Sense: Two Simple Baselines (2)

SynthesisGPT2

The topography of the city center was also changed by the construction of a seawall

  1. and the artificial harbor island ( completed 1909 ) at the mouth of the city's industrial Duwamish ✔ (real sentence from Wikipedia)
  2. designed to take the advantage of the low density of the region in order to reach sea level
  3. (first completed by Hermann I at the time of Napoleon III), and by the creation
  4. .... (generated by GPT2)
  • Source: Wikipedia paragraph

=> We do not introduce extra commonsense information

  • Generated by GPT2-medium

7 of 13

Tasks

Target

We intentionally include target tasks that require specific knowledge

8 of 13

A Good Intermediate Task

  1. improving target tasks’ best performance
  2. stabilizing the fine-tuning process of the target tasks; reducing the degenerate fine-tuning runs

9 of 13

36 hyperparameter trials for each intermediate-target task pair (each blue violin)

  • reasonable lr, warm up steps, batch size, random seeds

10 of 13

Generally, Syn_GPT2 & HellaSwag-p can sightly improve the best performance on dev sets

11 of 13

Syn_GPT2 & HellaSwag-p can greatly improve the average performance on dev sets

12 of 13

Using 2k intermediate training data of HellaSwag-p already helps!

improvement in accuracy over vanilla fine-tuning

13 of 13

Contribution

  • We discover that a widely beneficial intermediate task is not required to provide specific linguistic or reasoning skills
  • We highlight the enhancement on fine-tuning stability, providing more than 1000 experimental observations on RoBERTa-large
  • We study different factors, suggesting rethinking why it works (please refer to the paper for details)