1 of 13

Rethinking Why Intermediate-Task Fine-Tuning Works

Ting-Yun Chang

tingyun@usc.edu.tw

Chi-Jen Lu

cjlu@iis.sinica.edu.tw

2 of 13

Intermediate-Task Fine-Tuning

Improve the pretrained LMs final performance on the target task

ELMo, GPT, BERT, BART, RoBERTa...

Vanilla fine-tuning is sometimes very brittle -> avoid degenerate runs

3 of 13

What kinds of intermediate tasks work well?

In many cases, intermediate tasks are not helpful
Generally, those requiring commonsense reasoning abilities can help different target tasks

MNLI, HellaSwag, CosmosQA

Intermediate-Task Transfer Learning with Pretrained Models for Natural Language Understanding:

When and Why Does It Work? (Pruksachatkun et al. 2020)

Blue: helpful; Red: hurtful

4 of 13

But HellaSwag is a synthetic dataset...

Previous work has found that RoBERTa tends to use artifacts in HellaSwag to make predictions

5 of 13

Ablating Common Sense: Two Simple Baselines (1)

HellaSwag-p

6 of 13

Ablating Common Sense: Two Simple Baselines (2)

SynthesisGPT2

The topography of the city center was also changed by the construction of a seawall

and the artificial harbor island ( completed 1909 ) at the mouth of the city's industrial Duwamish ✔ (real sentence from Wikipedia)
designed to take the advantage of the low density of the region in order to reach sea level
(first completed by Hermann I at the time of Napoleon III), and by the creation
.... (generated by GPT2)

Source: Wikipedia paragraph

=> We do not introduce extra commonsense information

Generated by GPT2-medium

7 of 13

Tasks

Target

We intentionally include target tasks that require specific knowledge

8 of 13

A Good Intermediate Task

improving target tasks’ best performance
stabilizing the fine-tuning process of the target tasks; reducing the degenerate fine-tuning runs

9 of 13

36 hyperparameter trials for each intermediate-target task pair (each blue violin)

reasonable lr, warm up steps, batch size, random seeds

10 of 13

Generally, Syn_GPT2 & HellaSwag-p can sightly improve the best performance on dev sets

11 of 13

Syn_GPT2 & HellaSwag-p can greatly improve the average performance on dev sets

12 of 13

Using 2k intermediate training data of HellaSwag-p already helps!

improvement in accuracy over vanilla fine-tuning

13 of 13

Contribution

We discover that a widely beneficial intermediate task is not required to provide specific linguistic or reasoning skills
We highlight the enhancement on fine-tuning stability, providing more than 1000 experimental observations on RoBERTa-large
We study different factors, suggesting rethinking why it works (please refer to the paper for details)