1 of 37

Active Instruction Tuning: Improving Cross-Task Generalization by Training on Prompt Sensitive Tasks

Po-Nien Kung, Fan Yin, Di Wu, Kai-Wei Chang, Nanyun Peng

1

COMP790-158

Rui Shan Teo

09/30

2 of 37

Summary

Background:

Instruction tuning (IT) achieves impressive zero-shot generalization results by training LLMs on diverse tasks with instructions

Problem:

How to select new tasks to improve the performance and generalizability of IT models

Solution:

Active instruction tuning based on prompt uncertainty

2

3 of 37

Content

Background
Problem: data selection ≠ task selection
Solution

Active Instruction Tuning
Prompt Uncertainty
Task Map

Method

Active Instruction Tuning
Prompt Uncertainty

Experiment setting

Datasets

Results
Task Map
Conclusion

3

4 of 37

Background: instruction tuning has shown great success

Models can perform well on unseen tasks when trained with wide range of tasks

E.g. T0, FLAN, TK-Instruct, Instruct-GPT, Alpaca and Vicuna

Performance can be boosted by increasing number of diverse training tasks
However, scale of datasets grow rapidly → need to select data to train model on

Self-Instruct (Wang et al., 2022a) and Unnatural Instructions (Honovich et al., 2022) prompt LLMs to generate over 50K instruction tuning data
Dynosaur (Yin et al., 2023a) dynamically curates over 80K instruction tuning data from Huggingface datasets (Lhoest et al., 2021)

4

5 of 37

Problem: data selection ≠ task selection

Methods are focused on selecting the most useful instances for a single task

Using uncertainty-based intuitions like entropy, Monte Carlo dropout, ensemble disagreement

Uncertainty measurements can only measure uncertainty at instance-level

5

Task 1

Task 2

Task 3

instances

6 of 37

Problem: data selection ≠ task selection

Previous research (Ivison et al., 2022; Poth et al., 2021; Kung et al., 2021) has explored measuring task usefulness by assessing its similarity to the target task
Can enhance performance when aware of the target tasks
But not suitable for instruction tuning, which aims to improve overall generalization to arbitrary unseen tasks

6

7 of 37

Solution: Active Instruction Tuning

Framework that aims to actively identify informative new tasks for an IT model to continuously improve its cross-task generalization ability

7

“The main challenge lies in identifying useful tasks, for which we propose to select prompt-sensitive tasks.”

8 of 37

Solution: prompt uncertainty

Task-level uncertainty metric that measures the sensitivity of an IT model against instruction perturbations for a new task
To measure this:
Task Instruction: A specific task or question is given to the model
Unlabeled Instances: Examples or cases that do not have predefined answers or labels
Disagreement Assessment: Process involves comparing the model's predictions on the original prompts to its predictions on variations (perturbed prompts) of those inputs
Multiple Instances: Done across several examples to get a broader view of the model's performance

Average Disagreement Score: Finally, the results are averaged to quantify how often the model's predictions differ between the original and perturbed prompts.

8

9 of 37

Solution: Task Map (task diagnosing tool)

Task diagnosing tool that categorizes tasks based on prompt uncertainty and prediction probability
Tasks categorized into ambiguous, easy, difficult

9

10 of 37

Method: Active Instruction Tuning

When number of tasks is large and continuously expanding, training on all existing tasks is impractical → task selection

Large training task pool of fixed size
In 1st iteration, small number of tasks are randomly sampled to train a weak instruction-tuned model
Subsequent iterations select most useful tasks based on previous model and train a new model
Evaluate different task selection strategies by testing model on unseen tasks at each iteration

10

11 of 37

Method: prompt uncertainty

Select those highly uncertain tasks as the most informative ones at each stage for training
Prompt uncertainty measurement

Motivated from Bayesian Active Learning by Disagreement (BALD) (Houlsby et al., 2011)
Measure the disagreement of generation likelihoods on the original prediction over perturbed prompts and original prompts of a task

11

12 of 37

Method: prompt uncertainty

Measuring prompt uncertainty of model W to a task

12

13 of 37

Method: prompt uncertainty

n: the number of task instances x_i
k: the number of perturbations applied to the task instruction I^t, with j = 0 corresponding to the unperturbed instruction
P: the model’s predicted likelihood of output y given the input and instruction
W: the model weights (parameters)
I^t_j: task instruction j for task t, where perturbations are applied

13

14 of 37

Method: prompt uncertainty

14

Goal: average disagreement of likelihood between perturbed and original instruction

15 of 37

Method: prompt uncertainty

15

is the likelihood or probability of predicting the output y^t_i given the task instance x^t_i, the task instruction I^t_j , and the model weights W

absolute difference between the prediction probabilities of perturbed and unperturbed instruction

averages over k perturbations applied to the task

averages this quantity over n task instances

16 of 37

Method: prompt uncertainty

To perturb task instructions, while preserving meaning of altered instructions

Employ paraphrasing techniques
Adding extraneous tokes
Randomly omitting words

Assign a 0.2 drop rate for each word in the instruction to create perturbed instructions

16

17 of 37

Method: prompt uncertainty (underlying hypothesis)

Target the epistemic uncertainty of a model in response of variations in prompts

Refers to a model’s uncertainty due to its lack of knowledge or insufficient training on a specific task

Usually, epistemic uncertainty is measured using an ensemble of models

Different models are trained on the same data, and their predictions are compared to capture the range of uncertainties

Here, perturbations of prompts as different conditions are used instead
Use the original likelihood of the model’s prediction to represent ensemble prediction (averaging over multiple perturbed prompts instead of averaging over multiple models)

17

18 of 37

Method: prompt uncertainty (underlying hypothesis)

If a model cannot robustly map task instructions (prompts) to the underlying latent concepts (tasks it should perform), it will struggle to generalize (Xie et al., 2021; Pan et al., 2023)
Sensitivity to prompt perturbations is used as an indicator of robustness

Hypothesis: training the model on tasks where prompt perturbations are explicitly considered (i.e., training on prompt-uncertain tasks) could improve its ability to associate prompts with specific latent concepts (tasks) → better zero-shot performance on unseen instructions

18

19 of 37

Experiment Setting: datasets used

NIV2 is the largest IT dataset with 1600+ cross-lingual tasks

It focuses on improving model generalization to unseen tasks

Self-Instruct is used to train the Alpaca model (Taori et al., 2023)

Aims to enhance model instruction following ability, following the categorization in prior work (Kung and Peng, 2023)

19

20 of 37

Experiment Setting: active instruction tuning setting

Natural Instruction V2 dataset, NIV2 English tasks split

756 training tasks and 119 testing tasks, 5 random seeds
For each random seed, 68 tasks randomly selected as initial training tasks and 68 as validation tasks
Remaining 620 tasks form the remaining task pool
In each active learning iteration, model select the most informative tasks from the pool to improve its performance

Fixed ratio of classification and generative tasks is maintained (24 classification, 44 generative)

After each round of sampling, the newly selected tasks are added to the existing training set (for a new model to be trained)

20

21 of 37

Experiment Setting: active instruction tuning setting

Self-Instruct

Task pool consists of 52k tasks from self-instruct dataset, 500 tasks randomly sampled
Model performance compared at 1k, 2k, 4k, 8k, 16k training tasks
Task selection

All tasks divided into 13 chunks by output sequence length
Task selection methods applied to choose most informative tasks from each group
Number of tasks selected from each chunk follows the proportion of tasks in that chunk relative to the entire task pool

21

22 of 37

Experiment Setting: task selection strategies

Methods to be used as baseline task selection strategies

Random Sampling
High Perplexity
Low Perplexity (aims to select difficult/easy tasks by measuring predicted sentence perplexity for generation tasks or entropy for classification tasks)

Aggregate the uncertainty score of multiple (10 for NIV2 and 1 for Self-Instruct) instances to estimate task-level uncertainty

22

23 of 37

Experiment Setting: training

For the NIV2 dataset,

Train the T5-770M model following the settings used in the TK-instruct model (current state-of-the-art, or SOTA).

Performance is reported on Classification, Generative, and Overall tasks using the Rouge-L score

Rouge-L is a metric for evaluating the quality of generated text by comparing it to reference text, focusing on longest common subsequences

For Self-Instruct dataset,

Train the LLaMA-7B model following Alpaca model setting

Performance is reported on blind pair-wise comparison with Random Sampling on a 252 user-oriented test set

23

24 of 37

Experiment Setting: evaluation

Follow the evaluation in Vicuna (Chiang et al., 2023) to report GPT-4, Chat-GPT (GPT 3.5) and Human evaluation scores

Human evaluation

Human annotators judge whether one model “wins” or “loses” or if output from both models is “tie”
Majority voting used to aggregate final decision

GPT-4, Chat-GPT (GPT 3.5) for evaluation, applying blind pairwise comparison

Model predictions labeled as (1) and (2) and GPT models are prompted to decide which is better or say equal

24

25 of 37

Experiment Setting: evaluation

25

26 of 37

Results: NIV2 results

When selecting less than 340 tasks (half of the task pool), Prompt Uncertainty method consistently outperforms other baselines in terms of Overall scores for both the validation and testing sets

26

27 of 37

Results: NIV2 results

Proposed method is highly effective for Classification tasks, surpassing all other baselines.

27

28 of 37

Results: NIV2 results

For Generative tasks, the Low Perplexity method performs well on testing tasks at early iterations but poorly on the validation set. → Model’s generalizability was not enhanced

28

29 of 37

Results: NIV2 results

Proposed method achieves consistently good performance on both testing and validation tasks, outperforming Random Sampling on testing tasks and exhibiting similar performance on validation.

29

30 of 37

Results: NIV2 results

Further increase in number of training tasks

Uncertainty scores start to deviate → there are fewer uncertain tasks to select
For example, High Perplexity method starts selecting tasks with low perplexity scores

30

31 of 37

Results: Self-Instruct results

Compared with Random Sampling at each active instruction tuning iteration

31

32 of 37

Results: Self-Instruct results

For both GPT-4 and Chat-GPT evaluation, Fully Trained model outperforms Random Sampling. With more training tasks, performance gap diminishes → IT performance of the Alpaca setting scales with increasing number of training tasks

32

33 of 37

Results: Self-Instruct results

Low Perplexity and High Perplexity are generally subpar with Random Sampling, indicating that applying inadequate task selection strategies can hurt the model’s performance

33

34 of 37

Results: Self-Instruct results

Prompt uncertainty method is almost consistently more preferable by all GPT4, Chat- GPT, and human assessors when selecting less or equal than 8000 tasks

34

35 of 37

Task Map

Model-based diagnosing tool that understands the contributions of different groups of tasks towards instruction tuning
Uses two metrics:

Prediction probability
Prompt uncertainty

Three categories:

Ambiguous tasks, where models fail to recognize them and have high prompt uncertainty
Easy and Difficult tasks, where models can map the prompts to a certain latent task knowledge (low prompt uncertainty) and perform the task with high/low confidence (sentence probability)

35

36 of 37

Task Map

Training on Ambiguous tasks improve IT generalization, training on Easy and Difficult tasks is worse than randomly selecting tasks

36

37 of 37

Conclusion

Active Instruction Tuning with prompt uncertainty, a framework to enhance the generalization ability of the IT model in large-scale instruction tuning
Experiments demonstrate that training on prompt uncertain tasks consistently outperforms random sampling and other uncertainty baselines
Task Map tool shows that training on ambiguous tasks improves generalization, while some difficult tasks offer no benefit

37