1 of 37

Active Instruction Tuning: Improving Cross-Task Generalization by Training on Prompt Sensitive Tasks

Po-Nien Kung, Fan Yin, Di Wu, Kai-Wei Chang, Nanyun Peng

1

COMP790-158

Rui Shan Teo

09/30

2 of 37

Summary

Background:

  • Instruction tuning (IT) achieves impressive zero-shot generalization results by training LLMs on diverse tasks with instructions

Problem:

  • How to select new tasks to improve the performance and generalizability of IT models

Solution:

  • Active instruction tuning based on prompt uncertainty

2

3 of 37

Content

  1. Background
  2. Problem: data selection ≠ task selection
  3. Solution
    1. Active Instruction Tuning
    2. Prompt Uncertainty
    3. Task Map
  4. Method
    • Active Instruction Tuning
    • Prompt Uncertainty
  5. Experiment setting
    • Datasets
  6. Results
  7. Task Map
  8. Conclusion

3

4 of 37

Background: instruction tuning has shown great success

  • Models can perform well on unseen tasks when trained with wide range of tasks
    • E.g. T0, FLAN, TK-Instruct, Instruct-GPT, Alpaca and Vicuna
  • Performance can be boosted by increasing number of diverse training tasks
  • However, scale of datasets grow rapidly → need to select data to train model on
    • Self-Instruct (Wang et al., 2022a) and Unnatural Instructions (Honovich et al., 2022) prompt LLMs to generate over 50K instruction tuning data
    • Dynosaur (Yin et al., 2023a) dynamically curates over 80K instruction tuning data from Huggingface datasets (Lhoest et al., 2021)

4

5 of 37

Problem: data selection ≠ task selection

  • Methods are focused on selecting the most useful instances for a single task
    • Using uncertainty-based intuitions like entropy, Monte Carlo dropout, ensemble disagreement
  • Uncertainty measurements can only measure uncertainty at instance-level

5

Task 1

Task 2

Task 3

instances

6 of 37

Problem: data selection ≠ task selection

  • Previous research (Ivison et al., 2022; Poth et al., 2021; Kung et al., 2021) has explored measuring task usefulness by assessing its similarity to the target task
  • Can enhance performance when aware of the target tasks
  • But not suitable for instruction tuning, which aims to improve overall generalization to arbitrary unseen tasks

6

7 of 37

Solution: Active Instruction Tuning

  • Framework that aims to actively identify informative new tasks for an IT model to continuously improve its cross-task generalization ability

7

“The main challenge lies in identifying useful tasks, for which we propose to select prompt-sensitive tasks.”

8 of 37

Solution: prompt uncertainty

  • Task-level uncertainty metric that measures the sensitivity of an IT model against instruction perturbations for a new task
  • To measure this:
  • Task Instruction: A specific task or question is given to the model
  • Unlabeled Instances: Examples or cases that do not have predefined answers or labels
  • Disagreement Assessment: Process involves comparing the model's predictions on the original prompts to its predictions on variations (perturbed prompts) of those inputs
  • Multiple Instances: Done across several examples to get a broader view of the model's performance

Average Disagreement Score: Finally, the results are averaged to quantify how often the model's predictions differ between the original and perturbed prompts.

8

9 of 37

Solution: Task Map (task diagnosing tool)

  • Task diagnosing tool that categorizes tasks based on prompt uncertainty and prediction probability
  • Tasks categorized into ambiguous, easy, difficult

9

10 of 37

Method: Active Instruction Tuning

  • When number of tasks is large and continuously expanding, training on all existing tasks is impractical → task selection

  1. Large training task pool of fixed size
  2. In 1st iteration, small number of tasks are randomly sampled to train a weak instruction-tuned model
  3. Subsequent iterations select most useful tasks based on previous model and train a new model
  4. Evaluate different task selection strategies by testing model on unseen tasks at each iteration

10

11 of 37

Method: prompt uncertainty

  • Select those highly uncertain tasks as the most informative ones at each stage for training
  • Prompt uncertainty measurement
    • Motivated from Bayesian Active Learning by Disagreement (BALD) (Houlsby et al., 2011)
    • Measure the disagreement of generation likelihoods on the original prediction over perturbed prompts and original prompts of a task

11

12 of 37

Method: prompt uncertainty

  • Measuring prompt uncertainty of model W to a task

12

13 of 37

Method: prompt uncertainty

  • n: the number of task instances xi
  • k: the number of perturbations applied to the task instruction It, with j = 0 corresponding to the unperturbed instruction
  • P: the model’s predicted likelihood of output y given the input and instruction
  • W: the model weights (parameters)
  • Itj: task instruction j for task t, where perturbations are applied

13

14 of 37

Method: prompt uncertainty

14

Goal: average disagreement of likelihood between perturbed and original instruction

15 of 37

Method: prompt uncertainty

15

is the likelihood or probability of predicting the output yti​ given the task instance xti​, the task instruction Itj , and the model weights W

absolute difference between the prediction probabilities of perturbed and unperturbed instruction

averages over k perturbations applied to the task

averages this quantity over n task instances

16 of 37

Method: prompt uncertainty

  • To perturb task instructions, while preserving meaning of altered instructions
    • Employ paraphrasing techniques
    • Adding extraneous tokes
    • Randomly omitting words
  • Assign a 0.2 drop rate for each word in the instruction to create perturbed instructions

16

17 of 37

Method: prompt uncertainty (underlying hypothesis)

  • Target the epistemic uncertainty of a model in response of variations in prompts
    • Refers to a model’s uncertainty due to its lack of knowledge or insufficient training on a specific task
  • Usually, epistemic uncertainty is measured using an ensemble of models
    • Different models are trained on the same data, and their predictions are compared to capture the range of uncertainties
  • Here, perturbations of prompts as different conditions are used instead
  • Use the original likelihood of the model’s prediction to represent ensemble prediction (averaging over multiple perturbed prompts instead of averaging over multiple models)

17

18 of 37

Method: prompt uncertainty (underlying hypothesis)

  • If a model cannot robustly map task instructions (prompts) to the underlying latent concepts (tasks it should perform), it will struggle to generalize (Xie et al., 2021; Pan et al., 2023)
  • Sensitivity to prompt perturbations is used as an indicator of robustness

  • Hypothesis: training the model on tasks where prompt perturbations are explicitly considered (i.e., training on prompt-uncertain tasks) could improve its ability to associate prompts with specific latent concepts (tasks) → better zero-shot performance on unseen instructions

18

19 of 37

Experiment Setting: datasets used

  • NIV2 is the largest IT dataset with 1600+ cross-lingual tasks
    • It focuses on improving model generalization to unseen tasks
  • Self-Instruct is used to train the Alpaca model (Taori et al., 2023)
    • Aims to enhance model instruction following ability, following the categorization in prior work (Kung and Peng, 2023)

19

20 of 37

Experiment Setting: active instruction tuning setting

Natural Instruction V2 dataset, NIV2 English tasks split

  • 756 training tasks and 119 testing tasks, 5 random seeds
  • For each random seed, 68 tasks randomly selected as initial training tasks and 68 as validation tasks
  • Remaining 620 tasks form the remaining task pool
  • In each active learning iteration, model select the most informative tasks from the pool to improve its performance
    • Fixed ratio of classification and generative tasks is maintained (24 classification, 44 generative)
  • After each round of sampling, the newly selected tasks are added to the existing training set (for a new model to be trained)

20

21 of 37

Experiment Setting: active instruction tuning setting

Self-Instruct

  • Task pool consists of 52k tasks from self-instruct dataset, 500 tasks randomly sampled
  • Model performance compared at 1k, 2k, 4k, 8k, 16k training tasks
  • Task selection
    • All tasks divided into 13 chunks by output sequence length
    • Task selection methods applied to choose most informative tasks from each group
    • Number of tasks selected from each chunk follows the proportion of tasks in that chunk relative to the entire task pool

21

22 of 37

Experiment Setting: task selection strategies

  • Methods to be used as baseline task selection strategies
    • Random Sampling
    • High Perplexity
    • Low Perplexity (aims to select difficult/easy tasks by measuring predicted sentence perplexity for generation tasks or entropy for classification tasks)
  • Aggregate the uncertainty score of multiple (10 for NIV2 and 1 for Self-Instruct) instances to estimate task-level uncertainty

22

23 of 37

Experiment Setting: training

  • For the NIV2 dataset,
    • Train the T5-770M model following the settings used in the TK-instruct model (current state-of-the-art, or SOTA).
  • Performance is reported on Classification, Generative, and Overall tasks using the Rouge-L score
    • Rouge-L is a metric for evaluating the quality of generated text by comparing it to reference text, focusing on longest common subsequences
  • For Self-Instruct dataset,
    • Train the LLaMA-7B model following Alpaca model setting
  • Performance is reported on blind pair-wise comparison with Random Sampling on a 252 user-oriented test set

23

24 of 37

Experiment Setting: evaluation

Follow the evaluation in Vicuna (Chiang et al., 2023) to report GPT-4, Chat-GPT (GPT 3.5) and Human evaluation scores

  • Human evaluation
    • Human annotators judge whether one model “wins” or “loses” or if output from both models is “tie”
    • Majority voting used to aggregate final decision
  • GPT-4, Chat-GPT (GPT 3.5) for evaluation, applying blind pairwise comparison
    • Model predictions labeled as (1) and (2) and GPT models are prompted to decide which is better or say equal

24

25 of 37

Experiment Setting: evaluation

25

26 of 37

Results: NIV2 results

When selecting less than 340 tasks (half of the task pool), Prompt Uncertainty method consistently outperforms other baselines in terms of Overall scores for both the validation and testing sets

26

27 of 37

Results: NIV2 results

Proposed method is highly effective for Classification tasks, surpassing all other baselines.

27

28 of 37

Results: NIV2 results

For Generative tasks, the Low Perplexity method performs well on testing tasks at early iterations but poorly on the validation set. → Model’s generalizability was not enhanced

28

29 of 37

Results: NIV2 results

Proposed method achieves consistently good performance on both testing and validation tasks, outperforming Random Sampling on testing tasks and exhibiting similar performance on validation.

29

30 of 37

Results: NIV2 results

Further increase in number of training tasks

  • Uncertainty scores start to deviate → there are fewer uncertain tasks to select
  • For example, High Perplexity method starts selecting tasks with low perplexity scores

30

31 of 37

Results: Self-Instruct results

Compared with Random Sampling at each active instruction tuning iteration

31

32 of 37

Results: Self-Instruct results

For both GPT-4 and Chat-GPT evaluation, Fully Trained model outperforms Random Sampling. With more training tasks, performance gap diminishes → IT performance of the Alpaca setting scales with increasing number of training tasks

32

33 of 37

Results: Self-Instruct results

Low Perplexity and High Perplexity are generally subpar with Random Sampling, indicating that applying inadequate task selection strategies can hurt the model’s performance

33

34 of 37

Results: Self-Instruct results

Prompt uncertainty method is almost consistently more preferable by all GPT4, Chat- GPT, and human assessors when selecting less or equal than 8000 tasks

34

35 of 37

Task Map

  • Model-based diagnosing tool that understands the contributions of different groups of tasks towards instruction tuning
  • Uses two metrics:
    • Prediction probability
    • Prompt uncertainty
  • Three categories:
    • Ambiguous tasks, where models fail to recognize them and have high prompt uncertainty
    • Easy and Difficult tasks, where models can map the prompts to a certain latent task knowledge (low prompt uncertainty) and perform the task with high/low confidence (sentence probability)

35

36 of 37

Task Map

Training on Ambiguous tasks improve IT generalization, training on Easy and Difficult tasks is worse than randomly selecting tasks

36

37 of 37

Conclusion

  • Active Instruction Tuning with prompt uncertainty, a framework to enhance the generalization ability of the IT model in large-scale instruction tuning
  • Experiments demonstrate that training on prompt uncertain tasks consistently outperforms random sampling and other uncertainty baselines
  • Task Map tool shows that training on ambiguous tasks improves generalization, while some difficult tasks offer no benefit

37