Active Instruction Tuning: Improving Cross-Task Generalization by Training on Prompt Sensitive Tasks
Po-Nien Kung, Fan Yin, Di Wu, Kai-Wei Chang, Nanyun Peng
1
COMP790-158
Rui Shan Teo
09/30
Summary
Background:
Problem:
Solution:
2
Content
3
Background: instruction tuning has shown great success
4
Problem: data selection ≠ task selection
5
Task 1
Task 2
Task 3
instances
Problem: data selection ≠ task selection
6
Solution: Active Instruction Tuning
7
“The main challenge lies in identifying useful tasks, for which we propose to select prompt-sensitive tasks.”
Solution: prompt uncertainty
Average Disagreement Score: Finally, the results are averaged to quantify how often the model's predictions differ between the original and perturbed prompts.
8
Solution: Task Map (task diagnosing tool)
9
Method: Active Instruction Tuning
10
Method: prompt uncertainty
11
Method: prompt uncertainty
12
Method: prompt uncertainty
13
Method: prompt uncertainty
14
Goal: average disagreement of likelihood between perturbed and original instruction
Method: prompt uncertainty
15
is the likelihood or probability of predicting the output yti given the task instance xti, the task instruction Itj , and the model weights W
absolute difference between the prediction probabilities of perturbed and unperturbed instruction
averages over k perturbations applied to the task
averages this quantity over n task instances
Method: prompt uncertainty
16
Method: prompt uncertainty (underlying hypothesis)
17
Method: prompt uncertainty (underlying hypothesis)
18
Experiment Setting: datasets used
19
Experiment Setting: active instruction tuning setting
Natural Instruction V2 dataset, NIV2 English tasks split
20
Experiment Setting: active instruction tuning setting
Self-Instruct
21
Experiment Setting: task selection strategies
22
Experiment Setting: training
23
Experiment Setting: evaluation
Follow the evaluation in Vicuna (Chiang et al., 2023) to report GPT-4, Chat-GPT (GPT 3.5) and Human evaluation scores
24
Experiment Setting: evaluation
25
Results: NIV2 results
When selecting less than 340 tasks (half of the task pool), Prompt Uncertainty method consistently outperforms other baselines in terms of Overall scores for both the validation and testing sets
26
Results: NIV2 results
Proposed method is highly effective for Classification tasks, surpassing all other baselines.
27
Results: NIV2 results
For Generative tasks, the Low Perplexity method performs well on testing tasks at early iterations but poorly on the validation set. → Model’s generalizability was not enhanced
28
Results: NIV2 results
Proposed method achieves consistently good performance on both testing and validation tasks, outperforming Random Sampling on testing tasks and exhibiting similar performance on validation.
29
Results: NIV2 results
Further increase in number of training tasks
30
Results: Self-Instruct results
Compared with Random Sampling at each active instruction tuning iteration
31
Results: Self-Instruct results
For both GPT-4 and Chat-GPT evaluation, Fully Trained model outperforms Random Sampling. With more training tasks, performance gap diminishes → IT performance of the Alpaca setting scales with increasing number of training tasks
32
Results: Self-Instruct results
Low Perplexity and High Perplexity are generally subpar with Random Sampling, indicating that applying inadequate task selection strategies can hurt the model’s performance
33
Results: Self-Instruct results
Prompt uncertainty method is almost consistently more preferable by all GPT4, Chat- GPT, and human assessors when selecting less or equal than 8000 tasks
34
Task Map
35
Task Map
Training on Ambiguous tasks improve IT generalization, training on Easy and Difficult tasks is worse than randomly selecting tasks
36
Conclusion
37