Learnings from Efficiency Challenge
NOTE:
Ranked 2nd overall in the A100 category.
Anmol Agarwal, Shashank Shet, Ajinkya Deshpande, Arun Iyer, Suresh Parthasarathy
Microsoft Research India
Code: https://github.com/anmolagarwal999/neurips_effeciency_challenge_2023_submission
Constraints/Rules
Datasets
Dataset #1: (MMLU)
Aim: To measure skills acquired during pre-training by evaluating models exclusively in zero-shot and few-shot settings.
Dataset #2: (Truthful QA)
Aim: To measure whether a language model is truthful in generating answers to questions.
Generic few-shot examples
Few-shot examples meant to invoke conspiracy following behaviour from the model
Dataset #3: (Bias Benchmark for QA)
Aim: Meant to test resistance of model against biases along nine different social dimensions
Dataset #4: (CNN/DM)
Aim: Measure text summarization capabilities
Dataset #5: (GSM8k)
Aim: To test multi-step mathematical reasoning
Dataset #6: (BigBench)
Aim: Very diverse dataset (which is still expanding)
METRICS (from HELM)
(1) Accuracy
(3) Robustness: Test model performance after perturbing inputs such as paraphrasing text to pirate-style/native American English, adding spelling mistakes, randomly adding extra spaces between words, replacing words with synonyms.
(2) Fairness (across gender and race):
Measure how changing gender and race-specific terms (John → Iwobe) change model performance.
(4) Bias: Measures gender bias and racial bias in model generations.
Performance of our best submission
Model | Accuracy - MMLU | Robustness - MMLU | Fairness - MMLU | Accuracy - TruthfulQA | Accuracy - BBQ | Accuracy - CNN/DailyMail - ROUGE-2 | Bias - CNN/DailyMail - Stereotypes (race) [LOWER IS BETTER] | Bias - CNN/DailyMail - Stereotypes (gender) [LOWER IS BETTER] | Bias - CNN/DailyMail - Representation (race) [LOWER IS BETTER] | Bias - CNN/DailyMail - Representation (gender) [LOWER IS BETTER] |
gpt-3.5-turbo-0301 | 0.59 | 0.525 | 0.53 | 0.609 | NA | NA | NA | NA | NA | NA |
text-davinci-003 | 0.569 | 0.517 | 0.537 | 0.593 | 0.862 | 0.156 | 0.646 | 0.414 | 0.274 | 0.083 |
Anthropic-LM v4-s3 (52B) | 0.481 | 0.434 | 0.447 | 0.368 | 0.551 | 0.154 | 0.616 | 0.412 | 0.252 | 0.093 |
Turing NLG ((530B) | 0.469 | 0.403 | 0.418 | 0.251 | 0.479 | 0.161 | 0.629 | 0.398 | 0.227 | 0.12 |
LLAMA-2-7B | 0.4498 | 0.3904 | 0.3997 | 0.3 | 0.38 | 0.1427 | 0.6667 | 0.5 | 0.4667 | 0.3246 |
LLAMA-2-13B | 0.58 | 0.54 | 0.52 | 0.36 | 0.64 | 0.14 | 0.67 | 0.39 | 0.5 | 0.25 |
LLAMA-2-70B | 0.582 | 0.545 | 0.557 | 0.554 | NA | NA | NA | NA | NA | NA |
QWEN-14B-Base | 0.6961 | 0.6569 | 0.6765 | 0.5 | 0.8 | 0.0363 | 0 | 0.5 | 0.4167 | 0.5 |
MISTRAL-7B base | 0.6373 | 0.6176 | 0.598 | 0.6 | 0.82 | 0.0999 | 0 | 0.3464 | 0.3939 | 0.1575 |
Our Best Submission | 0.64 | 0.6 | 0.59 | 0.9 | 1 | 0.18 | 0.67 | 0.5 | 0.47 | 0.21 |
Approach and Learnings
55% of the time, they gave the same wrong answer
52% of the time, they gave the same wrong answer
(2) Use the same permutation-invariance approach as above on inference-time. On receiving an input, we created all possible permutations and did a max-voting/avg-logit-scores over the model responses for each of the inputs. We used the option with the highest votes as the final answer.
Did not include this in final submission due to the inference-time overhead added.
Question Q: <Question text>
Option A: <option 1>
Option B: <option 2>
Option C: <option 3>
Option D: <option 4>
Question Q: <Question text>
Option A: <option 3>
Option B: <option 4>
Option C: <option 2>
Option D: <option 1>
Created permutations of option-positions and kept instances in the dataset
Approach:
Approach and Learnings
Model | Accuracy - TruthfulQA - EM | Accuracy - BBQ - EM |
LLAMA-2-7B finetuned on BBQ dataset | 0.46 | 0.9 |
LLAMA-2-7B finetuned on TruthfulQA dataset | 0.34 | 0.5 |
LLAMA-2-7B finetuned on combination of BBQ+TruthfulQA+MMLU+BigBench | 0.64 | 0.96 |
On TruthfulQA,
On BBQ,
Learnings:
Approach and Learnings
Model | Accuracy - MMLU - EM | Robustness - MMLU - EM (Robustness) | Fairness - MMLU - EM (Fairness) | Accuracy - TruthfulQA - EM | Accuracy - CNN/DailyMail - ROUGE-2 | Bias - CNN/DailyMail - Stereotypes (race) [LOWER IS BETTER] | Bias - CNN/DailyMail - Stereotypes (gender) [LOWER IS BETTER] | Bias - CNN/DailyMail - Representation (race) [LOWER IS BETTER] | Bias - CNN/DailyMail - Representation (gender) [LOWER IS BETTER] |
Mistral finetuned till end of epoch 2 | 0.6448 | 0.598 | 0.593 | 0.84 | 0.176 | 0.666 | 0.5 | 0.4666 | 0.2111 |
Mistral finetuned till end of epoch 3 | 0.618262 | 0.578 | 0.583 | 0.9 | 0.174 | 0.66 | 0.4 | 0.5 | 0.12 |