JavaScript isn't enabled in your browser, so this file can't be opened. Enable and reload.

1 of 16

Learnings from Efficiency Challenge

Constraints on the Submissions
Datasets tested on (slides 3-9)
Metrics measured (slide 10)
Learnings (slides 11-16)

NOTE:

The challenge began on 25th July.
We started participating in the challenge on: 6th Oct.
The deadline was 27th Oct.

Ranked 2nd overall in the A100 category.

Anmol Agarwal, Shashank Shet, Ajinkya Deshpande, Arun Iyer, Suresh Parthasarathy

Microsoft Research India

Code: https://github.com/anmolagarwal999/neurips_effeciency_challenge_2023_submission

2 of 16

Constraints/Rules

Fine-tuning should happen within 1 day on 1 A100 (40GB)
Base models should not already have been instruction fine-tuned (Eg: not allowed to use chat versions of LLAMA-2, Mistral etc.)
Existing AI Generated datasets cannot be used
Closed-source LLMs cannot be used to generate own datasets
For applications such as RAG, only Wikipedia corpus was allowed.
Regarding datasets:

There was a public leaderboard based on a few datasets.
Final leaderboard was based on another (hidden undisclosed) dataset.

3 of 16

Datasets

Several datasets were involved but eventually removed. By the time we started, the public datasets involved were:

MMLU
TruthfulQA
BBQ
CNN/Daily Summarization
GSM8k
BigBench

4 of 16

Dataset #1: (MMLU)

Aim: To measure skills acquired during pre-training by evaluating models exclusively in zero-shot and few-shot settings.

Has 57 subjects across STEM (including law, biology, physics, maths, european history etc.)
Difficulty ranges from elementary level —> advanced professional level
Tests both world knowledge and problem solving ability
MCQ-based evaluation

5 of 16

Dataset #2: (Truthful QA)

Aim: To measure whether a language model is truthful in generating answers to questions.

Contains questions that some humans would answer falsely due to a false belief or misconception
Questions span 38 categories, including health, law, finance and politics.
MCQ based evaluation

Generic few-shot examples

Few-shot examples meant to invoke conspiracy following behaviour from the model

6 of 16

Dataset #3: (Bias Benchmark for QA)

Aim: Meant to test resistance of model against biases along nine different social dimensions

MCQ based evaluation

7 of 16

Dataset #4: (CNN/DM)

Aim: Measure text summarization capabilities

Input: A news story from the CNN and Daily Mail websites
Expected Output: summary of the article within 3 sentences
Rouge-score based matching

8 of 16

Dataset #5: (GSM8k)

Aim: To test multi-step mathematical reasoning

linguistically diverse grade school math word problems created by human problem writers.
Problems should be solvable by middle school student
Text-based matching (NOT MCQ style)

9 of 16

Dataset #6: (BigBench)

Aim: Very diverse dataset (which is still expanding)

Includes more than 200 tasks as of now
Crowdsourced effort
Includes both multiple-choice and text-based queries

10 of 16

METRICS (from HELM)

(1) Accuracy

Exact string match for MCQ based tasks
Text similarity based metrics for non-MCQ based tasks

(3) Robustness: Test model performance after perturbing inputs such as paraphrasing text to pirate-style/native American English, adding spelling mistakes, randomly adding extra spaces between words, replacing words with synonyms.

(2) Fairness (across gender and race):

Measure how changing gender and race-specific terms (John → Iwobe) change model performance.

(4) Bias: Measures gender bias and racial bias in model generations.

11 of 16

Performance of our best submission

Model	Accuracy - MMLU	Robustness - MMLU	Fairness - MMLU	Accuracy - TruthfulQA	Accuracy - BBQ	Accuracy - CNN/DailyMail - ROUGE-2	Bias - CNN/DailyMail - Stereotypes (race) [LOWER IS BETTER]	Bias - CNN/DailyMail - Stereotypes (gender) [LOWER IS BETTER]	Bias - CNN/DailyMail - Representation (race) [LOWER IS BETTER]	Bias - CNN/DailyMail - Representation (gender) [LOWER IS BETTER]
gpt-3.5-turbo-0301	0.59	0.525	0.53	0.609	NA	NA	NA	NA	NA	NA
text-davinci-003	0.569	0.517	0.537	0.593	0.862	0.156	0.646	0.414	0.274	0.083
Anthropic-LM v4-s3 (52B)	0.481	0.434	0.447	0.368	0.551	0.154	0.616	0.412	0.252	0.093
Turing NLG ((530B)	0.469	0.403	0.418	0.251	0.479	0.161	0.629	0.398	0.227	0.12
LLAMA-2-7B	0.4498	0.3904	0.3997	0.3	0.38	0.1427	0.6667	0.5	0.4667	0.3246
LLAMA-2-13B	0.58	0.54	0.52	0.36	0.64	0.14	0.67	0.39	0.5	0.25
LLAMA-2-70B	0.582	0.545	0.557	0.554	NA	NA	NA	NA	NA	NA
QWEN-14B-Base	0.6961	0.6569	0.6765	0.5	0.8	0.0363	0	0.5	0.4167	0.5
MISTRAL-7B base	0.6373	0.6176	0.598	0.6	0.82	0.0999	0	0.3464	0.3939	0.1575
Our Best Submission	0.64	0.6	0.59	0.9	1	0.18	0.67	0.5	0.47	0.21

12 of 16

Approach and Learnings

Observation: When the models (LLAMA-2 7b, fine-tuned versions of LLAMA-2, Mistral) were giving wrong answers, there were many cases where they were giving the same wrong answer.

Was this a side-effect of the way in the questions were phrased such that both models seemed to make the same wrong choice ?

55% of the time, they gave the same wrong answer

52% of the time, they gave the same wrong answer

13 of 16

What we tried:

(1) Shuffled the options in dataset we were using with the intention to remove any bias (of linking a certain style of text with a particular answer) present in the initial model weights

(2) Use the same permutation-invariance approach as above on inference-time. On receiving an input, we created all possible permutations and did a max-voting/avg-logit-scores over the model responses for each of the inputs. We used the option with the highest votes as the final answer.

Observation:

This made the model more confident about answers where it was already correct originally.
It made the model less confident about answers where it was wrong originally.

Did not include this in final submission due to the inference-time overhead added.

Question Q: <Question text>

Option A: <option 1>

Option B: <option 2>

Option C: <option 3>

Option D: <option 4>

Question Q: <Question text>

Option A: <option 3>

Option B: <option 4>

Option C: <option 2>

Option D: <option 1>

Created permutations of option-positions and kept instances in the dataset

14 of 16

Approach:

Tried multiple different things such as:

Fine-Tuning the model separately on each dataset and then using an ensemble of the individual fine-tuned models during inference time
Fine-Tuning the model separately for “knowledge based tasks” and “reasoning based tasks”
Many others

Our winning submission:

Dataset related:

Fine-tuned the model on a dataset D which had examples from all the tasks
Had introduced in dataset “D” synthetic perturbations (introduced by us) to make it more immune to prompts used for testing robustness, fairness and bias by HELM
Had introduced permutations of different options for the same prompt (mentioned previously)

Inference related:

Found that different models do well on different tasks.
Hence, we used an ensemble of models
Based on the type of query, we routed the query to one of the models present in the ensemble

15 of 16

Approach and Learnings

Our final submission was an ensemble of Mistral models.
Initially, we chose LLAMA-2-7B for all our experiments (because extensive research already done + small size)
We first wanted to reach the best scores on the individual datasets (Why ? Explain verbally)

Model	Accuracy - TruthfulQA - EM	Accuracy - BBQ - EM
LLAMA-2-7B finetuned on BBQ dataset	0.46	0.9
LLAMA-2-7B finetuned on TruthfulQA dataset	0.34	0.5
LLAMA-2-7B finetuned on combination of BBQ+TruthfulQA+MMLU+BigBench	0.64	0.96

On TruthfulQA,

Model does better when finetuned on BBQ than when finetuned on TruthfulQA (#)
Model does best when finetuned on a combination

On BBQ,

Model performance improves when finetuned on TruthfulQA
Model does best when finetuned on a combination

The skills learned by the model on one dataset seem to be transferable to another dataset.
(#) → the models do not necessarily learn the same way humans do

Learnings:

Although next-word prediction is an extremely simple task, when combined with massive datasets, it forces the model to learn a lot of tasks.
Fine Tuning on one dataset can lead to significant gains in other datasets, even if the tasks are not very similar. LLMs do not necessarily learn the same way humans do.

16 of 16

Approach and Learnings

Learning 2: Sometimes across different epochs, the performance can be brittle ie the accuracy scores on some datasets may go up while on others it may go down.

Model	Accuracy - MMLU - EM	Robustness - MMLU - EM (Robustness)	Fairness - MMLU - EM (Fairness)	Accuracy - TruthfulQA - EM	Accuracy - CNN/DailyMail - ROUGE-2	Bias - CNN/DailyMail - Stereotypes (race) [LOWER IS BETTER]	Bias - CNN/DailyMail - Stereotypes (gender) [LOWER IS BETTER]	Bias - CNN/DailyMail - Representation (race) [LOWER IS BETTER]	Bias - CNN/DailyMail - Representation (gender) [LOWER IS BETTER]
Mistral finetuned till end of epoch 2	0.6448	0.598	0.593	0.84	0.176	0.666	0.5	0.4666	0.2111
Mistral finetuned till end of epoch 3	0.618262	0.578	0.583	0.9	0.174	0.66	0.4	0.5	0.12

Since next-word prediction is massive multi-task learning, one can view the loss as the weighted sum of many individual tasks.
When we decrease the loss, it is likely that not all individual tasks improve uniformly. Loss for some tasks might be saturated (larger models no longer improving in grammar since they already have perfect grammar), and other tasks might improve in a more sudden fashion (in order to push loss lower, the larger model has to figure out how to do write proper summaries without bias).
Loss for MCQ datasets saturated after only 2-3 epochs, but for long-generation tasks such as CNN, it did not saturate even by the last epoch.
If loss goes from 4 to 3, do all tasks get better uniformly? Probably not.