1 of 16

Learnings from Efficiency Challenge

  • Constraints on the Submissions
  • Datasets tested on (slides 3-9)
  • Metrics measured (slide 10)
  • Learnings (slides 11-16)

NOTE:

  • The challenge began on 25th July.
  • We started participating in the challenge on: 6th Oct.
  • The deadline was 27th Oct.

Ranked 2nd overall in the A100 category.

Anmol Agarwal, Shashank Shet, Ajinkya Deshpande, Arun Iyer, Suresh Parthasarathy

Microsoft Research India

Code: https://github.com/anmolagarwal999/neurips_effeciency_challenge_2023_submission

2 of 16

Constraints/Rules

  • Fine-tuning should happen within 1 day on 1 A100 (40GB)
  • Base models should not already have been instruction fine-tuned (Eg: not allowed to use chat versions of LLAMA-2, Mistral etc.)
  • Existing AI Generated datasets cannot be used
  • Closed-source LLMs cannot be used to generate own datasets
  • For applications such as RAG, only Wikipedia corpus was allowed.
  • Regarding datasets:
    • There was a public leaderboard based on a few datasets.
    • Final leaderboard was based on another (hidden undisclosed) dataset.

3 of 16

Datasets

  • Several datasets were involved but eventually removed. By the time we started, the public datasets involved were:
    • MMLU
    • TruthfulQA
    • BBQ
    • CNN/Daily Summarization
    • GSM8k
    • BigBench

4 of 16

Dataset #1: (MMLU)

Aim: To measure skills acquired during pre-training by evaluating models exclusively in zero-shot and few-shot settings.

  • Has 57 subjects across STEM (including law, biology, physics, maths, european history etc.)
  • Difficulty ranges from elementary level —> advanced professional level
  • Tests both world knowledge and problem solving ability
  • MCQ-based evaluation

5 of 16

Dataset #2: (Truthful QA)

Aim: To measure whether a language model is truthful in generating answers to questions.

  • Contains questions that some humans would answer falsely due to a false belief or misconception
  • Questions span 38 categories, including health, law, finance and politics.
  • MCQ based evaluation

Generic few-shot examples

Few-shot examples meant to invoke conspiracy following behaviour from the model

6 of 16

Dataset #3: (Bias Benchmark for QA)

Aim: Meant to test resistance of model against biases along nine different social dimensions

  • MCQ based evaluation

7 of 16

Dataset #4: (CNN/DM)

Aim: Measure text summarization capabilities

  • Input: A news story from the CNN and Daily Mail websites
  • Expected Output: summary of the article within 3 sentences
  • Rouge-score based matching

8 of 16

Dataset #5: (GSM8k)

Aim: To test multi-step mathematical reasoning

  • linguistically diverse grade school math word problems created by human problem writers.
  • Problems should be solvable by middle school student
  • Text-based matching (NOT MCQ style)

9 of 16

Dataset #6: (BigBench)

Aim: Very diverse dataset (which is still expanding)

  • Includes more than 200 tasks as of now
  • Crowdsourced effort
  • Includes both multiple-choice and text-based queries

10 of 16

METRICS (from HELM)

(1) Accuracy

  • Exact string match for MCQ based tasks
  • Text similarity based metrics for non-MCQ based tasks

(3) Robustness: Test model performance after perturbing inputs such as paraphrasing text to pirate-style/native American English, adding spelling mistakes, randomly adding extra spaces between words, replacing words with synonyms.

(2) Fairness (across gender and race):

Measure how changing gender and race-specific terms (John → Iwobe) change model performance.

(4) Bias: Measures gender bias and racial bias in model generations.

11 of 16

Performance of our best submission

Model

Accuracy - MMLU

Robustness - MMLU

Fairness - MMLU

Accuracy - TruthfulQA

Accuracy - BBQ

Accuracy - CNN/DailyMail - ROUGE-2

Bias - CNN/DailyMail - Stereotypes (race) [LOWER IS BETTER]

Bias - CNN/DailyMail - Stereotypes (gender) [LOWER IS BETTER]

Bias - CNN/DailyMail - Representation (race) [LOWER IS BETTER]

Bias - CNN/DailyMail - Representation (gender) [LOWER IS BETTER]

gpt-3.5-turbo-0301

0.59

0.525

0.53

0.609

NA

NA

NA

NA

NA

NA

text-davinci-003

0.569

0.517

0.537

0.593

0.862

0.156

0.646

0.414

0.274

0.083

Anthropic-LM v4-s3 (52B)

0.481

0.434

0.447

0.368

0.551

0.154

0.616

0.412

0.252

0.093

Turing NLG ((530B)

0.469

0.403

0.418

0.251

0.479

0.161

0.629

0.398

0.227

0.12

LLAMA-2-7B

0.4498

0.3904

0.3997

0.3

0.38

0.1427

0.6667

0.5

0.4667

0.3246

LLAMA-2-13B

0.58

0.54

0.52

0.36

0.64

0.14

0.67

0.39

0.5

0.25

LLAMA-2-70B

0.582

0.545

0.557

0.554

NA

NA

NA

NA

NA

NA

QWEN-14B-Base

0.6961

0.6569

0.6765

0.5

0.8

0.0363

0

0.5

0.4167

0.5

MISTRAL-7B base

0.6373

0.6176

0.598

0.6

0.82

0.0999

0

0.3464

0.3939

0.1575

Our Best Submission

0.64

0.6

0.59

0.9

1

0.18

0.67

0.5

0.47

0.21

12 of 16

Approach and Learnings

  • Observation: When the models (LLAMA-2 7b, fine-tuned versions of LLAMA-2, Mistral) were giving wrong answers, there were many cases where they were giving the same wrong answer.
    • Was this a side-effect of the way in the questions were phrased such that both models seemed to make the same wrong choice ?

55% of the time, they gave the same wrong answer

52% of the time, they gave the same wrong answer

13 of 16

  • What we tried:
    • (1) Shuffled the options in dataset we were using with the intention to remove any bias (of linking a certain style of text with a particular answer) present in the initial model weights

(2) Use the same permutation-invariance approach as above on inference-time. On receiving an input, we created all possible permutations and did a max-voting/avg-logit-scores over the model responses for each of the inputs. We used the option with the highest votes as the final answer.

  • Observation:
    • This made the model more confident about answers where it was already correct originally.
    • It made the model less confident about answers where it was wrong originally.

Did not include this in final submission due to the inference-time overhead added.

Question Q: <Question text>

Option A: <option 1>

Option B: <option 2>

Option C: <option 3>

Option D: <option 4>

Question Q: <Question text>

Option A: <option 3>

Option B: <option 4>

Option C: <option 2>

Option D: <option 1>

Created permutations of option-positions and kept instances in the dataset

14 of 16

Approach:

  • Tried multiple different things such as:
    • Fine-Tuning the model separately on each dataset and then using an ensemble of the individual fine-tuned models during inference time
    • Fine-Tuning the model separately for “knowledge based tasks” and “reasoning based tasks”
    • Many others

  • Our winning submission:
    • Dataset related:
      • Fine-tuned the model on a dataset D which had examples from all the tasks
      • Had introduced in dataset “D” synthetic perturbations (introduced by us) to make it more immune to prompts used for testing robustness, fairness and bias by HELM
      • Had introduced permutations of different options for the same prompt (mentioned previously)
    • Inference related:
      • Found that different models do well on different tasks.
      • Hence, we used an ensemble of models
      • Based on the type of query, we routed the query to one of the models present in the ensemble

15 of 16

Approach and Learnings

  • Our final submission was an ensemble of Mistral models.
  • Initially, we chose LLAMA-2-7B for all our experiments (because extensive research already done + small size)
  • We first wanted to reach the best scores on the individual datasets (Why ? Explain verbally)

Model

Accuracy - TruthfulQA - EM

Accuracy - BBQ - EM

LLAMA-2-7B finetuned on BBQ dataset

0.46

0.9

LLAMA-2-7B finetuned on TruthfulQA dataset

0.34

0.5

LLAMA-2-7B finetuned on combination of BBQ+TruthfulQA+MMLU+BigBench

0.64

0.96

On TruthfulQA,

  • Model does better when finetuned on BBQ than when finetuned on TruthfulQA (#)
  • Model does best when finetuned on a combination

On BBQ,

  • Model performance improves when finetuned on TruthfulQA
  • Model does best when finetuned on a combination

  • The skills learned by the model on one dataset seem to be transferable to another dataset.
  • (#) → the models do not necessarily learn the same way humans do

Learnings:

  • Although next-word prediction is an extremely simple task, when combined with massive datasets, it forces the model to learn a lot of tasks.
  • Fine Tuning on one dataset can lead to significant gains in other datasets, even if the tasks are not very similar. LLMs do not necessarily learn the same way humans do.

16 of 16

Approach and Learnings

  • Learning 2: Sometimes across different epochs, the performance can be brittle ie the accuracy scores on some datasets may go up while on others it may go down.

Model

Accuracy - MMLU - EM

Robustness - MMLU - EM (Robustness)

Fairness - MMLU - EM (Fairness)

Accuracy - TruthfulQA - EM

Accuracy - CNN/DailyMail - ROUGE-2

Bias - CNN/DailyMail - Stereotypes (race) [LOWER IS BETTER]

Bias - CNN/DailyMail - Stereotypes (gender) [LOWER IS BETTER]

Bias - CNN/DailyMail - Representation (race) [LOWER IS BETTER]

Bias - CNN/DailyMail - Representation (gender) [LOWER IS BETTER]

Mistral finetuned till end of epoch 2

0.6448

0.598

0.593

0.84

0.176

0.666

0.5

0.4666

0.2111

Mistral finetuned till end of epoch 3

0.618262

0.578

0.583

0.9

0.174

0.66

0.4

0.5

0.12

  • Since next-word prediction is massive multi-task learning, one can view the loss as the weighted sum of many individual tasks.
  • When we decrease the loss, it is likely that not all individual tasks improve uniformly. Loss for some tasks might be saturated (larger models no longer improving in grammar since they already have perfect grammar), and other tasks might improve in a more sudden fashion (in order to push loss lower, the larger model has to figure out how to do write proper summaries without bias).
  • Loss for MCQ datasets saturated after only 2-3 epochs, but for long-generation tasks such as CNN, it did not saturate even by the last epoch.
  • If loss goes from 4 to 3, do all tasks get better uniformly? Probably not.