The current scenario of LLM evaluation
Anindyadeep Sannigrahi
Data science engineering at PremAI-io
November 2023
2
Scenarios, Task Benchmark Dataset and Metrics
A Scenario is a specific context/setting or a condition under which the LLM’s performance is assessed and tested.
Example:
Lightning AI ©2023 Proprietary and Confidential. All Rights Reserved.
3
A Task is a more granular form of a scenario. �It is very much specific on what basis the LLM �is evaluated.
Example:
Scenarios, Task Benchmark Dataset and Metrics
Lightning AI ©2023 Proprietary and Confidential. All Rights Reserved.
Tasks from LM Evaluation Harness
Scenarios from HELM
Another type of taxonomy (dimensions) by OpenCompass
7
A Metric is qualitative measure used to evaluate the performance of Language Model on certain �task/scenario.
Scenarios, Task Benchmark Dataset and Metrics
A metric can be either a simple:
Lightning AI ©2023 Proprietary and Confidential. All Rights Reserved.
8
A brief overview of metrics
Lightning AI ©2023 Proprietary and Confidential. All Rights Reserved.
9
A benchmark dataset is a standardised collection of test set that is used to evaluate the LLMs on a given task or scenario. Example:
Scenarios, Task Benchmark Dataset and Metrics
Lightning AI ©2023 Proprietary and Confidential. All Rights Reserved.
10
Current Popular LLM evaluation frameworks
LM
Evaluation
Harness
BigCode
Evaluation
Harness
Lightning AI ©2023 Proprietary and Confidential. All Rights Reserved.
11
Evaluation libraries/platforms for LLM applications and systems
Lightning AI ©2023 Proprietary and Confidential. All Rights Reserved.
12
Paid platforms for LLM applications and systems
GetScoreCard AI
Lightning AI ©2023 Proprietary and Confidential. All Rights Reserved.
13
LLM evaluation package developed by Eleuther AI. It provides a single framework for evaluating and reporting auto-regressive language models on various NLU tasks.
LM Evaluation Harness
Lightning AI ©2023 Proprietary and Confidential. All Rights Reserved.
14
What is Lit-GPT?��Lit-GPT by Lightning AI is a hackable implementation of the SoTA LLMs using PyTorch Lightning and Lightning Fabric
Getting Started with quick evaluation using Lit-GPT and LM Evaluation Harness
Getting Started with quick evaluation using Lit-GPT and LM- Evaluation-Harness
Lightning AI ©2023 Proprietary and Confidential. All Rights Reserved.
16
Let’s take look on some results
Model | Size (in B) | Average | ARC | HellaSwag | MMLU | TruthfulQA |
Llama 2 | 7 | 54.31 | 53.16 | 78.48 | 46.63 | 38.98 |
Mistral | 7 | 62.4 | 59.98 | 83.31 | 64.16 | 42.15 |
Falcon | 180 | 68.74 | 69.8 | 88.95 | 70.54 | 45.67 |
For more results and comparison between various models checkout hf.co/open_llm_leaderboard
Lightning AI ©2023 Proprietary and Confidential. All Rights Reserved.
17
Some other popular leaderboard platforms
Lightning AI ©2023 Proprietary and Confidential. All Rights Reserved.
18
References and some awesome resources on LLM eval
Lightning AI ©2023 Proprietary and Confidential. All Rights Reserved.
19
Anindyadeep Sannigrahi
discord/UpBeatCode
Lightning AI ©2023 Proprietary and Confidential. All Rights Reserved.