1 of 19

The current scenario of LLM evaluation

Anindyadeep Sannigrahi

Data science engineering at PremAI-io

November 2023

2 of 19

2

Scenarios, Task Benchmark Dataset and Metrics

A Scenario is a specific context/setting or a condition under which the LLM’s performance is assessed and tested.

Example:

Question Answering
Reasoning
Machine Translation
Text Generation and Natural Language Understanding

3 of 19

3

A Task is a more granular form of a scenario. �It is very much specific on what basis the LLM �is evaluated.

Example:

Math Multiple choice in the subject algebra
News letter summarization

Scenarios, Task Benchmark Dataset and Metrics

4 of 19

Tasks from LM Evaluation Harness

5 of 19

Scenarios from HELM

6 of 19

Another type of taxonomy (dimensions) by OpenCompass

7 of 19

7

A Metric is qualitative measure used to evaluate the performance of Language Model on certain �task/scenario.

Scenarios, Task Benchmark Dataset and Metrics

A metric can be either a simple:

deterministic statistical function (Accuracy)
or score from a ML/DL model (BERT Score).
Or evaluation done with GPT like LLMs. (G-eval)

8 of 19

8

A brief overview of metrics

9 of 19

9

A benchmark dataset is a standardised collection of test set that is used to evaluate the LLMs on a given task or scenario. Example:

SQuAD for question answering
GLUE for natural language understanding �and Q&A
IMDB for sentiment analysis

Scenarios, Task Benchmark Dataset and Metrics

10 of 19

10

Current Popular LLM evaluation frameworks

LM

Evaluation

Harness

BigCode

Evaluation

Harness

11 of 19

11

Evaluation libraries/platforms for LLM applications and systems

DeepEval �by Confident AI

OpenAI �Evals

By exploding-gradients

12 of 19

12

Paid platforms for LLM applications and systems

GetScoreCard AI

13 of 19

13

LLM evaluation package developed by Eleuther AI. It provides a single framework for evaluating and reporting auto-regressive language models on various NLU tasks.

LM Evaluation Harness

github.com/lm-evaluation-harness

14 of 19

14

What is Lit-GPT?��Lit-GPT by Lightning AI is a hackable implementation of the SoTA LLMs using PyTorch Lightning and Lightning Fabric

Getting Started with quick evaluation using Lit-GPT and LM Evaluation Harness

15 of 19

Getting Started with quick evaluation using Lit-GPT and LM- Evaluation-Harness

16 of 19

16

Let’s take look on some results

Model	Size (in B)	Average	ARC	HellaSwag	MMLU	TruthfulQA
Llama 2	7	54.31	53.16	78.48	46.63	38.98
Mistral	7	62.4	59.98	83.31	64.16	42.15

Falcon	180	68.74	69.8	88.95	70.54	45.67

For more results and comparison between various models checkout hf.co/open_llm_leaderboard

17 of 19

17

Some other popular leaderboard platforms

Leaderboard by Stanford HELM

Leaderboard by OpenCompass

Leaderboard

by BigCode Evaluation Harness

Chatbot Arena�Leaderboard by LMSys.org �(Elo rating of instruction fine-tuned LLM )

18 of 19

18

References and some awesome resources on LLM eval

19 of 19

19

Anindyadeep Sannigrahi

X/Anindyadeep

discord/UpBeatCode

LinkedIn /Anindyadeep

GitHub /Anindyadeep