1 of 19

The current scenario of LLM evaluation

Anindyadeep Sannigrahi

Data science engineering at PremAI-io

November 2023

2 of 19

2

Scenarios, Task Benchmark Dataset and Metrics

A Scenario is a specific context/setting or a condition under which the LLM’s performance is assessed and tested.

Example:

  • Question Answering
  • Reasoning
  • Machine Translation
  • Text Generation and Natural Language Understanding

Lightning AI ©2023 Proprietary and Confidential. All Rights Reserved.

3 of 19

3

A Task is a more granular form of a scenario. �It is very much specific on what basis the LLM �is evaluated.

Example:

  • Math Multiple choice in the subject algebra
  • News letter summarization

Scenarios, Task Benchmark Dataset and Metrics

Lightning AI ©2023 Proprietary and Confidential. All Rights Reserved.

4 of 19

Tasks from LM Evaluation Harness

5 of 19

Scenarios from HELM

6 of 19

Another type of taxonomy (dimensions) by OpenCompass

7 of 19

7

A Metric is qualitative measure used to evaluate the performance of Language Model on certain �task/scenario.

Scenarios, Task Benchmark Dataset and Metrics

A metric can be either a simple:

  • deterministic statistical function (Accuracy)
  • or score from a ML/DL model (BERT Score).
  • Or evaluation done with GPT like LLMs. (G-eval)

Lightning AI ©2023 Proprietary and Confidential. All Rights Reserved.

8 of 19

8

A brief overview of metrics

Lightning AI ©2023 Proprietary and Confidential. All Rights Reserved.

9 of 19

9

A benchmark dataset is a standardised collection of test set that is used to evaluate the LLMs on a given task or scenario. Example:

  • SQuAD for question answering
  • GLUE for natural language understanding �and Q&A
  • IMDB for sentiment analysis

Scenarios, Task Benchmark Dataset and Metrics

Lightning AI ©2023 Proprietary and Confidential. All Rights Reserved.

10 of 19

10

Current Popular LLM evaluation frameworks

LM

Evaluation

Harness

BigCode

Evaluation

Harness

Lightning AI ©2023 Proprietary and Confidential. All Rights Reserved.

11 of 19

11

Evaluation libraries/platforms for LLM applications and systems

Lightning AI ©2023 Proprietary and Confidential. All Rights Reserved.

12 of 19

12

Paid platforms for LLM applications and systems

GetScoreCard AI

Lightning AI ©2023 Proprietary and Confidential. All Rights Reserved.

13 of 19

13

LLM evaluation package developed by Eleuther AI. It provides a single framework for evaluating and reporting auto-regressive language models on various NLU tasks.

LM Evaluation Harness

Lightning AI ©2023 Proprietary and Confidential. All Rights Reserved.

14 of 19

14

What is Lit-GPT?��Lit-GPT by Lightning AI is a hackable implementation of the SoTA LLMs using PyTorch Lightning and Lightning Fabric

Getting Started with quick evaluation using Lit-GPT and LM Evaluation Harness

15 of 19

Getting Started with quick evaluation using Lit-GPT and LM- Evaluation-Harness

Lightning AI ©2023 Proprietary and Confidential. All Rights Reserved.

16 of 19

16

Let’s take look on some results

Model

Size (in B)

Average

ARC

HellaSwag

MMLU

TruthfulQA

Llama 2

7

54.31

53.16

78.48

46.63

38.98

Mistral

7

62.4

59.98

83.31

64.16

42.15

Falcon

180

68.74

69.8

88.95

70.54

45.67

For more results and comparison between various models checkout hf.co/open_llm_leaderboard

Lightning AI ©2023 Proprietary and Confidential. All Rights Reserved.

17 of 19

17

Some other popular leaderboard platforms

Lightning AI ©2023 Proprietary and Confidential. All Rights Reserved.

18 of 19

18

References and some awesome resources on LLM eval

Lightning AI ©2023 Proprietary and Confidential. All Rights Reserved.

19 of 19

19

Anindyadeep Sannigrahi

discord/UpBeatCode

Lightning AI ©2023 Proprietary and Confidential. All Rights Reserved.