第 1 页，共 27 页

TIGERSCORE: TOWARDS BUILDING EXPLAINABLE METRIC FOR ALL TEXT GENERATION TASKS

Presenter: Dongfu Jiang�First year PhD student advised by Prof. Wenhu Chen�David R. Cheriton School of Computer Science

2/16/2024

Authors: Dongfu Jiang, Yishan Li, Ge Zhang, Wenhao Huang, Bill Yuchen Lin, Wenhu Chen

第 2 页，共 27 页

Outline

Introduction
TIGERScore
MetricInstruct
Experiments
Community using
Conclusions

TIGERScore: Towards building Explainable Metric for All Text Generation Tasks

PAGE 2

第 3 页，共 27 页

Introduction: Motivation

How well can we do to use LLM to evaluate LLMs?
GPT-4 is proven to be a strong judge/evaluator for various tasks
Open-source LLM is lagging behind in the evaluation ability
Existing Metrics like BERTScore, BARTScore are not suitable for evaluating LLMs

TIGERScore: Towards building Explainable Metric for All Text Generation Tasks

PAGE 3

第 4 页，共 27 页

Introduction: Current Issues

Dependency on references:

ROUGE, BLEU, COMET, InstructScore

Limited to specific domains:

COMET for translation
BLEURT for translation, data2text

Lack of Attributions:

BARTScore, BERTScore, GPTScore, etc, only outputs a unreadable number as score

TIGERScore: Towards building Explainable Metric for All Text Generation Tasks

PAGE 4

第 5 页，共 27 页

Introduction: Universal Metric

What is a good universal metric?

TIGERScore: Towards building Explainable Metric for All Text Generation Tasks

PAGE 5

input

output

reference

Metric

eval results

input

output

reference

Metric

eval results

summarization

translation

data2text

long-form QA

instruction-following

math

Reference-free

Multiple tasks

Explainability

input

output

reference

Metric

eval results

Multiple tasks

overall �rating

natural language�explanations

第 6 页，共 27 页

TIGERScore

Design Principles:

Reference-Free
Driven by Instructions (Multiple Tasks)
Self-explainable
Multi-aspect Evaluation
Penalty-scoring System
Structured Analysis Output

TIGERScore: Towards building Explainable Metric for All Text Generation Tasks

PAGE 6

第 7 页，共 27 页

TIGERScore: Reference-Free

TIGERScore: Towards building Explainable Metric for All Text Generation Tasks

PAGE 7

No reference required

第 8 页，共 27 页

TIGERScore: Driven by Instructions

TIGERScore: Towards building Explainable Metric for All Text Generation Tasks

PAGE 8

Each instruction represents different task

第 9 页，共 27 页

TIGERScore: Self-Explainable

TIGERScore: Towards building Explainable Metric for All Text Generation Tasks

PAGE 9

Error analysis and explanations

第 10 页，共 27 页

TIGERScore: Multi-aspect Evaluation

TIGERScore: Towards building Explainable Metric for All Text Generation Tasks

PAGE 10

Error analysis and explanations

第 11 页，共 27 页

TIGERScore: Multi-aspect Evaluation

TIGERScore: Towards building Explainable Metric for All Text Generation Tasks

PAGE 11

第 12 页，共 27 页

TIGERScore: Penalty-Scoring Systems

TIGERScore: Towards building Explainable Metric for All Text Generation Tasks

PAGE 12

The final overall score is the sum of all the penalties

第 13 页，共 27 页

TIGERScore: Structured Analysis Output

TIGERScore: Towards building Explainable Metric for All Text Generation Tasks

PAGE 13

Each error consists of:�Location, Aspect, Explanation, Penalty Score

The overall analysis output is:�multiple identified errors + final score

第 14 页，共 27 页

MetricInstruct: Curation Pipeline

What does the training data comes from?

TIGERScore: Towards building Explainable Metric for All Text Generation Tasks

PAGE 14

Pre-defined aspects

Outputs

GPT-4

Error Analysis

TIGERScore

Query

Fine-tune

Collect and Filter

Prompt Template

Inputs

Systems

Sampling

第 15 页，共 27 页

MetricInstruct: Properties

Dataset diversity: 23 distinct dataset across 6 general NLG tasks
Error Coverage: more than 50+ systems’ outputs are collected.

TIGERScore: Towards building Explainable Metric for All Text Generation Tasks

PAGE 15

第 16 页，共 27 页

MetricInstruct: Properties

Quality Ensurance: various heuristics used to filter the bad-quality data.

Error location mismatch
Illogical severity labels
Excessively long outputs
Bad explanation that like “based on the reference, it should be … instead of …”.

TIGERScore: Towards building Explainable Metric for All Text Generation Tasks

PAGE 16

第 17 页，共 27 页

MetricInstruct: Properties

Two channel curation:

Real-world channel:

system outputs generated from 50+ models

Synthetic channels:

system outputs synthesized by GPT-4

To ensure the system outputs contain more diverse error types.
Total 32k after filtering

TIGERScore: Towards building Explainable Metric for All Text Generation Tasks

PAGE 17

第 18 页，共 27 页

MetricInstruct: Properties

Free-form aspects:

Another 10k sampled from alpaca-eval
Encourage free form aspects generation
Avoid overfits to certain aspects.

TIGERScore: Towards building Explainable Metric for All Text Generation Tasks

PAGE 18

第 19 页，共 27 页

MetricInstruct: Data distributions

Content length being 1024
K-error instances balancing
Error severity balancing

TIGERScore: Towards building Explainable Metric for All Text Generation Tasks

PAGE 19

第 20 页，共 27 页

Experiments: Training settings

Backbone: Llama-2-7b and Llama-2-13b
Context length: 1024; Batch size: 128;
7B:

4 80G A100 GPUs, 3 epochs, learning rate 2e-5.

13B:

8 80G A100 GPUs, 2 epochs, learning rate 1e-5.

Inference:

Using VLLM for acceleration during the testing

TIGERScore: Towards building Explainable Metric for All Text Generation Tasks

PAGE 20

第 21 页，共 27 页

Experiments: Testing datasets

5 held-in and 2 held-out
Correlation metrics:

Pearson
Spearman
Kendall

Human ratings comes from:

Official human rating of that dataset
GPT-4 scores

TIGERScore: Towards building Explainable Metric for All Text Generation Tasks

PAGE 21

第 22 页，共 27 页

Experiments: Correlation Results

Overall best cross both all other metrics
Best of all tasks compared with other reference-free metrics
Best for 5 tasks out of 7 compared with reference-based metrics
Approaching performance compared with GPT-4 zero-shot

TIGERScore: Towards building Explainable Metric for All Text Generation Tasks

PAGE 22

第 23 页，共 27 页

Experiments: Human Evaluation

50 samples for each test datasets
70.8% error analysis received positive ratings (3-5)
64.3% of the explanations are correct.

TIGERScore: Towards building Explainable Metric for All Text Generation Tasks

PAGE 23

第 24 页，共 27 页

Experiments: Ablation Study

TIGERScore: Towards building Explainable Metric for All Text Generation Tasks

PAGE 24

Whether 2 channel mixing works?

Whether multi-task learning benefit?

第 25 页，共 27 页

Community Using

TIGERScore: Towards building Explainable Metric for All Text Generation Tasks

PAGE 25

第 26 页，共 27 页

Conclusions

Take aways

We release TIGERScore model and MetricInstruct datasets
A new pipeline and a series of heuristics for curating MetricInstrut
Overall best correlation performance compared with other metrics.

Limitations:

Still generate hallucinated errors or false explanations sometimes
Not good at evaluating reasoning tasks like MathQA

Potential use cases

Large scale real-world evaluations (explainable)
Serve as a fine-grained reward models to enhance LLM’s ability in turn. (future work)

TIGERScore: Towards building Explainable Metric for All Text Generation Tasks

PAGE 26

第 27 页，共 27 页

Thanks!�Questions?

TIGERScore: Towards building Explainable Metric for All Text Generation Tasks

PAGE 27