第 1 页,共 27 页

TIGERSCORE: TOWARDS BUILDING EXPLAINABLE METRIC FOR ALL TEXT GENERATION TASKS

Presenter: Dongfu Jiang�First year PhD student advised by Prof. Wenhu Chen�David R. Cheriton School of Computer Science

2/16/2024

Authors: Dongfu Jiang, Yishan Li, Ge Zhang, Wenhao Huang, Bill Yuchen Lin, Wenhu Chen

第 2 页,共 27 页

Outline

  • Introduction
  • TIGERScore
  • MetricInstruct
  • Experiments
  • Community using
  • Conclusions

TIGERScore: Towards building Explainable Metric for All Text Generation Tasks

PAGE 2

第 3 页,共 27 页

Introduction: Motivation

  • How well can we do to use LLM to evaluate LLMs?
  • GPT-4 is proven to be a strong judge/evaluator for various tasks
  • Open-source LLM is lagging behind in the evaluation ability
  • Existing Metrics like BERTScore, BARTScore are not suitable for evaluating LLMs

TIGERScore: Towards building Explainable Metric for All Text Generation Tasks

PAGE 3

第 4 页,共 27 页

Introduction: Current Issues

  • Dependency on references:
    • ROUGE, BLEU, COMET, InstructScore
  • Limited to specific domains:
    • COMET for translation
    • BLEURT for translation, data2text
  • Lack of Attributions:
    • BARTScore, BERTScore, GPTScore, etc, only outputs a unreadable number as score

TIGERScore: Towards building Explainable Metric for All Text Generation Tasks

PAGE 4

第 5 页,共 27 页

Introduction: Universal Metric

What is a good universal metric?

TIGERScore: Towards building Explainable Metric for All Text Generation Tasks

PAGE 5

input

output

reference

Metric

eval results

input

output

reference

Metric

eval results

summarization

translation

data2text

long-form QA

instruction-following

math

 

Reference-free

Multiple tasks

Explainability

input

output

reference

Metric

eval results

Multiple tasks

overall �rating

natural language�explanations

第 6 页,共 27 页

TIGERScore

Design Principles:

  • Reference-Free
  • Driven by Instructions (Multiple Tasks)
  • Self-explainable
  • Multi-aspect Evaluation
  • Penalty-scoring System
  • Structured Analysis Output

TIGERScore: Towards building Explainable Metric for All Text Generation Tasks

PAGE 6

第 7 页,共 27 页

TIGERScore: Reference-Free

TIGERScore: Towards building Explainable Metric for All Text Generation Tasks

PAGE 7

No reference required

第 8 页,共 27 页

TIGERScore: Driven by Instructions

TIGERScore: Towards building Explainable Metric for All Text Generation Tasks

PAGE 8

Each instruction represents different task

第 9 页,共 27 页

TIGERScore: Self-Explainable

TIGERScore: Towards building Explainable Metric for All Text Generation Tasks

PAGE 9

Error analysis and explanations

第 10 页,共 27 页

TIGERScore: Multi-aspect Evaluation

TIGERScore: Towards building Explainable Metric for All Text Generation Tasks

PAGE 10

Error analysis and explanations

第 11 页,共 27 页

TIGERScore: Multi-aspect Evaluation

TIGERScore: Towards building Explainable Metric for All Text Generation Tasks

PAGE 11

第 12 页,共 27 页

TIGERScore: Penalty-Scoring Systems

TIGERScore: Towards building Explainable Metric for All Text Generation Tasks

PAGE 12

 

The final overall score is the sum of all the penalties

第 13 页,共 27 页

TIGERScore: Structured Analysis Output

TIGERScore: Towards building Explainable Metric for All Text Generation Tasks

PAGE 13

Each error consists of:�Location, Aspect, Explanation, Penalty Score

The overall analysis output is:�multiple identified errors + final score

第 14 页,共 27 页

MetricInstruct: Curation Pipeline

What does the training data comes from?

TIGERScore: Towards building Explainable Metric for All Text Generation Tasks

PAGE 14

Pre-defined aspects

Outputs

GPT-4

Error Analysis

TIGERScore

Query

Fine-tune

Collect and Filter

Prompt Template

Inputs

Systems

Sampling

第 15 页,共 27 页

MetricInstruct: Properties

  • Dataset diversity: 23 distinct dataset across 6 general NLG tasks
  • Error Coverage: more than 50+ systems’ outputs are collected.

TIGERScore: Towards building Explainable Metric for All Text Generation Tasks

PAGE 15

第 16 页,共 27 页

MetricInstruct: Properties

  • Quality Ensurance: various heuristics used to filter the bad-quality data.
    • Error location mismatch
    • Illogical severity labels
    • Excessively long outputs
    • Bad explanation that like “based on the reference, it should be … instead of …”.

TIGERScore: Towards building Explainable Metric for All Text Generation Tasks

PAGE 16

第 17 页,共 27 页

MetricInstruct: Properties

  • Two channel curation:
    • Real-world channel:
      • system outputs generated from 50+ models
    • Synthetic channels:
      • system outputs synthesized by GPT-4
    • To ensure the system outputs contain more diverse error types.
    • Total 32k after filtering

TIGERScore: Towards building Explainable Metric for All Text Generation Tasks

PAGE 17

第 18 页,共 27 页

MetricInstruct: Properties

  • Free-form aspects:
    • Another 10k sampled from alpaca-eval
    • Encourage free form aspects generation
    • Avoid overfits to certain aspects.

TIGERScore: Towards building Explainable Metric for All Text Generation Tasks

PAGE 18

第 19 页,共 27 页

MetricInstruct: Data distributions

  • Content length being 1024
  • K-error instances balancing
  • Error severity balancing

TIGERScore: Towards building Explainable Metric for All Text Generation Tasks

PAGE 19

第 20 页,共 27 页

Experiments: Training settings

  • Backbone: Llama-2-7b and Llama-2-13b
  • Context length: 1024; Batch size: 128;
  • 7B:
    • 4 80G A100 GPUs, 3 epochs, learning rate 2e-5.
  • 13B:
    • 8 80G A100 GPUs, 2 epochs, learning rate 1e-5.
  • Inference:
    • Using VLLM for acceleration during the testing

TIGERScore: Towards building Explainable Metric for All Text Generation Tasks

PAGE 20

第 21 页,共 27 页

Experiments: Testing datasets

  • 5 held-in and 2 held-out
  • Correlation metrics:
    • Pearson
    • Spearman
    • Kendall
  • Human ratings comes from:
    • Official human rating of that dataset
    • GPT-4 scores

TIGERScore: Towards building Explainable Metric for All Text Generation Tasks

PAGE 21

第 22 页,共 27 页

Experiments: Correlation Results

  • Overall best cross both all other metrics
  • Best of all tasks compared with other reference-free metrics
  • Best for 5 tasks out of 7 compared with reference-based metrics
  • Approaching performance compared with GPT-4 zero-shot

TIGERScore: Towards building Explainable Metric for All Text Generation Tasks

PAGE 22

第 23 页,共 27 页

Experiments: Human Evaluation

  • 50 samples for each test datasets
  • 70.8% error analysis received positive ratings (3-5)
  • 64.3% of the explanations are correct.

TIGERScore: Towards building Explainable Metric for All Text Generation Tasks

PAGE 23

第 24 页,共 27 页

Experiments: Ablation Study

TIGERScore: Towards building Explainable Metric for All Text Generation Tasks

PAGE 24

Whether 2 channel mixing works?

Whether multi-task learning benefit?

第 25 页,共 27 页

Community Using

TIGERScore: Towards building Explainable Metric for All Text Generation Tasks

PAGE 25

第 26 页,共 27 页

Conclusions

  • Take aways
    • We release TIGERScore model and MetricInstruct datasets
    • A new pipeline and a series of heuristics for curating MetricInstrut
    • Overall best correlation performance compared with other metrics.
  • Limitations:
    • Still generate hallucinated errors or false explanations sometimes
    • Not good at evaluating reasoning tasks like MathQA
  • Potential use cases
    • Large scale real-world evaluations (explainable)
    • Serve as a fine-grained reward models to enhance LLM’s ability in turn. (future work)

TIGERScore: Towards building Explainable Metric for All Text Generation Tasks

PAGE 26

第 27 页,共 27 页

Thanks!�Questions?

TIGERScore: Towards building Explainable Metric for All Text Generation Tasks

PAGE 27