TIGERSCORE: TOWARDS BUILDING EXPLAINABLE METRIC FOR ALL TEXT GENERATION TASKS
Presenter: Dongfu Jiang�First year PhD student advised by Prof. Wenhu Chen�David R. Cheriton School of Computer Science
2/16/2024
Authors: Dongfu Jiang, Yishan Li, Ge Zhang, Wenhao Huang, Bill Yuchen Lin, Wenhu Chen
Outline
TIGERScore: Towards building Explainable Metric for All Text Generation Tasks
PAGE 2
Introduction: Motivation
TIGERScore: Towards building Explainable Metric for All Text Generation Tasks
PAGE 3
Introduction: Current Issues
TIGERScore: Towards building Explainable Metric for All Text Generation Tasks
PAGE 4
Introduction: Universal Metric
What is a good universal metric?
TIGERScore: Towards building Explainable Metric for All Text Generation Tasks
PAGE 5
input
output
reference
Metric
eval results
input
output
reference
Metric
eval results
summarization
translation
data2text
long-form QA
instruction-following
math
Reference-free
Multiple tasks
Explainability
input
output
reference
Metric
eval results
Multiple tasks
overall �rating
natural language�explanations
TIGERScore
Design Principles:
TIGERScore: Towards building Explainable Metric for All Text Generation Tasks
PAGE 6
TIGERScore: Reference-Free
TIGERScore: Towards building Explainable Metric for All Text Generation Tasks
PAGE 7
No reference required
TIGERScore: Driven by Instructions
TIGERScore: Towards building Explainable Metric for All Text Generation Tasks
PAGE 8
Each instruction represents different task
TIGERScore: Self-Explainable
TIGERScore: Towards building Explainable Metric for All Text Generation Tasks
PAGE 9
Error analysis and explanations
TIGERScore: Multi-aspect Evaluation
TIGERScore: Towards building Explainable Metric for All Text Generation Tasks
PAGE 10
Error analysis and explanations
TIGERScore: Multi-aspect Evaluation
TIGERScore: Towards building Explainable Metric for All Text Generation Tasks
PAGE 11
TIGERScore: Penalty-Scoring Systems
TIGERScore: Towards building Explainable Metric for All Text Generation Tasks
PAGE 12
The final overall score is the sum of all the penalties
TIGERScore: Structured Analysis Output
TIGERScore: Towards building Explainable Metric for All Text Generation Tasks
PAGE 13
Each error consists of:�Location, Aspect, Explanation, Penalty Score
The overall analysis output is:�multiple identified errors + final score
MetricInstruct: Curation Pipeline
What does the training data comes from?
TIGERScore: Towards building Explainable Metric for All Text Generation Tasks
PAGE 14
Pre-defined aspects
Outputs
GPT-4
Error Analysis
TIGERScore
Query
Fine-tune
Collect and Filter
Prompt Template
Inputs
Systems
Sampling
MetricInstruct: Properties
TIGERScore: Towards building Explainable Metric for All Text Generation Tasks
PAGE 15
MetricInstruct: Properties
TIGERScore: Towards building Explainable Metric for All Text Generation Tasks
PAGE 16
MetricInstruct: Properties
TIGERScore: Towards building Explainable Metric for All Text Generation Tasks
PAGE 17
MetricInstruct: Properties
TIGERScore: Towards building Explainable Metric for All Text Generation Tasks
PAGE 18
MetricInstruct: Data distributions
TIGERScore: Towards building Explainable Metric for All Text Generation Tasks
PAGE 19
Experiments: Training settings
TIGERScore: Towards building Explainable Metric for All Text Generation Tasks
PAGE 20
Experiments: Testing datasets
TIGERScore: Towards building Explainable Metric for All Text Generation Tasks
PAGE 21
Experiments: Correlation Results
TIGERScore: Towards building Explainable Metric for All Text Generation Tasks
PAGE 22
Experiments: Human Evaluation
TIGERScore: Towards building Explainable Metric for All Text Generation Tasks
PAGE 23
Experiments: Ablation Study
TIGERScore: Towards building Explainable Metric for All Text Generation Tasks
PAGE 24
Whether 2 channel mixing works?
Whether multi-task learning benefit?
Community Using
TIGERScore: Towards building Explainable Metric for All Text Generation Tasks
PAGE 25
Conclusions
TIGERScore: Towards building Explainable Metric for All Text Generation Tasks
PAGE 26
Thanks!�Questions?
TIGERScore: Towards building Explainable Metric for All Text Generation Tasks
PAGE 27