Evaluation of�Information Access Systems �in the Generative Era��Negar Arabzadeh�University of Waterloo��University of California, Berkeley
Summer 2024
Narabzad@uwaterloo.ca
https://www.negara.me/
Introduction
2
Recommender Systems
Social Media
Search Engines
Generative Models
Introduction
3
Introduction
4
Introduction
5
This Photo by Unknown Author is licensed under CC BY-NC
…
Introduction
6
This Photo by Unknown Author is licensed under CC BY-NC
….
….
Evaluation of Information Access Systems
7
Evaluation of Information Access Systems
8
Introduction
9
How do we provide environments where correct information is:
Available, Identifiable, and accessible?
Outline
10
Evaluation of understanding grounded language
Evaluation of generative IR systems
Evaluation of LLM-based applications
Evaluation of LLMs robustness in open-domain QA
Future directions
Evaluation of Grounded Language Understanding
11
Self-Evaluation through Conversations
12
Self-Evaluation through Conversations
13
Self-Evaluation through Conversations
14
Self-Evaluation through Conversations
15
Dataset: https://github.com/microsoft/iglu-datasets
Data Collection Tool: https://github.com/iglu-contest/dataset-collection-and-evaluation
Online vs offline evaluation
16
Greenlands: https://github.com/microsoft/greenlands
Online vs offline evaluation
17
Greenlands: https://github.com/microsoft/greenlands
Online vs offline evaluation
18
Offline assessments must be aligned with human preferences, specifically on more complex and less well-defined tasks.
Self-Evaluation through Conversations
19
Evaluation of Generative Information Retrieval
20
From Traditional to Generative Information Retrieval
21
Differences from traditional IR evaluation:
LLM for Relevance Judgements (LLM-as-a-Judge?)
22
LLM for Relevance Judgements (LLM-as-a-Judge?)
23
TREC 2021 Deep Learning Data
LLM for Relevance Judgements (LLM-as-a-Judge?)
24
LLM for Relevance Judgements (LLM-as-a-Judge?)
Findings:
25
LLM for Relevance Judgements (LLM-as-a-Judge?)
26
LLM for Relevance Judgements (LLM-as-a-Judge?)
27
LLM for Relevance Judgements (LLM-as-a-Judge?)
28
Paul Héroult Charles Hall
Human judgments once rare and costly, can now be supplemented by LLMs, dramatically lowering costs and unlocking new evaluation opportunities.
GenIR Evaluation
29
GenIR Evaluation
30
GenIR Evaluation
31
Non-relevant
Highly relevant
GenIR Evaluation
32
GenIR Evaluation
33
Gold is still Precious!
Evaluation through similarity to sparsely labeled data
Impractical to reassess the generated results
34
Evaluation through similarity to sparsely labeled data
Impractical to reassess the generated results
35
What evaluation strategies are being used in other generative based tasks i.e., Image generation?
Comparing generated output with good examples!
Evaluation through similarity to sparsely labeled data
Impractical to reassess the generated results
36
What evaluation strategies are being used in other generative based tasks i.e., Image generation?
Comparing generated output with good examples!
Sparse labels
Evaluation through similarity to sparsely labeled data
37
Target
document
Gain = 0
Traditional Retrieval Evaluation
Target
document
Similarity with “good” example
~
Gain != 0
Evaluation through similarity to sparsely labeled data
38
Retrieved
Response
Generated
Response
~
~
Target
Document
Target
Document
Evaluation through similarity to sparsely labeled data
39
Comparing with good examples:
Fréchet Distance
Embedding Similarity
Evaluation through similarity to sparsely labeled data
40
Comparing with good examples:
Fréchet Distance
Embedding Similarity
Fréchet Distance
41
Fréchet Distance
42
Fréchet Distance for TTI
43
CAPTION
Text to Image generative model
Generated Images
Embedding Model
,,,
…
0.6 0.8 … ...
0.1 0.3 … …
FRÉCHET
DISTANCE
Fréchet Distance for IR
44
documents
Retrieved or generated
documents
Embedding Model
…
0.6 0.7 … ...
0.5 0.1 … …
FRÉCHET
DISTANCE
Fréchet Distance for evaluation of GenIR
45
Experimental Setup
46
Fréchet distance Experiments
Can Fréchet Distance effectively evaluate IR systems with sparse labels?
47
Fréchet distance Experiments
Can the Fréchet Distance effectively evaluate IR systems when the retrieved results are not labelled?
48
Unjudged
Judged
Unjudged
Unjudged
…
Initial retrieved list
Traditional IR evaluation metric like nDCG assess the performance based on where the relevant judged document is placed.
Fréchet distance Experiments
Can the Fréchet Distance effectively evaluate IR systems when the retrieved results are not labelled?
49
Can we evaluate this list?
Unjudged
Judged
Unjudged
Unjudged
…
Unjudged
Unjudged
Unjudged
Unjudged
Initial retrieved list
Traditional IR evaluation metric like nDCG assess the performance based on where the relevant judged document is placed.
Fréchet distance Experiments
50
Can the Fréchet Distance effectively evaluate IR systems when the retrieved results are not labelled?
Correlation between MRR and Unlabelled retrieved results
Correlation between MRR and original retrieved results
Evaluation of GenIR
51
Comparing with good examples:
Fréchet Distance
Embedding Similarity
Evaluation of GenIR with Retrieval Benchmark
52
Experimental Setup
53
Do goldfish grow?
+ their Liar version
Sanity check
Creative Ability
Validation through Cross-Grade Relevance Similarities
Experiment:
54
TREC DL 2019
Validation through Cross-Grade Relevance Similarities
Experiment:
Findings:
55
TREC DL 2019
Retrieved vs. Generated answers
Findings:
56
ndcg@10
Challenges in Evaluation of GenIR
57
What if we have no “good” example?
Assessing Responses without Relevance Judgments
58
Assessing Responses without Relevance Judgments
59
Evaluation of GenIR
60
Evaluation of LLM-based Applications
61
Evaluation of LLM-based Applications
62
Evaluation of LLM-based Applications
63
Turn on the lamp
Brainstorming on the paper title
Evaluation of LLM-based Applications
64
Give me a recipe with mushroom and chicken
Turn on the lamp
Evaluation of LLM-based Applications
65
Give me a recipe with mushroom and chicken
Turn on the lamp
Evaluation of LLM-based Applications
66
https://microsoft.github.io/autogen/blog/2023/11/20/AgentEval/
Evaluation of LLM-based Applications
67
https://microsoft.github.io/autogen/blog/2023/11/20/AgentEval/
Evaluation of LLM-based Applications
68
https://github.com/microsoft/autogen/blob/main/notebook/agenteval_cq_math.ipynb
Clarity Efficiency … completeness
Error
Analysis
Evaluation of LLM-based Applications
69
How to validate the
Llm-based evaluation?
https://microsoft.github.io/autogen/blog/2023/11/20/AgentEval/
Evaluation of LLM-based Applications
70
AgentEval Results
Findings:
71
AgentEval Results
Findings on quantification robustness:
72
Robustness
73
Hallucination in Generative models
74
RAG → GARAGE
Retrieval Augmented Generation
Generate an Answer, Retrieve, Augment, Generate w/ Evidence
Hallucination in Generative models
75
Self-detecting hallucination:
Hallucination in Generative models
76
Hallucination in Generative models
Stepped classification of QA pairs:
77
Hallucination in Generative models
78
Fact-checking in Generative models
79
Fact-based self-detecting hallucination:
Fact-checking in Generative models
80
Fact-based self-detecting hallucination:
Future Directions
81
Future Directions
82
Robustness
Fairness
Quality
Accuracy
Efficiency
Explainablity
Personalization
LLM-based
Applications
Efficiency
W/ Constraints
Robustness
Fairness
Quality
Accuracy
Efficiency
Future Directions
Focusing on evaluation of complex and not well-defined tasks:
83
Future Directions
84
85
Thanks!
Any questions?
Negar Arabzadeh
: Narabzad@uwaterloo.ca : https://www.negara.me/
: @NegarEmpr : Narabzad
86
Assessing Responses with Relevance Judgments
87
Fairness
88
Fairness
89
Fairness
90
Fairness
91
Query-Document pairs |
Query: how important is a governor |
Governor is important because he is the chief executive of the state. He is the little president that implements the law in the state and oversee the operations of all local government units within his area. The Governor is like the president of the state. He makes decisions for his state and makes opinions to the ppl of the state where he is president of the state that he controls.... It's important to a specific state. Not important for Congress. a governor is like a president of the state. |
Query-Document pairs |
Query: is a supervisor considered a manager? |
It becomes clear that the core of the role and responsibility of a supervisor lies in overlooking the activities of others to the satisfaction of laid standards in an organization. The position of a supervisor in a company is considered to be at the lowest rung of management. A supervisor in any department has more or less the same work experience as the other members in his team, but he is considered to be the leader of the group. The word manager comes from the word management, and a manager is a person who manages men. To manage is to control and to organize things, men, and events. Managers do just that. They ensure smooth running of the day to day functioning of a workplace, whether it is business, hospital, or a factory. |
Fairness
92
…
…
ARaB: Bias Inclination toward
ARaB: Bias Inclination toward
93
Do Neural Ranking Models Intensify Gender Bias?
Depth of ranking
ARaB: Bias Inclination toward
ARaB: Bias Inclination toward
94
Do Neural Ranking Models Intensify Gender Bias?
Depth of ranking
Evaluation in terms of Fairness
95
Step 3
Mitigating the Biases
Step 1
Quantifying Gender Biases
Step 2
Finding the Source of Biases
Investigating Gender Biases in IR
Evaluation in terms of Fairness
96
Step 3
Mitigating the Biases
Step 1
Quantifying Gender Biases
Step 2
Finding the Source of Biases
Investigating Gender Biases in IR
Accuracy
97
Self-Evaluation through QPP
98
Relevant documents
Information Need
Retrieval System
Query
Self-Evaluation through QPP
99
Relevant documents
Information Need
Retrieval System
Query
Self-Evaluation through QPP
100
Query
Routing
Query Reformulation
Feedback to the system
Efficient Multi Staging
Self-Evaluation through QPP
101
Query
Routing
Query Reformulation
Feedback to the system
Efficient Multi Staging
Self-Evaluation through QPP
102
Query
Routing
Query Reformulation
Feedback to the system
Efficient Multi Staging