1 of 102

Evaluation of�Information Access Systems �in the Generative Era��Negar Arabzadeh�University of Waterloo��University of California, Berkeley

Summer 2024

Narabzad@uwaterloo.ca

https://www.negara.me/

2 of 102

Introduction

2

Recommender Systems

Social Media

Search Engines

Generative Models

3 of 102

Introduction

3

4 of 102

Introduction

4

5 of 102

Introduction

5

This Photo by Unknown Author is licensed under CC BY-NC

…

6 of 102

Introduction

6

This Photo by Unknown Author is licensed under CC BY-NC

….

7 of 102

Evaluation of Information Access Systems

Rapid advancement of technologies 🡪 Evaluation become more challenging

Developing new metrics aligned with unique characteristics of new systems

7

8 of 102

Evaluation of Information Access Systems

Rapid advancement of technologies 🡪 Evaluation become more challenging

Developing new metrics aligned with unique characteristics of new systems

Evaluation requires data

Limitations of extensive evaluations 🡪 Biased or incomplete assessments

8

9 of 102

Introduction

9

How do we provide environments where correct information is:

Available, Identifiable, and accessible?

10 of 102

Outline

10

Evaluation of understanding grounded language

Evaluation of generative IR systems

Evaluation of LLM-based applications

Evaluation of LLMs robustness in open-domain QA

Future directions

11 of 102

Evaluation of Grounded Language Understanding

11

12 of 102

Self-Evaluation through Conversations

A system predicts its own failure and use it as a feedback:
Conversation with user

Asking the user to clarify their intent
Announcing the system’s failure

12

13 of 102

Self-Evaluation through Conversations

Neurips 2022 IGLU challenge: Interactive Grounded Language Understanding in a Collaborative Environment

Build interactive agents
In a Minecraft environment
learning to solve a task while provided with grounded natural language instructions
in a collaborative environment.

13

https://www.iglu-contest.net/

https://www.aicrowd.com/challenges/neurips-2022-iglu-challenge

14 of 102

Self-Evaluation through Conversations

Asking clarifying questions is essential for developing human-like dialogue systems.
“When” and “What” to ask as clarifying questions.

14

https://www.iglu-contest.net/

https://www.aicrowd.com/challenges/neurips-2022-iglu-challenge

15 of 102

Self-Evaluation through Conversations

Asking clarifying questions is essential for developing human-like dialogue systems.
“When” and “What” to ask as clarifying questions.

Collecting a dataset
Providing data collection tool
Providing the training environment
Designing the evaluation methodology

15

Dataset: https://github.com/microsoft/iglu-datasets

Data Collection Tool: https://github.com/iglu-contest/dataset-collection-and-evaluation

16 of 102

Online vs offline evaluation

Exploring human preferences with automatic offline metrics.
Collecting human explanations on why they prefer a certain agent.

16

Greenlands: https://github.com/microsoft/greenlands

17 of 102

Online vs offline evaluation

Exploring human preferences with automatic offline metrics.
Collecting human explanations on why they prefer a certain agent.
Findings:

17

Human preferences do not align with automatic offline evaluation metric e.g., F1.

Greenlands: https://github.com/microsoft/greenlands

18 of 102

Online vs offline evaluation

Exploring human preferences with automatic offline metrics.
Collecting human explanations on why they prefer a certain agent.
Findings:

Human preferences do not align with automatic offline evaluation metric e.g., F1.
Human criteria are not necessarily captured in offline evaluation metrics.

18

Offline assessments must be aligned with human preferences, specifically on more complex and less well-defined tasks.

19 of 102

Self-Evaluation through Conversations

Asking clarifying questions is essential for developing human-like dialogue systems.
“When” and “What” to ask as clarifying questions.

Collecting a dataset
Providing data collection tool
Providing the training environment
Designing the evaluation methodology

19

20 of 102

Evaluation of Generative Information Retrieval

20

21 of 102

From Traditional to Generative Information Retrieval

21

Differences from traditional IR evaluation:

Non-deterministic results
No more “ten blue links”
No more traditional browsing models
No more traditional evaluation metrics i.e., nDCG

Evaluation Anchor: Based on human implicit or explicit signals e.g., judged relevant documents.
Reward Mechanism: Based the rank position of relevant documents retrieved by the ranker.

22 of 102

LLM for Relevance Judgements (LLM-as-a-Judge?)

Automatic Evaluation of GenIR and RAG systems
Using LLM for relevance judgements in IR

22

23 of 102

LLM for Relevance Judgements (LLM-as-a-Judge?)

LLM vs. Human Judgments:
High correlation between ranking of different retrieval systems when evaluating with traditional ranking metrics e.g., nDCG@10 with human judgements vs LLM judgements

23

TREC 2021 Deep Learning Data

24 of 102

LLM for Relevance Judgements (LLM-as-a-Judge?)

24

25 of 102

LLM for Relevance Judgements (LLM-as-a-Judge?)

Findings:

25

Traditional relevance assessment for offline evaluation is no longer needed. LLMs are better than non-expert humans at judging basic relevance.

26 of 102

LLM for Relevance Judgements (LLM-as-a-Judge?)

26

27 of 102

LLM for Relevance Judgements (LLM-as-a-Judge?)

27

28 of 102

LLM for Relevance Judgements (LLM-as-a-Judge?)

28

Paul Héroult Charles Hall

Human judgments once rare and costly, can now be supplemented by LLMs, dramatically lowering costs and unlocking new evaluation opportunities.

29 of 102

GenIR Evaluation

Before, human judgements were too expensive. Now that we have cheap LLM judgments, what can we do with them?

Nugget-based evaluation

29

30 of 102

GenIR Evaluation

Before, human judgements were too expensive. Now that we have cheap LLM judgments, what can we do with them?

Nugget-based evaluation
Pairwise preference judgments

30

31 of 102

GenIR Evaluation

Before, human judgements were too expensive. Now that we have cheap LLM judgments, what can we do with them?

Nugget-based evaluation
Preference Judgments
Binary/graded relevance assessment

31

Relevant/ Non-relevant

Non-relevant

Highly relevant

32 of 102

GenIR Evaluation

Before, human judgements were too expensive. Now that we have cheap LLM judgments, what can we do with them?

Nugget-based evaluation
Preference Judgments
Binary/graded relevance assessment
Similarity to sparsely labeled data

32

33 of 102

GenIR Evaluation

Before, human judgements were too expensive. Now that we have cheap LLM judgments, what can we evaluations that were not practical before.

We still need human feedback in information seeking systems, but what is the optimal role of human in the evaluation loop?

The alignment between LLMs and real users needs thorough validation.
Humans provide the objective function to optimize.

33

Gold is still Precious!

34 of 102

Evaluation through similarity to sparsely labeled data

Challenges in generative-based tasks

Impractical to reassess the generated results

34

35 of 102

Evaluation through similarity to sparsely labeled data

Challenges in generative-based tasks

Impractical to reassess the generated results

35

What evaluation strategies are being used in other generative based tasks i.e., Image generation?

Comparing generated output with good examples!

36 of 102

Evaluation through similarity to sparsely labeled data

Challenges in generative-based tasks

Impractical to reassess the generated results

36

What evaluation strategies are being used in other generative based tasks i.e., Image generation?

Comparing generated output with good examples!

Sparse labels

37 of 102

Evaluation through similarity to sparsely labeled data

37

Advantage 1: Unlike traditional retrieval assessment, there is no need to exactly retrieve the good examples to gain reward.

Target

document

Gain = 0

Traditional Retrieval Evaluation

Target

document

Similarity with “good” example

~

Gain != 0

38 of 102

Evaluation through similarity to sparsely labeled data

38

Advantage 2: Allowing to have evaluation of both retrieval and generated answers in the same space.

Retrieved

Response

Generated

Response

~

Target

Document

Target

Document

39 of 102

Evaluation through similarity to sparsely labeled data

39

Comparing with good examples:

Fréchet Distance

Embedding Similarity

40 of 102

Evaluation through similarity to sparsely labeled data

40

Comparing with good examples:

Fréchet Distance

Embedding Similarity

41 of 102

Fréchet Distance

41

42 of 102

Fréchet Distance

42

43 of 102

Fréchet Distance for TTI

Evaluation of Text-to-Image (TTI) generative models with Fréchet distance:

43

CAPTION

Text to Image generative model

Generated Images

Embedding Model

,,,

…

0.6 0.8 … ...

0.1 0.3 … …

FRÉCHET

DISTANCE

44 of 102

Fréchet Distance for IR

Evaluation of IR systems with Fréchet distance:

44

Relevant Judged

documents

Retrieved or generated

documents

Embedding Model

…

0.6 0.7 … ...

0.5 0.1 … …

FRÉCHET

DISTANCE

45 of 102

Fréchet Distance for evaluation of GenIR

45

46 of 102

Experimental Setup

Classic Information retrieval test collection of items e.g., documents.
Query sets with sparsely labeled documents
A set of different retrievers/generative models to assess

46

47 of 102

Fréchet distance Experiments

Can Fréchet Distance effectively evaluate IR systems with sparse labels?

Experiment :

MRR@10 vs. FD@10 on ordering rankers

Findings:

FD can effectively pick out the better retriever, particularly when there is a significant difference between their performances.

47

48 of 102

Fréchet distance Experiments

Can the Fréchet Distance effectively evaluate IR systems when the retrieved results are not labelled?

48

Unjudged

Judged

Unjudged

…

Initial retrieved list

Traditional IR evaluation metric like nDCG assess the performance based on where the relevant judged document is placed.

49 of 102

Fréchet distance Experiments

Can the Fréchet Distance effectively evaluate IR systems when the retrieved results are not labelled?

49

Can we evaluate this list?

Unjudged

Judged

Unjudged

…

Unjudged

Initial retrieved list

Traditional IR evaluation metric like nDCG assess the performance based on where the relevant judged document is placed.

50 of 102

Fréchet distance Experiments

50

Fréchet Distance can assess unlabeled data.
When ranking a set of retrievers it shows statistically significant ranked based correlation on ordering rankers based on traditional IR metric.
In contrast, traditional IR metrics would be unable to provide any insights without retrieving the labeled documents.

Can the Fréchet Distance effectively evaluate IR systems when the retrieved results are not labelled?

Correlation between MRR and Unlabelled retrieved results

Correlation between MRR and original retrieved results

51 of 102

Evaluation of GenIR

51

Comparing with good examples:

Fréchet Distance

Embedding Similarity

52 of 102

Evaluation of GenIR with Retrieval Benchmark

52

53 of 102

Experimental Setup

LLMS used under experiment:

LLama2-7b-chat
LLama2-13b-chat
Gpt-3.5-turbo
Gpt-4

53

Do goldfish grow?

+ their Liar version

Sanity check

Creative Ability

54 of 102

Validation through Cross-Grade Relevance Similarities

Experiment:

Selecting one “target response” for each query.
Measuring similarity between representation of the target response and documents with different grade of relevance.

54

TREC DL 2019

55 of 102

Validation through Cross-Grade Relevance Similarities

Experiment:

Selecting one “target response” for each query.
Measuring similarity between representation of the target response and documents with different grade of relevance.

Findings:

Validating usage of similarity for evaluation of QA task.

55

TREC DL 2019

Robust w.r.t the choice of the document embedding.
Passages with lower relevance levels show lower similarity scores, reflecting their lessor relevance to the information need.

56 of 102

Retrieved vs. Generated answers

Findings:

We can assess both generated and retrieved models in a uniform context.
Statistically significant Kendall’s 𝜏 correlation with ndcg@10.

56

ndcg@10

57 of 102

Challenges in Evaluation of GenIR

57

What if we have no “good” example?

58 of 102

Assessing Responses without Relevance Judgments

Challenge: No human judgments are available.
Solution: Using top-passage returned by different retrieval methods as an evaluation anchor and compare it with the generated answers.

58

59 of 102

Assessing Responses without Relevance Judgments

Findings:

Regardless of the choice of the retriever cheap or expensive, the top retrieved passage consistently emerges as a strong indicator of relevance.
Relative performance of the models remains nearly unchanged when using different retrieval methods.

59

60 of 102

Evaluation of GenIR

60

61 of 102

Evaluation of LLM-based Applications

61

62 of 102

Evaluation of LLM-based Applications

62

63 of 102

Evaluation of LLM-based Applications

63

Turn on the lamp

Brainstorming on the paper title

64 of 102

Evaluation of LLM-based Applications

64

Give me a recipe with mushroom and chicken

Turn on the lamp

65 of 102

Evaluation of LLM-based Applications

65

Give me a recipe with mushroom and chicken

Turn on the lamp

66 of 102

Evaluation of LLM-based Applications

AgentEval: Assessing the Task Utility of LLM-powered Applications for Their End-Users

Math problem as an example to go beyond the success/failure of the method.

66

https://microsoft.github.io/autogen/blog/2023/11/20/AgentEval/

67 of 102

Evaluation of LLM-based Applications

AgentEval: Assessing the Task Utility of LLM-powered Applications for Their End-Users

Math problem as an example to go beyond the success/failure of the method.

67

https://microsoft.github.io/autogen/blog/2023/11/20/AgentEval/

68 of 102

Evaluation of LLM-based Applications

68

https://github.com/microsoft/autogen/blob/main/notebook/agenteval_cq_math.ipynb

Clarity Efficiency … completeness

Error

Analysis

69 of 102

Evaluation of LLM-based Applications

AgentEval: Assessing the Task Utility of LLM-powered Applications for Their End-Users

69

How to validate the

Llm-based evaluation?

https://microsoft.github.io/autogen/blog/2023/11/20/AgentEval/

70 of 102

Evaluation of LLM-based Applications

70

Integrated in Autogen Public library with more than 29k stars.
Adopted internally by Microsoft product teams.

Merge similar criteria
Avoid redundant of criteria
Avoid non-stable criteria
Avoid non-distinguishable criteria

Noise-injected vs. original samples
Ensuring the evaluation criteria effectively assert that original samples are superior to perturbed samples.

71 of 102

AgentEval Results

Findings:

Validation on the Evaluation: Sanity check on if AgentEval is able to distinguish between failed and successful cases.
Not all failed cases are the same.

71

72 of 102

AgentEval Results

Findings on quantification robustness:

Studying reliability and reproducibility of assessments over 50 repeats.
Some criteria lack clear distinguishability between failed and successful cases.
The narrower the distribution, the more robust the criteria.

72

73 of 102

Robustness

73

74 of 102

Hallucination in Generative models

74

RAG → GARAGE

Retrieval Augmented Generation

Generate an Answer, Retrieve, Augment, Generate w/ Evidence

75 of 102

Hallucination in Generative models

75

Self-detecting hallucination:

The ability for LLMs to self-detect hallucinations by confirming its generated responses against an external corpus. containing known correct answers.

76 of 102

Hallucination in Generative models

76

Strengths:
This approach is straightforward – minimal prompt engineering.

Challenges:
The reliance on retrieval.

77 of 102

Hallucination in Generative models

Stepped classification of QA pairs:

LLM can correctly detect its own hallucinations in a majority of cases (an accuracy of over 80%), with the help of retrieval methods.

77

78 of 102

Hallucination in Generative models

78

Strengths:
Ensuring that both the generated and retrieved answers are directly related to the question.

Challenges:
Misclassifying answers due to excessive detail in the retrieved passages.
Struggling with false negatives, especially when the retrieved evidence is too detailed or indirect.

79 of 102

Fact-checking in Generative models

False positive rate are much more than False negatives.

79

Fact-based self-detecting hallucination:

80 of 102

Fact-checking in Generative models

Strengths:
More granular verification of generated content.
Identifying hallucinations at a finer level.
By validating individual statements, it can correct specific parts of an answer while maintaining overall coherence.

80

Fact-based self-detecting hallucination:

Challenges:
Over-generating factual statements, leading to unnecessary checks.
Over-detailed or redundant he extraction of statements.
Incorrectly categorizing evidence as contradictory.

81 of 102

Future Directions

81

82 of 102

Future Directions

82

Robustness

Fairness

Quality

Accuracy

Efficiency

Explainablity

Personalization

LLM-based

Applications

Efficiency

W/ Constraints

Robustness

Fairness

Quality

Accuracy

Efficiency

83 of 102

Future Directions

Focusing on evaluation of complex and not well-defined tasks:

Open-ended
With dynamic test set
Non-definitive ground truth

83

84 of 102

Future Directions

Finding out what works for people?

Should we set human performance as our ground truth, or aim to surpass it?

To what extent can generative models replace human efforts in evaluation?
What is the optimal role of humans in the evaluation process during the generative era?

84

85 of 102

85

86 of 102

Thanks!

Any questions?

Negar Arabzadeh

: Narabzad@uwaterloo.ca : https://www.negara.me/

: @NegarEmpr : Narabzad

86

87 of 102

Assessing Responses with Relevance Judgments

Experiment:

measuring the average similarity between each generated answers and the judged relevant passages from different levels of relevance.

Findings:

Average similarity between generated answers and passages decreases with relevance level.
gpt-4 appears to be a more “convincing liar” compared to gpt-3.5-turbo, since it consistently yields lower similarity scores to the relevant judged passages.

87

88 of 102

Fairness

88

89 of 102

Fairness

A fair ranker provides a balanced representation of the protected attributes.

For example: gender, race, ethnicity, and age.

89

90 of 102

Fairness

A fair ranker provides a balanced representation of the protected attributes.

For example: gender, race, ethnicity, and age.

Ideal: Given a gender-neutral query:

The retrieved documents are Relevant to the query
The retrieved documents are do not show inclination towards a specific gender
Between two equally related documents, the one with lower degree of gender bias is ranked higher

90

91 of 102

Fairness

91

Query-Document pairs
Query: how important is a governor
Governor is important because he is the chief executive of the state. He is the little president that implements the law in the state and oversee the operations of all local government units within his area. The Governor is like the president of the state. He makes decisions for his state and makes opinions to the ppl of the state where he is president of the state that he controls.... It's important to a specific state. Not important for Congress. a governor is like a president of the state.

Query-Document pairs
Query: is a supervisor considered a manager?
It becomes clear that the core of the role and responsibility of a supervisor lies in overlooking the activities of others to the satisfaction of laid standards in an organization. The position of a supervisor in a company is considered to be at the lowest rung of management. A supervisor in any department has more or less the same work experience as the other members in his team, but he is considered to be the leader of the group. The word manager comes from the word management, and a manager is a person who manages men. To manage is to control and to organize things, men, and events. Managers do just that. They ensure smooth running of the day to day functioning of a workplace, whether it is business, hospital, or a factory.

92 of 102

Fairness

92

…

93 of 102

Gender Bias Metric: Average Rank Bias (ARaB)

ARaB: Bias Inclination toward

93

Do Neural Ranking Models Intensify Gender Bias?

Depth of ranking

94 of 102

Gender Bias Metric: Average Rank Bias (ARaB)

ARaB: Bias Inclination toward

Only the red bar is unsupervised approach (BM25). The rest are supervised neural rankers.
All the rankers has male-inclination biases
Neural-based rankers intensify gender biases toward male inclination more than the unsupervised ranker.

94

Do Neural Ranking Models Intensify Gender Bias?

Depth of ranking

95 of 102

Evaluation in terms of Fairness

95

Step 3

Mitigating the Biases

Step 1

Quantifying Gender Biases

Step 2

Finding the Source of Biases

Investigating Gender Biases in IR

96 of 102

Evaluation in terms of Fairness

96

Step 3

Mitigating the Biases

Step 1

Quantifying Gender Biases

Step 2

Finding the Source of Biases

Investigating Gender Biases in IR

97 of 102

Accuracy

97

98 of 102

Self-Evaluation through QPP

Query Performance Prediction (QPP): Predicting the quality of retrieved documents, in satisfying the information needs behind the query.

98

Relevant documents

Information Need

Retrieval System

Query

99 of 102

Self-Evaluation through QPP

Query Performance Prediction (QPP): Predicting the quality of retrieved documents, in satisfying the information needs behind the query.

How good is the predicted quality?

99

Relevant documents

Information Need

Retrieval System

Query

100 of 102

Self-Evaluation through QPP

Applications:

100

Query

Routing

Query Reformulation

Feedback to the system

Efficient Multi Staging

Query Performance Prediction (QPP) has shown to be highly correlated with retrieval performance

101 of 102

Self-Evaluation through QPP

Applications:
Proposed Methods:

Neural embedding-based QPP
QPP based on perturbations in query representation
QPP based on document coherency
Contextualized transformer-based QPP

101

Query

Routing

Query Reformulation

Feedback to the system

Efficient Multi Staging

Query Performance Prediction (QPP) has shown to be highly correlated with retrieval performance

102 of 102

Self-Evaluation through QPP

Applications:
Proposed Methods:

Neural embedding-based QPP
QPP based on perturbations in query representation
QPP based on document coherency
Contextualized transformer-based QPP

102

Query

Routing

Query Reformulation

Feedback to the system

Efficient Multi Staging

Query Performance Prediction (QPP) has shown to be highly correlated with retrieval performance