A Framework for Categorising AI Evaluation Instruments
Anthony G Cohn1
José Hernández-Orallo2
Julius Sechang Mboli3
Yael Moros-Daval2
Zhiliang Xiang4
Lexin Zhou2
EBeM Workshop 2022 @IJCAI/ECAI 2022 24th July, Vienna
1 School of Computing, University of Leeds, UK; and the Turing Institute, UK�2 VRAIN, Universitat Politècnica de València, Spain�3 Faculty of Engineering and Informatics, University of Bradford, UK�4 IROHMS, School of Computer Science and Informatics, Cardiff University, UK
The background
18 Facets to characterise EIs
Validity group
Consistency group
Fairness group
Why these 18? It was a long process that started was refining the facets and their possible values iteratively, as the first EI were evaluated.
AERA, APA, NCME, et al., Standards for educational and psychological testing,
Facet Values (Validity, Consistency, Fairness)
23 Selected EIs : some examples
Introduced by Levesque et al
A recognising textual entailment task to test commonsense reasoning
(a replacement for the Turing Test?)
Acronym | Type | Domain | Aim | Year |
WSC [7] | test, benchmark & competition | LU, CS, reasoning | It was specifically targeted to evaluate common sense reasoning, as an alternative to the Turing test, arguing conceptual and practical advantages | 2016 |
ALE [8] | benchmark | VG; navigation; perception | The original goal was to evaluate “general, domain-independent AI technology”, by using a diversity of video games, although what it measures more specifically is unclear | 2013 |
GLUE [9] | benchmark | LU; text retrieval; world knowledge | The goal of GLUE and superGLUE (an improvement/modified version of GLUE) is to measure the performance (e.g. accuracy, F1-score) of an AI system in natural language understanding tasks (Single-Sentence Tasks, Similarity and Paraphrase Tasks, and Inference Tasks) in English. | 2018 |
AIBIRDS [12] | competition | CV, VG, KRRP | Measures the planning capability of an agent in a large action space, without knowing of the physical parameters of objects, situation given by Angry Birds. | 2010 |
… … … …
Example EI: Winograd Schema Challenge (WSC)
Introduced by Levesque et al
A recognising textual entailment task to test commonsense reasoning
(a replacement for the Turing Test?)
The city councilmen refused the women a permit because they [feared/advocated] violence
The trophy would not fit into the case because it was too [big/small]
Purpose | RESEARCH | RESEARCH | RESEARCH | RESEARCH | n | maj |
Capability | CAPABILITY (specify) | CAPABILITY (specify) | TASK-PERFORMANCE (specify) | CAPABILITY (specify) | y | notmaj |
Reference | ABSOLUTE | ABSOLUTE | ABSOLUTE | ABSOLUTE | n | maj |
Coverage | BIASED (specify) | BIASED (specify) | BIASED (specify) | REPRESENTATIVE | y | notmaj |
Specificity | CONTAMINATED | CONTAMINATED | SPECIFIC | CONTAMINATED | y | notmaj |
Realism | REALISTIC | TOY | REAL-LIFE | REALISTIC | y | notmaj |
Judgeability | AUTOMATED | AUTOMATED | AUTOMATED | AUTOMATED | n | maj |
Containedness | FULLY-CONTAINED | FULLY-CONTAINED | FULLY-CONTAINED | FULLY-CONTAINED | n | maj |
Reproducibility | EXACT | EXACT | EXACT | EXACT | n | maj |
Reliability | RELIABLE | N/A | RELIABLE | RELIABLE | y | notmaj |
Variation | FIXED | FIXED | PROCEDURAL | FIXED | y | notmaj |
Adjustability | UNSTRUCTURED | UNSTRUCTURED | ADAPTIVE | UNSTRUCTURED | y | notmaj |
Antecedents | CREATED | CREATED | CREATED | CREATED | n | maj |
Ambition | LONG-TERM | LONG-TERM | LONG-TERM | LONG-TERM | n | maj |
Partiality | IMPARTIAL | IMPARTIAL | IMPARTIAL | IMPARTIAL | n | maj |
Objectivity | FULLY-INDEPENDENT | FULLY-INDEPENDENT | FULLY-INDEPENDENT | FULLY-INDEPENDENT | n | maj |
Progression | STATIC | STATIC | STATIC | STATIC | n | maj |
Autonomy | AUTONOMOUS | AUTONOMOUS | AUTONOMOUS | AUTONOMOUS | n | maj |
Analysis of Rater Consistency
Mercury
Mercury is the closest planet to the Sun and the smallest of them all
Facets Agreements for all 23 EIs
Analysis of Results
Mercury
Mercury is the closest planet to the Sun and the smallest of them all
Analysis of Results
Mercury
Mercury is the closest planet to the Sun and the smallest of them all
Summary
Potential future work
Proposed a framework for categorising EIs
Thank you!
Analysis of Results for 36 EIs
Mercury
Mercury is the closest planet to the Sun and the smallest of them all
Analysis of Results for 36 EIs
Mercury
Mercury is the closest planet to the Sun and the smallest of them all
Purpose | RESEARCH | RESEARCH | RESEARCH | RESEARCH | n | maj |
Capability | CAPABILITY (specify) | CAPABILITY (specify) | TASK-PERFORMANCE (specify) | CAPABILITY (specify) | y | maj |
Reference | ABSOLUTE | ABSOLUTE | ABSOLUTE | ABSOLUTE | n | maj |
Coverage | BIASED (specify) | BIASED (specify) | BIASED (specify) | REPRESENTATIVE | y | maj |
Specificity | CONTAMINATED | CONTAMINATED | SPECIFIC | CONTAMINATED | y | maj |
Realism | REALISTIC | TOY | REAL-LIFE | REALISTIC | y | notmaj |
Judgeability | AUTOMATED | AUTOMATED | AUTOMATED | AUTOMATED | n | maj |
Containedness | FULLY-CONTAINED | FULLY-CONTAINED | FULLY-CONTAINED | FULLY-CONTAINED | n | maj |
Reproducibility | EXACT | EXACT | EXACT | EXACT | n | maj |
Reliability | RELIABLE | N/A | RELIABLE | RELIABLE | y | maj |
Variation | FIXED | FIXED | PROCEDURAL | FIXED | y | maj |
Adjustability | UNSTRUCTURED | UNSTRUCTURED | ADAPTIVE | UNSTRUCTURED | y | maj |
Antecedents | CREATED | CREATED | CREATED | CREATED | n | maj |
Ambition | LONG-TERM | LONG-TERM | LONG-TERM | LONG-TERM | n | maj |
Partiality | IMPARTIAL | IMPARTIAL | IMPARTIAL | IMPARTIAL | n | maj |
Objectivity | FULLY-INDEPENDENT | FULLY-INDEPENDENT | FULLY-INDEPENDENT | FULLY-INDEPENDENT | n | maj |
Progression | STATIC | STATIC | STATIC | STATIC | n | maj |
Autonomy | AUTONOMOUS | AUTONOMOUS | AUTONOMOUS | AUTONOMOUS | n | maj |