JavaScript isn't enabled in your browser, so this file can't be opened. Enable and reload.

1 of 15

A Framework for Categorising AI Evaluation Instruments

Anthony G Cohn¹

José Hernández-Orallo²

Julius Sechang Mboli³

Yael Moros-Daval²

Zhiliang Xiang⁴

Lexin Zhou²

EBeM Workshop 2022 @IJCAI/ECAI 2022 24th July, Vienna

1 School of Computing, University of Leeds, UK; and the Turing Institute, UK�2 VRAIN, Universitat Politècnica de València, Spain�3 Faculty of Engineering and Informatics, University of Bradford, UK�4 IROHMS, School of Computer Science and Informatics, Cardiff University, UK

2 of 15

The background

Approached by the OECD for their “Artificial Intelligence and the Future of Skills (AIFS) Project”

Understand the potential as well as the limits of AI capabilities at a detailed task level to describe more precisely how humans and AI are complementary.

Subsequently invited and funded to work on a framework to analyse the landscape of competitions and benchmarks

Characterise EIs to capture strengths and weaknesses

Not based on domains but on the values of a set of facets.

3 of 15

18 Facets to characterise EIs

Validity group

Does it measure what we want to measure?
Purpose, Capability, Reference, Coverage, Specifity, Realism

Consistency group

Fairness group

Does it measure it effectively and verifiably?
Judgeability, Containedness, Reproducibility, Reliability, Variation, Ajustability

Does it treat all test takers equally?
Antecedents, Ambition, Partiality, Objectivity, Progression, Autonomy

Why these 18? It was a long process that started was refining the facets and their possible values iteratively, as the first EI were evaluated.

AERA, APA, NCME, et al., Standards for educational and psychological testing,

4 of 15

Facet Values (Validity, Consistency, Fairness)

Purpose [RESEARCH, CONFIRMITY, OTHER (specify)]
Capability [TASK-PERFORMANCE (specify), CAPABILITY (specify)]
Reference [ABSOLUTE, RELATIVE (specify)]
Coverage [BIASED (specify), REPRESENTATIVE]
Antecedents [CREATED, ADAPTED (specify)]
Specificity [SPECIFIC, CONTAMINATED]
Judgeability [MANUAL, AUTOMATED]
Containedness [FULLY-CONTAINED, PARTIAL-INFERENCE (specify), NOT-CONTAINED (specify)]
Reproducibility [NON-REPRODUCIBLE, STOCHASTIC, EXACT]
Reliability [RELIABLE, NON-RELIABLE, N/A]
Variation [FIXED, ALTERED, PROCEDURAL]
Adjustability [UNSTRUCTURED, ABLATABLE, ADAPTIVE]
Antecedents [CREATED, RETROFITTED (specify)]
Ambition [SHORT, LONG]
Partiality [PARTIAL (specify), IMPARTIAL]
Objectivity [LOOSE, CUSTOMISED, FULLY-INDEPENDENT]
Progression [STATIC, DEVELOPMENTAL]
Autonomy [AUTONOMOUS, COUPLED (specify), COMPONENT]

5 of 15

23 Selected EIs : some examples

Introduced by Levesque et al

A recognising textual entailment task to test commonsense reasoning

(a replacement for the Turing Test?)

Acronym	Type	Domain	Aim	Year
WSC [7]	test, benchmark & competition	LU, CS, reasoning	It was specifically targeted to evaluate common sense reasoning, as an alternative to the Turing test, arguing conceptual and practical advantages	2016
ALE [8]	benchmark	VG; navigation; perception	The original goal was to evaluate “general, domain-independent AI technology”, by using a diversity of video games, although what it measures more specifically is unclear	2013
GLUE [9]	benchmark	LU; text retrieval; world knowledge	The goal of GLUE and superGLUE (an improvement/modified version of GLUE) is to measure the performance (e.g. accuracy, F1-score) of an AI system in natural language understanding tasks (Single-Sentence Tasks, Similarity and Paraphrase Tasks, and Inference Tasks) in English.	2018
AIBIRDS [12]	competition	CV, VG, KRRP	Measures the planning capability of an agent in a large action space, without knowing of the physical parameters of objects, situation given by Angry Birds.	2010

… … … …

HRIC = Human-Robot-Interaction and Cooperation; NMDE = Navigation and Mapping in dynamic environments; CV = Computer Vision, ABP = Adaptive Behaviors, planning; AA = abstract argumentation; CL = computational logic; VG = video games; KRRP = knowledge representation; reasoning; planning; RCRPVMASS robotics; cooperation; real-time planning; vision; multiagent systems; strategy; LU = Language understanding; CS = common sense; RM = Robotics in Manufacturing; ARH = Adaptive Robot hands, MPLT = Manipulation planning based on learning techniques;DiHM = Dexterous in-hand manipulation; RGVELO = Robust grasping with various everyday life objects; SI = social interaction, SIn = social intelligence, EI = emotional intelligence, IR = inferential reasoning; CR = commonsense reasoning;VR = visual recognition; PN = planning and navigation, SG = Smart Grids, PG = Power Grids, PN = Power networks, PCU = physical commonsense understanding, NLI = natural language inference, RV = Robotic vision, RGM = Robotic Grasping and Manipulation, OD = Object Detection

6 of 15

Example EI: Winograd Schema Challenge (WSC)

Introduced by Levesque et al

A recognising textual entailment task to test commonsense reasoning

(a replacement for the Turing Test?)

The city councilmen refused the women a permit because they [feared/advocated] violence

The trophy would not fit into the case because it was too [big/small]

7 of 15

Purpose	RESEARCH	RESEARCH	RESEARCH	RESEARCH	n	maj
Capability	CAPABILITY (specify)	CAPABILITY (specify)	TASK-PERFORMANCE (specify)	CAPABILITY (specify)	y	notmaj
Reference	ABSOLUTE	ABSOLUTE	ABSOLUTE	ABSOLUTE	n	maj
Coverage	BIASED (specify)	BIASED (specify)	BIASED (specify)	REPRESENTATIVE	y	notmaj
Specificity	CONTAMINATED	CONTAMINATED	SPECIFIC	CONTAMINATED	y	notmaj
Realism	REALISTIC	TOY	REAL-LIFE	REALISTIC	y	notmaj
Judgeability	AUTOMATED	AUTOMATED	AUTOMATED	AUTOMATED	n	maj
Containedness	FULLY-CONTAINED	FULLY-CONTAINED	FULLY-CONTAINED	FULLY-CONTAINED	n	maj
Reproducibility	EXACT	EXACT	EXACT	EXACT	n	maj
Reliability	RELIABLE	N/A	RELIABLE	RELIABLE	y	notmaj
Variation	FIXED	FIXED	PROCEDURAL	FIXED	y	notmaj
Adjustability	UNSTRUCTURED	UNSTRUCTURED	ADAPTIVE	UNSTRUCTURED	y	notmaj
Antecedents	CREATED	CREATED	CREATED	CREATED	n	maj
Ambition	LONG-TERM	LONG-TERM	LONG-TERM	LONG-TERM	n	maj
Partiality	IMPARTIAL	IMPARTIAL	IMPARTIAL	IMPARTIAL	n	maj
Objectivity	FULLY-INDEPENDENT	FULLY-INDEPENDENT	FULLY-INDEPENDENT	FULLY-INDEPENDENT	n	maj
Progression	STATIC	STATIC	STATIC	STATIC	n	maj
Autonomy	AUTONOMOUS	AUTONOMOUS	AUTONOMOUS	AUTONOMOUS	n	maj

8 of 15

Analysis of Rater Consistency

Mercury

Mercury is the closest planet to the Sun and the smallest of them all

Facets Agreements for all 23 EIs

9 of 15

Analysis of Results

Mercury

Mercury is the closest planet to the Sun and the smallest of them all

10 of 15

Analysis of Results

Mercury

Mercury is the closest planet to the Sun and the smallest of them all

11 of 15

Summary

Potential future work

Proposed a framework for categorising EIs

Initial analysis of 23 EIs
Framework can be applied fairly reliably, but still some inter-rater disagreements

Framework could be used by:

EI designers
AI system developers
Multiple organisations

Apply the framework to further EIs (now 36)
Track the evolution of AI evaluation

12 of 15

Thank you!

13 of 15

Analysis of Results for 36 EIs

Mercury

Mercury is the closest planet to the Sun and the smallest of them all

14 of 15

Analysis of Results for 36 EIs

Mercury

Mercury is the closest planet to the Sun and the smallest of them all

15 of 15

Purpose	RESEARCH	RESEARCH	RESEARCH	RESEARCH	n	maj
Capability	CAPABILITY (specify)	CAPABILITY (specify)	TASK-PERFORMANCE (specify)	CAPABILITY (specify)	y	maj
Reference	ABSOLUTE	ABSOLUTE	ABSOLUTE	ABSOLUTE	n	maj
Coverage	BIASED (specify)	BIASED (specify)	BIASED (specify)	REPRESENTATIVE	y	maj
Specificity	CONTAMINATED	CONTAMINATED	SPECIFIC	CONTAMINATED	y	maj
Realism	REALISTIC	TOY	REAL-LIFE	REALISTIC	y	notmaj
Judgeability	AUTOMATED	AUTOMATED	AUTOMATED	AUTOMATED	n	maj
Containedness	FULLY-CONTAINED	FULLY-CONTAINED	FULLY-CONTAINED	FULLY-CONTAINED	n	maj
Reproducibility	EXACT	EXACT	EXACT	EXACT	n	maj
Reliability	RELIABLE	N/A	RELIABLE	RELIABLE	y	maj
Variation	FIXED	FIXED	PROCEDURAL	FIXED	y	maj
Adjustability	UNSTRUCTURED	UNSTRUCTURED	ADAPTIVE	UNSTRUCTURED	y	maj
Antecedents	CREATED	CREATED	CREATED	CREATED	n	maj
Ambition	LONG-TERM	LONG-TERM	LONG-TERM	LONG-TERM	n	maj
Partiality	IMPARTIAL	IMPARTIAL	IMPARTIAL	IMPARTIAL	n	maj
Objectivity	FULLY-INDEPENDENT	FULLY-INDEPENDENT	FULLY-INDEPENDENT	FULLY-INDEPENDENT	n	maj
Progression	STATIC	STATIC	STATIC	STATIC	n	maj
Autonomy	AUTONOMOUS	AUTONOMOUS	AUTONOMOUS	AUTONOMOUS	n	maj