1 of 15

A Framework for Categorising AI Evaluation Instruments

Anthony G Cohn1

José Hernández-Orallo2

Julius Sechang Mboli3

Yael Moros-Daval2

Zhiliang Xiang4

Lexin Zhou2

EBeM Workshop 2022 @IJCAI/ECAI 2022 24th July, Vienna

1 School of Computing, University of Leeds, UK; and the Turing Institute, UK�2 VRAIN, Universitat Politècnica de València, Spain�3 Faculty of Engineering and Informatics, University of Bradford, UK�4 IROHMS, School of Computer Science and Informatics, Cardiff University, UK

2 of 15

The background

  • Approached by the OECD for their “Artificial Intelligence and the Future of Skills (AIFS) Project”
    • Understand the potential as well as the limits of AI capabilities at a detailed task level to describe more precisely how humans and AI are complementary.

  • Subsequently invited and funded to work on a framework to analyse the landscape of competitions and benchmarks
    • Characterise EIs to capture strengths and weaknesses
      • Not based on domains but on the values of a set of facets.

3 of 15

18 Facets to characterise EIs

Validity group

  • Does it measure what we want to measure?
  • Purpose, Capability, Reference, Coverage, Specifity, Realism

Consistency group

Fairness group

  • Does it measure it effectively and verifiably?
  • Judgeability, Containedness, Reproducibility, Reliability, Variation, Ajustability
  • Does it treat all test takers equally?
  • Antecedents, Ambition, Partiality, Objectivity, Progression, Autonomy

Why these 18? It was a long process that started was refining the facets and their possible values iteratively, as the first EI were evaluated.​

AERA, APA, NCME, et al., Standards for educational and psychological testing

4 of 15

Facet Values (Validity, Consistency, Fairness)

  1. Purpose [RESEARCH, CONFIRMITY, OTHER (specify)]
  2. Capability [TASK-PERFORMANCE (specify), CAPABILITY (specify)]
  3. Reference [ABSOLUTE, RELATIVE (specify)]
  4. Coverage [BIASED (specify), REPRESENTATIVE]
  5. Antecedents [CREATED, ADAPTED (specify)]
  6. Specificity [SPECIFIC, CONTAMINATED]
  7. Judgeability [MANUAL, AUTOMATED]
  8. Containedness [FULLY-CONTAINED, PARTIAL-INFERENCE (specify), NOT-CONTAINED (specify)]
  9. Reproducibility [NON-REPRODUCIBLE, STOCHASTIC, EXACT]
  10. Reliability [RELIABLE, NON-RELIABLE, N/A]
  11. Variation [FIXED, ALTERED, PROCEDURAL]
  12. Adjustability [UNSTRUCTURED, ABLATABLE, ADAPTIVE]
  13. Antecedents [CREATED, RETROFITTED (specify)]
  14. Ambition [SHORT, LONG]
  15. Partiality [PARTIAL (specify), IMPARTIAL]
  16. Objectivity [LOOSE, CUSTOMISED, FULLY-INDEPENDENT]
  17. Progression [STATIC, DEVELOPMENTAL]
  18. Autonomy [AUTONOMOUS, COUPLED (specify), COMPONENT]

5 of 15

23 Selected EIs : some examples

Introduced by Levesque et al

A recognising textual entailment task to test commonsense reasoning

(a replacement for the Turing Test?)

Acronym

Type

Domain

Aim

Year

WSC [7]

test, benchmark & competition

LU, CS, reasoning

It was specifically targeted to evaluate common sense reasoning, as an alternative to the Turing test, arguing conceptual and practical advantages

2016

ALE [8]

benchmark

VG; navigation; perception

The original goal was to evaluate “general, domain-independent AI technology”, by using a diversity of video games, although what it measures more specifically is unclear

2013

GLUE [9]

benchmark

LU; text retrieval; world knowledge

The goal of GLUE and superGLUE (an improvement/modified version of GLUE) is to measure the performance (e.g. accuracy, F1-score) of an AI system in natural language understanding tasks (Single-Sentence Tasks, Similarity and Paraphrase Tasks, and Inference Tasks) in English.

2018

AIBIRDS [12]

competition

CV, VG, KRRP

Measures the planning capability of an agent in a large action space, without knowing of the physical parameters of objects, situation given by Angry Birds.

2010

…                  …                        …                                                                     …

6 of 15

Example EI: Winograd Schema Challenge (WSC)

Introduced by Levesque et al

A recognising textual entailment task to test commonsense reasoning

(a replacement for the Turing Test?)

The city councilmen refused the women a permit because they [feared/advocated] violence

The trophy would not fit into the case because it was too [big/small] 

7 of 15

Purpose

RESEARCH

RESEARCH

RESEARCH

RESEARCH

n

maj

Capability

CAPABILITY (specify)

CAPABILITY (specify)

TASK-PERFORMANCE (specify)

CAPABILITY (specify)

y

notmaj

Reference

ABSOLUTE

ABSOLUTE

ABSOLUTE

ABSOLUTE

n

maj

Coverage

BIASED (specify)

BIASED (specify)

BIASED (specify)

REPRESENTATIVE

y

notmaj

Specificity

CONTAMINATED

CONTAMINATED

SPECIFIC

CONTAMINATED

y

notmaj

Realism

REALISTIC

TOY

REAL-LIFE

REALISTIC

y

notmaj

Judgeability

AUTOMATED

AUTOMATED

AUTOMATED

AUTOMATED

n

maj

Containedness

FULLY-CONTAINED

FULLY-CONTAINED

FULLY-CONTAINED

FULLY-CONTAINED

n

maj

Reproducibility

EXACT

EXACT

EXACT

EXACT

n

maj

Reliability

RELIABLE

N/A

RELIABLE

RELIABLE

y

notmaj

Variation

FIXED

FIXED

PROCEDURAL

FIXED

y

notmaj

Adjustability

UNSTRUCTURED

UNSTRUCTURED

ADAPTIVE

UNSTRUCTURED

y

notmaj

Antecedents

CREATED

CREATED

CREATED

CREATED

n

maj

Ambition

LONG-TERM

LONG-TERM

LONG-TERM

LONG-TERM

n

maj

Partiality

IMPARTIAL

IMPARTIAL

IMPARTIAL

IMPARTIAL

n

maj

Objectivity

FULLY-INDEPENDENT

FULLY-INDEPENDENT

FULLY-INDEPENDENT

FULLY-INDEPENDENT

n

maj

Progression

STATIC

STATIC

STATIC

STATIC

n

maj

Autonomy

AUTONOMOUS

AUTONOMOUS

AUTONOMOUS

AUTONOMOUS

n

maj

8 of 15

Analysis of Rater Consistency

Mercury

Mercury is the closest planet to the Sun and the smallest of them all

Facets Agreements for all 23 EIs

9 of 15

Analysis of Results

Mercury

Mercury is the closest planet to the Sun and the smallest of them all

10 of 15

Analysis of Results

Mercury

Mercury is the closest planet to the Sun and the smallest of them all

11 of 15

Summary

Potential future work

Proposed a framework for categorising EIs

  • Initial analysis of 23 EIs
  • Framework can be applied fairly reliably, but still some inter-rater disagreements
  • Framework could be used by:
    • EI designers
    • AI system developers
    • Multiple organisations
  • Apply the framework to further EIs (now 36)
  • Track the evolution of AI evaluation

12 of 15

Thank you!

13 of 15

Analysis of Results for 36 EIs

Mercury

Mercury is the closest planet to the Sun and the smallest of them all

14 of 15

Analysis of Results for 36 EIs

Mercury

Mercury is the closest planet to the Sun and the smallest of them all

15 of 15

Purpose

RESEARCH

RESEARCH

RESEARCH

RESEARCH

n

maj

Capability

CAPABILITY (specify)

CAPABILITY (specify)

TASK-PERFORMANCE (specify)

CAPABILITY (specify)

y

maj

Reference

ABSOLUTE

ABSOLUTE

ABSOLUTE

ABSOLUTE

n

maj

Coverage

BIASED (specify)

BIASED (specify)

BIASED (specify)

REPRESENTATIVE

y

maj

Specificity

CONTAMINATED

CONTAMINATED

SPECIFIC

CONTAMINATED

y

maj

Realism

REALISTIC

TOY

REAL-LIFE

REALISTIC

y

notmaj

Judgeability

AUTOMATED

AUTOMATED

AUTOMATED

AUTOMATED

n

maj

Containedness

FULLY-CONTAINED

FULLY-CONTAINED

FULLY-CONTAINED

FULLY-CONTAINED

n

maj

Reproducibility

EXACT

EXACT

EXACT

EXACT

n

maj

Reliability

RELIABLE

N/A

RELIABLE

RELIABLE

y

maj

Variation

FIXED

FIXED

PROCEDURAL

FIXED

y

maj

Adjustability

UNSTRUCTURED

UNSTRUCTURED

ADAPTIVE

UNSTRUCTURED

y

maj

Antecedents

CREATED

CREATED

CREATED

CREATED

n

maj

Ambition

LONG-TERM

LONG-TERM

LONG-TERM

LONG-TERM

n

maj

Partiality

IMPARTIAL

IMPARTIAL

IMPARTIAL

IMPARTIAL

n

maj

Objectivity

FULLY-INDEPENDENT

FULLY-INDEPENDENT

FULLY-INDEPENDENT

FULLY-INDEPENDENT

n

maj

Progression

STATIC

STATIC

STATIC

STATIC

n

maj

Autonomy

AUTONOMOUS

AUTONOMOUS

AUTONOMOUS

AUTONOMOUS

n

maj