1 of 12

FOUNDATIONS OF ARTIFICIAL INTELLIGENCE

Towards a working understanding of AI and its relationship to evaluation

AN OIL PAINTING BY MATISSE OF A HUMANOID ROBOT PLAYING CHESS

Generated by AI using DALL-E

2 of 12

Machine Learning�Systems

Data

Pattern detection

Extensible Model

At the core of AI:

3 of 12

How Machine Learning Systems are Trained

Machine learning systems may be supervised or unsupervised

Models can only be as strong as the data on which they are trained

The patterns these models detect and use are not always accessible to humans

6 of 12

Random Forests: �an example

https://datahacker.rs/012-machine-learning-introduction-to-random-forest/

7 of 12

Example ML System Types

Natural language processing

Used to convert human language into machine-readable signals

Neural networks/deep learning

Approximates ”cognition” by looking for patterns and yielding outputs that match those patterns
Neural networks and humans may come to the same conclusions but using different rationales

Generative AI

Creates novel content in response to a prompt

ANGLE VIEW OF CIRCUIT SHAPED LIKE A BRAIN

Recommended by AI based on text in this slide

8 of 12

AI AS BOTH SUBJECT AND OBJECT IN EVALUATION

Potential uses for AI in an evaluation space

Evaluating work produced by AI

AN ABSTRACT PAINTING OF ARTIFICIAL INTELLIGENCE

Generated by AI using DALL-E

9 of 12

Evaluation using AI

Non-generative AI

Summarization

https://quillbot.com/summarize

Image analysis

https://cloud.google.com/vision

Text analysis

http://text-processing.com/demo/sentiment/

Generative AI

Code generation and translation
Annotated bibliography and background narrative

COMPUTER SCRIPT ON A SCREEN

Recommended by AI based on text in this slide

10 of 12

Evaluation of AI

Qualitative

Explanation Goodness
User satisfaction
User Curiosity/Attention Engagement
User Trust/Reliance
User understanding
Productivity/Productivity of use
System Controllability/Interactivity

Mixed Methods

User trust
User preference
User confidence
Level of disagreement

Galvanic Skin Response
Blood Volume Pulse
Number of operations depending on model size
Interaction strength

Quantitative

D, performance difference between models and XAI method execution;
R, the number of rules in the model explanation (rule based);
F, the number of features used to construct the explanation;
S, the stability of the explanation of models;

Hsiao, J.H.-W.; Ngai, H.H.T.; Qiu, L.; Yang, Y.; Cao, C.C. Roadmap of Designing Cognitive Metrics for Explainable Artificial Intelligence (XAI). arXiv 2021, arXiv:2108.01737

Lin, Y.-S.; Lee, W.-C.; Berkay Celik, Z. What Do You See? Evaluation of Explainable Artificial Intelligence (XAI) Interpretability through Neural Backdoors. arXiv 2020, arXiv:2009.10639

Rosenfeld, A. Better Metrics for Evaluating Explainable Artificial Intelligence. In Proceedings of the 20th International Conference on Autonomous Agents and MultiAgent Systems, Virtual Event, UK, 3–7 May 2021

Measures in this field are emergent and consistent metrics for evaluation are yet to crystallize

1 of 12

2 of 12

3 of 12

4 of 12

5 of 12

6 of 12

7 of 12

8 of 12

9 of 12

10 of 12

11 of 12

12 of 12