Framing Issues of Measurement in AI
Luca Mari
Università Cattolica
Milan Network of Logic and Philosophy of Science
Università degli Studi di Milano, 15 January 2026
The context
Framing the analysis
The relation between AI and Measurement Science (MS) is twofold:
We focus here on the latter, and specifically about the evaluation of LM behavior:
Today we say Artificial Intelligence (AI) and mean Machine Learning
Framing the analysis
About the evaluation of LM behavior two main issues need to be considered:
In an evaluation-oriented framework three classes of LMs can be identified
Class A
Well-known statistical / data mining techniques: distinction between features and targets; training vs test set split; bias vs variance (underfitting vs overfitting); …
Well-known statistical / data mining quality parameters: precision, recall, accuracy, …
Traditional ML systems,�for classifications or regressions
Two main measurement-related problems:
Tools: k-nearest neighbors, logistic regression, decision trees, neural networks, …�Examples: handwritten character recognition, antispam filtering, recommendation systems, sentiment analysis, time series forecast, …
A
B
C
Class B
Tools: RNNs, Transformers�Examples: translation, summarization, …
All common benchmarks for language models assume single Q&A
– MMLU (Massive Multitask Language Understanding): performance on a wide range of tasks (“the SAT for chatbots”)
– HellaSwag: commonsense reasoning
– PIQA (Physical Interaction Question Answering): comprehension of physical interactions
– WinoGrande: common sense reasoning, complex pronoun disambiguation
Single Q&A GenAI systems,�for context-free tasks
Together with what was mentioned for Class A systems, the key measurement-related problem:
there could be no intersubjective criteria to assess the quality of inference results�(see the case of BLEU: “the closer a machine translation is to a professional human translation, the better it is”)
A
B
C
Class C
Conversational GenAI systems,�for context-sensitive tasks
Tools: Transformers�Examples: like for Class B, plus conversations
We are not aware of any benchmark / metric specifically devoted to context-sensitive tasks
(LMArena Leaderboard (https://lmarena.ai/leaderboard) is based on direct comparison and using the Elo rating system)
Together with what was mentioned for Class A & B systems, the key measurement-related problem:
there could be no intersubjective criteria to assess the quality of inference results
A
B
C
In summary
Classes of systems Measurement-related problems
A: traditional ML systems solved or well-known
B: single Q&A GenAI systems partially solved, hard
C: conversational GenAI systems unsolved, very hard
A
B
C
Open issues / Why this is important
The evaluation of the quality of behavior of Class C systems (chatbots…),�as a context-sensitive task, is still an open issue�(in particular: how to avoid the “learn to test” effect?)
Moreover:
How to evaluate the quality of behavior of these systems is still an open issue
Finally, the training process of chatbots includes today not only self-supervised learning and supervised fine tuning (both token oriented), but also reinforcement learning (task oriented):�particularly the version with human feedback (RLHF) is unavoidably ideological,�and therefore very hard to evaluate objectively�(see the case of the OpenAI Model Spec: https://model-spec.openai.com)
Thanks for your attention