EVALUATION IN AI
Rishi Bommasani
Deep Learning: Classics and Trends
ML Collective
1
Societal Impact of Foundation Models
Bommasani 2
TRANSPARENCY
CONCEPTS
CHANGE
Outline
Bommasani 3
Bommasani 4
Bommasani 5
Bommasani 6
LMs are important
Adept, AI21, Anthropic, Character, Cohere, Hugging Face, Inflection, OpenAI, …
Bommasani 7
Yet we don’t understand them
Bommasani 8
Bommasani 9
Bommasani 10
Benchmarking
Benchmarks orient AI. They set priorities and codify values.
Benchmarks are mechanisms for change.
"proper evaluation is a complex and challenging business"
- Karen Spärck Jones (ACL Lifetime Achievement Award, 2005)
Spärck Jones and Galliers (1995), Liberman (2010), Ethayarajh and Jurafsky (2020), Bowman and Dahl (2021), Raji et al. (2021), Birhane et al. (2022), Bommasani (2022) inter alia
Bommasani 11
Bommasani 12
Bommasani 13
Language model:
Blackbox – no assumptions on how it is built, etc.��Inputs: Text�Outputs: Text with probabilities (likelihood)
Bommasani 14
HELM design principles
Bommasani 15
Principle 1: Broad coverage
First taxonomize, then select
Bommasani 16
Principle 2: Multi-metric
Measure all metrics simultaneously to expose relationships/tradeoffs
Bommasani 17
Benchmarking paradigms
Bommasani 18
Accuracy, 1 dataset
Accuracy, several datasets
Many metrics, many datasets
Principle 3: Standardization
Bommasani 19
Important considerations
Bommasani 20
Evaluation at scale
Costs
Bommasani 21
Primitives
Bommasani 22
Scenario
Bommasani 23
Adaptation
Bommasani 24
Metrics
Bommasani 25
Scenario Taxonomy
Bommasani 26
Task selection
Bommasani 27
Example scenario: RAFT
Bommasani 28
Desiderata/Metrics
Bommasani 29
Desiderata/Metric Selection
Bommasani 30
Example metric: Calibration
Bommasani 31
Scenarios x metrics
Bommasani 32
Targeted Evaluations
Bommasani 33
Models
Bommasani 34
Hardware (public models)
Bommasani 35
Adaptation via prompting
Bommasani 36
Bommasani 37
Accuracy vs X
Bommasani 38
Metric relationships
Bommasani 39
Accuracy as a function of time
Bommasani 40
Accuracy as a function of access
Bommasani 41
Variance across seeds
Bommasani 42
In-context examples
Bommasani 43
Multiple-choice method
Bommasani 44
Robustness (contrast sets)
Bommasani 45
Summarization
Bommasani 46
Disinformation
Bommasani 47
Next steps
Bommasani 48
Trust
Bommasani 49
Lots of bias metrics, little trust
Bommasani 50
Bommasani, Davis, Cardie (ACL 2020)
Testing Protocol to Accrue Trust
Bommasani 51
Face validity
Bommasani 52
Predictive Validity
Bommasani 53
Hypothesis validity
Bommasani 54
Evaluation for Change
Bommasani 55
Policy
Bommasani 56
References
Reach out at nlprishi@stanford.edu
Bommasani 57
Bommasani 58
HALIE
Bommasani 59
Centering interaction
Bommasani 60
Interactive tasks
Bommasani 61
Coverage of design space
Bommasani 62
Social Dialogue
Bommasani 63
Interactive QA
Bommasani 64
Crossword Puzzles
Bommasani 65
Harms that arose in practice
Bommasani 66
Discussion
Bommasani 67