1 of 67

EVALUATION IN AI

Rishi Bommasani

Deep Learning: Classics and Trends

ML Collective

2 of 67

Societal Impact of Foundation Models

Bommasani 2

TRANSPARENCY

CONCEPTS

CHANGE

3 of 67

Outline

Transparency

HELM
HALIE (Mina Lee et al., 2022)

Concepts

Trust (Bommasani and Liang, 2022)
Homogenization (Bommasani et al., 2022)
Emergence (Jason Wei et al., 2022)

Change

Power (Bommasani, 2022)
Policy (Bommasani et al, 2023, next week!)

Bommasani 3

7 of 67

LMs are important

Research

NLP: almost every NLP paper uses an LM
Other AI subareas

New trends (do RL as “language modeling”)

Other disciplines (protein language models)

Deployment

Flagship products with billions of users (Bing, Google Translate, Microsoft Word)
Emerging tech (ChatGPT, CoPilot)
Focus of some of the newest and most aggressively funded AI startups in history

Adept, AI21, Anthropic, Character, Cohere, Hugging Face, Inflection, OpenAI, …

Bommasani 7

8 of 67

Yet we don’t understand them

Bommasani 8

10 of 67

CRFM

300+ researchers, 40+ faculty
10+ academic departments

Bommasani 10

11 of 67

Benchmarking

Benchmarks orient AI. They set priorities and codify values.

Benchmarks are mechanisms for change.

"proper evaluation is a complex and challenging business"

- Karen Spärck Jones (ACL Lifetime Achievement Award, 2005)

Spärck Jones and Galliers (1995), Liberman (2010), Ethayarajh and Jurafsky (2020), Bowman and Dahl (2021), Raji et al. (2021), Birhane et al. (2022), Bommasani (2022) inter alia

Bommasani 11

12 of 67

Bommasani 12

13 of 67

Bommasani 13

14 of 67

Language model:

Blackbox – no assumptions on how it is built, etc.��Inputs: Text�Outputs: Text with probabilities (likelihood)

Bommasani 14

15 of 67

HELM design principles

Broad coverage and recognition of incompleteness
Multi-metric
Standardization

Bommasani 15

16 of 67

Principle 1: Broad coverage

First taxonomize, then select

Bommasani 16

17 of 67

Principle 2: Multi-metric

Measure all metrics simultaneously to expose relationships/tradeoffs

Bommasani 17

18 of 67

Benchmarking paradigms

Bommasani 18

Accuracy, 1 dataset

Accuracy, several datasets

Many metrics, many datasets

19 of 67

Principle 3: Standardization

Bommasani 19

20 of 67

Important considerations

How you adapt the LM (e.g. prompting, probing, fine-tuning) matters
Different LMs might work in different regimes
Hard to ensure models are not contaminated (exposed to test data/distribution)
We don’t evaluate all models, and models are constantly being built (e.g. ChatGPT)

Bommasani 20

21 of 67

Evaluation at scale

40+ scenarios across 6 tasks (e.g. QA) + 7 targeted evals (e.g. reasoning)
7 metrics (e.g. robustness, bias)
30+ models (e.g. BLOOM) from 12 organizations (e.g. OpenAI)

Costs

5k runs
12B tokens, 17M queries
$38k USD for commercial APIs, 20k A100 GPU hours for public models

Bommasani 21

22 of 67

Primitives

Bommasani 22

23 of 67

Scenario

Bommasani 23

24 of 67

Adaptation

Bommasani 24

25 of 67

Metrics

Bommasani 25

26 of 67

Scenario Taxonomy

Bommasani 26

27 of 67

Task selection

Unilingual (English)
Unimodal (text)
User-facing

Question Answering
Summarization
Information Retrieval
Sentiment Analysis
Toxicity Detection
Miscellaneous Text Classification

Bommasani 27

28 of 67

Example scenario: RAFT

Bommasani 28

29 of 67

Desiderata/Metrics

Bommasani 29

30 of 67

Desiderata/Metric Selection

Bommasani 30

31 of 67

Example metric: Calibration

Bommasani 31

32 of 67

Scenarios x metrics

Bommasani 32

33 of 67

Targeted Evaluations

Language

Language modeling
Minimal pairs

Knowledge

Knowledge-intensive QA
Fact completion

Reasoning

Synthetic/purer reasoning

Ampliative
Non-ampliative
Recursive hierarchy
State tracking

Realistic/situated reasoning

Copyright
Disinformation
Bias/Stereotypes
Toxicity

Bommasani 33

34 of 67

Models

Bommasani 34

35 of 67

Hardware (public models)

Bommasani 35

36 of 67

Adaptation via prompting

Bommasani 36

37 of 67

Bommasani 37

38 of 67

Accuracy vs X

Bommasani 38

39 of 67

Metric relationships

Bommasani 39

40 of 67

Accuracy as a function of time

Bommasani 40

41 of 67

Accuracy as a function of access

Bommasani 41

42 of 67

Variance across seeds

Bommasani 42

43 of 67

In-context examples

Bommasani 43

44 of 67

Multiple-choice method

Bommasani 44

45 of 67

Robustness (contrast sets)

Bommasani 45

46 of 67

Summarization

Bommasani 46

47 of 67

Disinformation

Bommasani 47

48 of 67

Next steps

Add scenarios, models, metrics we missed

Already added text-davinci-003, new AI21 and Cohere models
Adding FLAN-T5, OPT-IML this month
Some progress on other closed models (Google, DeepMind)
Some progress on ChatGPT (hard with rate limits/no API)

Monolingual (non-English) + Multilingual

Some support in-progress for various MT, multilingual/cross-lingual datasets

Dialogue/assistant-type models
Vision, vision + text models
Other foundation models

Bommasani 48

49 of 67

Trust

Bommasani 49

50 of 67

Lots of bias metrics, little trust

Bommasani 50

Bommasani, Davis, Cardie (ACL 2020)

51 of 67

Testing Protocol to Accrue Trust

Measurement modeling (Loevinger, 1957; Messick, 1987, Jackman, 2008, …)

Widespread use in many social sciences

Specific criteria to ensure measures are valid and reliable

Bommasani 51

52 of 67

Face validity

Bommasani 52

53 of 67

Predictive Validity

Bommasani 53

54 of 67

Hypothesis validity

Bommasani 54

55 of 67

Evaluation for Change

Evaluation is a force

Power comes from adoption
Reified as standards (ImageNet)

Other forces

Resources > Evaluation for FMs

Scaling laws (efficient allocation mindset)

Evaluation more pluralistic

Power

Evaluation’s power is legitimate
Evaluation’s power is unevenly distributed

Time is ripe to use evaluation to drive change

Less costly (few-shot)
Community-driven (BIG-bench, EleutherAI, GEM, UD)
More recognition

Bommasani 55

56 of 67

Policy

Ground policy decisions in concrete evaluations

I.e. public discourse on AI often is untethered to actual results

Transparency on models not released at all (e.g. PaLM)
Multidimensional, standardizing
Access mediates evaluation and transparency

Bommasani 56

57 of 67

References

HELM (Liang*, Bommasani*, T. Lee*, et al., 2022)
Trustworthy Social Bias Measurement (Bommasani and Liang, 2022)
Evaluation for Change (Bommasani, 2022)
Policy Brief (Bommasani, Zhang, T. Lee, Liang, forthcoming, 2023)

Reach out at nlprishi@stanford.edu

Bommasani 57

58 of 67

Bommasani 58

59 of 67

HALIE

Bommasani 59

60 of 67

Centering interaction

Bommasani 60

61 of 67

Interactive tasks

Bommasani 61

62 of 67

Coverage of design space

Bommasani 62

63 of 67

Social Dialogue

Bommasani 63

64 of 67

Interactive QA

Bommasani 64

65 of 67

Crossword Puzzles

Bommasani 65

66 of 67

Harms that arose in practice

Bommasani 66

67 of 67

Discussion

Low-latency very important for human experience
Interactive study design is much harder (e.g. user adaptation)
How does human-human and human-machine language change over time?

Bommasani 67