1 of 67

EVALUATION IN AI

Rishi Bommasani

Deep Learning: Classics and Trends

ML Collective

1

2 of 67

Societal Impact of Foundation Models

Bommasani 2

TRANSPARENCY

CONCEPTS

CHANGE

3 of 67

Outline

  • Transparency
    • HELM
    • HALIE (Mina Lee et al., 2022)
  • Concepts
    • Trust (Bommasani and Liang, 2022)
    • Homogenization (Bommasani et al., 2022)
    • Emergence (Jason Wei et al., 2022)
  • Change
    • Power (Bommasani, 2022)
    • Policy (Bommasani et al, 2023, next week!)

Bommasani 3

4 of 67

Bommasani 4

5 of 67

Bommasani 5

6 of 67

Bommasani 6

7 of 67

LMs are important

  • Research
    • NLP: almost every NLP paper uses an LM
    • Other AI subareas
      • New trends (do RL as “language modeling”)
    • Other disciplines (protein language models)
  • Deployment
    • Flagship products with billions of users (Bing, Google Translate, Microsoft Word)
    • Emerging tech (ChatGPT, CoPilot)
    • Focus of some of the newest and most aggressively funded AI startups in history

Adept, AI21, Anthropic, Character, Cohere, Hugging Face, Inflection, OpenAI, …

Bommasani 7

8 of 67

Yet we don’t understand them

Bommasani 8

9 of 67

Bommasani 9

10 of 67

  • 300+ researchers, 40+ faculty
  • 10+ academic departments

Bommasani 10

11 of 67

Benchmarking

Benchmarks orient AI. They set priorities and codify values.

Benchmarks are mechanisms for change.

"proper evaluation is a complex and challenging business"

- Karen Spärck Jones (ACL Lifetime Achievement Award, 2005)

Spärck Jones and Galliers (1995), Liberman (2010), Ethayarajh and Jurafsky (2020), Bowman and Dahl (2021), Raji et al. (2021), Birhane et al. (2022), Bommasani (2022) inter alia

Bommasani 11

12 of 67

Bommasani 12

13 of 67

Bommasani 13

14 of 67

Language model:

Blackbox – no assumptions on how it is built, etc.��Inputs: Text�Outputs: Text with probabilities (likelihood)

Bommasani 14

15 of 67

HELM design principles

  • Broad coverage and recognition of incompleteness
  • Multi-metric
  • Standardization

Bommasani 15

16 of 67

Principle 1: Broad coverage

First taxonomize, then select

Bommasani 16

17 of 67

Principle 2: Multi-metric

Measure all metrics simultaneously to expose relationships/tradeoffs

Bommasani 17

18 of 67

Benchmarking paradigms

Bommasani 18

Accuracy, 1 dataset

Accuracy, several datasets

Many metrics, many datasets

19 of 67

Principle 3: Standardization

Bommasani 19

20 of 67

Important considerations

  • How you adapt the LM (e.g. prompting, probing, fine-tuning) matters
  • Different LMs might work in different regimes
  • Hard to ensure models are not contaminated (exposed to test data/distribution)
  • We don’t evaluate all models, and models are constantly being built (e.g. ChatGPT)

Bommasani 20

21 of 67

Evaluation at scale

  • 40+ scenarios across 6 tasks (e.g. QA) + 7 targeted evals (e.g. reasoning)
  • 7 metrics (e.g. robustness, bias)
  • 30+ models (e.g. BLOOM) from 12 organizations (e.g. OpenAI)

Costs

  • 5k runs
  • 12B tokens, 17M queries
  • $38k USD for commercial APIs, 20k A100 GPU hours for public models

Bommasani 21

22 of 67

Primitives

Bommasani 22

23 of 67

Scenario

Bommasani 23

24 of 67

Adaptation

Bommasani 24

25 of 67

Metrics

Bommasani 25

26 of 67

Scenario Taxonomy

Bommasani 26

27 of 67

Task selection

  • Unilingual (English)
  • Unimodal (text)
  • User-facing
    • Question Answering
    • Summarization
    • Information Retrieval
    • Sentiment Analysis
    • Toxicity Detection
    • Miscellaneous Text Classification

Bommasani 27

28 of 67

Example scenario: RAFT

Bommasani 28

29 of 67

Desiderata/Metrics

Bommasani 29

30 of 67

Desiderata/Metric Selection

Bommasani 30

31 of 67

Example metric: Calibration

Bommasani 31

32 of 67

Scenarios x metrics

Bommasani 32

33 of 67

Targeted Evaluations

  • Language
    • Language modeling
    • Minimal pairs
  • Knowledge
    • Knowledge-intensive QA
    • Fact completion
  • Reasoning
    • Synthetic/purer reasoning
      • Ampliative
      • Non-ampliative
      • Recursive hierarchy
      • State tracking
    • Realistic/situated reasoning
  • Copyright
  • Disinformation
  • Bias/Stereotypes
  • Toxicity

Bommasani 33

34 of 67

Models

Bommasani 34

35 of 67

Hardware (public models)

Bommasani 35

36 of 67

Adaptation via prompting

Bommasani 36

37 of 67

Bommasani 37

38 of 67

Accuracy vs X

Bommasani 38

39 of 67

Metric relationships

Bommasani 39

40 of 67

Accuracy as a function of time

Bommasani 40

41 of 67

Accuracy as a function of access

Bommasani 41

42 of 67

Variance across seeds

Bommasani 42

43 of 67

In-context examples

Bommasani 43

44 of 67

Multiple-choice method

Bommasani 44

45 of 67

Robustness (contrast sets)

Bommasani 45

46 of 67

Summarization

Bommasani 46

47 of 67

Disinformation

Bommasani 47

48 of 67

Next steps

  • Add scenarios, models, metrics we missed
    • Already added text-davinci-003, new AI21 and Cohere models
    • Adding FLAN-T5, OPT-IML this month
    • Some progress on other closed models (Google, DeepMind)
    • Some progress on ChatGPT (hard with rate limits/no API)
  • Monolingual (non-English) + Multilingual
    • Some support in-progress for various MT, multilingual/cross-lingual datasets
  • Dialogue/assistant-type models
  • Vision, vision + text models
  • Other foundation models

Bommasani 48

49 of 67

Trust

Bommasani 49

50 of 67

Lots of bias metrics, little trust

Bommasani 50

Bommasani, Davis, Cardie (ACL 2020)

51 of 67

Testing Protocol to Accrue Trust

  • Measurement modeling (Loevinger, 1957; Messick, 1987, Jackman, 2008, …)
    • Widespread use in many social sciences
  • Specific criteria to ensure measures are valid and reliable

Bommasani 51

52 of 67

Face validity

Bommasani 52

53 of 67

Predictive Validity

Bommasani 53

54 of 67

Hypothesis validity

Bommasani 54

55 of 67

Evaluation for Change

  • Evaluation is a force
    • Power comes from adoption
    • Reified as standards (ImageNet)
  • Other forces
    • Resources > Evaluation for FMs
      • Scaling laws (efficient allocation mindset)
    • Evaluation more pluralistic
  • Power
    • Evaluation’s power is legitimate
    • Evaluation’s power is unevenly distributed
  • Time is ripe to use evaluation to drive change
    • Less costly (few-shot)
    • Community-driven (BIG-bench, EleutherAI, GEM, UD)
    • More recognition

Bommasani 55

56 of 67

Policy

  • Ground policy decisions in concrete evaluations
    • I.e. public discourse on AI often is untethered to actual results
  • Transparency on models not released at all (e.g. PaLM)
  • Multidimensional, standardizing
  • Access mediates evaluation and transparency

Bommasani 56

57 of 67

References

Reach out at nlprishi@stanford.edu

Bommasani 57

58 of 67

Bommasani 58

59 of 67

HALIE

Bommasani 59

60 of 67

Centering interaction

Bommasani 60

61 of 67

Interactive tasks

Bommasani 61

62 of 67

Coverage of design space

Bommasani 62

63 of 67

Social Dialogue

Bommasani 63

64 of 67

Interactive QA

Bommasani 64

65 of 67

Crossword Puzzles

Bommasani 65

66 of 67

Harms that arose in practice

Bommasani 66

67 of 67

Discussion

  • Low-latency very important for human experience
  • Interactive study design is much harder (e.g. user adaptation)
  • How does human-human and human-machine language change over time?

Bommasani 67