1 of 58

Lecture 7

CS 263:

Advanced NLP

Saadia Gabriel

2 of 58

Announcements

  • Our next guest lecture is on Monday:

Alisa Liu (UW)

OpenAI Superalignment and NSF fellow

Between Language and Models: Rethinking Algorithms for Tokenization

Language models operate over real numbers, while users of language models interface with human-readable text. This is made possible by tokenization, which encodes text as a sequence of embeddings and decodes real-valued predictions back into generated text. In this talk, I will discuss our recent work in improving algorithms for tokenization. The first half presents SuperBPE, a superword tokenizer that extends traditional subword tokenization to include tokens that span multiple words. We motivate superword tokens from a linguistic perspective, and demonstrate empirically that models pretrained from scratch with SuperBPE achieve stronger performance on downstream tasks while also being significantly more efficient at inference-time. The second half revisits a fundamental limitation of tokenizer-based LMs: models trained over sequences of tokens cannot, out of the box, model the probability of arbitrary strings. I discuss the practical implications of this in domains such as Chinese and code, and then present an inference-time solution that converts LM-predicted probabilities over tokens into probabilities over characters.

3 of 58

Announcements

  • Wednesday will be peer review of mid-project reports

(assignment just released on Bruin Learn and due 2/3)

  • Feedback will be individual, based on in-class assignments
  • Guidelines have been updated
  • Give feedback on 3 reports other than your own

4 of 58

Quiz Recap

Q1: Grice's Maxims were introduced in the guest lecture. Explain what this concept is and how it is relevant to digital privacy. 

In linguistics, principles of cooperative communication introduced by Grice (1975)

In the guest lecture, we discussed how LLMs’ adherence to these principles in user-LLM interactions affects users’ personal disclosure behavior

5 of 58

Quiz Recap

Q2: Why do we typically use supervised finetuning to initially train LLMs to follow instructions, instead of just using RLHF or DPO? 

Supervised finetuning helps initialize the model by teaching it an expected format through examples. Outputs can then be refined to adhere to certain desired qualities through RLHF or DPO, but it is far more challenging and inefficient to learn format through reward signals alone.

6 of 58

Quiz Recap

Q3: Datasheets are only meant to explain dataset statistics and attributes. True or False.

False.

They can contain many other metadata details like who created the dataset and what was its intended purpose:

7 of 58

Quiz Recap

Q5: What key finding about model scaling was revealed by the release of the Chinchilla model? 

To optimally use model parameters, scaling parameters should be balanced with scaling training data.

Hoffmann et al., 2022

8 of 58

Last Time

We explored various sampling or search based approaches for constructing actual text sequences from our language model’s token probability distributions

fun?

9 of 58

Today

The following slide examples are partially from Daphne Ippolito and Chenyan Xiong’s CMU 11-667 slides,

as well as Yann Dubois’ Stanford CS224N slides

How do we decide who built the “best” chatbot?

Alice

Bob

🤖

🤖

A

B

10 of 58

What is a benchmark?

https://www.ruder.io/nlp-benchmarking/

11 of 58

Historical Perspective on Benchmarking

Government agencies like DARPA and NIST funded early large-scale benchmark creation efforts

MNIST, LeCun et al. (1998)

12 of 58

Historical Perspective on Benchmarking

13 of 58

What are the metrics?

Accuracy

Recall

Precision

F1

14 of 58

What if we have open-ended tasks

(e.g. summarization)?

15 of 58

Let’s assume for now we have good-quality reference texts…

16 of 58

Ngram-based Metrics

17 of 58

Lack of Semantic Understanding

Do you prefer Alice’s model to Bob’s model?

🤖

GPT-5

Heck yes

Heck no

Yup!

No n-gram overlap, but same meaning

N-gram overlap, but opposite meaning

How can we fix this?

18 of 58

Challenges in Benchmarking

19 of 58

Model-based Metrics

20 of 58

Should we trust our references?

21 of 58

All these metrics assume we even have a reference…

22 of 58

LLM-as-a-judge

Do you prefer Alice’s model to Bob’s model?

🤖

GPT-5

Heck yes

🤖

A

🤖

B

23 of 58

LLM Self-Bias

Suppose Alice’s model is also GPT-5

(or derived from GPT-5)…

🤖

GPT-5

Heck yes

Panickssery et al. (2024)

24 of 58

The Gold Standard

25 of 58

Human Evaluation

26 of 58

Human Evaluation

27 of 58

Human Evaluation

Model

Training

Model

Evals

28 of 58

Ensuring Quality in

Human Evaluation

29 of 58

Inter-annotator Agreement

Do annotators make consistent judgements?

If not, this could be a sign of annotator error or bad protocol design*

* It could also be a sign of task subjectivity, so this is not inherently bad.

A good rule of thumb is checking agreement scores from published work doing the same or similar tasks.

Alice’s model is better

Bob’s model is better

30 of 58

Common IAA metrics

  • Raw percentage agreement
  • Cohen’s Kappa (Cohen, 1960)
  • Fleiss’ Kappa (Fleiss, 1971)
  • Generalization of Cohen’s Kappa for n > 2

(k items with n judgements per item)

  • Krippendorff’s Alpha (Krippendorff, 1980)
  • What % of the time does annotator #1 agree with annotator #2?
  • Agreement can happen by chance

Correcting for chance agreement:

  • Even more robust, handles missing data

31 of 58

Cohen’s Kappa (κ)

32 of 58

Cohen’s Kappa (κ)

.34

.34

34 * .34 = .1156

.116

.339

33 of 58

Cohen’s Kappa (κ)

.339

.339

.359

34 of 58

Fleiss’ Kappa (κ)

35 of 58

Fleiss’ Kappa (κ)

36 of 58

Fleiss’ Kappa (κ)

37 of 58

Fleiss’ Kappa (κ)

38 of 58

What these scores mean

39 of 58

Other methods

Multiple choice?

Likert scale?

Free-text?

40 of 58

Other Issues

Order Bias

Randomize positions of questions and examples

Inattentive Annotators

Put in simple attention checks (e.g. how many questions have you completed)

AI Annotators????

Spamming Annotators

Time checks, evaluate multiple responses for random guessing, quality of free-text responses…

41 of 58

AI and Crowdwork Platforms

33-46% of crowd workers were estimated to be using LLMs when completing a summarization task on Amazon Mechanical Turk

42 of 58

Automatic Evaluations Generally Are Reproducible…

43 of 58

Is Human Evaluation Reproducible?

44 of 58

Variability across Evaluations

45 of 58

Minimum Reporting

Requirements

46 of 58

Benchmarking

Ecosystem

47 of 58

Model Leaderboards

Liang et al., 2022

48 of 58

Dynamic

Benchmarking

49 of 58

Take a couple minutes to discuss:

Consider a chatbot web app (e.g. ChatGPT, Gemini).

What information should we collect to assess the LM?

50 of 58

Example Suggestions

51 of 58

Dynamic

Benchmarking

Chiang et al., 2024

52 of 58

Dynamic

Benchmarking

53 of 58

Benchmarking Monoculture

54 of 58

Multilingual Benchmarking

55 of 58

Benchmark Contamination

56 of 58

Preventing Contamination

57 of 58

Preventing Contamination

58 of 58

After Next Week:

Model Interpretability