Lecture 7
CS 263:
Advanced NLP
Saadia Gabriel
Announcements
Alisa Liu (UW)
OpenAI Superalignment and NSF fellow
Between Language and Models: Rethinking Algorithms for Tokenization
Language models operate over real numbers, while users of language models interface with human-readable text. This is made possible by tokenization, which encodes text as a sequence of embeddings and decodes real-valued predictions back into generated text. In this talk, I will discuss our recent work in improving algorithms for tokenization. The first half presents SuperBPE, a superword tokenizer that extends traditional subword tokenization to include tokens that span multiple words. We motivate superword tokens from a linguistic perspective, and demonstrate empirically that models pretrained from scratch with SuperBPE achieve stronger performance on downstream tasks while also being significantly more efficient at inference-time. The second half revisits a fundamental limitation of tokenizer-based LMs: models trained over sequences of tokens cannot, out of the box, model the probability of arbitrary strings. I discuss the practical implications of this in domains such as Chinese and code, and then present an inference-time solution that converts LM-predicted probabilities over tokens into probabilities over characters.
Announcements
(assignment just released on Bruin Learn and due 2/3)
Quiz Recap
Q1: Grice's Maxims were introduced in the guest lecture. Explain what this concept is and how it is relevant to digital privacy.
In linguistics, principles of cooperative communication introduced by Grice (1975)
In the guest lecture, we discussed how LLMs’ adherence to these principles in user-LLM interactions affects users’ personal disclosure behavior
Quiz Recap
Q2: Why do we typically use supervised finetuning to initially train LLMs to follow instructions, instead of just using RLHF or DPO?
Supervised finetuning helps initialize the model by teaching it an expected format through examples. Outputs can then be refined to adhere to certain desired qualities through RLHF or DPO, but it is far more challenging and inefficient to learn format through reward signals alone.
Quiz Recap
Q3: Datasheets are only meant to explain dataset statistics and attributes. True or False.
False.
They can contain many other metadata details like who created the dataset and what was its intended purpose:
Quiz Recap
Q5: What key finding about model scaling was revealed by the release of the Chinchilla model?
To optimally use model parameters, scaling parameters should be balanced with scaling training data.
Hoffmann et al., 2022
Last Time
We explored various sampling or search based approaches for constructing actual text sequences from our language model’s token probability distributions
fun?
Today
The following slide examples are partially from Daphne Ippolito and Chenyan Xiong’s CMU 11-667 slides,
as well as Yann Dubois’ Stanford CS224N slides
How do we decide who built the “best” chatbot?
Alice
Bob
🤖
🤖
A
B
What is a benchmark?
https://www.ruder.io/nlp-benchmarking/
Historical Perspective on Benchmarking
Government agencies like DARPA and NIST funded early large-scale benchmark creation efforts
MNIST, LeCun et al. (1998)
Historical Perspective on Benchmarking
What are the metrics?
Accuracy
Recall
Precision
F1
What if we have open-ended tasks
(e.g. summarization)?
Let’s assume for now we have good-quality reference texts…
Ngram-based Metrics
Lack of Semantic Understanding
Do you prefer Alice’s model to Bob’s model?
🤖
GPT-5
Heck yes
Heck no
Yup!
No n-gram overlap, but same meaning
N-gram overlap, but opposite meaning
How can we fix this?
Challenges in Benchmarking
Model-based Metrics
Should we trust our references?
All these metrics assume we even have a reference…
LLM-as-a-judge
Do you prefer Alice’s model to Bob’s model?
🤖
GPT-5
Heck yes
🤖
A
🤖
B
LLM Self-Bias
Suppose Alice’s model is also GPT-5
(or derived from GPT-5)…
🤖
GPT-5
Heck yes
Panickssery et al. (2024)
The Gold Standard
Human Evaluation
Human Evaluation
Human Evaluation
Model
Training
Model
Evals
Ensuring Quality in
Human Evaluation
Inter-annotator Agreement
Do annotators make consistent judgements?
If not, this could be a sign of annotator error or bad protocol design*
* It could also be a sign of task subjectivity, so this is not inherently bad.
A good rule of thumb is checking agreement scores from published work doing the same or similar tasks.
Alice’s model is better
Bob’s model is better
Common IAA metrics
(k items with n judgements per item)
Correcting for chance agreement:
Cohen’s Kappa (κ)
Cohen’s Kappa (κ)
.34
.34
34 * .34 = .1156
.116
.339
Cohen’s Kappa (κ)
.339
.339
.359
Fleiss’ Kappa (κ)
Fleiss’ Kappa (κ)
Fleiss’ Kappa (κ)
Fleiss’ Kappa (κ)
What these scores mean
Other methods
Multiple choice?
Likert scale?
Free-text?
Other Issues
Order Bias
Randomize positions of questions and examples
Inattentive Annotators
Put in simple attention checks (e.g. how many questions have you completed)
AI Annotators????
Spamming Annotators
Time checks, evaluate multiple responses for random guessing, quality of free-text responses…
AI and Crowdwork Platforms
33-46% of crowd workers were estimated to be using LLMs when completing a summarization task on Amazon Mechanical Turk
Automatic Evaluations Generally Are Reproducible…
Is Human Evaluation Reproducible?
Variability across Evaluations
Minimum Reporting
Requirements
Benchmarking
Ecosystem
Model Leaderboards
Liang et al., 2022
Dynamic
Benchmarking
Take a couple minutes to discuss:
Consider a chatbot web app (e.g. ChatGPT, Gemini).
What information should we collect to assess the LM?
Example Suggestions
Dynamic
Benchmarking
Chiang et al., 2024
Dynamic
Benchmarking
Benchmarking Monoculture
Multilingual Benchmarking
Benchmark Contamination
Preventing Contamination
Preventing Contamination
After Next Week:
Model Interpretability