Team Presentations
Teams TK, Learnings, Power
Neural Mechanics�Week 3: Evaluation and�LLM-generated data
Tuesday, January 27, 2026
David Bau
Northeastern University
Three essential ways of prompting LLMs
The capital of the state of Vermont is the city of
Cloze prompt (fill-in-the-blank)
incomplete sentence
In-context learning (by-example)
demonstrations
New York: Albany; Montana: Helena; Vermont:
Tell me the capital of Vermont.
Explicit instruction-style prompt
direct instruction
next-word prediction
Petroni: LAMA
1. Take factual knowledge (e.g., "Paris is the capital of France")
↓
2. Convert to cloze template: "Paris is the capital of [MASK]."
↓
3. Have BERT predict the masked word (single token)
↓
4. Check if top prediction matches the correct answer
- Luze Sun: Is this really "knowledge"?
- Yunus Q1: Hidden organization?
- Rice Q2: Subspace of an LLM?
LLM as Judge: evaluating long texts
1. Collect open-ended responses from multiple LLMs
↓
2. Have GPT-4 judge the responses (either scoring or pairwise comparison)
↓
3. Compare GPT-4's judgments to human preferences
↓
4. Measure agreement rate and analyze disagreements
↓
5. Identify and characterize biases
The Three Big Biases that stand out:
Li 2024, CalibraEval measured this: https://arxiv.org/abs/2410.15393v1
Guangyuan: 80% agreement?
Kai: How to communicate limitations?
Armita: Just tell the model to avoid bias?
Arya: Game the judge with personas?
Courtney: Self-enhancement 🡪 adversary?
Haoyu: Shared inductive bias in judgments?
How Perez Created Evaluation Data
1. Define a behavior you want to test
↓
2. Write a few example test questions by hand
↓
3. Prompt an LLM to generate many more examples in that format
↓
4. Second LLM review to filter out bad generations
↓
5. Run eval on target models
↓
6. Analyze results (in his paper: base model vs RLHF model)
Grace: "uneducated" user behavior - is that sycophancy?
Claire: Sycophancy and its broader effects
Avery: Can we fix RLHF feedback?
Yunus Q2: Why does RLHF amplify political views?
Rice Q1: Are LLM-generated datasets limited by RLHF bias?
Isaac: Say vs. do - what do we learn from yes/no questions?
Jasmine Q4: Failure modes of model-written evals
Evals and interpretability
Eval Type | Question | Example |
Capability | Can it do X? | Can the model recall facts? |
Behavioral | How does it do X? | Does the model give less accurate answers to “uneducated” users? |
Mechanistic | Does component M implement X? | Does ablating layer 15 erase knowledge of a fact? |
Jasmine Q1: Why do evaluations - what does the knowledge enable?
Yuqi: What do evaluation metrics measure in interp research?
Evals in Interpretability: Research Flow
Jasmine Q2: Minimum viable eval (pelican on a bicycle) Jasmine Q3: Baselines and controlling confounds
Homework for Thursday Project
Measure capabilities: “Which models can do X”?
Measure behavior: “How do they do X”?