1 of 12

Team Presentations

Teams TK, Learnings, Power

2 of 12

Neural Mechanics�Week 3: Evaluation and�LLM-generated data

Tuesday, January 27, 2026

David Bau

Northeastern University

3 of 12

Three essential ways of prompting LLMs

The capital of the state of Vermont is the city of

Cloze prompt (fill-in-the-blank)

incomplete sentence

In-context learning (by-example)

demonstrations

New York: Albany; Montana: Helena; Vermont:

Tell me the capital of Vermont.

Explicit instruction-style prompt

direct instruction

next-word prediction

4 of 12

Petroni: LAMA

1. Take factual knowledge (e.g., "Paris is the capital of France")

↓

2. Convert to cloze template: "Paris is the capital of [MASK]."

↓

3. Have BERT predict the masked word (single token)

↓

4. Check if top prediction matches the correct answer

5 of 12

- Luze Sun: Is this really "knowledge"?

- Yunus Q1: Hidden organization?

- Rice Q2: Subspace of an LLM?

6 of 12

LLM as Judge: evaluating long texts

1. Collect open-ended responses from multiple LLMs

↓

2. Have GPT-4 judge the responses (either scoring or pairwise comparison)

↓

3. Compare GPT-4's judgments to human preferences

↓

4. Measure agreement rate and analyze disagreements

↓

5. Identify and characterize biases

7 of 12

The Three Big Biases that stand out:

Positional and label bias, e.g., (A) vs (B)

Mitigate by shuffling.

Length bias

LLM judges strongly prefer longer answers.
Can be somewhat mitigated by rubrics.

“Self enhancement bias”

5-7% boost in scores for same-family outputs
Standard mitigation: judge with a different model family

Li 2024, CalibraEval measured this: https://arxiv.org/abs/2410.15393v1

Guangyuan: 80% agreement?

Kai: How to communicate limitations?

Armita: Just tell the model to avoid bias?

Arya: Game the judge with personas?

Courtney: Self-enhancement 🡪 adversary?

Haoyu: Shared inductive bias in judgments?

8 of 12

How Perez Created Evaluation Data

1. Define a behavior you want to test

↓

2. Write a few example test questions by hand

↓

3. Prompt an LLM to generate many more examples in that format

↓

4. Second LLM review to filter out bad generations

↓

5. Run eval on target models

↓

6. Analyze results (in his paper: base model vs RLHF model)

9 of 12

Grace: "uneducated" user behavior - is that sycophancy?

Claire: Sycophancy and its broader effects

Avery: Can we fix RLHF feedback?

Yunus Q2: Why does RLHF amplify political views?

Rice Q1: Are LLM-generated datasets limited by RLHF bias?

Isaac: Say vs. do - what do we learn from yes/no questions?

Jasmine Q4: Failure modes of model-written evals

https://www.evals.anthropic.com/model-written/

10 of 12

Evals and interpretability

Eval Type	Question	Example
Capability	Can it do X?	Can the model recall facts?
Behavioral	How does it do X?	Does the model give less accurate answers to “uneducated” users?
Mechanistic	Does component M implement X?	Does ablating layer 15 erase knowledge of a fact?

Jasmine Q1: Why do evaluations - what does the knowledge enable?

Yuqi: What do evaluation metrics measure in interp research?

11 of 12

Evals in Interpretability: Research Flow

Starting point, establishing feasibility.

Cannot study “where does the model do X”�until you know “it can do X”.
Capability eval to identify models that can perform the task.

Finding an interesting hypothesis with specific behavior:

“The model understands the relationship between X and Y”
Create minimal pairs that show awareness of the relationship

Validation:

“I found the feature that controls X”
Datasets will give you a way to measure X.

Jasmine Q2: Minimum viable eval (pelican on a bicycle) Jasmine Q3: Baselines and controlling confounds

12 of 12

Homework for Thursday Project

Try using LLMs to create one or more datasets for your problem
Be inspired by Perez: use a different LM to judge, filter and check
Always monitor samples manually. Human eval means: count!

Measure capabilities: “Which models can do X”?

Measure behavior: “How do they do X”?