Predictable Noise in LLM Benchmarks [10/27]
Email *

Based on the motivation presented, why should we be cautious about a paper claiming a 2.5% improvement on the HumanEval benchmark?


*
1 point
The idea that a few "hard" generative problems might be more informative than many easy ones. What did the exploratory data analysis (the heatmaps) reveal about this? *
1 point
What was the key finding from analyzing millions of prompt results regarding the statistical noise of a benchmark in the lecture? *
1 point
The speaker introduced the "signal-to-noise ratio" to measure a benchmark's quality. What was the main conclusion from this analysis? *
1 point
The attempt to "boost the signal" by reweighting or filtering for harder problems failed. What was the main reason for this failure? *
1 point
A copy of your responses will be emailed to the address you provided.
Submit
Clear form
reCAPTCHA
This form was created inside of UC Berkeley.

Does this form look suspicious? Report