Pre-read: Benchmarks

Limits to Prediction (Spring 2024)

Arvind Narayanan

The use of widely shared benchmarks in machine learning is a key to the rapid progress of the field. Computational linguist Marc Liberman calls this practice the Common Task Framework. In his lecture (about 35 minutes long), he describes how this approach originated in a DARPA program decades ago in response to a credibility crisis in the field of language technology (now called NLP). Statistician David Donoho observed that “the Common Task Framework is the single idea from machine learning and data science that is most lacking attention in today’s statistical training”. This is from his essay 50 years of data science, which is not on the reading list but very much worth reading if you have time. Think about how the Common Task Framework helps address some of the pitfalls we discussed on Monday.

In the Datasets chapter of our book on fairness and machine learning, we take a broader look at the role of datasets and benchmarking in machine learning, addressing both the advantages and disadvantages. Most of the chapter is not specifically about fairness.

One pitfall we didn’t discuss on Monday is performance degradation due to distribution shift, simply because I didn’t want to put too many readings on the list (there’s an almost endless variety of pitfalls!). See this investigation for why distribution shift can have life-and-death consequences in real systems. Distribution shift is actively studied in fields such as computer vision and NLP, and there many learning approaches that aim for robustness to distribution shift. But it is less well understood in social domains. TableShift is a recently introduced benchmark that seeks to change this.

In explanatory modeling, pre-registration is an important recent intervention in response to credibility concerns. Could Pre-registration for Predictive Modeling make sense? And how would it interact with the common task framework and other existing best practices for predictive modeling?