Published using Google Docs
Pre-read: Ethics / Limits to prediction (Spring 2024)
Updated automatically every 5 minutes

Pre-read: Ethics

Limits to prediction (Spring 2024)

Arvind Narayanan

Note: citations in bold are items on the reading list.

The ethics of (predictive) decision making algorithms are studied in many fields including computer science, political philosophy, and law. Here, we’ll take a sociological perspective. Although an outsider to sociology, it is the perspective that I’ve found most useful in my own research on this topic.

Sociologists organize discrimination into three levels: structural, organizational, and interpersonal. Structural discrimination arises from the ways in which society is organized, both through relatively hard constraints such as discriminatory laws and through softer ones such as norms and customs. Organizational factors operate at the level of organizations or other decision-making units, such as a company making hiring decisions. Interpersonal factors refer to the attitudes and beliefs that result in discriminatory behavior by individuals. (Pager & Shepherd)

Each of these can shed light on the ethics of predictive decision making.

One of the main selling points of predictive decision making systems is that they can help reduce discrimination by eliminating the discretion and prejudices of individual human decision makers (the interpersonal level) with data-driven decisions, data being assumed to be objective. A large body of work has critically examined this claim and identified reasons why predictive systems might nonetheless be discriminatory. For example, criminal risk prediction systems that predict the risk of reoffending are actually trained to predict rearrests, because crime is not directly observable. Arrests, of course, are not an objective measurement of crime. Arguably, in effect, the system replaces the biases of judges with those of the police.

Racial disparities in criminal risk prediction were famously examined by a team of investigative journalists from ProPublica in 2016 (Angwin, Larson, Mattu, Kirchner). They were able to use a Freedom of Information Act request to obtain data on risk scores assigned to defendants in Broward County, Florida by a system called COMPAS.

There are more layers to the analysis of disparities in criminal risk prediction, and we will return to it below.

The interpersonal level is limited in how much it can tell us about the ethics of predictive decision making. As we (Barocas, Hardt, Narayanan) explained in our textbook on fairness and machine learning:

While machine learning is sometimes used to automate the tasks performed inside a human’s head, many of the high-stakes decisions that are the focus of the work on fairness and machine learning are those that have been traditionally performed by bureaucracies. For example, hiring, credit, and admissions decisions are rarely left to one person to make on their own as they see fit. Instead, these decisions are guided by formal rules and procedures, involving many actors with prescribed roles and responsibilities. Bureaucracy arose in part as a response to the subjectivity, arbitrariness, and inconsistency of human decision making; its institutionalized rules and procedures aim to minimize the effects of humans’ frailties as individual decision makers. (Weber).

Of course, bureaucracies aren’t perfect. The very term bureaucracy tends to have a negative connotation — a needlessly convoluted process that is difficult or impossible to navigate. And despite their overly formalistic (one might say cold) approach to decision making, bureaucracies rarely succeed in fully disciplining the individual decision makers that occupy their ranks. Bureaucracies risk being equally capricious and inscrutable as humans, but far more dehumanizing. (Weber).

That’s why bureaucracies often incorporate procedural protections: mechanisms that ensure that decisions are made transparently, on the basis of the right and relevant information, and with the opportunity for challenge and correction. Once we realize that machine learning is being used to automate bureaucratic rather than individual decisions, asserting that humans don’t need to — or simply cannot — account for their everyday decisions does not excuse machine learning from these expectations. As Katherine Strandburg has argued, “[r]eason giving is a core requirement in conventional decision systems precisely because human decision makers are inscrutable and prone to bias and error, not because of any expectation that they will, or even can, provide accurate and detailed descriptions of their thought processes”. (Strandburg)

In other words, it’s important to look at how predictive systems fit into, or change, decision making in organizations. In a sense, the introduction of automated decision making can be seen as a continuation of the process of increasing rationalization that Weber identified a century ago (i.e. decision making by bureaucracy replacing customs and traditions or charismatic individuals). On the other hand, automated decision making might represent a step backwards by stripping away the procedural protections that bureaucracies have in place, while also worsening some of the drawbacks of bureaucracies (such as their “coldness”).

I view this lack of procedural protection as the pre-eminent ethical challenge of predictive decision making and automated decision making in general. These protections are often called “due process”, especially in the government context. Many people argue that machine-learning-based predictive systems are black boxes, and so inherently frustrate such protections. But I think this is wrong, for two reasons. As Strandburg notes above, we have due process for human decision making despite humans being black boxes. And we know very well how to design machine learning systems that are not black boxes from the perspective of designing procedural protections; whether or not we can explain the weights of a neural network turns out to be mostly irrelevant. (Kroll)

Empirically, though, the failure of due process in automated decision making is a consistent pattern. Exactly why is unclear. My hypothesis is that in many cases, designing adequate procedural protections (hearing, appeals, etc.) would negate the putative efficiency benefits of automated decision making. Danielle Citron gives a detailed account of how due process fails in automated decision making by the administrative state, and attributes it to the failure to update the legal concept of due process to keep up with technological developments. (Citron)

Bureaucratic decision making consists of two complementary processes: policymaking and adjudication. The former is about deciding the rules, criteria, or policy for making decisions, and the latter is about applying it to specific cases. For example, a college admissions body needs to first decide what kind of student body they want, what constitutes academic merit, and so forth. Then they have to debate each applicant based on those established criteria. It is critical to recognize that machine learning automates both of these processes (via training and inference, respectively), and so two independent sets of procedural protections are potentially stripped away.

In addition, so called street-level bureaucrats perform the crucial work of interpreting high-level policy in specific situations. The concept originates from Michael Lipsky. “Typical street-level bureaucrats are teachers, police officers and other law enforcement personnel, social workers, judges, public lawyers and other court officers, health workers, and many other public employees who grant access to government programs and provide services within them.” (Lipsky) He explains why discretion, although sometimes problematic, is an indispensable feature of this work.

... street-level bureaucrats work in situations that often require responses to the human dimensions of situations. They have discretion because the accepted definitions of their tasks call for sensitive observation and judgment, which are not reducible to programmed formats. It may be that uniform sentencing would reduce inequities in the criminal justice system. But we also want the law to be responsive to the unique circumstances of individual transgressions. We want teachers to perceive the unique potential of children. In short, to a degree the society seeks not only impartiality from its public agencies but also compassion for special circumstances and flexibility in dealing with them.

When automated decision making or decision support systems are introduced in these contexts, even if they do not eliminate the role of the street-level bureaucrat, they are almost always in direct tension with the discretion afforded to them. Decreasing this discretion has both benefits and drawbacks.

All of this sets the stage for the next pair of readings. In “What is the bureaucratic counterfactual?” Rebecca Johnson and Simone Zhang analyze the policymaking process in traditional social service provision. They introduce a model called categorical prioritization to describe how this process works. More clearly understanding the traditional process allows for a clearer comparison of automated decision making with the counterfactual of bureaucratic decision making. (Johnson & Zhang)

That brings us to the Against Predictive Optimization paper (Wang, Kapoor, Barocas, Narayanan), which is partly a response to two papers: Johnson & Zhang’s paper and the Prediction Policy Problems paper (they are both great papers, but we present a different perspective). The origin of this paper was my struggle during the previous edition of this course to articulate why (1) descriptively, predictive decision making keeps going wrong and (2) normatively, there is something fundamentally dubious about it. A lot of empirical work was needed to answer these questions, led by graduate students Angelina Wang and Sayash Kapoor. Some things worth keeping in mind:

So far we’ve mainly talked about organizations and bureaucracy. But predictive decision making can have structural effects. If adopted pervasively within an institution, it can change the nature of that institution. In criminal justice, rehabilitation of offenders is an intervention problem (e.g. who will benefit most from rehabilitative efforts), whereas incarceration can be formulated as a pure prediction problem (who presents the most risk). In other words, the automatability of risk prediction morphs the system in a more punitive direction. In our fairness and machine learning textbook, we briefly talk about machine learning and structural discrimination. (Barocas, Hardt, Narayanan)

Now let’s go back to the ProPublica investigation of the COMPAS algorithm. It was historically significant as it led to the algorithmic fairness research community coalescing. One early observation in this community was that there are many ways to define fairness in a statistical sense, and in essentially any realistic situation, you can’t achieve them simultaneously (Angwin & Larson). Even if you eliminated the biases of policing, it would seem like an algorithm like COMPAS can’t possibly be unbiased. What’s going on?

A sociological perspective again clarifies the situation. What these “impossibility theorems” are fundamentally getting at is the difference between structural discrimination and organizational discrimination. In a society with deep background conditions of injustice, both historical and present, people from different social classes will appear very different on average to a decision maker, even if they are equally morally deserving. Thus, there is an inescapable tension between an organization practicing nondiscrimination in a narrow, formal sense, and actually accounting for these structural factors. This tension has nothing to do with the nature of the decision making; it equally exists whether it is an automated system or a bureaucracy or an individual decision maker. (They roughly parallel ongoing societal debates around equity.)

The next reading puts this tension in sharp focus. It is a study that reveals bias in a predictive model used in healthcare (Obermeyer et al.). The problem, in short, is that the system used healthcare costs as a proxy for needs. Since the healthcare system has a history of spending less on Black patients, the system picks up and perpetuates this inequality. In response to the study, company in question, Optum, said that the model did its job: to predict cost. That is correct. But in deploying this model to prioritize patient care, hospitals end up discriminating along racial lines. This leads to some interesting questions (that we can’t necessarily based on the study):

Highlighting structural discrimination and its perpetuation is not an excuse for passivity by treating racism as something that simply exists and reproduces itself without an agent. In the final reading, a commentary on the Obermeyer et al. paper, Ruha Benjamin makes this powerfully clear (Benjamin):

Labels matter greatly, not only in algorithm design but also in algorithm analysis. Black patients do not “cost less,” so much as they are valued less (9). It is not “something about the interactions that Black patients have with the healthcare system” that leads to poor care, but the persistence of structural and interpersonal racism. Even health care providers hold racist ideas, which are passed down to medical students despite an oath to “do no harm” (10). The trope of the “noncompliant (Black) patient” is yet another way that hospital staff stigmatize those who have reason to question medical authority (11, 12). But a “lack of trust” on the part of Black patients is not the issue; instead, it is a lack of trustworthiness on the part of the medical industry (13). The very designation “Tuskegee study” rather than the official name, U.S. Public Health Service Syphilis Study at Tuskegee, continues to hide the agents of harm. Obermeyer et al. mention some of this context, but passive and sanitized descriptions continue to hide the very social processes that make their study consequential. Labels matter.

A final note. An unusual number of the readings and citations above have Princeton connections. Devah Pager and Hana Shephard were at Princeton sociology when they wrote their paper, as were Rebecca Johnson and Simone Zhang (Johnson is also an author of the predictive estimands paper we discussed in class). Joshua Kroll is a Princeton PhD, and Ruha Benjamin is a professor (and director of the Ida B. Wells JUST Data Lab). Surya Mattu, one of the ProPublica data journalists, is now at Princeton (at the Center for Information Technology Policy, where he heads the Digital Witness Lab.) The Against Predictive Optimization paper, as well as parts of the fairness and machine learning textbook, owe their existence to the previous edition of this course.

I’m probably biased in my selection of readings, but only slightly — I do think these are some of the most influential works in this area. I feel fortunate to have benefited from the company and collaboration of so many extraordinary scholars, and look forward to contributing to the Princeton community through this course.