Designing Equitable Algorithms for Criminal Justice Reform

This is part of The Ethical Machine: Big ideas for designing fairer AI and algorithms, an on-going series about AI and ethics, curated by Dipayan Ghosh, a former Public Interest Technology fellow. You can see the full series on the Harvard Shorenstein Center website.

SAM CORBETT-DAVIES
RESEARCHER, STANFORD COMPUTATIONAL POLICY LAB

SHARAD GOEL
ASSISTANT PROFESSOR, DEPARTMENT OF MANAGEMENT SCIENCE & ENGINEERING, STANFORD UNIVERSITY AND EXECUTIVE DIRECTOR, STANFORD COMPUTATIONAL POLICY LAB

We frequently rely on professionals to make predictions about human behavior, informing decisions from medicine to criminal justice. But research over the last several decades shows that intuitive judgments are often inferior to statistically informed assessments [1-3]. As a result, decision-makers are turning to algorithmic assessments to guide important choices.

One of the most consequential—and controversial—use of algorithms is in the criminal justice system. When deciding which defendants to release before their trials, judges in many jurisdictions now consult risk assessment algorithms that aim to quantify the likelihood that a defendant will engage in violent crime or fail to appear in court if released.

Some might hope that such algorithmic decision aids can eliminate troubling human biases. Others might worry that these risk assessments simply add a veneer of objectivity, and may even exacerbate historical disparities. The reality is in between. When risk assessment tools are developed with scientific rigor and public oversight, they have an important role to play in a broader movement to reform the criminal justice system.

Assessing the Efficacy of Risk Assessments

Research suggests that risk assessment algorithms are, at least in theory, better than judges at determining which defendants pose a flight risk or a threat to public safety, allowing jurisdictions to reduce jail populations without increasing the frequency of adverse outcomes. In our own work, we estimate that following a simple, statistically informed rubric—based only on a defendant’s age and number of previously missed court dates—would allow judges to require bail from 30 percent fewer defendants, without increasing the rate at which defendants fail to appear in court [4].

Risk assessment tools also have the potential to bring consistency to what has typically been a haphazard process. Without such tools, judges must rely on intuition to make these high-stakes decisions, typically basing their judgments on short meetings with defendants that are often conducted via video conference. That process can produce arbitrary results, with some judges more than twice as likely to demand bail than others [5].

Despite these theoretical advantages, it’s hard to evaluate the effects of risk assessments in practice. In part this is because risk assessments are often introduced in conjunction with other changes to the bail system—such as limiting the use of money bail altogether—making it difficult to isolate the effects of the risk assessments themselves [6]. Further, simply giving judges access to such tools doesn’t mean that they’ll follow the recommendations, diminishing the potential for positive impact [7].

Nevertheless, there’s reason to be optimistic that risk assessments can improve outcomes. For example, an experiment in Virginia found that when randomly selected pretrial services agencies were given a risk assessment tool, defendants were 60 percent more likely to be released than under the old system, with no apparent increase in pretrial crime [8].

Fairness and False Positives

Despite such benefits, there are concerns that these tools—designed by fallible humans with their own prejudices—might entrench existing biases in the historical data used to generate the algorithms.

One of the most prominent critiques of risk assessments was raised by journalists at ProPublica who investigated the use of the COMPAS algorithm in Broward County, Florida [9]. They found that, on average, black defendants received higher risk scores than white defendants. As the journalists noted, this pattern does not necessarily imply that the scores are biased—it may simply reflect the fact that black defendants in Broward County are rearrested more often than whites [10].

But after following defendants for two years after their COMPAS assessment and recording who was arrested for new crimes, the journalists found that the tool made errors in different ways for blacks and whites. In particular, black defendants faced a higher false positive rate: among those defendants who were not rearrested, a greater share of blacks than whites were rated high-risk by the algorithm.

This fact was widely interpreted as meaning the risk assessment tool was biased against blacks. Surprisingly, however, different error rates do not imply an algorithm is biased, a point we’ve made repeatedly in our own research [11, 12]. Indeed, one can reduce the false positive rate for black defendants—and accordingly make the algorithm look less “biased”—by arresting more low-risk black individuals. Such paradoxical results are one reason that error rates are a problematic measure of fairness.

Designing Fair Algorithms

Unfortunately, there’s no single metric for determining whether an algorithm is fair. But there are at least two important considerations to keep in mind when designing equitable risk assessment tools.

First, risk factors might not be equally predictive for all groups. For example, women in Broward County are significantly less likely to be arrested for future crimes than men with similar characteristics. An algorithm that evaluates women as if they were men can systematically overestimate the recidivism risk of women, exposing them to unnecessary detention [13]. Some jurisdictions have addressed this issue by using gender-specific risk assessments, but many still use tools plagued by such problems.

Second, we must be wary of biased measurements of what we’re trying to predict [14]. For example, we only observe whether defendants are rearrested before their trial, but we really want to know whether they’ve committed a new crime. Because of America’s racially disparate policing patterns, in some cases blacks are more likely to be arrested than whites who’ve committed the same offense—most notably for minor drug crimes [15].

As a result, an algorithm designed to predict drug arrests (rather than drug crimes) would overestimate the probability that black defendants will commit a further offense. Some jurisdictions, including New York City, avoid this problem by focusing on whether a defendant fails to appear in court, an outcome that can be perfectly observed and is therefore immune to such measurement bias. Other risk assessment tools focus on predicting violent crimes since, unlike drug crimes, they appear to be less susceptible to this kind of racial bias [16]. But there are still many jurisdictions that use algorithmic tools to predict arrests for any offense, including petty street crimes, potentially introducing racial bias into their assessments.

Moving Forward

The use of risk assessments demands transparency. Policymakers, researchers, and the public at large should be given enough information to understand exactly how such algorithms are created and how they are used. Such openness not only fosters trust, but also helps to ensure that the best methods are used to build and evaluate these consequential tools.

When weighing the promise and perils of risk assessments, it’s important to remember that the algorithms themselves are but one piece in a complex criminal justice system. Even if we can predict—accurately and without bias—who will fail to appear in court, that doesn’t mean we should necessarily demand money bail as a condition of releasing “high-risk” defendants. Defendants, their families, and society more generally may be better served by alternative interventions, such as text reminders, transportation vouchers, or electronic monitoring. These policy decisions remain important regardless of whether algorithms play a greater role in the pretrial system.

Alongside more comprehensive reforms, algorithmic risk assessments have the potential to make the criminal justice system more equitable. But they must be developed, evaluated, and implemented with care that reflects the seriousness of the decisions they help guide.

References

Robyn M. Dawes, David Faust, and Paul E. Meehl, “Clinical versus Actuarial Judgment,” Science 243, no. 138 (1989): 1668–1674.
William M. Grove et al., “Clinical versus Mechanical Prediction: A Meta-Analysis,” Psychological Assessment 12, no. 1 (2000): 19–30.
Stefanía Ægisdóttir et al., “The Meta-Analysis of Clinical Judgment Project: Fifty-Six Years of Accumulated Research on Clinical versus Statistical Prediction,” The Counseling Psychologist 34, no. 3 (2006): 341–382.
Jongbin Jung et al., “Simple Rules for Complex Decisions” (2017), available at https://arxiv.org/abs/1702.04690.
Jon Kleinberg et al., “Human Decisions and Machine Predictions,” The Quarterly Journal of Economics 133, no. 1 (2018): 237–293.
Lisa Foderaro, “New Jersey Alters Its Bail System and Upends Legal Landscape,” The New York Times, February 6, 2017, https://www.nytimes.com/2017/02/06/nyregion/new-jersey-bail-system.html.
Megan T. Stevenson, “Assessing Risk Assessment in Action,” Minnesota Law Review 103 (2017).
Mona Danner, Marie VanNostrand, and Lisa Spruance, “Risk-Based Pretrial Release Recommendation and Supervision Guidelines” (2015).
Julia Angwin et al., “Machine Bias,” ProPublica, May 23, 2016, https://www.propublica.org/article/machine-bias-risk-assessments-in-criminal-sentencing.
Jeff Larson et al., “How We Analyzed the COMPAS Recidivism Algorithm,” ProPublica, May 23, 2106, https://www.propublica.org/article/how-we-analyzed-the-compas-recidivism-algorithm.
Sam Corbett-Davies and Sharad Goel, “The Measure and Mismeasure of Fairness: A Critical Review of Fair Machine Learning” (2018), available at https://arxiv.org/abs/1808.00023.
Sam Corbett-Davis et al., “Algorithmic Decision Making and the Cost of Fairness,” Proceedings of the 23rd Conference on Knowledge Discovery and Data Mining (2017).
Jennifer Skeem, John Monahan, and Christopher Lowencamp, “Gender, Risk Assessment, and Sanctioning: The Cost of Treating Women Like Men,” Law and Human Behavior 40, no. 5 (2016): 580–593.
Kristian Lum, “Limitations of Mitigating Judicial Bias with Machine Learning,” Nature Human Behavior 1 (2017).
Ian Urbina, “Blacks Are Singled Out for Marijuana Arrests, Federal Data Suggests,” The New York Times, June 3, 2013, https://www.nytimes.com/2013/06/04/us/marijuana-arrests-four-times-as-likely-for-blacks.html.
Jennifer Skeem and Christopher Lowencamp, “Risk, Race, and Recidivism: Predictive Bias and Disparate Impact,” Criminology 54, no. 4 (2016): 680–712.

Publications

The Thread

Events

Our Story

Our People

Our Funding

Press Room

Jobs & Fellowships

Better Life Lab

Center on Education & Labor

Digital Impact and Governance Initiative

Early & Elementary Education

Future Frontlines

Future Security

Future Tense

Future of Land and Housing

Higher Education

New America Chicago

New America Fellows

New America Local

New Practice Lab

Open Technology Institute

Planetary Politics

Political Reform

PreK-12 Education

Public Interest Technology

Teaching, Learning & Tech

Us@250 Initiative

Designing Equitable Algorithms for Criminal Justice Reform

Blog Post

Sam Corbett-Davies

Sharad Goel

Nov. 28, 2018