Does Predictive Policing Lead to Biased Arrests? Results From a Randomized Controlled Trial

ABSTRACT Racial bias in predictive policing algorithms has been the focus of a number of recent news articles, statements of concern by several national organizations (e.g., the ACLU and NAACP), and simulation-based research. There is reasonable concern that predictive algorithms encourage directed police patrols to target minority communities with discriminatory consequences for minority individuals. However, to date there have been no empirical studies on the bias of predictive algorithms used for police patrol. Here, we test for such biases using arrest data from the Los Angeles predictive policing experiments. We find that there were no significant differences in the proportion of arrests by racial-ethnic group between control and treatment conditions. We find that the total numbers of arrests at the division level declined or remained unchanged during predictive policing deployments. Arrests were numerically higher at the algorithmically predicted locations. When adjusted for the higher overall crime rate at algorithmically predicted locations, however, arrests were lower or unchanged.


Introduction
Place-based predictive policing is based on two core ideas: (1) mathematical forecasting methods can be used to anticipate future crime risk in narrowly prescribed geographic areas; and (2) the delivery of police resources to those prediction locations disrupts the opportunity for crime (Bowers, Johnson, and Pease 2004;Mohler et al. 2011). Randomized controlled experiments of predictive policing conducted in Los Angeles provided evidence that algorithmic methods not only predict two-times as much crime as existing best practice, but also double the amount of crime prevented (Mohler et al. 2015). While this treatment effect can be measured in the field, the specific mechanisms by which predictive policing delivers greater crime reduction are not immediately obvious.
The prevailing view, derived from experiments in hot spot policing (Sherman and Eck 2002;Braga and Bond 2008), is that the presence of police in a given place removes opportunities for crime even without any direct contact with potential offenders (Sherman and Weisburd 1995;Weisburd 2008;Loughran et al. 2011). This general deterrent effect persists for some time after police have departed (Koper 1995;Sherman and Weisburd 1995) and appears to diffuse into nearby areas, where the police were not concentrating their efforts (Clarke and Weisburd 1994;Weisburd et al. 2006;Telep et al. 2014). General deterrence is not the only mechanism by which crime might be prevented by police patrol, however. Direct interference via stops, searches, detentions short of arrest, and arrest, may prevent crime by physically incapacitating potential offenders (Sherman and  Eck 2002; Weisburd and Eck 2004). This use of selective incapacitation may have immediate effects on crime (Wyant et al. 2012), especially if prolific offenders are the ones being arrested. Incapacitation may have longer term effects if those prolific offenders are subsequently removed from the community.
Considerable evidence suggests that explicit and implicit bias can have a major impact on who gets stopped, searched, and detained. Reasonable concern therefore exists that predictive policing can exacerbate such biases and reinforce any tendency for police to target minority individuals and communities (Ferguson in press). Such concern exists even if the forecasting methods used to drive predictive policing refrain from incorporating data that would be an explicit source of bias. If predictive policing indirectly exacerbates bias, any crime control benefits would need to be weighed in terms of their discriminatory costs. In the worst case, documented benefits might be derived solely from bias induced by predictions. In other words, predictions absent such bias would yield no crime control benefits at all. Here, we seek to evaluate whether predictive policing leads to patterns of arrest biased against minority individuals.
Racial bias of predictive policing algorithms has been the focus of a number of recent news articles (https://www. sciencenews.org/blog/science-public/data-driven-crimeprediction-fails-erase-human-bias; https://www.pbs.org/newshour/ nation/column-big-data-analysis-police-activity-inherentlybiased), and concerns have been raised by several national organizations (for example the ACLU and NAACP) (https://www.aclu.org/other/statement-concern-about-predictive-policing-aclu-and-16-civil-rights-privacy-racial-justice) and recent research articles (Brayne 2017;Jefferson 2017;Ferguson, in press). In regards to place-based predictive policing methods that forecast a time and location where a crime may occur, the concern is that racially biased police practices may be directed toward some areas rather than others (Ferguson, in press). In addition, knowing that they are in a prediction area may heighten the awareness of police officers in ways that amplify biases (Ferguson 2012). That is, a minority individual observed in a prediction area may be more likely to be subject to biased police actions than the same individual observed outside of a prediction area. Lum and Isaac (2016) conducted a simulation study of predictive policing focused on drug arrests in Oakland, CA. Their goal is to ascertain if racially biased outcomes are possible, or even amplified, with place-based predictive policing methods. The algorithm Lum and Isaac analyze is a space-time Hawkes process that was the method used in the Los Angeles Predictive Policing Experiment (Mohler et al. 2015). In particular, given a city grid indexed by n, the probabilistic rate λ n (t ) of events in cell n at time t is determined by where t i n are the times of events in cell n in the history of the process, μ n is a baseline rate of events, and θωe −ωt reflects the increase in risk following a recent crime. Lum and Isaac showed that if the events t i n correspond to racially biased drug arrests, then λ n (t ) will increase after an arrest, leading to more police resources deployed to cell n in the future. Thus, a feed back loop may be possible where more arrests then occur in cell n, leading to a further increase in λ n . A similar concern is raised by Ferguson (2017), who notes that arrests in a prediction area "memorializes" that location as "hot, " which guarantees that that it will show up again as a prediction area producing further arrests. Selbst (2017a) also warns that racially biased outcomes may become entrenched in place-based predictive policing given data collected via discriminatory policing practices. Constitutional policing tenets such as reasonable suspicion and probable cause may offer little protection against bias embedded in the data (Degeling and Berendt 2017;Selbst 2017b). Though all of these studies deal with hypothetical scenarios or thought experiments, they succeed in demonstrating that careful attention needs to be paid to whether predictive policing produces biased arrests (Moses and Chan 2016;Brayne 2017;Degeling and Berendt 2017;Jefferson 2017).
In practice, the majority of hotspot and place-based predictive policing algorithms focus not on arrests, but on crimes predominantly reported to the police by the public (e.g., robbery, burglary, assault) (Johnson n.d.;Black 1970;Mohler et al. 2015;Ferguson 2017). Thus, the goal is to send police resources to areas where crimes have been reported by victims, thus preventing future crimes in those areas. While a feedback loop for reported crime may be possible, in this case the selfreinforcement is toward places where citizens are placing calls for service. Therefore, we focus on the question of whether predictive policing produces arrests biased against minorities when the inputs to the system are reported crime incidents, rather than arrests. We run a set of hypotheses tests on empirical arrests recorded during the Los Angeles predictive policing experiment. We ask three related questions: (1) Did arrest of minority individuals differ between control and treatment conditions in test divisions? (2) Did arrest rates overall differ between control and treatment conditions in test divisions? and (3) Did the rate of arrests per crime differ across treatment and control conditions.

Predictive Policing Experiments in Los Angeles
A randomized controlled trial (of predictive policing was conducted in three divisions of the Los Angeles Police Department (LAPD) between November 2011 and January 2013. The three participating LAPD divisions were Foothill (FH), North Hollywood (NH), and Southwest (SW). Only a brief outline of the experiment is presented here. Additional details of the algorithmic procedures, experimental design, and main effects are presented in (Mohler et al. 2015).
Each day of the experiment police patrol officers were handed patrol maps with 20 target areas marked as 500 × 500 foot boxes. Officers were informed that the target areas were locations where the risk of crime was highest for their shift. They were encouraged to patrol target areas during any available discretionary time. What officers did not know was that the mission maps distributed to them each day were designed either by an algorithmic forecasting method (see Mohler et al. 2011Mohler et al. , 2015, or by an analyst from within the division using all of the technological and intelligence assets at their disposal. Which mission map officers received on any given day was randomized creating a treatment condition (algorithmic forecast) and control condition (analyst forecast). In this repeatedmeasures experimental design, treatment days were considered exchangeable with control days (Mohler et al. 2015).
The outcome of interest was the difference in reported crime between control and treatment days. The crime types targeted were burglary, car theft, and burglary theft from vehicle (BTFV). Historically, these crime types account for as much as 60% of the serious crime in the City of Los Angeles. In addition to this outcome measure, we collected information on the amount of time police officers spent in prediction areas under each of the experimental conditions (Mohler et al. 2015). Officers used their in-car computer terminals to register when they were entering and exiting prediction locations. This "dosage" was aggregated by day for a total amount of time (in minutes) spent in prediction locations.
Across the three test divisions, patrol officers using the algorithmic predictions produced an average 7.4% drop in crime at mean patrol dosage. By contrast, use of the best-practice predictions produced an average 3.5% drop in crime at mean patrol dosage. Individually, the decrease in crime associated with algorithmic predictions was statistically significant, while that with best-practice predictions was not. The evidence presented in Mohler et al. (2015) suggests that police patrol, when influenced by accurate predictions about the timing and location of crime, may provide some additional crime deterrence value. However, the standard errors of the estimates are relatively large making the slopes difficult to distinguish statistically and the precision on these rates is insufficient to conclude that the algorithmic forecasting crime rate drop is greater than the analyst crime rate drop.

... Defining Control and Treatment Days
Control and treatment missions were designed independently, but in parallel each day of the experiment. Recall that treatment missions were based on algorithmic forecasting, while the control missions were based on existing best practice of analysts. Once mission designs were finalized, a control or treatment mission was chosen randomly for deployment. This randomization was done independently each day for each division taking part in the experiment. On occasion, the analyst was not present on a randomly designated control day and therefore control missions were not available for those days. We exclude treatment days from these days to ensure fair comparison. In Foothill Division, there were a total of 124 test days with successful random assignment, after discarding days on which the analyst was not present to design control missions. The 124 test days were evenly divided with 62 control and 62 treatment days. There were 152 total test days in North Hollywood Division. These included 82 control and 70 treatment days. In Southwest Division, there were 234 total days, including 117 control and 117 treatment days.

... Defining Arrests
An arrest is generally understood to mean the taking into custody of an individual by the police given probable cause that a violation of the law has occurred. An arrest, as recorded by the LAPD, should not be conflated with other down-stream processes of the criminal justice system. An arrest does not imply booking, continued detention, nor whether those individuals are ultimately prosecuted for a crime. Arrests also should not be conflated with contacts between the public and police that did not result in arrest, even if such contacts were contentious. In general, police can exercise many alternatives to arrest in seeking to enforce laws and ensure order including behavioral directives, warnings and brief detention without arrest. On average, the LAPD makes about 1.5 million public contacts per year, about 24,000 of these contacts (1.6%) are arrests (Beck 2016). Here, arrests are taken at face value, without considering anything beyond the official record that an individual was taken into custody.
We do not distinguish between arrests for different types of crimes. In 2012, the LAPD made arrests under 520 different criminal codes representing 25 broad classes of crimes such as aggravated assault, robbery, burglary and larceny. Our primary focus is on whether the practice of policing introduces new biases into arrest patterns, not whether bias might be differentially present in arrests for different types of crimes.

... Defining Racial-Ethnic Groups
The LAPD collects demographic information as part of the arrest process including age, sex, and race-ethnicity of the individuals arrested. This information may be elicited from the individual or inferred by the arresting officer. The LAPD recognizes the categories Asian, black, Latino, white and other, which combined constitute 97.7% of all arrests on average. Occasionally, other categories such as Filipino, Korean, and Pacific Islander appear within the data. Given the sometimes fraught history between the LAPD and communities of color (see Herbert 1997;Muniz 2015;Martinez 2016) we focus on patterns in the arrest of black and Latino individuals and therefore report results for these two groups and for white individuals.

... Defining Crimes
We define crimes as those incidents reported to the LAPD that are classified by the LAPD crime coding system into 226 recognized crime types. In a typical year, the LAPD collects reports on approximately 180,000 crimes. Again, we do not distinguish between different types of crimes.

Results
We now turn to a consideration of potential biases induced by predictive policing. We test three null hypotheses: (1) arrest of minority individuals did not differ between control and treatment conditions in test divisions; (2) arrest rates overall did not differ between control and treatment conditions in test divisions; (3) the rate of arrests per crime was unchanged across treatment and control conditions. The LAPD experiment was designed to test for differences in predictive accuracy and impact on crime between control and treatment (predictive policing) conditions. Here we examine arrest patterns on control and treatment days (Table 1). Because in North Hollywood there was a lower number of treatment days (n = 70) compared to control days (n = 82) in the experiment, we adjust North Hollywood treatment counts to a rate per 82 days to be comparable to control. We conduct a Cochran-Mantel-Haenszel test (Agresti and Kateri 2011) to examine whether ethnicity is independent of the treatment condition. The CMH test is a generalization of a chi-square test, where the test is repeated across a strata, in this case several divisions. Here, we do not find evidence to reject the null hypothesis that treatment arrests are independent of ethnicity, the CMH test p-value is 0.6957 (Table 1). The CMH test assumes that the treatment effect is homogeneous across the three divisions, we test this assumption using a Woolf test and do not find evidence to reject the null hypothesis of homogeneity (p-value 0.94337).
In Table 2, we test whether the total number of arrests were higher or lower on treatment days at the division level. Here, we find that arrests were unchanged in Foothill and North Hollywood (using a chi square goodness-of-fit test for equal rates) and arrests were slightly lower in Southwest on treatment days (marginally significant at the 0.06 level).
We also examine arrest patterns in control and treatment boxes ( Table 3). Recall that each participating division of the LAPD was allocated 20 prediction boxes, each 500 × 500 feet in Table . Total arrests on control vs treatment days aggregated by division. p-value is given for a chi square goodness-of-fit test for equal rates. † indicates counts adjusted for treatment days to a rate (per  days) to match control.  In Table 4, we test whether arrests were higher or lower on treatment days within predicted boxes. Here, we find a statistically significant increase in arrests (approximately double) in treatment boxes in all three divisions. To understand why arrests are higher in treatment boxes, in Table 5 we adjust the arrest rate to control for the higher overall rate of crime in treatment boxes. Our null hypothesis is that if predictive policing is "fair, " then the percentage of crimes in an area leading to an arrest should be equal across treatment and control conditions. Here we find this to be the case in North Hollywood (arrests per crime 7.0% control, 5.6% treatment, p-value 0.34) and Southwest (arrests per crime 10.3% control, 10.7% treatment, p-value 0.75). The arrest rate per crime is slightly lower in treatment boxes in Foothill (arrests per crime 14.9% control, 8.5% treatment, p-value 0.009).

Discussion and Conclusions
The stated goal of the analyses presented above was to assess the degree to which arrest rates were impacted by the introduction of predictive policing in three divisions patrolled by the LAPD. Special attention was paid to arrest rates by the race-ethnicity of the individuals detained. Our null hypotheses were: (1) arrest of minority individuals did not differ between control and treatment conditions in test divisions; (2) arrest rates overall did not differ between control and treatment conditions in test divisions; (3) the rate of arrests per crime was unchanged across treatment and control conditions. The evidence presented does not allow us to reject null hypothesis (1). There is no significant difference in the arrest proportions of minority individuals between treatment and control conditions. We also cannot reject hypothesis (2) at the division level. Arrest rates overall are the same on control and treatment days within the test divisions as a whole. However, we do reject null hypothesis (2) at the box level. Arrests were higher overall in treatment prediction boxes. We therefore tested hypothesis (3) to see if the higher arrest rate in treatment boxes is explained by an overall higher crime rate in treatment boxes. We fail to reject the null hypothesis (3). Arrest rates per crime do not differ across treatment and control conditions.
Clearly, arrests are a common part of day-to-day police operations. The introduction of predictive policing did not increase arrests overall, though treatment prediction boxes did see significantly more arrests than control prediction boxes. The increase arrests in treatment prediction boxes are perhaps understandable given that algorithmic crime predictions are more accurate than those produced by existing best practice (Mohler et al. 2015).
The present study has several important limitations. Arrests are an imperfect proxy for other types of police contacts including stops, searches and detentions short of arrest. It is possible that predictive policing induced increases in these other categories of police contacts, without a concomitant impact on arrests. For this to hold true, it would have to be the case that the rate of arrest actually declined as these other precursor contacts increased, leaving overall arrest numbers unchanged. This hypothetical downward adjustment in arrests would have to hold not only for the experimental deployment period overall, but also for randomly assigned treatment days. We do not have sufficient data to exclude such dynamics, but they seem improbable on the face of it.
Second, the analyses do not provide any guidance on whether arrests are themselves systemically biased. Such could be the case, for example, if black and Latino individuals experienced arrest at a rate disproportionate to their share of offending (Rosenfeld and Fornango 2014). The current study is only able to ascertain that arrest rates for black and Latino individuals were not impacted, positively or negatively, by using predictive policing. Future research could seek to test whether the situational conditions surrounding arrests and final dispositions differ in the presence of predictive policing.
Finally, the results reported herein pertain for the narrowlydefined place-based predictive policing model used in the Los Angeles predictive policing experiment (Mohler et al. 2015). This model focused on reported crime data for a limited set of crimes including burglary, car theft and burglary from vehicle and used only information on crime location and time. Predictions were made for small 500 × 500 foot boxes and changed every day. Under those conditions, we can conclude that predictive policing did not result in biased arrests. Whether the same outcomes would hold given changes in implementation is uncertain. If the exact same data types and methods are applied in a different location there may be reason to be optimistic. However, if the data types change, for example to focus on discretionary crimes or arrests, or if personal information is incorporated into predictions, then pessimism may be warranted. At the same time, we should ask whether police would be negligent if they had data or information that led to accurate forecasts of crime risk but failed to act on it for fear of potential bias (Ferguson 2017). Continued empirical scrutiny along with careful policy development will be needed to guard against bias in predictive policing and ensure fairness in outcomes.