Crime in Philadelphia: Bayesian Clustering with Particle Optimization

Accurate estimation of the change in crime over time is a critical first step towards better understanding of public safety in large urban environments. Bayesian hierarchical modeling is a natural way to study spatial variation in urban crime dynamics at the neighborhood level, since it facilitates principled ``sharing of information'' between spatially adjacent neighborhoods. Typically, however, cities contain many physical and social boundaries that may manifest as spatial discontinuities in crime patterns. In this situation, standard prior choices often yield overly-smooth parameter estimates, which can ultimately produce mis-calibrated forecasts. To prevent potential over-smoothing, we introduce a prior that partitions the set of neighborhoods into several clusters and encourages spatial smoothness within each cluster. In terms of model implementation, conventional stochastic search techniques are computationally prohibitive, as they must traverse a combinatorially vast space of partitions. We introduce an ensemble optimization procedure that simultaneously identifies several high probability partitions by solving one optimization problem using a new local search strategy. We then use the identified partitions to estimate crime trends in Philadelphia between 2006 and 2017. On simulated and real data, our proposed method demonstrates good estimation and partition selection performance.


Introduction
Beginning in the late 2000's and continuing through the 2010's, Philadelphia experienced population growth, underwent a rapid evolution in its built environment, and observed a generally decreasing trend in the total number of violent crimes reported in the city. Although the total number of violent crimes decreased city-wide, the decreasing trend was not experienced uniformly in all of the local neighborhoods making up the city. In this paper, we carefully estimate both the neighborhood-level baseline levels and temporal trends in crime and separately cluster these estimates. This enables us to identify clusters of neighborhoods whose trends in crime deviated markedly from the overall city-wide decrease as well as clusters of neighborhoods displaying substantially higher or lower baseline levels than surrounding neighborhoods.
Accurate modeling of the neighborhood-level crime patterns benefits many stakeholders: urban planners can better understand how socioeconomic factors and the built environment affect crime, city officials can develop community programs and interventions to improve the overall quality of life for all residents, and law enforcement can more thoughtfully deploy resources to increase public safety. Accurate modeling of neighborhood trends is, however, complicated by several factors. First, simple comparisons of the temporal trajectories of neighborhood-level counts fail to account for important differences in size, land use & zoning, population, and the socio-demographic makeup of each neighborhood. Consequently, in order to facilitate accurate comparisons across neighborhoods, we directly model the crime density -defined as the number of violent crimes divided by the land area (in square miles) -of each neighborhood over time. As we detail in Section 2, area is a more principled normalization than population for small urban areas.
The second complication lies in the spatial modeling of crime density. Bayesian hierarchical modeling is an attractive way to study crime at the neighborhood level as it allows us to "borrow strength" between spatially adjacent neighborhoods. Recently, Balocchi and Jensen (2019) demonstrated that Bayesian models that encourage spatial smoothing between neighborhoods with conditionally auto-regressive (CAR; Besag, 1974) priors yield more accurate neighborhood-level forecasts than fitting independent models to each neighborhood. Though CAR models are an intuitive and popular way to introduce spatial dependence, they can produce overly smooth parameter estimates and forecasts that are at odds with the realities of complex urban environments. In fact, as we will see in Section 2, while crime in Philadelphia displays considerable spatial correlation, there are also sharp discontinuities.
While some discontinuities coincide with known landmarks like major streets, parks, and bodies of water, several occur along less obvious boundaries that may result from differences in socioeconomic conditions and disparate effects of public policy initiatives like community crime watches.
Specifying a CAR prior without accounting for potential discontinuities can lead to poor estimation of crime around these geographic or socioeconomic barriers. Directly modeling the locations of individual discontinuities between pairs of neighborhoods typically relies on heavily over-parametrized models; in fact, many such models introduce one latent parameter for each pair of adjacent neighborhoods. We instead aim to identify clusters of neighborhoods with similar crime dynamics.
In this paper, we propose a "CAR-within-clusters" model where we introduce two latent spatial partitions of neighborhoods in Philadelphia, one for the baseline levels of crime and one for the time trends. We then specify separate CAR priors on the neighborhood-specific parameters within each cluster of each partition. Unlike many classical approaches to Bayesian spatial clustering, our proposed model allows (i) the spatial distribution of the baseline levels of crime to differ from that of the time trends and (ii) the parameters to vary both within and between clusters.
Because we allow different model parameters to cluster differently, the combinatorial vastness of the underlying product space of partitions renders stochastic search techniques like Markov Chain Monte Carlo (MCMC) computationally prohibitive. We instead focus on posterior optimization. However, rather than simply finding the maximum a posteriori (MAP) partitions, we propose an extension of Ročková (2018)'s ensemble optimization framework that simultaneously identifies multiple partitions with high posterior probability by solving a single optimization problem. Solving this problem is formally equivalent to finding a variational approximation of the discrete posterior distribution over the partitions. We introduce a new local search strategy that, at a high level, runs several greedy searches that are made "mutually aware" through an entropy penalty. This penalty promotes diversity among the estimated partitions by discouraging different search paths from visiting the same point in the latent discrete space.
Here is an outline for the rest of the paper. In Section 2 we describe our crime data and introduce our "CAR-within-clusters" model. Then, in Section 2.1, we highlight important similarities and differences between our proposed model and existing Bayesian spatial clustering methods. We introduce our variational approximation and local optimization strategy in Section 3 before demonstrating its use on synthetic data in Section 4. We then apply our method to the Philadelphia crime data in Section 5. Section 6 concludes with a discussion of our results and an outline of potential future directions. An R package implementing our method and all code and data to replicate the results in this paper are available at github.com/cecilia-balocchi/particle-optimization.

Data and the "CAR-within-clusters" Model
Our crime data comes from opendataphilly.org, where the Philadelphia Police Department publicly releases the location, time, and type of each reported crime in the city. We focus on violent crimes, which include homicides, rapes, robberies, and aggravated assaults (FBI, 2011), aggregated at the census tract level. Philadelphia is divided into N = 384 census tracts, which range in size from 0.04 to 6.65 square miles and whose populations range from 0 to 8300, with mean of 4000, according to the 2010 Census.
When comparing crime incidence across heterogeneous regions, it is very common for criminologists to work with per-capita crime rates (e.g. crime rates per 100,000 inhabitants). However, for small geographic regions within urban environments, Zhang and Peterson (2007) point out that frequently neither criminals nor victims are from the same neighborhood as the crime location. They instead recommend normalizing crime counts by the area of the neighborhood rather than the population.
To this end, let c i,t be the total number of violent crimes reported in neighborhood i in year t, with t = 0 corresponding to the year 2006 and t = 11 corresponding to the year 2017. Additionally, let A i be the area of neighborhood i in square miles. Since the distribution of crime density c i,t /A i is quite skewed, we transform the densities using the inverse hyperbolic sine transformation: y i,t = log c i,t /A i + (1 + (c i,t /A i ) 2 )) 1/2 − log(2). This transformation behaves quite similarly to the logarithmic transform but is well-defined at zero (Burbidge et al., 1988).
At a high-level, we could model, independently for each census tract i = 1, . . . , N and t = 0, . . . , 11, y i,t ∼ N (f i,0 (x t ), σ 2 ), where x t is the time index standardized to have mean zero and unit variance and the unknown regression function f i,0 measures the expected transformed crime density in neighborhood i. Rather than attempt to estimate each f i,0 exactly -a task made complicated by the fact that we only have 12 observations per census tract -we instead focus on approximating f 0 up to the first order. That is, we introduce parameters α i and β i for each census tract and fit the approximate model y i,t ∼ N (α i + β i x t , σ 2 ). (1) The parameters α i and β i respectively approximate the mean level and time trend of the transformed crime density in census tract i. Although the approximation in Equation (1) may not perfectly characterize the temporal evolution of crime, such linear approximations are routinely used in limited data settings like ours, which includes only 12 observations per census tract (Bernardinelli et al., 1995;Anderson et al., 2017). Moreover, an exploratory analysis (see Section S4.1 in the Supplementary Materials) did not reveal strong suggestions of non-linearity in the y i,t . While such a first order approximation is justified in our setting, we note that the methods we develop in Section 3 to estimate and cluster the α i 's and β i 's may be extended to higher-order approximations of the f 0,i 's that can capture non-linearities; we defer discussion of such extensions of Section S4.2 in the Supplementary Materials.
Although the raw crime counts c i,t were not of primary interest in our analysis, we can, nevertheless, elaborate the model in (1) to directly model the observed crime counts. For instance, following Kowal and Canale (2020), we can model crime counts as a suitably rounded and transformed latent Gaussian random variable. Specifically, we can model c i,t = A i sinh (log(2) + y it ) where u is the nearest integer to u and y i,t ∼ N (α i + β i x t , σ 2 ) is a latent variable. Recent work (Kowal and Canale, 2020;Kowal and Wu, 2021) demonstrates that such transformed and truncated latent Gaussian models are more accurate and flexible than conventional Poisson and negative binomial models of counts.

Related work
The model in Equation (1) is a linear model with spatially-varying coefficients. Geographically weighted regression (GWR; Fotheringham et al. (2002)) is a popular method that estimates the parameters (α i , β i ) in region i using data from region i and weighted data from nearby regions, with regions closer to i receiving higher weights. Within the Bayesian literature, Gaussian processes (GPs) are a common prior choice to induce spatial dependence between the parameters estimated at different locations. In the presence of a priori unknown spatial discontinuities, however, both GWR and GP-based methods can introduce an inappropriate amount of smoothness, resulting in substantial bias in the parameter estimates and downstream crime forecasts.
There is a vast literature on data-adaptive strategies for detecting such discontinuities. One approach involves first fitting a simple model that does not account for potential discontinuities and then identifying jumps in the fitted values (see, e.g., Boots (2001), Li et al. (2011), Banerjee et al. (2012, Lu and Carlin (2005), and Lee and Mitchell (2013)). Alternatively, many authors directly model uncertainty about which borders correspond to sharp discontinuities within larger Bayesian hierarchical models (see, e.g., Lee and Mitchell (2012), Lu et al. (2007), and Balocchi and Jensen (2019)). Despite their intuitive appeal, these latter models are heavily over-parametrized, often introducing one latent parameter for every pair of adjacent neighborhoods.
Rather than looking for individual discontinuities between pairs of neighborhoods, several authors instead look for clusters of neighborhoods with similar parameter values. Such clustered models are much more parsimonious and readily interpreted than models of pairwise discontinuities. Our goal in this work is to cluster and estimate the mean-levels α i and time-trends β i in Equation (1). Because there is no a priori reason to expect the spatial distribution of the α i 's to match the spatial distribution of the β i 's, we require a method that can estimate separate clusters for both sets of parameters. We further require a method that can recover spatial partitions, whose clusters consist of neighborhoods which are spatially adjacent. Unfortunately, most existing methods for spatially clustered regression fail to meet both criteria.
Two important early works on Bayesian spatial clustering are Knorr-Held and Raßer (2000) and Denison and Holmes (2001). Both fit intercept-only Poisson regression models in which the intercepts are constant within clusters. In both methods, clusters are defined with respect to carefully chosen distance metrics. Gangnon and Clayton (2000) and Green and Richardson (2002) also fit Poisson regression models with clustered intercepts. They also assumed that parameters were constant within clusters but specified priors on the partitions that respectively depended on the cluster's shape and number of adjacent units within each cluster. In general, neither of their procedures recover spatial partitions. Feng et al. (2016) substantially expand Knorr-Held and Raßer (2000)'s model by incorporating additional covariates with spatially clustered effects. Despite producing spatial clusters, their extension assumes that (i) covariate effects are constant within clusters and (ii) the effects of every covariate are clustered identically. Page and Quintana (2016)'s spatial product partition model makes the same assumptions but is further limited by the fact that it places positive prior probability on non-spatial partitions. Anderson et al. (2017) study the spatiotemporal variation in respiratory disease risk by fitting a Poisson regression model in each areal unit. Like us, their goal is to estimate an intercept and time-trend within each areal unit and, unlike Feng et al. (2016), they do not assume that the intercepts and time-trends are clustered identically. However, they do not restrict attention to spatial clusters and their computational procedure requires the user to specify, a priori, the maximum number of clusters for both the intercept and time-trend.
Recently Teixeira et al. (2019) and Li and Sang (2019) proposed elegant methods for spatial clustering based on spanning tree representations of the adjacency structure of the neighborhoods. Whereas Li and Sang (2019)'s method is based on solving a fused lasso problem corresponding to a fixed spanning tree, Teixeira et al. (2019) instead developed a Gibbs sampler that sequentially samples a spanning tree, spatial partition, and regression parameters. Despite the elegance of their framework, Teixeira et al. (2019) noted that sampling a spanning tree from its exact posterior conditional distribution is computationally prohibitive. They instead sampled from an approximate conditional and the extent to which their approximation affected their downstream inference is unclear.

Model
To motivate our proposed CAR-within-cluster model, we first note that the tract-specific maximum likelihood estimates (MLEs) of α i and β i suggest that the crime may not have decreased uniformly across the city (Figure 1). In fact, in a small number of neighborhoods, crime has actually increased over the last decade.
We further observe that, with a few notable exceptions, spatially adjacent neighborhoods tend to have similar MLEs, suggesting a high degree of spatial correlation in the neighborhoodlevel crime dynamics. We take a hierarchical Bayesian approach in order to "borrow strength" between neighborhoods that involves specifying a prior distribution on the parameters α = (α 1 , . . . , α N ) and β = (β 1 , . . . , β N ). Because we expect the tract-specific parameters to display some spatial continuity, we use priors that explicitly introduce dependence between parameters from neighboring tracts.
Conditionally autoregressive (CAR) models are a popular class of such priors and we use a version introduced in Leroux et al. (2000). Letting W = (w i,j ) be a binary adjacency matrix with w i,j = 1 if and only if neighborhoods i and j share a border, we say that the vector θ = (θ 1 , . . . , θ n ) follows a CAR model with grand mean θ and variance scale τ 2 if and only In this CAR model, the conditional mean of θ i | θ −i is a weighted average of the grand mean θ and the average of the θ j 's from the neighborhoods that border neighborhood i. The degree to which θ i is shrunk toward either of these targets is governed by a parameter ρ, which is typically set by the analyst, and the number of neighbors. These full conditionals uniquely determine the joint distribution θ ∼ N (θ1 n , τ 2 Σ CAR ) where , 1 n is the n-vector of ones, and W is the unweighted graph Laplacian of the adjacency matrix W . For compactness, we will write θ | θ, τ 2 ∼ CAR(θ, τ 2 , W ).
Cities typically contain many geographic and social barriers like rivers and highways that manifest in sharp spatial discontinuities. In the presence of these discontinuities, a naïvely specified CAR model can induce a level of spatial smoothness among the parameters at odds with the data. To avoid this behavior, we seek clusters of parameters that demonstrate considerable spatial continuity within but not between clusters. We introduce two latent partitions of the neighborhoods, γ (α) and γ (β) , where γ (·) = {S (·) 1 , . . . , S (·) K (·) }. We refer to the sets S (·) k as clusters and restrict attention to partitions consisting of clusters of spatially connected neighborhoods. We denote the set of all such partitions by SP and let γ := (γ (α) , γ (β) ) be the pair of latent spatial partitions underlying the mean levels and the time trends of crime across neighborhoods. In what follows, we will refer to γ as a particle.
To simplify our presentation, we describe only the prior over the mean levels of crime α; we place an analogous prior on the time trends β. We place independent CAR priors on the collections α k = {α i : i ∈ S (α) k }, so that the joint prior density π(α | γ (α) , σ 2 ) factorizes over the collection of all clusters: π(α | γ (α) , σ 2 ) = K (α) k=1 π(α k | σ 2 ). To this end, we introduce a collection of grand cluster means α = {α 1 , . . . , α K (α) } and model is the sub-matrix of W whose rows and columns are indexed by the cluster S (α) k . We further place independent N (0, a 2 σ 2 ) priors on the grand cluster means α k .
We place independent truncated Ewens-Pitman priors on the two latent spatial partitions γ α and γ β . The Ewens-Pitman distribution on partitions is equivalent to the exchangeable partition probability function (EPPF) of a Chinese Restaurant Process (Aldous, 1985;Pitman et al., 2002). It is characterized by favoring a small number of large clusters and asymptotically it induces an average of η log(N ) clusters. To recover partitions with connected clusters, we truncate this distribution to the set of spatial partitions SP. The probability mass function of this prior is given by and we denote it by T -EP. We complete our hierarchical prior with an Inverse Gamma prior on the residual variance σ 2 ∼ IG νσ 2 , νσλσ 2 . To summarize, our model is The high degree of conditional conjugacy in (3) enables us to derive analytic expressions for quantities such as the marginal likelihood p(y | γ) as well as the conditional posterior expectations E[α, β | γ, y]. The availability of these expressions will be crucial for the posterior exploration strategy we develop below.
Given the residual variance σ 2 and latent partitions γ (α) and γ (β) , parameters in different clusters are conditionally independent. In other words, our model falls with the class of conditional product partition models (PPMs) that have been widely used in Bayesian spatial statistics (see, e.g., Knorr-Held and Raßer (2000), Denison and Holmes (2001), and Feng et al. (2016)). Unlike these papers, however, we are interested in recovering two latent partitions, one each for the mean levels and time-trends within each census tract. Our model is perhaps most similar to Anderson et al. (2017), who also seek to estimate two distinct partitions, one for the intercepts and one for the slopes. However, they limit attention to partitions containing five or fewer clusters for computational simplicity, whereas we will not need to impose any a priori restriction on the number of clusters.
Our CAR-within-cluster model depends on several hyperparameters, including ρ, which regulates the amount of within-cluster spatial autocorrelation, and η, which regulates the expected number of clusters. Throughout our analysis, we fix these hyperparameters and set the remaining hyperparameters a 1 , a 2 , b 1 , b 2 , ν σ and λ σ in a data-dependent fashion, as described later in Section 3.2.

Posterior Exploration and Summarization
In fitting our model, we have three simultaneous aims: (i) identify the latent pair of partitions γ = (γ (α) , γ (β) ), (ii) estimate the approximate baseline levels α and time trends β of crime in each neighborhood, and (iii) make predictions about future incidents of crime in each neighborhood. These latter two tasks can generally be expressed as evaluating posterior expectations of the form E[g(α, β) | y] where g is a functional of interest. For instance, we can obtain point estimates of individual mean levels with g(α, β) = α i and future crime forecasts at some future time with g(α, β) = α i + β i x . The combinatorial vastness of the space SP 2 , which contains all possible pairs of spatial partitions, renders it impossible to enumerate all γ's for even small values of N. As a result, we cannot compute the posterior probability π(γ | y) exactly.
It is tempting to resort to Markov Chain Monte Carlo (MCMC) simulations to approximate expectations E[g(α, β) | y]. Unfortunately, due to the vastness of SP 2 , such MCMC simulations may require a prohibitive amount of time to mix. To get around this difficulty, Anderson et al. (2017) arbitrarily restricted attention to partitions with no more than five clusters each. Even with such a restriction, which we will not impose, it is still quite difficult to distill the thousands of resulting draws of γ into a single point estimate and to quantify parameter and partition uncertainty in an interpretable fashion (see, e.g., Wade and Ghahramani, 2018).
A popular alternative approach is posterior optimization, which usually focuses on identifying the maximum a posteriori (MAP) pair of partitions γ (1) or some other decision-theoretic optimal point estimate (see, e.g., Lau and Green (2007) and Rastelli and Friel (2018)). One then estimates the marginal expectation E[g(α, β) | y] with a "plug-in" estimator E[g(α, β) | y, γ (1) ]. Though this procedure might be substantially faster than MCMC, it completely eschews exploration of the uncertainty about γ. As a result, the MAP plug-in estimator may result in over-confident inference about the functional. Essentially, using the MAP plug-in amounts to approximating the full posterior distribution over SP 2 with a single point mass.
Instead of identifying only a single γ with high posterior probability, what if we could identify Γ L = {γ (1) , . . . , γ (L) }, the set of L pairs of partitions with largest posterior mass? Given Γ L , a natural approximation is whereπ L (·|y) is the truncation of π(γ|y) to the set Γ L . Because this approximation averages over more of the posterior uncertainty about γ, we might reasonably expect it to be more accurate than the MAP plug-in.

A Variational Approximation
It turns out that we can identify Γ L by finding a variational approximation to the full posterior distribution. Before proceeding, we introduce some more notation. For any collection of L particles Γ = {γ 1 , . . . , γ L } and vector w = (w 1 , . . . , w L ) in the L-dimensional simplex, let q(· | Γ, w) be the discrete distribution that places probability w on the particle γ . Following Ročková (2018), we will refer to individual pairs of partitions γ as particles, the collection Γ as a particle set and w as importance weights. Let Q L be the collection of all such distributions supported on at most L particles. Finally, for each λ > 0, let Π λ be the tempered marginal posterior with mass function π λ (γ) ∝ π(γ | y) 1 λ . Note that the particles in Γ L , which are the L particles with largest posterior mass, are also the L particles with largest tempered posterior mass for all λ.
The following proposition provides the foundation for estimating Γ L .
In light of Proposition 1, we can recover Γ L by finding an approximation of any tempered posterior Π λ . This is equivalent to solving where H(Γ, w) = −E q [log q(·|Γ, w)] is the entropy of the approximating distribution q(·|Γ, w).
Before proceeding, we stress that we are not finding a variational approximation of π(α, β, σ 2 |y), the marginal posterior distribution of the continuous parameters of interest. Instead, we are approximating the discrete posterior distribution π(γ | y), which places positive probability over all particles γ = (γ α , γ β ), with another discrete distribution q that places positive probability on only L particles.
We pause briefly to reflect on the two terms in Equation (4). The first term is, up to an additive constant depending only on y, the w-weighted average of the height of the logposterior at each particle in the particle set Γ. This term is clearly maximized when all of the particles in Γ are equal to the MAP. On the other hand, the entropy H(Γ, w) of the approximating distribution is maximized when all of the particles in Γ are distinct and each w = L −1 . The penalty term λ balances these two opposing forces.
Finally, we note that Ročková (2018) introduced essentially the same family of optimization problems to identify sparse high-dimensional linear regression models and described a coordinate ascent strategy that iteratively updated w and Γ. In that work, γ was a binary vector indicating which variables to include in the model.

Particle Optimization
Finding the global optimum of (4) exactly is practically impossible, given the enormous size of the set of all possible particle sets Γ. Instead, like Ročková (2018), we deploy a coordinate ascent strategy: starting from an initial particle set Γ and initial weight vector w, we iteratively update one of w and Γ until we reach a stationary point.
To update the particle set Γ, we sweep over the individual particles, sequentially updating each one while holding the others fixed. Finding the globally optimum single particle update is practically impossible, since doing so would require searching over the space of all possible pairs of spatial partitions. We instead update each γ by updating each of its constituent partitions one at a time. Even updating a single partition within a single particle is practically impossible, given the vastness of the space of spatial partition SP. For that reason, we must rely on a local algorithm to search over SP. That is, we update a single partition by first enumerating a large number of candidate partitions and moving to the candidate that maximizes the objective function Equation (4).
Perhaps the simplest candidate set consists of all partitions that can be formed by reallocating a single neighborhood to a new or existing cluster. Such one-neighborhood updates directly parallel conventional Gibbs samplers for Dirichlet process mixture models. In our optimization setting, such a restrictive search strategy results in premature termination at a sub-optimal particle set Γ. Instead, a more promising strategy for navigating the space of partitions is to allow multiple neighborhoods to be reallocated at once.
We build a candidate set using both fine transitions, which reallocate a single neighborhood to a new or existing cluster, and coarse transitions, which simultaneously reallocate multiple neighborhoods. There are two types of fine transitions, "island" moves, in which a neighborhood is remove from its current cluster and reallocated to a new singleton cluster, and "border" moves, which move a single neighborhood located at the boundary between two clusters across the boundary. We also consider three types of coarse transitions: (i) "merge" moves, which combine two adjacent clusters into a single clusters; (ii) "split" moves, which divide an existing cluster into multiple smaller sub-clusters; and (iii) "split-and-merge" moves, which first split an existing clusters into multiple pieces and then merge some or all of those pieces with other existing clusters. Figure 2 illustrates these moves. We note that sometimes, removing a single neighborhood from a cluster leaves the resulting cluster spatially disconnected. When this happens, we treat the resulting components as individual clusters.
Figure 2: Two broad types of transitions that we consider. An "island" move (top) removes a single neighborhood from an existing cluster (the lower left orange cluster) and creates a new singleton cluster. A "split and merge" move (bottom) first splits an existing cluster (the bottom left cluster) into multiple sub-clusters and then merges some or all of the new sub-clusters into existing clusters Local search heuristics. In our local search algorithm, it is not practical to enumerate all possible coarse and fine transitions. We instead use a number of heuristics to limit the number of possible moves considered in each step. For brevity, we describe these heuristics for transitions for γ (α) ; we use exactly the same heuristics for γ (β) .
The conditional conjugacy of our "CAR-within-cluster" model allows us to quickly compute We use these conditional means as running estimates with which to propose transitions. Rather than attempting to merge an existing cluster k with each one of its neighboring clusters, we only attempt to merge k with its neighbor k whose grand cluster mean α k is closest to α k . Moreover, when attempting split-and-merge moves, we cap the number of new sub-clusters at five. Finally, for island moves, we initially only attempt to remove neighborhood i from its current cluster and move it to a new singleton if the estimated α i is in the top or bottom 5% of the distribution of estimates within the cluster. During our coordinate ascent algorithm, if we find that none of these transitions are accepted, we try all N island moves. This last check ensures that our algorithm converges locally in the sense that no one-neighborhood update to an individual partition will result in a higher objective.
These heuristics are admittedly ad hoc but we have found that they strike a good balance between solution optimality and computational speed. We found, for instance, that merge moves that tried to merge clusters with dissimilar estimated grand means were rarely accepted. We also found that if we restricted split moves to create only two new sub-clusters at a time, it was more difficult to identify partitions containing many small clusters.
Choice of hyperparameters. Our prior depends on several hyperparameters, some of which we recommend fixing and some of which we set in a data-dependent manner. Specifically, we fix ρ = 0.9 so that our prior concentrates on clusters with relatively high levels of spatial autocorrelation. This choice of ρ also ensures prior propriety and results in somewhat more numerically stable posterior computations than values of ρ much closer to 1. Although the posterior distribution over γ is somewhat sensitive to ρ, resulting estimates of α and β are somewhat less sensitive to misspecification of ρ (see Section S3.4 of the Supplementary Materials for a detailed sensitivity analysis).
In our data analysis and simulations, the number of spatial units N is approximately 400. For a non-truncated Ewens-Pitman prior, the expected number of clusters grows at the rate η log (N ). To us log (400) ≈ 6 seemed like a reasonably adequate level of summarization of crime patterns; accordingly, we fixed η = 1 throughout.
The remaining hyperparameters are set in a data-dependent fashion. At a high-level, to set ν σ and λ σ , we first fit separate linear models to the data in each neighborhood to get an empirical distribution of the residual variance. We then set the hyperparameters of the Inv. Gamma(ν σ /2, ν σ λ σ /2) prior on σ 2 to match the first and second moments of this empirical distribution. To set a 1 , a 2 , b 1 , and b 2 , we use an empirical Bayes approach based on computing the MAP estimate of γ. We provide full details of this strategy to Section S2 of the Supplementary Materials. In our experiments, we have found this strategy to work well and so have automated its implementation as a default option in our R package.
Initialization and choice of λ. To run our local search algorithm, one must additionally specify (i) the number of particles L, (ii) the initial locations of the particles, and (iii) the inverse temperature λ > 0. The choice of L is largely dependent on the computational budget, as the per-iteration complexity of our algorithm scales linearly in L. For problems of our size, with around N = 400 spatial units, we have found that setting L = 10 or L = 20 strikes a good balance between posterior exploration and computational effort.
We initialize the particle set by randomly drawing particles (γ K is the partition obtained by running k-means on the maximum likelihood estimates of α with k = K clusters. We let K, K = 1, . . . , log(N ) . In this initialization, the probability of drawing particle (γ is proportional to its marginal posterior probability. Our initialization allows our algorithm to pursue several search directions simultaneously but also allows for some redundancy in the initial particle set. In regions of high posterior probability, such redundancy allows multiple particles to search around a dominant mode, providing a measure of local uncertainty. Finally, it is important to stress that λ is not a model parameter; neither the data likelihood nor the prior distribution depends on it. Moreover, although λ plays the role of a penalty parameter in Equation (4), Proposition 1 guarantees that the global solution to that optimization problem is the same for all λ > 0. This is in marked contrast to most penalized likelihood procedures, whose solutions are highly dependent on the choice of penalty parameter. In practice, however, the solution obtained by our local search algorithm is somewhat sensitive to the choice of λ.
In our experiments, we have found that setting λ = 1 often results in all L particles collapsing onto a single point. On further inspection, we found that changes in the entropy, being upper bounded by log (L), are often an order of magnitude smaller than changes in the w-weighted log-posterior. To offset this, we recommend running our local search algorithm with large λ. Typically, with larger values of λ, particles are encouraged to drift to regions of lower posterior probability more forcefully than with lower values of λ. In our experiments, we have found that λ = 10 and λ = 100 often produce good results.

Synthetic Data Evaluation
To investigate the behavior of our proposed particle optimization procedure, which we will refer to as PartOpt throughout this section, we conducted several experiments using synthetic data generated over a 20×20 grid of equally-sized spatial units. Throughout our experiments, we fix the true spatial partitionsγ α , which contains 10 clusters ranging in size from one unit to 237 units, andγ β , which contains four clusters of sizes 188, 100, 100, and 12. Without losing any generality, we assumed throughout our simulation study that each region had unit area A i = 1. Given the true spatial partitions, we generated synthetic data according to the model in Equation (1) where the α i 's and β i 's were drawn from CAR-within-cluster distributions. We considered three different settings, corresponding to varying levels of separation between the means α k and β k associated with each cluster ofγ α andγ β . This allowed us to investigate how PartOpt works in settings where the cluster structure is readily apparent and in settings where the cluster structure is less well-defined. For simplicity, althoughγ α contains ten distinct clusters, we fixed the corresponding cluster means α k in such a way that there were only five distinct cluster means. Figure 3 shows the true spatial partitions we used as well as a single draw of α and β in each of the three settings we considered. Full details for generating our synthetic data may be found in Section S3.1 of the Supplementary Materials.
We compared PartOpt method with the following competitors: (i) Anderson et al. (2017)'s method for finding clusters of slopes and intercepts in spatial Poisson regression (hereafter And); (ii) Li and Sang (2019)'s spatially clustered coefficient procedure (hereafter SCC); (iii) separately running K-means on the MLEs of α and β (hereafter KM); and (iv) separately running spectral clustering (Ng et al., 2002) on the MLEs of α and β (hereafter SC).
In our experiments and data analysis, we placed independent truncated Ewens-Pitman prior (2) on the latent partitions γ α and γ β with η = 1. We ran PartOpt with L = 10 particles and three different penalties λ ∈ {1, 10, 100}. We further fixed ρ = 0.9 in the CAR-withinclusters prior and defer a sensitivity analysis of this choice of ρ to Section S3.4 of the Supplementary Materials. We moreover set the remaining hyperparameters a 1 , a 2 , b 1 , and b 2 , together with ν σ and λ σ , in a data-dependent way. At a high level, we first use a heuristic based on the MLE estimate of α and β and on the expected number of clusters to estimate Figure 3: The spatial partitions used to generate our synthetic data and three different settings of α and β values. The color of each square corresponds to the value of the parameter. Different color scales have been used for α and β. Going from left to right, the distances between the cluster means gets progressively small and the overall cluster structure is less visibly apparent. temporary values for the hyperparameters, with which we find the MAP; we then find an empirical Bayes estimate of the hyperparameters given the MAP and run our full procedure to recover the particle set. For further details, see Section S2 of the Supplementary Materials.
Note that our synthetic transformed crime densities y i,t are generated from the normal model in Equation (1), while And is designed for count-valued data. In order to facilitate a comparison between PartOpt and And, we first transformed our y i,t 's into crime counts c i,t using the formula c i,t = sinh(y i,t + log 2) where x is the nearest integer to x. As noted earlier, Anderson et al. (2017) restricts attention to partitions with a pre-specified number of clusters. Following their recommendations we set the maximum number of clusters per partition to be five when we ran And,. In Section S3.3 of the Supplementary Materials, we consider an additional comparison between PartOpt and And in which we first generate crime counts c i,t from a Poisson regression model whose slopes and intercepts are drawn from a CARwithin-clusters prior, and then compute the transformed crime densities y i,t using the inverse hyperbolic sine transformation.
When running KM on the MLEs of α and β, we varied the initial number of clusters from one to five and selected the final number of clusters using the silhouette method (Aranganayagi and Thangavel, 2007). Since KM and And do not generally return spatially connected clusters, we post-processed the identified clusters by splitting them into their connected components. For SC, we varied the number of clusters from one to 10 and used the silhouette method to select the number of clusters. Unlike PartOpt, And, and SCC, which all attempt to learn the latent partitions jointly, KM and SC estimate γ α and γ β essentially independently. For both KM and SC, we estimated the tract-level parameters α i and β i using the posterior mean conditional on the estimated partitions.
We compared the parameter estimation and predictive performance of these methods as well as their abilities to recover the true spatial partitions. We assessed parameter estimation using the root mean square error (RMSE) for estimating the concatenated vector of parameters (α , β ) . We assessed predictive performance using the RMSE for predicting the vector of one-step-ahead out-of-sample observations . SCC, KM, and SC all estimate a single pair of partitions. We measured these methods' ability to recover the latent spatial partitions by computing the adjusted Rand indices (Hubert and Arabie, 1985) between (i) the estimated intercept partitionγ α and the true partitionγ α and (ii) the estimated slope partitionγ β and the true partitionγ β . The Rand index (Rand, 1971) between two partitions is the proportion of pairs of items that are clustered together in both partitions. Values close to one indicate a high a degree of similarity between partitions. The adjusted Rand index corrects the Rand index for chance agreement.
For And, we first computed the adjusted Rand indices between each sampled partition and the corresponding true partition. We then approximated the posterior mean adjusted Rand index for each of γ α and γ β . We similarly approximated the posterior mean adjusted Rand indices using the particle set identified by PartOpt. Large values of the (approximate) posterior mean adjusted Rand index indicates that the (approximate) posterior places more of its mass near the true partitions that generated the data.
For each of the three settings of cluster separation, we generated 100 synthetic datasets. Figure 4 shows the overall performance of the methods considered in the medium separation setting (analogous figures for the high and low separation setting may be found in Section S3.2 of the Supplementary Materials).
Across our simulations, PartOpt obtained consistently lower RMSE and prediction error than the competitors in each of the high, medium, and low separation settings. In all of our experiments, the estimation and prediction error of SCC was several times larger than that of every other method; indeed, in the medium separation setting, the estimation error Figure 4: The estimation and partition selection performance of our method run with λ = 100 and several competitors in the moderate cluster separation setting. Across all simulations, the estimation and prediction error of SCC was substantially greater than those of other methods and are not shown.
of SCC was between 2.7 and 3.0 while the prediction error was between 4.3 and 5.4. On further inspection, we found that the clusters recovered by SCC were quite different from the clusters identified by any of the other methods. Often, SCC would cluster together spatial units with very different true parameter values, introducing substantial bias in the parameter estimates. Interestingly, while KM achieved smaller estimation and prediction than SC in both the high and medium separation setting, SC performed better in the low separation setting, when the cluster structure was much less distinctive. As the data in our main simulation experiments was generated from a Gaussian likelihood, the Poisson regression model fit by And is mis-specified. As a result, its parameter estimates and one-step-ahead predictions are somewhat worse than those of PartOpt, which is well-specified. Interestingly, in our second set of experiments, where we generated crime counts from a Poisson regression model and then ran PartOpt on the transformed densities, PartOpt often outperformed the correctly specified And (see Section S3.3 of the Supplementary Materials).
In the high separation setting, each of And, KM, and PartOpt consistently recovered the true partitions. This is, in-and-of-itself, not especially surprising: in the high separation setting, the cluster structure is obvious upon visual inspection of the MLEs of α and β. Nevertheless, it is interesting to note that both SC and SCC perform quite poorly in terms of recovering the true partitions in this setting (see Figure S2 in the Supplementary Materials). In the medium and low separation setting, when the true cluster structure is less distinctive, the approximate posterior distribution over γ identified by PartOpt tends not to place much mass near the true partitions. On further inspection, we found that in these two settings, the true partitions that generated the data had considerably less posterior probability than the majority of the particles identified by PartOpt. This is also not especially surprising: when the data-generating parameters formed clusters which were all quite similar, the posterior favored forming a single big cluster rather than many smaller clusters with similar parameter values.

Clustering Crime Dynamics in Philadelphia
As described in Section 2, we model the transformed density of violent crimes y i,t in neighborhood i at time t as y i,t ∼ N (α i + β i x t , σ 2 ).We further wish to identify two partitions of neighborhoods: one, γ (α) , that clusters together neighborhoods with similar mean levels of crime α i , and the other, γ (β) , that clusters together neighborhoods with similar time trends β i . For our analysis of the Philadelphia crime data, we ran PartOpt using L = 20 particles and with λ = 100. Like in our simulation experiments, we fixed the hyperparameter η = 1 in the truncated Ewens-Pitman priors on γ (α) and γ (β) .
Identified particles. Figure 5 shows the top particle recovered by PartOpt. We highlighted the borders between clusters with thick black lines and each neighborhood's color correspond to the conditional parameter estimate given the partition.
Immediately we notice that while the top partition for the mean level of crime density γ (α) contains many clusters, the top partition for the time trends γ (β) contains only one cluster. We observe that it contains many neighborhoods which experienced small decreases in crime density and a handful of neighborhoods which experienced small increases in crime density. We observe that the overall range of β i 's is quite small, suggesting that the within-cluster Figure 5: Top particle identified by PartOpt. The thick black lines delineate the borders between the clusters, and the color represents the posterior mean of α and β given the top particle. Regions highlighted in red have different cluster assignment in the particle set.
variation is not large enough to overcome our prior's regularization towards a small number of large clusters. Indeed, as we discuss in more detail in Section S4.2 of the Supplementary Materials, the posterior distribution of γ is somewhat sensitive to the choice of prior.
The remaining nineteen particles identified by PartOpt are quite similar to the top particle shown in Figure 5. In fact, the twelve particles with next highest importance weights had identical γ (β) 's as the top particle. The particles with second through fifth highest importance weight differs from the first particle in terms of the cluster assignment of the sets of neighborhoods labelled (a), (b) and (c) in γ (α) , and the sixth through twelfth particles differ from the first in terms of the other neighborhoods highlighted in red.
The thirteenth through the twentieth particles have identical γ (α) 's, which differ from the partition shown in Figure 5 in the assignment of (a), but have different γ (β) 's. In the thirteenth particle, the neighborhood labelled (A) is removed from the large cluster and moved into a singleton cluster on its own. The remaining particles respectively separate the neighborhoods (B), (C), (D), (E), and (F) into their own clusters. Analysis of the top particle. In Figure 5, we can also immediately recognize many interesting features of Philadelphia's built environment and geography. The clusters labelled Wissahickon creek and Pennypack creek respectively correspond to the parks surrounding these two rivers. In light of this, it is not especially surprising that these areas have somewhat lower baseline levels of crime than their surrounding neighborhoods. We also observe several cluster borders coinciding along with sections of the major arterial road Broad Street (labelled Broad Street).  figure) with three notable exceptions on the east and southeast areas of the region. These three clusters, which are colored in lighter shades, are adjacent to university buildings and student housing. We also see a generally decreasing trend in crime in the vicinity of the universities and a slightly increasing trend further away from the universities. This finding aligns with previous reports of the positive impact of the University of Pennsylvania's West Philadelphia Initiatives aimed at improving the social and economic landscape around the university campus (Ehlenz, 2016).
Predictive performance. Table 1 reports the root mean square error for predicting crime density in each neighborhood in 2018. We compare the performance of PartOpt, which averages over all of identified particles, with the MAP plug-in estimator. We compare it with the forecasts made by a method that does not impose any shrinkage or clustering and instead makes predictions based only on the maximum likelihood estimates of α and β; we also compare it with the competitors considered in Section 4. Recall that the twenty identified particles were all quite similar to the top particle shown in Figure 5. Thus, it is not entirely surprising to see that MAP plug-in predictions are quite similar to the model averaged predictions made by PartOpt. Nevertheless, we find that by averaging over more uncertainty about the latent partitions, PartOpt achieved slightly better predictive performance. We find that And and SC perform worse than the simple MLE predictions. Interestingly, SCC recovers a single cluster of time trends β i , like the top particle shown in Figure 5. However, SCC's estimates of the mean levels α i are substantially smaller than both the MLE and PartOpt's estimates, yielding rather poor predictions.

Discussion
Accurate estimation of the change in crime over time is a critical first step towards a better understanding of public safety in large urban environments. An especially important challenge to such estimation is the potential presence of sharp discontinuities, which may be smoothed over by naive spatial shrinkage procedures. Focusing on the city of Philadelphia, we introduced a Bayesian hierarchical model that naturally identifies these discontinuities by partitioning the city into several clusters of neighborhoods and introduces spatial smoothness within but not between clusters. In particular, we focused on recovering two latent spatial partitions, one for the approximate baseline level of crime over the twelve year period 2006-2017 and one for the approximate time-trend.
Rather than use a computationally prohibitive stochastic search, we identified partitions with highest posterior probability by solving a single optimization problem. We showed that optimizing the proposed objective function is formally equivalent to finding a particular variational objective and introduced a local search strategy for solving this problem. While our primary focus has been on crime in the city of Philadelphia, our ensemble optimization framework is more general and there are a number of areas of future development, which we discuss below.
It is possible to run our particle optimization procedure at both higher and lower spatial resolutions. It would be interesting to run our procedure with data aggregated at the census block group level to reveal heterogeneity in crime dynamics within census tracts. Doing so would require considerably more computational resources, as there are 1336 block groups.
Though we have not done so in this paper, it is also possible to adjust for important neighborhood-level covariates within our framework. For instance, one may extend Equation (1) and model y i,t ∼ N (α i + β i x t + z i,t θ, σ 2 ) where z i,t is a vector of possibly timevarying covariates. Modifying our implementation of PartOpt for this purpose is relatively straightforward with a conditionally conjugate prior on θ; it essentially amounts to writing a new function to compute the marginal likelihood p(y|γ). Alternatively, one could replace the Ewens-Pitman priors on the latent partitions with priors that encourage neighborhoods with similar covariates to cluster together. Many such covariate-dependent clustering priors have been introduced in the past (Park and Dunson, 2010;Müller et al., 2011;Wade et al., 2014) and it would be straightforward to include support for these priors in our implementation of PartOpt. Profile regression (Molitor et al., 2010) incorporates covariates via direct regression adjustment and through the prior on clusters. It does not, however, allow different regression parameters to cluster differently nor does it permit parameters to vary within clusters. In spatial contexts, profile regression may not produce spatial clusters.
Although we have focused on identifying separate partitions of the mean levels and time trends, we believe that there are situations in which it is more desirable to identify clusters defined by both parameters. For instance, policymakers may be interested in identifying clusters of neighborhoods which have both high mean levels of crime and increasing time trends. In Section S3.5 of the Supplementary Materials, we discuss how we can accomplish this with a small modification to our implementation of PartOpt that constrains γ (α) = γ (β) .
Although linear approximations to the true regression functions f i,0 were reasonable in our dataset, in datasets with stronger suggestions of non-linearity and more observations per census tract, it is reasonable to consider higher-order approximations of f i,0 . With conditionally conjugate Gaussian priors on the tract-level parameters, it is still feasible to compute the marginal likelihood p(y | γ) required by particle optimization. We sketch such extensions in Section S4.1 of the Supplementary Materials Finally, while our analysis has focused on modeling crime densities, it is natural to wonder whether PartOpt can be extended to modeling crime counts with Poisson or negative binomial regressions. Our local search algorithm involves repeated calculation of the marginal likelihood p(y | γ). Unfortunately, for Poisson and negative binomial regression with CARwithin-clusters prior, these likelihoods are not available in closed form. Nevertheless, in small-scale experiments, we have found that running PartOpt with approximate marginal likelihoods computed using Laplace approximations works rather well. We discuss these approximations and report the preliminary results of an approximate PartOpt for Poisson regression in Section S6 of the Supplementary Materials.

Supplementary Materials for "Crime in Philadelphia: Bayesian Clustering with
Particle Optimization" 1 Proof of Proposition 1 In this Section 3.1 we state that we can find the set of L particles with largest posterior by finding a variational approximation of the tempered posterior Π λ . Here we restate Proposition 1 and provide the proof.
Remember that we denote with Γ L = {γ (1) , . . . , γ (L) } the set of L particles with largest posterior mass, with q(· | Γ, w) the discrete distribution that places probability w on the particle γ and with Q L the collection of all such distributions supported on at most L particles. Moreover, for each λ > 0, let π λ be the mass function of the tempered marginal posterior Π λ , where π λ (γ) ∝ π(γ | y) 1 λ .

Various hyper-parameter choices
The main model described in Section 2 depends on several hyper-parameters, which need to be fixed by the practitioner: the parameters for the prior for σ 2 (ν σ and λ σ ) and the multiplicative constants to specify within and between cluster variance (a 1 , a 2 , b 1 and b 2 ). We will now describe the data-dependent approach to specify such values.
Let us consider each neighborhood separately and fit a linear regression model for each one: letα i andβ i be the least square estimates andσ 2 i be the estimated residual variance for neighborhood i.
We can use the collection ofσ 2 i 's to specify the prior for σ 2 : by matching mean and variance, we can recover ν σ = 2 m 2 v + 4 and λ σ = m(1 − 2 νσ ), where we denote with m and v the empirical mean and variance of theσ 2 i 's.
To specify the within and between cluster variance parameters we use a two-step heuristic: at a high level, we first find a temporary estimate of the hyperparameters a 1 , a 2 , b 1 and b 2 , based on the MLE estimate of α i and β i and on the expected number of clusters; we then recover the maximum a posteriori (MAP) partition under these values and find the empirical Bayes estimate of a 1 and b 1 given the MAP. These values are finally used to run our full Particle Optimization procedure, which can be initialized from the MAP partition recovered in the first step of the heuristic.
Specifically, we consider the least square estimatesα i ,β i , which can be thought of as an approximation of α i , β i given the partition with N clusters γ N , since they do not incorporate any prior information or sharing of information; in fact under such configuration the coefficients are exchangeable and the only shrinkage induced is through the common variance parameter. Given this, one heuristic desideratum is that the marginal prior on α | γ = γ N should assign substantial probability to range of theα i , assuming symmetry around zero. Specifically, we will make sure that this conditional prior places 95% of its probability over the range of theα i 's. Since α | γ = γ N ∼ N (0, σ 2 (a 1 /(1 − ρ) + a 2 )I n ), we constrain a 1 and a 2 so that When the MLE's are not symmetric around zero the prior probability on the range ofα i 's will be smaller than 95%, but at least each point in the range has prior density higher than φ(2).
In order to determine each of a 1 and a 2 , we need a second constraint. To this end, consider the highly stylized setting in which we have K overlapping clusters with equal variance σ 2 cl whose means are equally spaced at distance 2σ cl . The idea of this second heuristic is to match such a stylized description to the observed distribution ofα i . In essence, this involves covering the range ofα i with K + 1 "chunks" of length 2σ cl . While the exact value of σ cl is unknown, we have found it useful to approximate it a 1 σ 2 /(1 − ρ). This approximation tends to produce smaller values of a 1 , which in turn encourages a relatively larger number of clusters.
With these two constraints we can find the temporary values: Similarly for theβ i 's we find: In order to operationalize these heuristics, we must specify an initial guess at K. We have found in our experiments, setting K = log N works quite well. It, moreover, accords with the general behavior of the Ewens-Pitman prior.
We now use these values to find the MAP partition γ (1) (we can run our Particle Optimization procedure with L = 1) and find the Empirical Bayes estimates of a 1 and b 1 given γ (1) and the other hyperparameter estimates, i.e. we find using a numerical optimization algorithm.
Note that finding the MAP as part of our heuristic procedure does not increase the computational burden. In fact, even though it requires us to run the Particle Optimization algorithm twice (the first time with L = 1), the output from the first run can be used as a starting point in the initialization of the second run -together with the partitions recovered by running k-means on the maximum likelihood estimates. We empirically see that the MAP recovered under the temporary hyperparameters has always higher posterior probability than other initializing partitions. Consequently, when we run our Particle Optimization procedure for the second time, we start from a point with high posterior probability, harnessing the work of the initial MAP search.

Synthetic data description
In Section 4, we generated several synthetic datasets based on a 20 grid of spatial units, given the true partitionsγ α andγ β and with various levels of cluster separations, as displayed in Figure 3.
Since the values of α k and β k were artificially chosen and not sampled, there are no exact values for a 2 and b 2 , but we can compute a lower bound for these hyperparameters so that the prior distributions of α k and β k cover the values artificially chosen. If we require that the values of α k and β k fall within the 0.025 and 0.975 quantiles of the prior distribution, we find that a 2 ≥ 9 and b 2 ≥ 2 for the high separation setting. In the moderate separation setting we find a 2 ≥ 8 and b 2 ≥ 1, while for the low separation setting a 2 ≥ 7 and b 2 ≥ 0. Note that the larger values of a 2 are due to the shift of 3.5 in the construction of α k , which is important to generate data that mimics the real data which has a positive mean level of crime.
For each cluster separation setting we generated 100 pairs of vectors (α, β). For each set of parameters we generated the outcomes We also considered a second data generating process, in which c i,t ∼ Pois(exp(α i + β i x t )).

Additional cluster separation settings
In Section 4, we compared the estimation, prediction and partition selection performance of our method to that of several competitors and we displayed results for the moderate separation setting. Specifically, for the estimation and prediction performance we displayed the root mean squared error (RMSE) for estimating the concatenated vector of parameters (α, β) and the RMSE for the vector of one-step-ahead observations y T +1 generated from the same model. For the partition selection performance we compared the adjusted Rand index between the true partitionsγ α andγ β and the recovered one.
We now report the same measures for the additional cluster separation settings in Figure S1 and Figure S2. As in Section 4, we do not show the RMSE and Prediction error for the SCC method, since they were substantially greater than those of other methods (RMSE ranged between 3.9 and 4.3 in the high separation setting and between 2.4 and 3.7 in the low separation setting; prediction error ranged between 7.7 and 8.8 in the former and between 3.2 and 3.7 in the latter). Figure S1: The estimation and partition selection performance in the high cluster separation setting. Figure S2: The estimation and partition selection performance in the low cluster separation setting.
Similarly to the moderate cluster separation setting, we see that PartOpt performs better in terms of estimation and predictive performance, with no other method consistently achieving second best. In fact KM performs almost as well as PartOpt in the high separation setting, while SC achieves second best in the low separation setting. In terms of selection performance, in the high separation setting, PartOpt recovers the true partitions almost always exactly, but And and KM also perform quite well; SC and SCC instead recover partitions that are quite different from the true one. In the low separation setting instead, all method have low values for the adjusted Rand index; in fact, when there is no difference between the cluster means the true partitionsγ α andγ β lose meaning, and we expect the methods to recover the partitions with only one cluster.
To better investigate some of these issues, we report in Figures  It's clear from these figures that even when the adjusted Rand index is low, the log-posterior of the partitions recovered by PartOpt is higher than the one of the true partitions. By examining the number of clusters of the recovered partitions, we see that SC always underestimate the number of clusters, suggesting the reason of the poor performance we had previously noticed. SCC and And instead often overestimate the number of clusters, but for different reasons. And in fact does not target spatial partitions, and we have to manually find the connected components, artificially inflating the number of clusters; however, while not spatial, the partitions recovered by And are not so distant from the true partitions and have relatively high log-posterior values. SCC instead targets spatial partitions, so no manual post-processing is necessary, but it's highly sensitive to the choice of spanning tree, resulting in low values of log-posterior.

Second data generating process
So far we have compared the behavior of PartOpt, And and the other competitors under the data generating process suggested by our model in Section 2. However, the method by Anderson et al. (2017) is developed for count data, generated from a Poisson distribution. So we also considered a second data generating process, in which the count data c i,t is generated from a Poisson distribution with mean λ i,t = exp(µ i,t ) and µ i,t = α i + β i x t , and the counts are transformed, as in equation (1): (2). In Figure S4 we report the results for the simulations under this second data generating process.
In high separation settings, And has better estimation performance than PartOpt (lower RMSE and higher log-posterior), even though the latter recovers the true partitions almost exactly. However, in moderate to low separation settings, PartOpt performs better than And, both in terms of estimation performance measures (lower RMSE and prediction error) and of log-posterior of the selected partitions, suggesting that PartOpt is robust to misspecification Figure S3: Additional partition selection measures in several cluster separation settings. Top row: high cluster separation. Medium row: moderate cluster separation. Bottom row: low cluster separation. Figure S4: The estimation and partition selection performance under the second data generating process, for several cluster separation settings. Top row: high cluster separation. Medium row: moderate cluster separation. Bottom row: low cluster separation.
of the model.

Sensitivity to hyperparameter choice
In all our simulation settings and data analysis the spatial autocorrelation hyper-parameter ρ is fixed and equal to 0.9. This choice is motivated by our search for clusters that display a large spatial autocorrelation, without having to choose an improper prior for α and β.
We now explore how the results from our synthetic analysis change when we fix a different value for the hyperparameter ρ. In particular, we consider the moderate cluster separation setting and we separately fit our model with ρ = 0.1, 0.5, 0.75, 0.95 together with ρ = 0.9 which is the value we used in our main synthetic analysis. Figure S5 shows the estimation and partition selection performance of PartOpt under the various values of ρ. We first notice that there is quite some heterogeneity in the partition selection for different values of the hyperparameter ρ; in fact, for smaller values of ρ, such as ρ = 0.1 and ρ = 0.5, PartOpt recovers partitions that are very close to the true partitions (with high adjusted Rand index values), while for larger values of ρ it recovers partitions that are quite distant from the truth, similarly to what we discovered in our main synthetic analysis. Remember in fact that for each value of the hyperparameter the posterior distribution changes, and the particles that have largest posterior under different values of the hyperparameter will most likely not coincide. However, it is reassuring that the particle recovered by PartOpt under each value of the hyperparameter has almost always larger posterior probability (under such value of ρ) than the particles recovered under different values of ρ (results not shown). Moreover, for all values of the hyperparameter, PartOpt achieves similarly good estimation and prediction performance, suggesting robustness with respect to the choice of ρ. Figure S5: The estimation and partition selection performance under the moderate cluster separation setting, for several specification of the hyperparameter ρ in PartOpt (the labels report the value of ρ for each specification). The performance of And and KM is also reported for reference.

Recovering equal partitions mean levels and time trends
It is possible to adapt our procedure to the case where one is interested in recovering the same partition for the mean levels and the time trends, i.e. when γ α = γ β = γ. In such case, our method approximates the posterior distribution π(γ|y) of one random partition that affects the distribution of both α and β. We have implemented this version of PartOpt that constrains the two partitions to be equal (hereafter referred to as "Equal Partition" or EqualPart) and we have compared its performance to the unconstrained method that we have presented in the main manuscript, under the same synthetic data simulation described in Section 3.1. Note that for this simulation study, there two true partitions used to generate the data were different. Figure S6: The estimation and partition selection performance of equalPart under the high cluster separation setting. The performance of PartOpt is also reported for reference. The panel "Rand Index for γ overlay" plots the adjusted Rand index between the partition returned by EqualPart and the overlay of the two true latent partitions. Figure S6 reports the estimation and partition selection performance of EqualPart compared with the unconstrained PartOpt under the high cluster separation setting.
Interestingly, despite "Equal Partitions" being misspecified, we see that it has comparable estimation and prediction error as PartOpt. In terms of partition recovery, the adjusted Rand index between the partition estimated by EqualPart and the two partitions used to generate the data appears to be quite small. This is somewhat unsurprising: our data was generated from a model with two latent partitions and "Equal Partitions" looks only for one. However, in virtually all of our simulation replications, the top partition recovered by EqualPart is quite close to the partition formed by "overlaying" the partitions in the top particle recovered by PartOpt. More specifically, this is the partition whose clusters are found as the pairwise intersection of clusters in the true partitionsγ α andγ β . In the combinatorics literature this is known as the meet of the two partitions, which corresponds to the greatest lower bound of these partitions, under the partial order defined by the "finer than" relation. The panel "Rand Index for γ overlay" plots the adjusted Rand index between the partition returned by "Equal Partitions" and the overlay of the two true latent partitions. We see that the "Equal Partitions" routinely identified partitions close to the true overlay.

Linearity of crime trends
In our analysis of crime in Philadelphia, we model the change of crime over time for years between 2006 and 2017 with a linear trend (see Equation (2) of the main manuscript). While linearity might not perfectly characterize the trend over time, it is the most practical and common choice when using a relatively small number of time points (see Bernardelli et al., 1995 andAnderson et al., 2017). In fact, this simple model allows us to detect the general trend, i.e. whether crime is overall increasing or decreasing in a neighborhood. However, the careful reader might worry about the validity of such assumption. We analyze here a representative sample of neighborhoods and their trend over time to check for strong nonlinearities.
To analyze the linearity of crime trends we computed the Pearson correlation coefficient between time and log crime density and examined the absolute value of their correlation. Neighborhoods characterized by a correlation coefficient close to 1 (in absolute value) have trends that are very close to linear. The ones with smaller values of the correlation coefficient (again, in absolute value) are either not changing over time, or could display non-linear trend. We note that more than 50% of the neighborhoods have correlation greater than 0.6 (in absolute value) and more than 75% greater than 0.4. Figure S7 shows the trends of the 30 neighborhoods with lowest correlation in absolute value (left panel) and the 30 neighborhoods with highest correlation in absolute value (right panel). While the low correlation neighborhoods display much more nonlinear variation than the high correlation ones, we still believe a linear trend is insightful in describing the time trend in such neighborhoods.  To study the neighborhoods with low correlation is more detail, Figure S8 displays the time trends individually for the six neighborhoods with lowest correlation in absolute value, together with the least square line. Most of these neighborhoods do not display clear nonlinear patters. The only neighborhood for which crime density seems to decrease and then increase is Census tract 299, where a quadratic term might better describe the trend.

Extending PartOpt to accomodate non-linearities
In our particular application, a first-order expansion of the expected transformed crime density was sufficient to characterize the general neighborhood-level trends in crime. However, for datasets displaying strong non-linearities, such an approximate may not be appropriate.
We now describe how one might extend PartOpt to accommodate more flexible non-linear models. To this end, consider a D th order expansion of the model in Equation 2 of the main text: where once again x t is the standardized time index. We now must estimate the base-line transformed crime density α i and a D-vector of regression coefficientsβ i = (β i,1 , . . . , β i,D ) for each neighborhood i.
Doing so, however, requires us to make more modeling decision than we had to make with the first-order model in the main text. Namely, we must decide how much heterogeneity we would like to allow in the spatial distributions of the coefficients in the expanded model. At one extreme, we can introduce D underlying partitions, one for each collection β (d) = (β 1,d , . . . , β N,d ), that allows different spatial distributions for each coefficient. We could then place conditionally independent CAR-within-clusters priors on each of these collections in a manner analogous to the priors on α and β in the main text. At the other extreme, we can introduce a single underlying partition, implicitly assuming that the spatial variability in, say, the quadratic coefficients is identical to the spatial variability in the linear coefficients. We could then place conditionally independent multivariate CAR (Gelfand and Vounatsou, 2003) priors on the vectors of regression coefficient for the neighborhoods in each cluster.
Although there are many possibilities for specifying the latent cluster structure of the intercepts and regression coefficients in the D th order expansion, we may still run PartOpt so long as marginalize out the intercepts and regression coefficients. Essentially, so long as we adopt conditionally conjugate priors within each cluster, we can still compute the log marginal likelihoods and conditional posterior expectations needed by PartOpt. We could, in fact, extend the model even further by using an alternative basis expansion in Equation (S1): where φ 1 , . . . , φ D are pre-specified basis functions. Once again, so long as we maintain conditional conjugacy within cluster, we would be able to run PartOpt.

Sensitivity to prior choice
In Section 5, we analyzed the data on crime density in Philadelphia's census tracts, by running our Particle Optimization procedure on the model described in Section 2. The choice of the prior distribution and of some of the hyperparameters could affect the posterior estimates recovered. In this section, we will analyze sensitivity to prior and hyper-parameter choices. In particular, we will compare results recovered under the Ewens-Pitman prior with different η parameters and under the Uniform prior on the space of spatial partitions SP.
We additionally study the sensitivity under different values of the spatial autocorrelation parameter ρ.
We start by reporting the top particle recovered by our procedure under the Uniform prior in Figure S9. We first notice that under this prior, the partition of the time trends γ β presents several clusters, identifying many moderately sized clusters that display a range of time trends, both increasing and decreasing. Moreover, the parameter estimates under the uniform prior are almost constant within cluster, in contrast to the ones under Ewens-Pitman prior that showed much larger levels of within-cluster variation. Interestingly, though we recover more clusters in the mean level partition γ (α) with a uniform prior, the estimates of α i arising from both priors show little substantive difference. While it might not be obvious from visually comparing the two plots, we can easily check by analyzing the linear correlation between the estimates of the vector of crime trends α under the two priors, which is equal to 0.999. The same correlation measure on the estimates of β under the two models is instead 0.931, which suggests the estimates are somewhat different.
While the partition under this prior can be seen as more interpretable, it is associated with worse predictive performances: it's out of sample predictive error is 0.2344, which is larger than the error under the Ewens-Pitman prior, but smaller than the one achieved by running And or by finding separate MLE's (see Table 1). Figure S9: Top particle identified by our procedure under the Uniform prior on SP. The thick lines highlight the border between the clusters, and the color represents the posterior mean of the parameters α and β conditional on the displayed particle.
We now analyze the particles recovered by our procedure under the Ewens-Pitman prior with different values for the concentration hyperparameter η. This hyperparameter regulates the probability of a units joining a new cluster, with higher values inducing a larger number of clusters in expectation. Specifically we compared values η = 1, 3, 5. We find that the partition of the mean level γ α in the top particle differs for different values of η, with the partition recovered under η = 5 displayed in Figure S10 and the one under η = 3 being very similar to it. We notice instead that the partition of the time trends γ β does not change substantially, still showing one cluster but also recovering several singleton clusters for neighborhoods with extreme values of the time trend β i . The change in the partition γ α is likely caused by the local nature of our algorithm, which can get stuck in local modes.
In fact, we found that the particle recovered under η = 5 has a larger posterior probability under the model with η = 1, compared to the top particle recovered under η = 1.
Finally, we considered sensitivity to the choice of the spatial autocorrelation hyperparameter ρ. In our analysis and simulation, it was chosen equal to 0.9, which induces a strong degree of spatial autocorrelation, without causing the prior distributions of α and β to be improper (this happens when ρ = 1). To test the sensitivity of our procedure to this value, we ran our method with different values of ρ, ranging from low prior autocorrelations with ρ = 0 Figure S10: Top particle identified by our procedure under the Ewens-Pitman prior with concentration hyperparameter η = 5.
to very strong prior autocorrelation with ρ = 0.99. We analyzed the top particle recovered by our procedure. For the mean levels of crime, our procedure recovered two sets of similar partitions for different values of ρ. One set corresponds to partitions that look like the top particle displayed in Figure 5 of our main manuscript, and it was recovered for ρ = 0.9 and for ρ = 0.85. The other set of partitions (recovered for ρ = 0. 99, 0.95, 0.80, 0.75, 0.50, 0.25, 0) is similar to the partition reported in Figure S11, which corresponds to the top particle for ρ = 0.99. Similarly to what we discussed previously, the difference between these two sets is likely caused by the local nature of our algorithm, which can get stuck in local modes. This seems to be the case in this example, as the partition recovered by our procedure with ρ = 0.99 has a higher posterior probability under the model with ρ = 0.9, suggesting that the algorithm with ρ = 0.9 and ρ = 0.85 got stuck in a local mode. While this is not ideal, the positive aspect is that the estimate for the α values is robust to these changes, and does not show differences; in fact, the linear correlation between any pair of estimates under different ρ values is always greater than 0.99.
Instead, the partition of the time trends recovered in the top particle is always equal to the partition with one large cluster (with one exception where a singleton cluster is recovered too). In this case the estimate of the β values changes slightly, but not substantially: the linear correlation between the estimate found under ρ = 0 and the one under ρ = 0.99 is Figure S11: Top particle identified by our procedure when the spatial autocorrelation hyperparameter ρ = 0.99. Many of the top particles recovered in our sensitivity analysis for other values of ρ looked similar to this.
0.95 but is greater than 0.98 for any two pairs of parameters found with positive values of ρ. This effect is not surprising, as we expect the CAR prior to have stronger effects when the partition recovered is formed by only one large cluster, rather than when it is formed by many smaller clusters, as it's the case for γ α .

Derivation of Closed Form Expressions
Recall from Section 2 that our full model is: We exploit the conditional conjugacy present in this model in several places. First, we have closed form expressions for the conditional posterior means E[α | y, γ] and E[β | y, γ], which we use in our particle optimization procedure to propose new transitions. Second, we can compute the marginal likelihood p(y | γ) in closed form, which we use to evaluate the optimization objective and pick between multiple transitions. Below, we carefully derive these closed form expressions, noting that in several places, we can avoid potentially expensive matrix inversions. In particular, the choice to center the time variable, thereby ensuring an orthogonal design matrix within each neighborhood, facilitates rapid likelihood evaluations.
Distribution of α k Let us first consider the vector of parameters α k in cluster S (α) k given σ 2 : by marginalizing the distribution of the grand cluster mean α k , we find that its distribution is a multivariate normal with covariance matrix σ 2 Σ +a 2 11 . Note that its precision matrix can be computed using Woodbury's formula without having to invert any matrix: where Ω (α) k ) * + (1 − ρ)I; the second line follows from noticing that 1 is both a left and right eigenvector of Ω (α) k,CAR with eigenvalue 1 − ρ. Similarly this holds for the distribution of β k .
Distribution of α Next, we can write the distribution of the whole vector α given σ 2 and γ (α) : by combining the distributions of the cluster specific parameters α k 's, and using the independence between different clusters, we find that the distrubution of α given σ 2 and γ (α) is a multivariate normal with mean zero and covariance matrix that can be found by combining the Σ (α) k 's. Because of the independence between clusters, there exists an ordering of the indices of α so that the covariance matrix of α|γ α , σ 2 has a block-diagonal structure. We denote such permutation of the indices with π (α) , and it can be constructed by mapping the first n 1 elements to the indices in the first cluster ({π (α) (1), . . . , π (α) (n 1 )} = S (α) 1 ), the following n 2 elements to the indices in the second cluster ({π (α) (n 1 + 1), . . . , π (α) (n 1 + n 2 )} = S (α) 2 ), and so on. With such ordering, the kth diagonal block of the covariance matrix is k . Similarly, we can find a (potentially different) permutation π (β) for β and derive the distribution of β π |σ 2 , γ (β) .
Notation To describe the distributions of interest we can represent our model in the form of a unique linear model, by combining all the observations in a vector Y , combining the reodered coefficients in a unique vector θ = (α π , β π ) and appropriately constructing the covariate matrix X. In the next paragraphs we will provide with the details on how we constructed such vectors and matrix.
To build the column vector Y we stack the vectors y i with i = 1, . . . , N : Y is a vector of length N ·T and each block of T rows corresponds to a particular neighborhood; in particular, the ((i − 1)T + t)th entry of Y corresponds to y i,t .
Its precision matrix can be computed using Woodbury's formula again: Σ −1 Y = I − X(Σ −1 θ + X X) −1 X . Note that X X is a diagonal matrix, and we derive its form towards the end of this section.
The marginal likelihood can now be derived by integrating out σ 2 : Note that if λ σ = 1, this is multivariate t-distribution with ν σ degrees of freedom.
For this we need to compute the quadratic form Because of the block diagonal structure of Σ −1 θ + X X we can write this as a sum over the clusters of the two partitions. Consider the column vector X Y of length 2N : the first N elements correspond to the summary statistics related to the α π(i) 's and we will denote the ones corresponding to cluster S k , while the second N elements are for the β i 's and we denote with (X Y ) (β) k the ones for cluster S (β) k . Now we can write k ; each of them can be inverted using methods for symmetric positive definite matrices.
To compute the marginal likelihood we are left we calculating the determinant of Σ Y , where we can use the reciprocal of the determinant of its inverse where the last equality is given by Sylvester's formula, and allows us to compute the determinant of a smaller dimensional matrix. Moreover, because of its block diagonal structure, we can compute the determinant block-wise.
Posterior mean of α, β The calculations for the posterior mean of α, β are very similar: using the same notation and the results for linear regression, we can find E θ|Y, γ (α) , γ (β) , σ −1 = X X + Σ −1 θ −1 X Y and since this does not depend on σ 2 , it coincides with E θ|Y, γ (α) , γ (β) . Because of the block diagonal structure of the matrices involved, we can compute the estimate of the parameter for each cluster independently. Moreover, note that the inverse of X X +Σ −1 θ is computed in the likelihood calculation, so it can be stored and does not need to be computed two times.
Derivation of X X Since in our formulation the covariates are orthogonal, i.e. T t=1 x it = 0 for all i, X X is a diagonal matrix. Note that column X (π (α) ) −1 (i ) contains T 1's in rows t + (i − 1)×T and zeros elsewhere; similarly column X N +(π (β) ) −1 (i ) contains elements (x i t ) in rows t + (i − 1) × T and zero's elsewhere. Thus, when we compute (X X) ij we consider the cross product of columns X i and X j . Depending on the value of i and j, we have the following cases: • if i = j ≤ N , then (X X) ij = T , • if i = j ≥ N , then (X X) ij = t x 2 π (β) (j−N ),t , • if i ≤ N and j = N + i, then (X X) ij = t x π (β) (i),t = 0, • if j ≤ N and i = N + j, then (X X) ij = t x π (β) (j),t = 0, • for any other i, j, (X X) ij = 0.
Thus the matrix X X is a diagonal matrix: the first n × n diagonal block is T I, and the second diagonal block is a diagonal matrix whose entries are T t=1 x 2 it ; when we have fixed design, x it = x t = (t − t)/sd(t), then T t=1 x 2 it = T t=1 ((t − t)/sd(t)) 2 is constant, so the second diagonal block is x 2 it I. Because of the orthogonality of the covariates, the upperright and lower-left blocks are zero matrices, since T t=1 x it = 0.
Note on cluster-wise update of calculations. In our greedy search when we perform a move only one or two clusters in only one partition is changed: in a split move for γ (·) , a cluster is divided into two sub-clusters, and the original cluster replaced by the first, while the second creates an additional cluster; in a merge move, one of two clusters is deleted and the other is replaced to the merge of the two original clusters. In each case, we need to update the value of the marginal likelihood, of the prior for γ (·) and of the estimate of the parameters.
Because of the block structure given by orthogonality of covariates and by the reordering of the parameters, changing the structure of some clusters does not affect the parameter estimates for other clusters that are not involved in the move. This implies that updates for updates to S (α) k do not affect the parameter estimates α h for h = k or β k for any k . Similarly, since the quadratic form Y Σ −1 Y Y can be written as sum of cluster-specific quadratic forms, we can update only the quadratic form of the clusters affected and we can compute the determinant of the blocks of Σ Y corresponding to the modified clusters.
This allows us to invert matrices that scale like the size of the clusters, reducing the computational costs dramatically.

Extension to Non-Conjugate Models
The proposed particle optimization strategy relies on the ability to compute the marginal likelihood π(y|γ). This is often straightforward with conjugate models, such as the one considered for our application. However, it may be challenging or impossible to compute π(y|γ) in more complicated non-conjugate settings. We can, nevertheless, approximate the marginal likelihood using, e.g., a Laplace approximation and deploy an approximate particle optimization strategy. Below, we outline how this works for Poisson regression.

Illustration on synthetic data
To understand the performance of our approximate PartOpt, we performed a simulation study very similar to that described in Section 3.1. Specifically, we generated observations where the α i 's and β i 's were drawn from a CAR-within-cluster distribution. We considered three settings of the α i 's and β i 's, one where the grand cluster means were highly separated (first row of Figure S12), one where the grand cluster means were moderately separated (second row of Figure S12), and one where the grand cluster means were very close in value (third row of Figure S12). We ran our approximate PartOpt with L = 10 particles and penalties λ = 1, 10, and 100.
For all values of λ, the particle set contained the true partitions γ (α) and γ (β) shown in Figure S12. Like the simulations reported in the main manuscript for the Gaussian regression setting, we found that when λ = 1, many particles collapsed to the same point. However, we recovered a much more diverse particle set when we set λ = 10 and λ = 100. Figure S13 shows the L = 10 particles recovered when we ran the approximate PartOpt on data generated from (S2) with the α i 's and β i 's from the high separation setting, shown in the top panel Figure S12. Similarly, Figure S14 shows the recovered particles for the moderate separation setting, when λ = 10. Finally, Figure S15 shows the recovered particles for the low separation setting, when λ = 10. In the high and moderate cluster separation settings, all of the estimated partitions are extremely close to the true partitions and the corresponding conditional posterior means of α i and β i are quite close to the true values. In the low separation setting, the particle set recovered partitions quite different from the ones used to generate the data. The behavior of the approximate PartOpt in the low separation Figure S12: The true partition γ (α) and γ (β) used to generate the synthetic data to test the extension of Particle Optimization to non-conjugate models. First row: high cluster separation configuration. Second row: moderate cluster separation configuration. Third row: low cluster separation configuration.
setting is not surprising: when there is little between-cluster variation in parameter values, the posterior strongly favors a single large cluster instead of several smaller clusters that all containing similar parameter values. Figure S13: The estimate particles (γ (α) , γ (β) ) recovered for λ = 10 in the high separation setting, and the weight associated to each one. Figure S14: The estimate particles (γ (α) , γ (β) ) recovered for λ = 10 in the moderate separation setting, and the weight associated to each one. Figure S15: The estimate particles (γ (α) , γ (β) ) recovered for λ = 10 in the low separation setting, and the weight associated to each one.