Modeling of spatial stratified heterogeneity

ABSTRACT Spatial stratified heterogeneity (SSH) refers to the geographical phenomena in which the geographical attributes within-strata are more similar than the between-strata, which is ubiquitous in the real world and offers information implying the causation of nature. Stratification, a primary approach to SSH, generates strata using a priori knowledge or thousands of supervised and unsupervised learning methods. Selecting reasonable stratification methods for spatial analysis in specific domains without prior knowledge is challenging because no method is optimal in a general sense. However, a systematic review of a large number of stratification methods is still lacking. In this article, we review the methods for stratification, categorize the existing typical stratification methods into four classes – univariate stratification, cluster-based stratification, multicriteria stratification, and supervised stratification – and construct their taxonomy. Finally, we present a summary of the software and tools used to compare and perform stratification methods. Given that different stratification methods reflect distinct human understandings of spatial distributions and associations, we suggest that further studies are needed to reveal the nature of geographical attributes by integrating SSH, advanced algorithms, and interdisciplinary methods.


Introduction
Spatial heterogeneity (SH) is of great importance in the study of earth science, ecology, and public health. SH is defined either as the variation in space in the distribution of a point pattern or as the variation in the qualitative or quantitative value of a surface pattern (Dutilleul and Legendre 1993). Spatial heterogeneity disrupts the stationarity assumptions underpinning many spatial models (Gaetan and Guyon 2010); therefore, it is crucial to summarize the models used to deal with spatial heterogeneity. On a spatial scale, SH can be classified as spatial local heterogeneity (SLH) or spatial stratified heterogeneity (SSH) (Wang, Zhang, and Fu 2016). SLH explores the local spatial cluster with similarity in geographical attributes. For instance, there are many indicators to detect hot spots, such as local indicators of spatial association (LISA) (Anselin 1995), Getis-Ord Gi (Getis and Ord 1992), spatial scan statistics (Kulldorff 1997), and for processing spatial local effects by geographically weighted regression (GWR) and its extended models (Brunsdon, Fotheringham, and Charlton 1998;Fotheringham, Yang, and Kang 2017;Huang, Wu, and Barry 2010). SSH emphasizes the heterogeneity between strata, and each stratum is composed of a number of spatial objects. For study variables in different strata, the heterogeneity may exist in significantly different means and dominant factors, as well as various relationships with the explanatory variables. For example, precipitation has a dome-shaped effect on human plague in the north and a U-shaped effect in the south China (Xu et al. 2011), the human plague is spatial stratified heterogeneity on north and South China. Strata can be constructed through a stratification process based on prior knowledge or with the aid of quantitative methods, such as clustering analysis (Jain 2010;Jain, Murty, and Flynn 1999) and spatially constrained clustering (Lankford 2010;AssunÇão et al. 2006). The geographical detector model (Wang, Zhang, and Fu 2016;Wang et al. 2010) is a primary and popular tool for the SSH, widely used for measuring the SSH and evaluating the determinant power of potential explanatory variables (Yin et al. 2019;Hu et al. 2020;Xu et al. 2021). Other main models based on SSH are Sandwich ), a mapping method for strong SSH, and the mean of the surface with non-homogeneity (MSN), which considers both spatial autocorrelation and SSH in mean estimation or spatial interpolation (Wang, Li, and Christakos 2009;Gao et al. 2020). Unlike the local model, model parameters based on SSH vary only across strata rather than varying with spatial location as in GWR. These models achieve a good trade-off between producing general results (i.e. model parameters do not vary with spatial location) and capturing local variations (i.e. parameters are allowed to vary across locations).
Stratification is one of the vital steps in research on SSH and is conducted mainly by the process based on prior knowledge (e.g. Kӧppen climate classification system (Köppen 1931) and ecoregions of the United States (Bailey 1995)) or a few quantitative methods, such as clustering analysis (Jain 2010;Jain, Murty, and Flynn 1999), spatially constrained clustering (Lankford 2010;AssunÇão et al. 2006), and regression tree (Breiman et al. 1984). In explaining spatial heterogeneity, stratification has advantages in four major aspects. First, stratification results (i.e. strata) are qualitative data, and learning from qualitative data is often more effective and more efficient than learning from quantitative data or data of a mixed type. Second, stratification converts continuous explanatory variables into strata to find complex nonlinear correlations with the study variable (Luo, Song, and Wu 2021;Xu et al. 2011). Third, stratification can identify confounding factors by replacing unobserved data with strata based on expert experience (Wu, Li, and Guo 2021). Finally, stratification has been proven to be an effective way to improve the accuracy of spatial sampling and inference using several models, such as stratified kriging (Stein, Hoogerwerf, and Bouma 1988;Weber and Englund 1992), geographical detector-based stratified regression kriging (Liu et al. 2021), and Biased Sample Hospital-based Area Disease Estimation (B-SHADE) Xu, Wang, and Li 2018).
Although some studies have conducted stratification based on SSH (Cao, Ge, and Wang 2013;Song et al. 2020;Meng et al. 2021), the selection of stratification methods usually hinges upon individual preferences, which often leads to unconvincing results. Therefore, attention still needs to be paid to the differences between these stratification methods, and a more systematic study of stratification is urgently required for scientific instruction. The main contributions of this work are as follows: Summarize the general criteria for stratification based on statistical characteristics of the data and study context Construct the taxonomy of stratification algorithms according to the number of variables, exploratory or predictive task, and single or multiple criteria and the relationships among algorithms Provide the guidelines for selection of stratification algorithms based on input parameters, robustness, arbitrary shape, and scalability to large data, and available software packages for stratification in R, Python, Geoda, ArcGIS, and QGIS Discuss the assessment of results of stratification and future recommendations.
The remainder is organized as follows. Section 2 and Section 3 describe the definition, processing and criteria of stratification. Section 4 covers the main algorithms of stratification. Section 5 summarizes the stratification methods and gives common software packages for stratification. Sections 6 and 7 present the discussion and conclusions of the study, respectively.

Definition
By "stratification," we mean organizing geographical objects into subsets (called strata) based on the similarity of their attributes or spatial relationships. Stratification in spatial analysis is usually undertaken on a geographical basis, for example, by dividing the study area into subregions by variables, especially for the purpose of sampling (Cochran 1977) or statistical inference Wang, Gao, and Stein 2020). There are many terms similar to stratification, such as "clustering," "classification," and "regionalization." Although the three terms aim to group the study population into different subgroups, there are some minor differences among them. Clustering and regionalization always involve unlabeled data (unsupervised learning), and regionalization methods are also known as spatially constrained aggregation methods, zone design, and constrained clustering (Duque, Ramos, and Surinach 2007); while classification involves labeled data (supervised learning) (Duda, Hart, and Stork 2000). In regionalization, only adjacent zones may be merged to form regions, which is the distinction between regionalization and classification or clustering. In this paper, we refer to them collectively as "stratification" in the domains of SSH. Strata can be either attribute groups (result of clustering or classification) or spatially connected zones (result of regionalization).

Processing of stratification
Stratification processing consists of four steps, that is, selection of explanatory variables, determination of stratification criteria, stratification, assessment of result. Figure 1 depicts a typical sequencing of these steps.
(1) Selection of explanatory variables is to choose distinguishing features from a set of candidates according to the study variables. The explanatory variables may be nominal, ordinal, interval scaled, or of mixed types.
Determination of stratification criteria refers to the abstraction of the research problem based on the study task, classified as exploratory task and predictive task, the criteria of stratification, and stratification for each explanatory variable or all variables.
Stratification step is choosing the algorithm based on the pattern representation, the available software packages and the users' understanding of algorithms.
Assessment of result can be performed by q-statistic for (Wang, Zhang, and Fu 2016;Wang et al. 2010). If the stratification result is too far from the expected one, we could reselect explanatory variables as well as the stratification algorithm.
The outputs of stratification need to be interpreted with experimental evidence and analysis, so that convincing conclusions could be drawn, and the strata can be used for sampling or modeling.

Stratification criteria
Stratification is often driven by the study task is exploratory or predictive, the number of variables, and criteria of the algorithm. The predictive task can be solved by supervised learning method and focuses on the performance on new variables data; while exploratory task is solved by unsupervised learning method and focuses on the patterns reflected by the current variables data. The number of variables of each stratification determines whether a univariate or multivariate approach is used. The criterion of the algorithm reflects our understanding of the problem and is the more central issue in the stratification. In this section, we focus only on several general and popular criteria, which are summarized below:

Homogeneity
Spatial objects in the same strata should be as similar as possible in terms of attributes relevant to the analysis. Similarity can be measured using within-strata variability, distance matrices, or probability models.
(2) Within-strata variability. It measures the variation of objects to the representative points of each stratum. The search for strata of maximum homogeneity leads to minimum within-strata variability. Let be the set of m-dimensional points. The within-strata variability can be calculated as follows: where c k is the representative point of stratum k. If c k is the mean value in stratum k, J s ð Þ is the objective function of the K-means (Ball and Hall 1965;MacQueen 1967) and Jenks natural breaks (JNB) (Fisher 1958;Jenks 1977). If c k is the medoid, J s ð Þ is the objective function of partitioning around the medoids (PAM) (Kaufman and Rousseeuw 1990). The algorithms based on within-strata variability aim at optimizing an objective function that describes how well data are grouped around representative points.
Distance matrix. The elements of the distance matrix measure the dissimilarity (or similarity) between objects. The distance matrix is the input data for hierarchical clustering (e.g. single-linkage (Sneath 1957;Florek et al. 1951); [ (Sneath 1957), complete-linkage (McQuitty 1960)], and spectral clustering (e.g. normalized cuts algorithm (Shi and Malik 2000)]). For numerical variables, similarity can be measured by the Minkowski metric, which is: is the Euclidean distance, which is the most popular metric for evaluating the proximity of objects. If p ¼ 1 and p ¼ 1, the d x i ; x j À � represents the Manhattan and Chebyshew distance, respectively. Additionally, there are also many other distance metrics, such as the cosine distance, or other relevant distances for numerical, categorical, or mixed-type data (Murtagh and Contreras 2012;Deza and Deza 2016). The methods based on the distance matrix formalize the idea that objects within strata should be similar and that objects in different strata should be dissimilar.
Probability models. The density of data is the sum of K components densities f k x ð Þ k ¼ 1; ::; K ð Þ in some unknown properties π k . Each stratum was represented by parametric probability models. The general form of a mixture model is that the data are assumed independently.
where and π k � 0. θ k are the parameters of the density function of the k th component. Generally, f k x; θ k ð Þ is normal distribution, and θ k is mean vector and variance matrix; and π k can be fitted by maximum likelihood by the expectation-maximization algorithm (Dempster, Laird, and Rubin 1977).

Balance of strata size
Balance of strata size means that all strata should have roughly the same size. When objects are equal across all strata, the analysis of variance has the highest statistical power and is less sensitive to violations of the equal variance assumption (Wickens and Keppel 2004). This is important for the use of strata in models involving significance tests. However, the truth is that units are often unevenly distributed across strata. The quantile and geometrical interval essentially minimizes the imbalance of strata size. Sun, Wong, and Kronenfeld (2017) employed the standard deviation of the number of units in each stratum as a measure of the degree of imbalance. In other methods, degree of unbalance is regarded as a constraint that imposes a minimum number or ratio on the group (Hothorn, Hornik, and Zeileis 2006;Guo and Wang 2011).

Equality
Equality requires minimizing the difference between the total value of a certain attribute (e.g. administration area or population) in a stratum and the mean total of this value for all strata. This is an important criterion in the study of socioeconomic stratification (Wise, Haining, and Ma 1997;Kaiser 1966). In map classification, Armstrong, Xiao, and Bennett (2003) used the optimized Gini coefficient as the areal equality measure. Equality can also be a constraint, imposing a minimum threshold for the total attribute of each stratum (Duque, Anselin, and Rey 2012).

Spatial contiguity
Spatial contiguity requires that spatial connectivity or spatial tightness should be met. It is the distinction between regionalization and conventional clustering or classification algorithms. The main criteria for spatial contiguity are hard spatial contiguity and soft spatial contiguity. The former means that the objects of a class are geographically connected and can be easily recognized. In contrast, the latter indicates that withinstratum units do not necessarily meet explicit spatial connectivity and can be measured by the similarity between geographical coordinates (Traun and Loidl 2012;Chavent et al. 2018) Spatial contiguity and homogeneity are the basic criteria underpinning regionalization (AssunÇão et al. 2006;Guo 2008;Duque, Anselin, and Rey 2012).

Spatial compactness
Spatial compactness is a criterion more strict than spatial contiguity, requiring the spatial partition shape to be almost a circle (Kaiser 1966). It has been an important characteristic in political districting (Kaiser 1966) and sales districting (Hess and Samuels 1971) and can be measured by the ratio of perimeter and area of the region or by the distance between the object centroid and the region centroid (Wise, Haining, and Ma 1997).

Stratification methods
According to the study task (i.e. exploratory task or predictive task), the number of variables (univariate or multivariate) and the criterion of the algorithm (single criterion or multicriteria), we classify the stratification methods into four categories (Table 1), namely: univariate stratification, cluster-based stratification, multicriteria stratification, and supervised stratification. Figure 2 is the taxonomy of typical stratification methods under each category. In this section, we will provide detailed descriptions of the algorithms below.

Univariate stratification
Univariate stratification methods are unsupervised and directly divide continuous data into userdefined parameters. The following is a description of the most commonly used univariate stratification methods (Cao, Ge, and Wang 2013;Song et al. 2020;Meng et al. 2021).
The equal interval (EI) divides the entire range of data values equally into specified intervals. The cut points depend only on the start and end points and on the number of groups, ignoring the distribution of the data.  Figure 2. Taxonomy of stratification methods (WCSC, Ward's clustering with spatial constraints; BSSC, binarized spatially constrained spectral clustering; AZP-SA, AZP based on simulated annealing; ARiSel, automatic regionalization with initial seed location; SKATER, spatial "K"luster analysis by tree edge removal; REDCAP, regionalization with dynamically constrained agglomerative clustering and partitioning). The line with arrows between the two algorithms means that one is a variation of the other.
The quantile (QU) divides data into specified intervals of equal size. The algorithm is distinctly useful for ordinal-level data. However, this may lead to objects of the same values being assigned to different strata, and values within-stratum may be hugely different.
The geometrical interval (GI) method creates geometrical intervals by minimizing the square sum of the elements per class, and the geometric coefficient in GI can change once (to its inverse) to optimize the class ranges, ensuring that each interval has approximately the same number of values. It is particularly suitable for data with skewed distribution, and the minimum value must be greater than zero.
The standard deviation (SD) is one of the stratification methods that considers data distribution. The breaks are formed by repeatedly adding or subtracting the standard deviation from the mean of the data. It only works well with symmetrically distributed data.
JNB is designed to determine the best arrangement of values into different intervals by minimizing the in-class variance and maximizing the inter-class variance (Fisher 1958;Jenks 1977). It is a good "general" method and has been the default classification method for ArcGIS, a popular platform for geographical information processing.
Head/tail breaks (HTB) (Jiang 2013) recursively separates data into head groups (greater than the mean) and tail groups (less than the mean), and then divides the head data into secondary heads and tails until the head part is no longer less than a threshold, such as 40%. The number of groups is determined by the threshold. HTB is a stratification scheme for data with a heavy-tailed distribution.
The break point of EI and GI is determined only by the start and end points (maximum and minimum values), and the SD by the mean and variance; therefore, in practice, it is necessary to carefully avoid creating invalid intervals when using EI, GI, or SD. Except for special demands (e.g. the quantile), how to choose from these methods is based on data distribution. Figure 3 summarizes the guidelines for choosing the method according to data distribution.

Cluster-based stratification
Cluster-based stratification is aimed at grouping objects into strata based on their similarities. A large number of reviews about cluster analysis have been published, including general studies (Jain 2010;Xu and Wunsch 2005;Jain, Murty, and Flynn 1999) and studies on specific subjects, such as robustness of clustering (Angel Garcia-Escudero et al. 2010), spatiotemporal data clustering (Zhicheng and Pun-Cheng 2019), clustering for high-dimensional and large datasets (Pandove, Goel, and Rani 2018), clustering based on neural networks (Du 2010), and so on. Clustering algorithms can generally be classified as hierarchybased or partitioning-based, according to the resultant partitions (Leung 2010;Jain, Murty, and Flynn 1999). Hierarchy-based methods generate a nested series of groups, while partitioning-based methods produce only one group.

Hierarchy-based clustering
Hierarchical clustering algorithms recursively find nested clusters in an agglomerative or divisive manner. Well-known agglomerative (bottom-up) algorithms include single linkage clustering (SLC), complete linkage clustering, and Ward's linkage clustering (Ward 1963). For agglomerative hierarchical clustering, Lance and Williams (1967) proved a unified formula for dissimilarity updates: where d x i ; x j À � is the distance of points x i and x j and α i , α j , β; and γ define the agglomerative criteria, of which the settings in the above methods are shown in Table 2.
In contrast to agglomerative algorithms, divisive (top-down) algorithms are for splitting data rather than aggregating data and are always much more complex (Macnaughton-Smith et al. 1964); thus, they are rarely used for stratification. To manage time complexity (i.e. O n 2 log n ð Þ, where n is the object size) and space complexity (i.e.) of agglomerative hierarchical clustering methods, scholars have proposed a few improved methods, such as the BIRCH (Zhang, Ramakrishnan, and Livny 1996), CURE (Guha, Rastogi, and Shim 1998), and Chameleon (Karypis, Han, and Kumar 1999).
The BIRCH is to compress data objects into many small subgroups and then cluster them. Since the number of subclusters after compressing is much smaller than the initial number of data objects, it allows cluster execution with a small memory, giving rise to algorithms that scan the database only once.
For CURE, random sampling and partitioning are used to reliably find clusters of arbitrary shapes and sizes. The minimum distance amid the representative points chosen is the cluster distance. This means that the technique incorporates both single-linkage and average-linkage methodologies.
The Chameleon algorithm groups objects into a number of small subclusters by a graphical partitioning algorithm under the given the minimum number of objects in a subcluster, and then finds the real clusters by repeating combinations of these subclusters with a hierarchical clustering algorithm. In Chameleon, cluster similarity is assessed by how wellconnected objects are within a cluster and the closeness of clusters.

Partitioning-based clustering
Partitioning clustering is a grouping process that minimizes (or maximizes) an objective function (e.g. the sum of the squared errors of all clusters) and distinct from the hierarchical approach, it only gives a single partition of the dataset. K-means is the most famous partitioning clustering algorithm (Ball and Hall 1965;MacQueen 1967) and is also one of the most popular algorithms for geospatial clustering (Polczynski and Polczynski 2014;Huang et al. 2020). In K-means, a partition is identified when the sum of the squared error between the centroid and other points in the cluster is minimized and the centroid serves as the prototype.
Currently, most partitioning methods can be seen as an extension of the basic K-means, for example, the PAM (Kaufman and Rousseeuw 1990), the K-means++ (Arthur and Vassilvitskii 2007), etc. The K-means converge to local minima, and different initializations can lead to different final clustering. To overcome this weakness, K-means++ gives a different weight each time a new initial point is selected, and the weights depend on the distance from the previous initial point.
The PAM algorithm is more robust than K-means, in which each cluster is represented by the medoid, defined as the point with the least mean dissimilarity to all the points in the cluster. However, it cannot handle large datasets due to its high computational complexity. To compensate for this, CLARA (Clustering LARge Applications) was then proposed (Kaufman and Rousseeuw 1990). In CLARA, the medoids of a cluster are obtained with a sample from a set of objects, and then the remaining points are assigned to clusters where the nearest centroid is located; the cluster as the result is the one that performs best under different sample sets. CLARANS (Clustering Large Applications based on RANdomized Search) (Ng and Han 2002) is an improvement of both PAM and CLARA. In CLARANS, the process of grouping can be presented as searching a graph where every node is a potential solution, that is, a set of medoids. It is more effective than PAM because it checks only a sample of nodes. Unlike CLARA, CLARANS draws a sample of neighbors at each step of the search, which has the benefit of not confining a search to a local area.
By introducing a kernel distance function, the kernel K-means (Dhillon, Guan, and Kulis 2004;Schölkopf, Smola, and Müller 1998) addresses the limitation that K-means cannot capture the nonconvex structure of data. Spectral clustering is also an improvement for this limitation and is another nonlinear approach that acts on the eigenvectors of Note: n i is the number of objects in stratum i a matrix derived from the data. When inquiring into the relationship between the above two method, Dhillon, Guan, and Kulis (2004) found that the two classical spectral clustering methods -normalized cuts (Shi and Malik 2000) and NJW spectral clustering (Ng, Jordan, and Weiss 2001) -are special cases of weighted kernel K-means. The Gaussian mixture model (Rasmussen 1999) is a well-known probability model-based algorithm. It assumes that all data points are generated from a mixture of a finite number of Gaussian distributions with unknown parameters. The mixture model can be regarded as a generalized K-means algorithm that contains information about the data covariance structure and potential Gaussian centers, which are usually estimated by the expectation maximization algorithm (Dempster, Laird, and Rubin 1977).
In the K-means and its variations mentioned in the previous paragraphs, each object belongs to one and only one group. Such crisp assignment of data to clusters can be inadequate in the presence of data points that are almost equally distant from two or more clusters. Fuzzy clustering allows objects to belong to more than one cluster simultaneously, with different degrees of membership in different clusters (for more details, see (D'Urso 2016)), and it is widely used in spatial analysis (D'Urso et al. 2019b;Coppi, D'Urso, and Giordani 2010). The famous fuzzy clustering is fuzzy c-means (FCM) (Bezdek 1981); Fuzzy c-medoids is very similar to the K-means except that the degree of fuzziness df ðdf > 1Þ is needed. Fuzzy c-medoids is a variation of the PAM approach in a fuzzy framework with respect to the FCM, and it allows for more appeals and makes it easier to interpret the results of the final partition (D'Urso et al. 2019a).

Multicriteria stratification
Univariate methods and cluster-based methods mostly meet only one criterion of stratification, such as homogeneity (e.g. K-means) and balance size across strata (e.g. QU); thus, sometimes, they do not support stratification for spatial data, which were described by location and attributes (Cromley 1996;Jenks and Caspall 1971;Wise, Haining, and Ma 1997). A multicriteria stratification is desired in special application scenarios, such as climate zones (Fovell and Fovell 1993) and socioeconomic districts (AssunÇão et al. 2006). In addition to homogeneity, another most commonly used criterion is spatial contiguity (or spatial compactness). This section focuses mainly on regionalization methods, which are based on both homogeneity and spatial connectivity (Lankford 2010;Openshaw 1977;Gordon 1996b;Duque, Ramos, and Surinach 2007). According to whether explicit spatial connectivity constraint is satisfied, these methods can be categorized into two types: hard spatial contiguity-constrained stratification and soft spatial contiguity-constrained stratification.

Hard spatial contiguity-constrained stratification
Hard spatial contiguity entails that the objects of a stratum must be strictly geographically connected. In stratification, the spatial relationships of two datasets are represented by the spatial adjacency matrix, usually denoted by W ¼ w ij À � ; in which the w ij ¼ 1 if the th and j th unit are contiguous and w ij ¼ 0 otherwise. Hard spatial contiguity-constrained stratification always maximizes the homogeneity of attributes under set constraints. Like clustering-based stratification, algorithms for hard spatial contiguity-constrained stratification can be divided into hierarchy-based and partitioning-based methods.
Hierarchy-based constrained stratification methods commonly used are spatially constrained hierarchical clustering (SCHC) (Guo 2009), spatial "K"luster analysis by tree edge removal (SKATER) (AssunÇão et al. 2006), and regionalization with dynamically constrained agglomerative clustering and partitioning (REDCAP) (Guo 2008). The SCHC is a special form of constrained clustering whose constraint is spatial contiguity, and the merged objects must be spatially contiguous. SKATER is based on a minimum spanning tree (MST) and is similar to a single-linkage cluster (Gower and Ross 1969). In SKATER stratification, a full graph is first created, with observations as nodes and contiguity relations as edges, and reduced to an MST. Finally, the MST is divided into regions; similar to SLC, SKATER has a "chaining effect." REDCAP uses different ways to build the spanning tree based on three linkages (single linkage, average linkage, and complete linkage) and two spatial constraining strategies (first-order constraining and full-order constraining). First-order single linkage and SKATER are identical.
For partitioning-based methods, the common stratification methods are automatic zoning procedure (AZP) (Openshaw and Rao 1995;Openshaw 1977) and Max-p regionalization (Duque, Anselin, and Rey 2012). In AZP stratification, strata are divided when the sum of the squared errors of the continuous spatial variable data is minimized. It originally (Openshaw 1977) uses a slow hill-climbing heuristic to find the stratification solution, which has the problem of being trapped in a local solution. To promote this, Openshaw and Rao (1995) proposed two alternative approaches: the AZP-SA and the AZP-Tabu. The former employs the simulated annealing algorithm (Aarts and Korst 1989) as the optimal method, while the latter uses the Tabu Search algorithm (Glover 1986). Automatic regionalization with initial seed location (ARiSel) is a variant of AZP, choosing the best initial points among many starting points based on minimizing the sum of squared errors (Duque and Church 2004). The main disadvantage of AZP is its high computational cost and inability to process large datasets (Guo 2008). While, Max-p regionalization consists of three key phases: growth, assignment of enclaves, and spatial search. First, in the growth phase, based on equality under contiguity constraints, large neighbors are merged into start objects until the minimum bound is met. Second, areas that do not belong to divided regions are stored in enclaves. Lastly, these enclaves are assigned to existing regions based on homogeneity so that the overall within-group sum of squares is minimized. The Max-p stratification can automatically determine the number of partitions, but in it, equality has a greater impact than homogeneity.

Soft spatial contiguity-constrained stratification
The soft spatial contiguity implies that within-strata objects may not necessarily meet spatial connectivity, despite the consideration of geographical location or connectivity. There are two main solutions. The first takes the geographical coordinates (or centers of polygons) as additional attributes for clustering with different weights from other attributes (Webster and Burrough 1972;Oliver and Webster 1989). According to the weights given to the geographical coordinates and attributes, the results will have more or less spatially contiguous regions. However, it is challenging to set weights in applications. Ward's clustering with spatial constraints proposed by Chavent et al. (2018) determines the weight of geographical similarity by increasing spatial contiguity without greatly decreasing the homogeneity of the feature space.
The other solution is to act on spatial connectivity by must-link or cannot-link constraints, resulting in a trade-off between spatial contiguity and homogeneity. Representative approaches include spectral constraint modeling (SCM) (Shi, Fan, and Philip 2010), spatially constrained spectral clustering (SSC), and binarized spatially constrained spectral clustering (BSSC) (Yuan et al. 2015). Yuan et al. (2015) has evaluated the effectiveness of SCM, SSC, and BSSC on a large-scale terrestrial ecology dataset and found that the SSC has the highest spatial connectivity (measured by the percentage of must-link constraints preserved within the regions) and moderate withingroup sum-of-square error.

Supervised stratification
Supervised learning stratification methods are characterized by considering the study variable or existing labels in the process of stratification, which makes more helpful predictions than unsupervised learning methods. When objects have been labeled, many supervised learning algorithms can be used for stratification (Ramírez-Gallego et al. 2015;Yang, Webb, and Wu 2009). In this section, we introduce the regression decision trees and methods based on the q-statistic in the geographical detector.

Decision trees
Decision trees are nonparametric-supervised learning methods used for classification and regression. Classification and regression trees (CART) (Breiman et al. 1984) does not require data scaling or normalization, and variables can be both continuous and discrete. In CART, exploratory variables are recursively partitioned into strata that are as homogeneous as possible with respect to the study variable. CART grows a large tree and then prunes the tree to a size based on a cross-validation error. The conditional inference tree (CTREE) (Hothorn, Hornik, and Zeileis 2006), a regression method, recursively splits the space of exploratory variables based on the p-value of the difference of study variables between partitions from permutation distributions. The tree size (i.e. the number of strata) of CTREE was controlled by the threshold of the given significance level. Many other regression tree models are also applicable for stratification; a relatively well-rounded overview can be found in Loh (2014).

Methods based on the q-statistic
The q-statistic of the geographical detector can be used to measure the SSH Wang, Zhang, and Fu 2016). It considers the study space to be composed of n spatial objects, stratified into strata; stratum k is composed of spatial objects, and y i and y hi denote the study variable value of unit in the population and in stratum h, respectively: The objective function is expressed as follows: Methods of q-statistic-based stratification are the optimal parameter-based geographical detector (OPGD) (Song et al. 2020), the multi-scale discretization (MSD) (Meng et al. 2021), the interactive detector for spatial associations (IDSA) , the geographically optimal zone-based heterogeneity (GOZH) algorithm (Luo et al. 2022), and the robust geographical detector (RGD) (Zhang, Song, and Wu 2022). The OPGD needs to be given the number of strata intervals and grouping methods; it then selects the highest q-statistic value result among all combinations of the interval and methods as the final stratification scheme. The MSD uses upscale and downscale strategies to obtain the optimal cut points that maximize the q-statistic value, with the advantage that the result is dependent on the characteristics of the explanatory variables and the study variable. The IDSA employs spatial fuzzy overlay, a process for overlapping geographical variables, to derive spatial zones in terms of fuzzy relations across spaces. Compared with OPGD, stratification in IDSA accounts for the correlations between stratification variables. GOZH employs CART to generate interactive strata in which the number of strata is less than those by spatial overlays in the geographical detector. The RGD motivated by the q-statistic is sensitive to the stratification of explanatory variables, in which the optimization algorithm for variance-based change point detection obtains a robust q-statistic.

Stratification algorithm comparison
As shown in Figure 2, some algorithms inherit and improve upon root characteristics of root algorithms. Improvements are mainly in robustness -the ability to resist the effects of outliers, arbitrary shape, spatial contiguity, stability, and scalability to large data. Stability means that the results of stratification are rarely influenced by the initial conditions, for example, the enhancement of the K-means++ to the K-means. The scalability of large data refers to the fact that the run time and memory requirements should not explode for large datasets (Andreopoulos et al. 2009).
In Table 3, we have summarized the general characteristics of the stratification methods from the input data, input parameters, robustness, arbitrary shape, and scalability to large data. A good stratification algorithm should have less input parameters, be less sensitive to outliers, support arbitrary shape and have scalability to large data.
This information in Figure 2 Table 3 will help users to better select the appropriate method for an application. Although the improved method has better performance in some respects than the original one, it may also lose simplicity, i.e. it requires more input parameters; so, in practical applications, it is necessary not only to meet the problem requirements but also to consider the difficulty of parameter selection. For example, in the Chameleon algorithm, although the method can identify clusters better, it requires us to input four parameters, and the parameter selection process has a significant impact on the results.

The number of strata
One of the most difficult issues for stratification is defining an appropriate number of strata is of great importance for unsupervised stratification. Although some algorithms, such as head/tail breaks and max-p regionalization, do not need to set the number of strata, they still need to input other parameters to decide it. A reasonable solution we suggest is to first set the interval of the number of strata (e.g. to match the human memory limit, the number should not exceed nine (Miller 1994)) and then select "optimal" number from the interval with the help of statistical indices or visualizations. Many indices can be used to determine the number of strata (Milligan and Cooper 1985;Gordon 1996a;Liu et al. 2010), and they can be grouped into two classes. One class includes indices based on the compactness of within stratum objects (or separability of strata), for example, the Calinski-Harabasz index (CH) (Caliński and Harabasz 1974), the silhouette correlation (SC) index (Rousseeuw 1987), and gap statistics (Tibshirani, Walther, and Hastie 2001); the other includes indices based on cluster stability (Wang 2010;Ben-David, von Luxburg, and Pál 2006), which are always achieved by cross-validation or bootstrapping (Fang and Wang 2012). In Table 4, we listed four famous statistical metrics for choosing the best number of strata. The CH index (Caliński and Harabasz 1974) is a computationally efficient and the most commonly used non-supervisory evaluation metric (Milligan and Cooper 1985). The gap statistics and clustering stability-based methods are more general than CH and SC.
Visualization of data is also an important means for determining the number of strata, of which the most common ones are histograms for one-dimensional data and scatterplots for two-dimensional data. Meanwhile, there are many techniques for mapping high-dimensional data to lower-dimensional data, including linear projection methods, such as principal component analysis (Wold, Esbensen, and Geladi 1987) and multidimensional scaling (MDS) (Sammon 1969), and nonlinear projection methods, such as selforganizing mapping (Kohonen 1990). In addition to the projection methods, visual assessment of tendency (VAT) can be also used to identify the number of strata by reordering the dissimilarity matrix for the data so that similar samples are placed close to each other (Bezdek and Hathaway 2002). A detailed overview of the VAT family of algorithms can be found in a study by Dheeraj and Bezdek (2020).
. . . ; n be the set of d-dimensional points, n is objects size, and d is variables size; y ¼ y i f g is the study variable; NS, number of strata; SW, spatial weights; DM, distance (dissimilarity) matrix; numlocal, number of local minima; maxneighbor, maximum number of neighbors; maxCF, maximum number of CF subclusters in each node; numrep, number of representative points for each stratum; minsize, threshold of number of vertices; a, important indicator for the relative closeness and relative inter-connectivity; minsplit, the minimum number of observations that must exist in a node; classInt (R), means the classInt is a R package; mapclassify (Python), means the mapclassify is a Python package.

Software packages
Many stratification software packages are available in R (R Core Team 2021), Python, ArcGIS, QGIS, and Geoda (Anselin, Syabri, and Kho 2010) for spatial data. For univariate stratification, the base packages are classInt (Bivand et al. 2013) in R and mapclassify in Python. The mapclassify is a subpackage of the Python Spatial Analysis Library (PySAL) (Rey and Anselin 2010). The cluster-based stratification packages are stats and cluster (Maechler et al. 2012) in R, and Scikit-learn (Pedregosa et al. 2011) and pyclustering (Novikov 2019) in Python. Other R cluster analysis packages not mentioned in this study can be found at Comprehensive R Archive Network task views (https://cran.r-project.org/web/ views/Cluster.html). Generally, most spatially constrained stratification methods can be implemented with PySAL, Geoda. The software packages are summarized in Table 3.

Assessment of strata
Assessment of the results is critical, especially when the strata is used for later sampling and inference, etc. For supervised stratification methods, the assessment of output is simple, and we can use the sum squared errors (SSE) of study variable or q-statistic in Equation 4. The advantage of q-statistic is its value within [0,1], which also can be index of association between continues study variables and categorical explanatory variable. To avoid overfitting, the SSE or q-statistic can be calculated by cross-validation. For unsupervised stratification methods, the assessment is not trivial as supervised algorithm because the context should be taken into account, that is the assessment of unsupervised stratification should be based on the study context as well as statistical indicators. The classical statistical indicators to quantify inter-strata similarity, or intra-strata dissimilarity are SSE of explanatory variables (i.e. within-strata variability in Equation 6) and SH index which was described in Table 5. The other metrics can be found in literatures such as (Milligan and Cooper 1985), Gordon (1996a), and (Liu et al. 2010).

Limitations of this study
All the methods listed in this paper are of general purpose, not focusing on the special domains in which superior stratification should be based on a priori knowledge to ensure that the results have good interpretation power. However, in practical applications, prior knowledge does not always exist, so it is necessary to use algorithms based on data distribution and given constraints. In this paper, we introduce the characteristics of different methods but do not provide a scenario to be used strictly by individuals because, in practice, it is difficult to have a reliable evaluation index for unsupervised methods. More conventional methods and the degree of difficulty of method implementation are preferred. For example, in the application of the geographical detector model, a popular tool for SSH analysis, the top three most popular stratification algorithms are JNB, K-means, and quantile (Meng et al. 2021).

Future recommendations
Although SSH has been widely used in many fields, few studies have been conducted on new theoretical approaches to SSH. In the future, research should be furthered in the following four aspects. The first is how to synthesize the stratification of multiple variables in practical applications. The available methods are clustering and spatial overlays of each variable stratum . The former ignores the importance of variables, and the latter may cause a "curse of dimensionality" (Bellman 1966) -the number of strata would expand exponentially with increasing variables, and many strata would end up devoid of samples and thus be incapable of providing any probability estimates. The second is how to evaluate the uncertainty of stratification in subsequent sampling or modeling. The third is developing new measures of SSH. The q-statistic and spatial association detector (a variation of q-statistic) (Cang and Luo 2018) are applicable only for univariate analysis and crisp strata. Future research is urgently needed to develop SSH measures for spatial multivariate data and/or fuzzy strata. The last is building new stratification algorithm or prediction models based on both spatial autocorrelation and SSH, which are two main ways to describe spatial variation.

Conclusions
SSH, a fundamental assumption for understanding geographical objects, simplifies a complex object, explains nonlinear relationships between variables, and eliminates confounding factors in modeling by stratification. In stratification, different algorithms reflect distinct human understandings of spatial distributions and associations. The choice of stratification depends on the task of stratification (e.g. exploratory task should use the unsupervised stratification and the predictive task should use the supervised stratification) and on the study context that may involve in the specific constraints. No method is optimal in a general sense, and knowledge of the different characteristics of an algorithm is important for making the required decisions. We have summarized typical existing stratification methods and constructed a taxonomy for selecting a more reasonable method. If the number of strata is otherwise unknown, statistical metrics and the visualization approach are proved to be helpful in selecting an appropriate number of strata. The scale dependence of SSH remains a challenge in earth science and requires further study. Future research is recommended to reveal the nature of geographical attributes through the integration of SSH, advanced measures for multivariate analysis and/or fuzzy strata, and new prediction models based on spatial autocorrelation and SSH.

Data availability statement
The simulation data that support the findings of this study are available at https://github.com/geogxy/modelling_ssh

Disclosure statement
No potential conflict of interest was reported by the author(s).

Funding
This work was supported by the National Natural Science Foundation of China [42071375].