Understanding and extending the geographical detector model under a linear regression framework

Abstract The Geographical Detector Model (GDM) is a popular statistical toolkit for geographical attribution analysis. Despite the striking resemblance of the q-statistic in GDM to the R-squared in linear regression models, their explicit connection has not yet been established. This study proves that the q-statistic reduces into the R-squared under a linear regression framework. Under linear regression and moderate-to-strong spatial autocorrelation, Monte Carlo simulation results show that the GDM tends to underestimate the importance of variables. In addition, an almost perfect power law relationship is present between the percentage bias and the degree of the spatial autocorrelations, indicating the presence of fast uplifting bias in response to increasing levels of spatial autocorrelations. We propose an integrated approach for variable importance quantification by bringing together the spatial econometrics model and the game theory based-Shapley value method. By applying our proposed methodology to a case study of land desertification in African, it is found human activity tends to affect land desertification both directly and indirectly. However, such effects appear to be underestimated or undistinguished in the classic GDM.


Introduction
The Geographical Detector Model (GDM), proposed by Wang et al. (2010), serves as a statistical toolkit for geographical attribution analysis by quantifying the extent to which the spatial variance of an outcome variable can be explained by a set of CONTACT Guanpeng Dong gpdong@vip.henu.edu.cnindependent variables and their interactions (Wang et al. 2016).The distribution of most physical and human geographical variables often exhibits stark stratifications (Haining 2003, Banerjee et al. 2014, Dong et al. 2020), implying potential existence of distinct mechanisms operating across strata (Davies et al. 2005).The fundamental concept of GDM is spatially stratified heterogeneity that gauges the proportion of overall heterogeneity in an outcome variable attributed to between-strata heterogeneity, with strata delineated based on classifications of potential influencing factors (Wang et al. 2010(Wang et al. , 2016)).Should between-strata heterogeneity predominantly govern the total heterogeneity of an outcome variable, it is plausible to infer that the variable used to define strata is likely to be a driving factor of this outcome variable.Mathematically, spatial stratified heterogeneity is quantified by the q-statistic, formulated as in Equation (1) (Wang et al. 2016), where h represents the h-th stratum predefined by the categories of one or more independent variables; y i is the outcome value of the i-th sample, and y is the overall mean; y hi is the outcome value of the i-th sample belonging to the h-th stratum and y h is the sample mean of the hth stratum; N h and r 2 h are the sample sizes and variances of y hi for the h-th stratum, respectively; and parameters N and r 2 represent the size and variance of the full sample.The q-statistic lies within [0, 1], and its monotonic transformation, N−L L−1 q 1−q , is a classic F-statistic-the ratio of between-strata variance to within-strata variance of y, adjusted by corresponding degrees of freedom.This statistic conforms to a noncentral F-distribution, thereby enabling the undertaking of significance inference on the q-statistic.Detailed mathematical derivations on statistical significance test are given in referred to Wang et al. (2016).A q-statistic of substantial magnitude and statistical significance obtained for a given variable strongly suggests the potential role of that variable as a driving force behind the observed outcome variable. 1Due to its intuitive constructionist logic and computational simplicity, GDM has been extensively applied to a wide range of social and environmental disciplines including, amongst others, urban studies (Feng et al. 2021, Sapena et al. 2021), ecology (e.g.Sannigrahi et al. 2020), environmental pollution (e.g.Ding et al. 2019, Zhang et al. 2019), and climate change studies (e.g.Yin et al. 2019, Fan et al. 2021).For instance, Zhang et al. (2019) categorized air pollution exposure time and intensity into a small number of bands, and investigated their main and interaction effects on peak bilirubin level of a newborn using GDM.Wang et al. (2023) employed the K-means clustering method to classify a set of factors into categories, and used GDM to quantify the impacts of those factors on vegetation optical depth.
Alongside the burgeoning applications of GDM, methodological advances have been proposed to address pragmatic issues when GDMs are applied to different types of spatial data.Cang and Luo (2018) used spatial variance to correct the bias of GDMs when spatially autocorrelated data are processed.They incorporate spatial weights into the GDM, that can cause the resulting values to exceed the range of [0, 1] and blur the physical interpretation of the value, while statistical significance can still be tested.To deal with the issue that users need to predefine (often arbitrarily) the discretization of a continuous variable (the number of categories as well as the cutting points), Meng et al. (2021) developed an optimal discretization scheme by an exhaustive search method.It has the advantage that it accounts for the characteristics of both independent and dependent variables, as opposed to focusing solely on independent variables.
Although the q-statistic bears a striking resemblance to the R-squared in linear regression, their explicit connection has not hitherto been established.This induces one of the key objectives and contributions of this study.Proving the equality that exists between the q-statistic and R-squared model fit statistic under a linear regression framework is crucial to understanding the mathematical nature of GDM and extending the methodology so that it can be applied to additional application contexts.To provide a proof of the concept, two types of extensions are discussed in this study.First, spatial autocorrelation affects the effective sample size, the information provided by independent or random geographic samples, and this in turn exerts influences on the calculations of both the overall and the stratum-wise variance parameters.It has been established that an effective sample size in the presence of spatial autocorrelation would be smaller than the actual geographic sample size (Griffith 2005(Griffith , 2013)).Using the equivalence between the q-statistic and R-squared, state-of-theart spatial econometrics (or spatial statistics) models can be specified to deal with spatial autocorrelation, whilst retaining the logic instinct in their definition and measurement of variable contribution in GDM.In addition, many theoretically and mathematically sound variable importance decomposition methods, such as the game theorybased Shapley value method (Shapley 1953, Shorrocks 2013), could be naturally incorporated into spatial econometrics models.This would, in turn, offer informative interpretation of both the main and the interaction effects exerted from two or more independent variables on an outcome variable under investigation.It is important to note that establishing the equity which exists between the q-statistic and R-squared permits the extension advanced in this study to be readily applied to panel data models, leading to an important research avenue to be explored within future studies.
The equivalence between the q-statistic and R-squared can be first discerned through the genuine interpretation of these two statistics.At its heart, the q-statistic measures the extent to which independent variables explain the variability (or spatial pattern) of an outcome variable (Wang et al. 2010(Wang et al. , 2016)).Under the linear regression framework, the R-squared, also known as the coefficient of determination, measures the proportion of variability in a dependent variable that can be explained by independent variables included in a linear regression model (Kvalseth 1985, Freedman 2009). 2 Under conditions in which all independent variables were categorical in nature or had been categorized before entering a linear regression model, the R-squared measures exactly the same quantity as the q-statistic (mathematical details provided below).
One key implication of this equality is that a more accurate q-statistic for spatial reasoning could be achieved in situations where data is not independent, as it would present spatial or group dependence.For instance, both classic and advanced spatial and multi-level extensions to the linear regression model have been well established and can be flexibly implemented with existing open-source software packages (e.g. Bates et al. 2015, Dong and Harris 2015, Dong et al. 2016, Bivand et al. 2021, Ma and Dong 2023).Such regression models combined with the game theory-based Shapley value method for variable importance decomposition can yield great benefits for spatial reasoning.To demonstrate the same, this study first derives the mathematical equivalence that exists between the q-statistic and the R-squared under a linear regression framework.Then, Monte Carlo simulation experiments are undertaken to assess the extent of the bias of the q-statistic in GDMs when processing data with varying degrees of spatial autocorrelation.One key result indicates that the q-statistic tended to underestimate the importance of factors; this downward bias elevates quickly in response to increasing levels of spatial autocorrelation.In addition, the empirical relationship that presents between the extent of bias and the strength of spatial autocorrelation exhibits a power law.Thereafter, the game theory-based Shapley value method, originally applied in non-spatial linear regression model is introduced to show how it could be adapted to spatial econometrics models.Finally, the developed methodology is applied to identify factor importance under the context of land desertification in Africa.

Proving the equivalence between the q-statistic and R-squared
The modelling starts with a classic linear regression model specified by Equation (2) as where y i is a dependent variable; a is the intercept term; c h is the regression coefficient of covariates € X h, i ; and e i is the residual term following a normal distribution with mean zero and variance r 2 e i : It is useful to note that € X h, i may be a set of dummy variables generated by encoding a categorical variable X: Accordingly, if X has or can be discretized into L strata, then as many as L − 1 dummy variables, € X h , h 2 2, 3, . . ., L, need to be defined by The above assumes that any stratum of X can be treated as the reference (i.e.baseline) group or category.Of course, any of the L strata can be selected as a reference group without affecting the estimates of the model.Among many of the assumptions imposed on the model residual term (an extensive list was referred to by Wooldridge ( 2010)), one of those is the independence of samples, or technically, the off-diagonal elements of the covariance matrix of e must be equal to zero.We turn our attention to the R-Squared statistic or the coefficient of determination formulated by Freedman (2009) as, where ŷi is the conditional expectation of y i given € X i as in Equation ( 2), and y ‾ is the overall mean of y: This expression of R-Squared possesses most of the desirable properties that make a good statistic for model fit, and is easily extended to a model fit statistic that is resistant to extreme sample values (Kvalseth 1985).The numerator (the sum of residuals) stratum-wise can be further expressed as: where N h is the same as in Equation ( 1).Comparing Equation ( 5) with the q-statistic expressed in Equation ( 1), there is only a need to prove the equality of ŷhi and y ‾ h before establishing equality between the q-statistic and the R-squared.Recalling the coding rules for variable X noted above, and with regard to the samples belonging to the h-th stratum, only € X hi is equal to 1; the other dummy variables are equal to 0. Consequently, the predicted value, ŷhi , is where â is the intercept and ĉh is the estimated regression coefficient of € X h : A wellknown result is that the conditional expectation for samples to belong to the same group or stratum equals the group mean in dummy variable regression (e.g.Powers andXie 2008, Freedman 2009) Derivations from this equity are provided in the Appendix.Finally, the equality between q-statistic and R-squared can be readily shown as For multi-factor detection models, the above equivalence can be derived via a multivariate linear regression model.For example, two factors X and Z, each with three categories or strata are considered.In order to maintain the equality of Rsquared and q-statistic, it is necessary to add interaction terms between the sets of dummy variables to the baseline regression model, leading to where c and b are regression coefficients for the two sets of dummy variables, x and z; d is a vector of coefficients for the interactions between x and z.By Equation ( 7) again, the conditional expectation for samples to belong to the group j and k, ŷjk , equals the group mean y jk (Ab Abadie 2005, Athey and Imbens 2006), where j and k are the factor levels of each variable.From this, it can be concluded that in a multi-factor interaction detection model, the equivalence between q-statistic and R-squared also holds.
After establishing the mathematical equivalence between the q-statistic and Rsquared under a linear regression framework, simple verification with empirical data was then carried out using data in the R package of GDM (Wang et al. 2010).In a single factor detection model, with incidence as the dependent variable and elevation as the independent variable, the result of q ¼ R 2 ¼ 0:6067 was obtained.In a two-factor detection model with incidence as the dependent variable, and soiltype and elevation as the independent variable, the same equivalence of q ¼ R 2 ¼ 0:6635 was obtained.In addition, a series of Monte Carlo simulation experiments were carried out to verify the result.With correct linear regression model specifications, the R-squared obtained is equal to the q-statistic in GDM under all scenarios.

A Monte Carlo simulation experiment for data with spatial autocorrelation
After establishing the equivalence between the q-statistic and R-squared in the linear regression model, it was natural to study whether the q-statistic in GDM performs well in the presence of spatial autocorrelation.The rationale behind this was based on the established assumption that estimates of regression coefficients are biased in a linear regression model when applied to data with spatial autocorrelation (e.g.Anselin 1988, Banerjee et al. 2014).The biased estimates of regression coefficients for dummy variables discussed above could render Equation ( 7) invalid, thereby yielding bias in estimates of variable importance from the q-statistic because of the equivalence.To test this conjecture and assess the degree of bias of the q-statistic, a Monte Carlo simulation experiment was conducted.The specific steps of the experiment were as follows.
Step ( 1): Taking an open-source data, the North Carolina dataset associated with the spatialreg (Bivand et al. 2021) R package, as the experimental geography, the spatial adjacency based weights matrix W was constructed, with elements (w i, j ) defined on the basis of geographical contiguity using Equation (11).Afterwards, the weight matrix was row-normalized (Figure 1).Step (2): An independent variable X with three categories was generated, and the dependent variable y was then generated separately, by a spatial lag model (SLM), a spatial error model (SEM), and a spatial Durbin model (SDM): SDM : where the strength of spatial autocorrelation increases with parameter q; e i is a normally distributed model residual term, with mean zero and variance r 2 e i ; a is the intercept term; and € X h is the h-th dummy variable recoded from X: In each simulation, the dependent variable, y, is generated based on different spatial autocorrelation mechanisms (SLM, SEM and SDM), randomly generated independent variables X, randomly generated model residual term e i , and pre-set parameters.Without loss of generality, the simulation parameters were set as: Step ( 3): The true variable importance quantities were calculated.With known values of the spatial autocorrelation parameter q, already set in Step (2), spatial econometrics models were degenerated to a linear regression model similar to Equation (2) with a new transformed dependent variable ỹ ¼ y − qWy for SLM, for SEM, and ỹ ¼ y − qWy for SDM.From this, the true variable importance quantity was calculated as, where, ŷi is the conditional expectation of ỹi in a linear regression model.
Step (4): The percentage bias of the q-statistic in GDM was calculated as For each value of q, 1,000 samples were randomly generated and Steps (1) to (4) were implemented for each sample.
Monte Carlo simulation results are summarized in Figure 2(a and b).It is important to notice that the q-statistic in GDM tends to underestimate variable importance, and that the extent to which this downward bias is positively correlated with the strength of spatial autocorrelation.For instance, when q equals 0.55 (a medium level of spatial autocorrelation), the q-statistic in GDM under-estimates variable importance by approximately 3% for SEM, 10% for SLM, and 20% for SDM.These biases quickly increase to 30% for SEM, 40% for SLM, and 50% for SDM in the presence of relatively strong spatial autocorrelation (q equals 0.85).Another intriguing point is that an almost perfect power law relationship between the percentage bias and the degree of spatial autocorrelation tends to hold, with a Pearson correlation coefficient consistently exceeding 0.994.Turning the spatial autocorrelation parameter q to the more commonly used Moran's I statistic, the above findings still hold.
This empirical power law functional form could be used to adjust the q-statistic if the GDM was chosen for identifying variable contributions.However, it is noted that this power law relationship is not expected to be over-interpreted as it might depend on geographic topology and mechanisms that generate spatial autocorrelation. 3On the other hand, the variable importance calculated from the SLM model estimates show positive bias and the percentage bias curve is almost indistinguishable from the zero line with a range of -0.5%, 0.2%.This was expected as the true data generating process followed an SLM model.However, the implication of this is that spatial econometrics models together with an adapted Shapley value method (discussed below) could serve as a useful alternative in spatial reasoning.

Adapting the game theory-based Shapley value method in spatial econometrics models
As demonstrated above, spatial econometrics models offer, compared to the GDM, greater accuracy for calculating variable importance with data that exhibits moderate to strong spatial dependency.In this section, a mathematically sound variable importance decomposition method, the game theory-based Shapley value method (Shapley 1953, Shorrocks 2013) is introduced into spatial econometrics models to offer flexible and intuitive interpretations of variable importance.In essence, the Shapley value The empirical relationship between strength of spatial auto-correlation (q) and percentage bias of q-statistic in GDM from three spatial econometrics models (true data generating processes).To avoid overlapping error bars, the curve of SEM is shifted 0.01 units to the left and the curve of SDM is shifted 0.01 units to the right; (b) The empirical relationship between Moran's I and percentage bias of q-statistic in GDM under three spatial econometrics models.The same value of q corresponds to different Moran's I under three models.As a result, the starting and ending points of the three curves exhibit tiny differences.
method conceptualizes variables as 'players' in a collaborative game in which the optimal objective is to maximize 'scores' with respect to whether or not each player enters the game.Specifically, in a collaborative game with N p players, p i denotes i-th player (the i-th variable in a regression context).When p i participates in the game, the marginal contribution of p i is defined as following (Shorrocks 2013): where, gðÞ is a score function or gain function and P −i, j is j-th player-combination without p i : For example, in a game with 3 players, all player-combinations are: It follows, that player-combinations without p 3 take the form: Naturally, when no players are involved in the game, the score g ; ð Þ is equal to 0. Next, the Shapley value method calculates the 'expected value' of a player's contribution to the game from the perspective of probabilities: where, P −i, j � � � � denotes the number of players in j-th player-combination.For linear regression models, the R-squared value serves as the gain function, and as such, the marginal contribution of player p i in the SLM is expressed as follows: where R 2 ð�Þ indicates that R-squared values calculated based on parameter estimates from SLM.We note that the Shapley value is an inherent method for the quantification of variable importance, and is independent of model estimation methods (Shorrocks 2013).All it requires is a proper model fit statistic that can measure scores or gains from combinations of independent variables after model estimation.The Shapley value method possesses several favorable attributes concerning the assessment of variable importance (Nandlall and Millard 2020).The first is non-discrimination; meaning that the variable's importance remains unaffected by the order in which variables enter the model.Secondly, the Shapley value method measures the marginal contribution of each individual variable -variables and their interaction terms with more contributions assigned by higher Shapley values.Finally, the sum of the Shapley values of each variable is equal to the R-squared that occurs when all variables participate, and the contribution share of each variable can be derived as

A Case study of land desertification in Africa
Desertification is a process of land degradation which occurs under the combined actions of natural and human factors primarily in arid, semi-arid, and dry sub-humid areas.It significantly affects both the quality of local ecosystems and human life (Reynolds et al. 2007).The Sahel region represents a classic example of the same issue; the continuous deterioration of land desertification has been attributed to various climatic elements such as drought and strong winds, as well as to anthropogenic activities including deforestation and overgrazing.Despite the introduction of measures such as the Great Green Wall of Africa by Sahel nations to combat desertification and reverse its effects by 2030, progress is inadequate; largely due to the ongoing deforestation and unsustainable grazing practices being practiced by local residents (Zucca et al. 2022).Under the influence of global climate change, the Sahel region has experienced rising precipitation over the past 30 years; a crucial opportunity, all else being equal, for the region to regreen (Brandt et al. 2020).While consensus has been reached that desertification is affected by both climatic and human activity factors, the quantification of their relative contributions remains poorly understood.In this case study, we aim to narrow this gap by applying the developed method to assess the impacts of climatic factors, human activities, and their interplay on desertification in Great Green Wall of Africa.
Table 1 provides an overview of variables and data sources used in this investigation.Our outcome variable is a desertification index calculated for each grid (with a resolution of 0.5 � � 0.5 � ) at year 2020 in the study area.The index is extracted by leveraging the albedo-Modified Soil-Adjusted Vegetation Index (MSAVI) feature space (Wu et al. 2019) through the Google Earth Engine.Population density and nighttime light intensity are included as proxy measures of human activities (Levin et al. 2020, Zucca et al. 2022).For climatic factors, we extract total precipitation and soil moisture variables from the European Centre for Medium-Range Weather Forecasts (ECMWF) Reanalysis V5 data (ERA5).All data has been resampled to a resolution of 0.5 � � 0.5 � under the WGS-84 coordinate system, and the independent variables are discretized into three categories using the natural breaks method.The final dataset is visualized in Figure 3.
The high value of Moran's I (0.91 with a p-value <0.001) indicates the existence of strong spatial autocorrelation in the desertification index.This suggests that GDM might lead to biased estimates of variable importance.To ensure an appropriate form of spatial econometrics model, the commonly used Lagrange Multiplier Test (LM-test) was employed for model selection (Breusch and Pagan 1980).Significant LM-lag and LM-error values indicated that the spatial Durbin model (SDM) was a suitable choice (Table 2).To avoid the risk of model overfitting, the Akaike information criterion (AIC) is used to select a parsimonious model specification that yields good balance between model fit and model complexity (Akaike 1974).As shown in Table 3, Models 1 and 2 give similar AIC values that are significantly lower than those from other model specifications.As the R 2 value of Model 1 is higher than that of Model 2, we choose Model 1 as our preferred model for discussion.Table 4 presents estimation results on the relative importance and contribution share of variables from the preferred model-SDM with independent variables of soil moisture and nighttime light intensity.To facilitate estimation results from OLS and SLM models are also reported in Table 4. Overall, 73.7% of the variability in desertification is accounted for by the model, highlighting the substantial role of soil moisture and human activity in driving desertification in the study area.Turning to the contribution shares of each variable, it is not unanticipated to find that soil moisture alone contributes the most to desertification, accounting for over 50% of the model explanatory power.Human activity, measured by nighttime light intensity, also exhibits considerable influences on desertification, which is in accordance with the conclusions of previous studies on desertification (e.g.Wang et al. 2006, Jahelnabi et al. 2016).
As the Shapley value method treats each interaction term as a distinct variable that operates independently from the main effects, an interaction term possessing positive marginal contribution indicates an enhancement effect, whereas a negative marginal contribution suggests a trade-off effect.As shown in Table 4, the interaction effect between soil moisture and human activity accounts for approximately 25% of the model explanatory power, which is even slightly larger than the main effect of human activity on desertification.This suggests a significant anthropogenic enhancement effect on desertification, emphasizing the imperative to incorporate human activity intensity as an integral component when developing policies that aim to preserve land sustainability.To assess whether our empirical results are sensitive to different discretization methods, we once again cut independent variables into three groups using a quantile discretization method-representing the upper, middle, and lower thirds of their respective distributions.Encouragingly, the results exhibit a robust concordance between the two discretization methods (Table 4).

Conclusion
This study has established an explicit connection between the q-statistic in GDM and the R-squared in a linear regression model.By proving this equivalence, the state-ofthe-art spatial econometrics models can be specified to deal with bias introduced by spatial autocorrelation in GDM, whilst retaining the logic inherent in the definition and measurement of variable contributions.The research combined the spatial econometrics models with a theoretically and mathematically sound variable importance decomposition method, the game theory-based Shapley value method, so that informative interpretations of the main and interaction effects exerted by two or more OLS refers to ordinary least squares model and the R-squared (0.627) from the OLS model is equivalent to the qstatistic from the GDM.
independent variables on an outcome variable could ascertained.Through undertaking Monte Carlo simulation experiments the study demonstrated that GDM tended to underestimate variable importance, with the degree of downward bias being positively correlated with the strength of the spatial autocorrelation.In addition, an almost perfect power law relationship between the percentage bias and the degree of spatial autocorrelation tended to hold; indicating rapidly increasing bias in response to increasing levels of spatial autocorrelation.In contrast, variable importance calculated based on spatial econometrics model estimates presented minimal positive bias.This highlights the benefits of bringing together spatial econometrics models and the Shapley value method when it comes to spatial reasoning.By applying this study's proposed methodology to a case study of land desertification in African, it was found that human activity tended to affect land desertification both directly (as indicated by a statistically significant main effect), and indirectly through enhancing the effects of climatic factors such as soil moisture.These effects appeared to be underestimated or indistinguishable in the classic GDM.Despite this study's advances, some limitations remain.First, the present study focuses on cross-sectional spatial econometrics models, thereby leaving more advanced spatio-temporal econometrics models untested.To address this in future, this study's key findings should be interpreted in a cross-sectional setting.Secondly, the causal identification capabilities of GDM were not extended explicitly.This is primarily because causal identification relies more on research design than specific models.

Notes
1. We underscore that such associations are not supposed to be interpreted as causality without further examination.Establishing causality requires a robust research design, such as control laboratory experiments or quasi-natural experiments, which leverages strictly exogenous variations in an independent variable and links these variations to variability in an outcome variable under investigation (Angrist and Pischke 2010).2. It is useful to note that the calculation of R-squared from a linear regression model can be varied, with different levels of desirable properties that make a good statistic for model fit (Kvalseth 1985, Freedman 2009).The interpretation of R-squared differs under a linear regression model and a generalized linear regression model; and so does its calculation.Detailed treatment of the R-squared is presented in McCullagh and Nelder (1989).3.This study also carried out the simulation experiment on a regular grid topology with 50 by 50 cells.In addition, different forms of spatial weights matrix (adjacency-and distancedbased rules) were also tried Most of the results showed a power law relationship between the percentage bias from the q-statistic and the degree of spatial autocorrelation, but there were slightly different degrees of model fit; ranging from 0.91 to 0.998.

Appendix
This appendix provides the details of the derivations of the conclusion used in the main text, the conditional expectations that outcomes for samples belonging to the same group or stratum equal the group mean in a dummy variable OLS regression model.Without loss of generality, and assuming that the variable X has or can be discretized into 4 categories, the corresponding dummy variables, € X , are arranged in the order of categories: where, n h is the number of samples in category-h: The least squares estimation of b is (2) where, n h i is the i-th sample in category h.Then, the conditional expectation of y i is (3) which is the group mean.

Figure 1 .
Figure 1.Topology of counties in North Carolina.

Figure 2 .
Figure 2. (a)The empirical relationship between strength of spatial auto-correlation (q) and percentage bias of q-statistic in GDM from three spatial econometrics models (true data generating processes).To avoid overlapping error bars, the curve of SEM is shifted 0.01 units to the left and the curve of SDM is shifted 0.01 units to the right; (b) The empirical relationship between Moran's I and percentage bias of q-statistic in GDM under three spatial econometrics models.The same value of q corresponds to different Moran's I under three models.As a result, the starting and ending points of the three curves exhibit tiny differences.

Table 1 .
Statistical summaries of variables used in this study's models.
Note: The mean and variance of variables are calculated before the discretization procedure; NOAA-NGDC refers to National Oceanic and Atmospheric Administration's National Geophysical Data Center.

Table 2 .
Estimation results on the LM-test.

Table 3 .
Estimation results on the Akaike information criterion and variable importance.

Table 4 .
Decomposition results for variable importance.