Exploration of spatial and temporal characteristics of PM2.5 concentration in Guangzhou, China using wavelet analysis and modified land use regression model

ABSTRACT This article attempts to detail time series characteristics of PM2.5 concentration in Guangzhou (China) from 1 June 2012 to 31 May 2013 based on wavelet analysis tools, and discuss its spatial distribution using geographic information system software and a modified land use regression model. In this modified model, an important variable (land use data) is substituted for impervious surface area, which can be obtained conveniently from remote sensing imagery through the linear spectral mixture analysis method. Impervious surface has higher precision than land use data because of its sub-pixel level. Seasonal concentration pattern and day-by-day change feature of PM2.5 in Guangzhou with a micro-perspective are discussed and understood. Results include: (1) the highest concentration of PM2.5 occurs in October and the lowest in July, respectively; (2) average concentration of PM2.5 in winter is higher than in other seasons; and (3) there are two high concentration zones in winter and one zone in spring.


Introduction
The air quality has always been a big concern and attracts tremendous attention in China's mega-cities, especially the more international ones, such as Beijing (Zhao et al. 2009), Shanghai (Huang et al. 2013), Hong Kong (Shi et al. 2012), and Guangzhou (Garland et al. 2008). PM2.5, one class of PM, is the designation for particulars that are less than 2.5 µm in aerodynamic diameter, making it small enough to be inhaled. In recent decades, the focus has shifted more toward the effects of long-term exposure to PM2.5 at moderate levels typically seen in inhabited regions (Samet et al. 2000;Leiva et al. 2013). In addition, some studies show that PM2.5 in the city has a trend to exacerbate weather conditions; PM2.5 pollution in the Pearl River Delta (PRD) region is closely related with the continuously frequent occurrence of haze.
Due to the temporal and spatial independence of the pollutants, the characteristics of PM2.5 resolved in one region cannot be replicated to another region (Shi et al. 2012). Some studies in PRD show that the daily concentration of PM is higher in winter than in summer. Some studies also have found that the concentration of PM2.5 is higher in winter than that in other periods both in China (Zhao et al. 2009), middle eastern (Elbayoumi et al. 2013), USA (Lee et al. 2011) and Europe (Marcazzan et al. 2001). Other studies have confirmed the truth of significant differences both in temporal and spatial distribution whether in China (Li et al. 2008(Li et al. , 2017a or in other countries, such as Philadelphia (Burton, Suh, and Koutrakis 1996), Europe ). On the other hand, the spatial differences between cities are also analyzed (Yu 2010). It has been observed that PM2.5 concentrations vary to differing degrees in the urban areas examined in the United States and considerable spatial variation is found in other regions, especially in the West (Pinto, Lefohn, and Shadwick 2004). Although there are some existing studies involving PM2.5 in Guangzhou or PRD, they focus mainly on the chemical composition (Wang et al. 2006;Tao et al. 2009). Recently, some important studies on PM2.5 or on other subject, such as land surface temperature (Wu et al. 2015), using Earth observation satellite data are finished, which expands the research scope of PM2.5 or other subject largely. These important studies mostly refers to fusing (Shen et al. 2016) between satellite data and station data to estimate ground-level PM2.5 (Li et al. 2017b).
Wavelet analysis is a powerful method of time series analysis, especially helpful for long time series change because of its ability to provide localized information on a time series in both the time and the frequency domains (Vadrevu and Choi 2011). Wavelet analysis generates an estimate of the local frequency content of a signal by representing the data using a family of wavelet functions that vary in scale and position (Chen et al. 2013). Wavelet analysis has been successfully applied in the time series analysis for climate (Kumar and Foufoula 1997), micrometeorology (Sang 2013) and atmospheric environment (Strunin and Hiyama 2005). In this study, Daubechies (DB) wavelets, an important function of wavelet analysis, are used to obtain the representation of time series of PM2.5 concentrations due to its compact support and orthogonality.
Land use regression (LUR) model is a robust technique used to predict concentrations of air pollutants at a given site based on surrounding physical, land use, and traffic characteristics (Jerrett et al. 2005). LUR model usually combines air pollutants concentrations (at a number of sites) and stochastic models using predictor variables obtained through geographic information system (GIS) (Jerrett et al. 2005). The model has been widely applied to simulate the spatial distribution (Hochadel et al. 2006) of PM2.5, NO x (Wang et al. 2012 and Volatile Organic Compounds (VOCs) ). Significant predictor variables in LUR model include: traffic, population, land use and physical geography .
This paper is aimed to explore temporal and spatial variations of PM2.5 levels in Guangzhou. The objectives include: (1) detailing the time series changes of PM2.5 using wavelet analysis; (2) seeking the characteristics of PM2.5 with a seasonal view; and (3) discussing the spatial distribution of PM2.5 using GIS tools and LUR model.

Impervious surface area
Impervious surface area is used to rebuild the LUR model. Impervious surface area can be obtained using linear spectral mixture analysis method (LSMA). Impervious surface area is the artificial construction that prevents water entering into the soil (Weng and Lu 2008). The roof, square and the land for transportation are the main components of the impervious surface area. In this study, impervious surface area is extracted based on Landsat ETM+ image (11 November 2010) (Figure 2). At the same time, four buffers of each monitoring site in different radiuses (1, 2, 3 and 4 km) are built and the areas of impervious surface in each buffer are calculated.

Vector traffic network and population density
Road network information is important data for the LUR model, which is provided by Guangzhou Urban Planning Department with vector layout (can be handled by GIS software). Attributes of the roads include road name, length, district, code number and road classification. The vector traffic network is used to calculate the total length of all roads within several buffers (1, 2, 3 and 4 km). The length of all roads within each buffer of 13 monitoring sites is calculated and summed using GIS tools. The population densities data are obtained from the Guangzhou Statistic Yearbook (2013) and are applied as the population variable of the LUR model.

Wind index
Wind index is another important variable of LUR model. According to the meteorological statistics of 1995-2005, Guangzhou is a monsoonal climate city with two prevailing winds. The prevailing wind is demonstrated in Figure 3. Northerly (0°) and southeasterly (135°) are two prevailing wind directions. To account for the location of monitoring site relative to the wind direction and the nearest traffic source, the Euclidean direction from the nearest source (0-360°) is derived. The wind index can be calculated using following Equations (1) and (2).
where θ is the Euclidean direction from the nearest traffic source to the monitoring site in degrees east of north. The value of wind index ranges from 0 to 1.0 means the monitoring site is on the upwind of its nearest road and 1 means on the downwind.

Wavelet analysis and Daubechies (DB) wavelet
Wavelet analysis is becoming a common tool for analyzing localized variations of power within a time series. Through this approach, one can track how the different scales related to the periodic components of the signal change overtime (Torrence and Compo 1997). There are various types of wavelet transforms and they are generally grouped as continuous or discrete (Torrence and Compo 1997). Based on the discrete wavelet transform (DWT), the signal is decomposed into a set of approximations and details, and the approximation is the low-frequency component of the signal and the detail is the high-frequency components of the signal (Lee and Tarng 2000). Observed PM2.5 concentration series in nature are usually discrete signals; DWT is often conducted as: where a 0 and b 0 are constants, integer j is decomposition level, and k is time translation factor. In order to understand the change rule of signal, the signal will be decomposed and then be reconstructed using DWT at a certain level. DB wavelet is one of the important wavelets because of the compact support and the perfect performance of time-frequency analysis; moreover, it can  describe the details of the signal conveniently and accurately so that it is successfully applied for signal processing (Ismail and Asfour 1999). Hence, DB wavelet is applied in this research to find the law of daily PM2.5 concentration.

LSMA and impervious surface area
LSMA is utilized to obtain impervious surface area in this paper, which is put into the LUR model as an alternative variable for land use data. The major assumption of the LSMA is that the spectral signature of a given pixel is proportion-weighted and linear combination of the end-member spectra (Roberts, Smith, and Adams 1993). An end-member is a "pure" land cover type or surface material that is assumed to have a unique spectral signature. Mathematically, the formula of the LSMA is: where R i is the reflectance value of band i; f k is the fraction of end-member k; R ik is end-member k's reflectance in band i; ER i is the un-modeled residual.
Associated with determining R i , P n k¼1 f k ¼ 1 and f k ≥ 0 are required. Wu and Murray (2003) built a model to extract the impervious surface and this idea is accepted widely. The model is shown as follow: where R imp,b , R low,b and R high,b are reflectivity of impervious surface, low albedo and high albedo of band b. Subsequently, f low and f high are density of low albedo and high albedo, respectively. e b is the value of what remains.

Modified LUR model using impervious surface area
We developed a modified LUR model using impervious surface area, in which the land use data is replaced by impervious surface area. The detail process can be found in Figure 4.
Variables of this modified model are chosen because of these following reasons: the length of roads implies the traffic condition; impervious surface area implies an air pollutant source; population density and meteorological conditions (wind index) can affect the spatial distribution of air pollutant concentration. The potential predictors are selected for model building and multiple linear regression models are created using stepwise regression in Statistical Product and Service Solutions software tool (SPSS) by involving the most significant predictors.
The measured concentrations are divided into two parts in random selection: data from nine monitoring sites are used to establish the multiple linear regression equations while the other four are selected as validating sites. We use the hold-out-validation for the model evaluation; the evaluation approach applies the model developed for modeling sites to estimate concentrations at the validating sites. Finally, we use the leave-one-out cross-validation (LOOCV) to evaluate the LUR modes once more. LOOCV, which successively leaves out one site from the training dataset and estimates models based on the remaining N-1 sites, is also used for LUR model evaluation in order to test certainty of multiple linear regression models in this research. LOOCV is simply K-fold cross-validation with K set to n, the number of instance in the dataset. This means that the test set only consists of a single instance, which will be classified either correctly or incorrectly.

Descriptive statistical analysis
Descriptive statistical analysis is fundamental and studies aiming at describing the patterns and general trends in a dataset are necessary. The indicators used in this paper are: central tendency (mean, median and mode), dispersion (maximum, minimum, range, coefficient of variation, standard deviation, inferior quartile, superior quartile) and distribution (kurtosis, skewness). All details of those indicators are shown in Table 1  However, the value of spring (0.37) is positive, thus implying that measured records are more decentralized than normal distribution. At the same time, the coefficient of skewness is close to zero, namely, the data tends toward symmetrical distribution and demonstrates no significant seasonal differences in regard to the skewness of the data.

Temporal variation (daily average concentration of PM2.5)
For understanding the variation of PM2.5 concentration during the period of June 2012 to May 2013, DB wavelet of order 4 (DB4) is applied to decompose the time series of PM2.5 daily concentration into three levels. The original signal is decomposed into a hierarchy set of orthogonal approximation and detail functions. Finally, diurnal variation of PM2.5 average concentration is clearly displayed ( Figure 6). As shown in Figure 6 fall (59.59 μg/m 3 ) > spring (52.63 μg/m 3 ) > summer (37.65 μg/m 3 ). It is nearly twofold higher in winter than in summer. In comparison to the low temperatures and more stable atmospheric stratification on spring mornings, the factors of high temperatures, unstable atmospheric and strong air convections in summer contribute to diffusion and migration of PM2.5. Furthermore, abundant rainfall in summer can purify the air by removing PM2.5 in the way of wet deposition.
To demonstrate the seasonal diversification, PM2.5 concentrations for each season were further studied with the finding that the change curve of summer is decidedly different from the others (Figure 7). It is apparent that most of the daily concentration is lower than 35 μg/m 3 (in green) in summer. The highest and lowest concentrations of summer are 67.92 and 6.92 μg/ m 3 , respectively. Secondly, in addition to two peaks, it is quite smooth with a similar serrated trend in the range of 35-75 μg/m 3 (the yellow part) for spring. Lastly, the fluctuation in fall and winter is worth noting because many of the concentrations are so high (>75 μg/m 3 , the orange) that will make immediate impact on air quality.

Mutation characteristics of PM2.5 concentration
Identifying the mutation characteristics of PM2.5 concentration is significant to protecting environment. DB1 wavelet is considered because of its good consistency. Firstly, the time series of PM2.5 concentration are decomposed at the scale of three. Secondly, after the reconstructions of the first and the second high-frequency coefficients, signal charts are shown in Figure 8. There are seven mutated points, which are located on the 264, 289, 331, 356, 360, 433, 434 days, respectively (20 September 2012: 94.15

Spatial distribution of PM2.5 levels
To evaluate the effect of variables on different seasons, the LUR model is applied separately, instead of an overall LUR model in this study. The most frequent wind in winter is the northerly and in summer, the southeasterly wind.
In order to select potential predictor variables, the correlations between predictor variables and dependent variable are calculated using unit linear regression function in SPSS software ( Table 2). The correlation with southeasterly wind is stronger than that with northerly, so the southeasterly wind may be regarded as the most frequent wind for spring and fall. In winter, the strongest correlation with PM2.5 concentration is the total length of road within the buffer of 2 km (R 2 = 0.60). In spring, the strongest correlation variable is the total length of road within the buffer of 1 km (R 2 = 0.37). Traffic emissions play an important role on PM2.5 concentration. Differing from winter and spring, PM2.5 concentrations of summer and fall have weak correlation with all variables. It may suggest that other factors are likely to influence PM2.5 levels for the two seasons. As a result, the LUR model is able to explain the spatial variation of PM2.5 concentration in each area of Guangzhou, both in winter and spring.
It is necessary to standardize those variables. The formula for standardization is introduced in Equation (6).
Several multiple linear regression equations for winter and spring are established, respectively, using stepwise regression for selecting the best one (Equations (7)-(11) and Table 3). The test indicates that Equation (7) (R 2 = 0.75) is better and could be used to obtain insight into the distribution of PM2.5 concentration for winter; Equation (11) (R 2 = 0.85) could be used for spring. The error rate of 17.91% for Equation (10) is higher than that for Equation (11) with 9.31%. It suggests that wind direction plays an important role in affecting the PM2.5 concentration in spring. Therefore, Equation (11) will be the best to simulate the spatial distribution.
Due to the small number of monitoring sites, we test a grid of 5 km × 5 km instead of higher spatial resolution grid by the mean of ArcGIS. PM2.5 concentrations are calculated at each intersection (the intersect points of rows and columns in 5 km × 5 km grid) using multiple linear regression equations. Then the seasonal spatial concentration distributions of PM2.5 are interpolated by the ordinary Kriging approach. The simulating distributions are described in Figure 9.
In Figure 9(a), spatial characteristics of two distinct high-value centers get much attention while the surrounding area gradually reduced obviously. Those highvalue centers are mainly in the northern and central areas of Guangzhou. A high-value center in the northern part covers the southeast of Huadu, northwest of Baiyun. The central high-value center includes the whole of Yuexiu, north of Liwan, and most of Tianhe and Haizhu. The suburbs of Zengcheng, Conghua and Nansha have better air quality with the low PM2.5 concentration. On the other hand, there is a similar phenomenon happening in spring. One high-value center exists in all of Yuexiu, most of Tianhe, the north of Liwan and south of Baiyun. Conghua, Zengcheng and Nansha are also the region with low PM2.5 concentration. To examine the accuracy of simulation, the results are compared to the measured data. According to the data, monitoring sites of LWXC, LH, GYQ, HD, PY and HZBG are those with high PM2.5 concentration in both of winter and spring. On the contrary, Tianhe is with low PM2.5 concentration. As a whole, the simulation results roughly correspond to the actual values. Without measured data in Zengcheng and Nansha, we cannot determine whether the values of Zengcheng and Nansha are correct or not.

Conclusions
In this paper, temporal and spatial variations of PM2.5 average levels are analyzed in Guangzhou. Wavelet analysis is used to judge the day-to-day changes in PM2.5 concentrations, and this approach is confirmed as a reliable method for investigating temporal variations in air pollutant research. Temporal variations will be well expressed and the mutated signal of PM2.5 level can be identified by wavelet analysis. To map the spatial differentiation of PM2.5 levels, the LUR model achieves satisfactory results. To confirm seasonal differences, seasonal multiple linear regression models are naturally considered. We can determine the characteristics of space in seasons. Deciding upon the dependent variable and the independent variables correlation, we find that a weak relationship between impervious surface area and PM2.5 levels exist, whereas there is a strong correlation with traffic, particularly in winter. The measured concentration of PM2.5 and variables are significant when associated just in winter/spring and not in summer/fall. Perhaps additional factors contribute to the air pollutants in the two seasons. Hence, for Guangzhou, the LUR model is only available in winter and spring. Wind direction has immediate impact on PM2.5 levels in spring. Even other independent variables do not affect the spatial distribution of PM2.5 in the study domain; it does not suggest those variables has no relationship to PM2.5 but rather the small number of monitoring sites used to build models and test. Further research is necessary because of the existence of deficiencies in this study, such as the sparse monitoring sites. If there are more monitoring sites, the accuracy of the simulation could potentially be higher.