666
Views
14
CrossRef citations to date
0
Altmetric
Articles

Sparse representation-based correlation analysis of non-stationary spatiotemporal big data

, &
Pages 892-913 | Received 08 Nov 2015, Accepted 21 Feb 2016, Published online: 20 Jun 2016

ABSTRACT

As the basic data of digital city and smart city research, Spatiotemporal series data contain rich geographic information. Alongside the accumulation of spatial time-series data, we are also encountering new challenges related to analyzing and mining the correlations among the data. Because the traditional methods of analysis also have their own suitable condition restrictions for the new features, we propose a new analytical framework based on sparse representation to describe the time, space, and spatial-time correlation. First, before analyzing the correlation, we discuss sparse representation based on the K-singular value decomposition (K-SVD) algorithm to ensure that the sparse coefficients are in the same sparse domain. We then present new computing methods to calculate the time, spatial, and spatial-time correlation coefficients in the sparse domain; we then discuss the functions' properties. Finally, we discuss change regulations for the gross domestic product (GDP), population, and Normalized Difference Vegetation Index (NDVI) spatial time-series data in China's Jing-Jin-Ji region to confirm the effectiveness and adaptability of the new methods.

1. Introduction

As the basic data of digital city and smart city researches, spatial data describe information on a variety of spatial factors, including spatial location, shape, size, and distribution. The time-series factor of spatial data consists of Spatiotemporal series data. The analysis, modeling, and forecasting of time-space series data are known as Spatiotemporal series analysis (Curry Citation1970). Spatial data currently include geographical data (obtained by various statistics and measurement methods) and remote sensing data (obtained by aerospace methods); in particular, geographical data comprise natural geographic data, economic conditional data, and social development data; Spatiotemporal series data contain rich geographic space–time information. Given the development of earth-observation and site-observation technologies, we are faced with increasingly diverse spatial data types. Due to a lack of effective spatial and time-data analysis methods, however, it is difficult to analyze and process these data (Curry Citation1970). It is therefore important to analyze Spatiotemporal series data.

As the dataset's scale continues to increase at a rapid rate, researchers have searched for new methods for extracting the information implied in Spatiotemporal datasets. As a typical representation of big data, Spatiotemporal data mining is one new method for obtaining deep and general information for scientific research (Zhang et al. Citation2014; Khan et al. Citation2016; Zhang et al. Citation2015). In addition to the general features of big data, such as volume, variety, velocity, and value (Beyer and Laney Citation2012), spatial big data represent new extensive external and internal features that require intensive study. Massiveness, heterogeneity, and multiple sources (as external features) and high dimensionality, multiple scales, and non-stationary aspects (as internal features) represent the essential differences between spatial big data and other types (Guo et al. Citation2014). Thus, a new analytical method is required for new features.

Spatiotemporal series data analysis is used to research the relationship between large-scale and long time-series geographic factors. A variety of applications use Spatiotemporal data, especially through analyzing the relationships between various geographical factors to provide constructive advice for urban planning, environmental protection, and government decision-making. Spatiotemporal series data analysis thus has instructive significance. Most traditional analyses calculate the correlation coefficient and forecast the development trend from the perspective of time through statistical methods in the time domain, however, while ignoring the internal structures (Glaeser, Kolko, and Saiz Citation2001; KITOV Citation2009; Putterman and Weil Citation2010; Pan et al. Citation2013). The results of the data analysis thus represent a macro-analysis rather than being specifically guided. Spatiotemporal series data provide a new perspective on analyzing spatial data and obtaining new knowledge and rules; because of the complex interrelationships among various factors, it raises new challenges for processing and analyzing spatial big data.

1.1. The volume of Spatiotemporal series data is increasing rapidly

Alongside the development of earth- or site-observation technologies and statistical measurement methods, spatial data types have become rich and varied, with an ever-increasing volume(Star and Estes Citation1991). For example, with the time resolution spanning from once a month to several times a day, the spatial resolution from kilometers to tenths of meters, and the spectral resolution from several bands to several hundreds of bands, the volume of remote sensing data is increasing rapidly each year. The rapidly increasing data volumes and multiple data types add to the complexity and difficulty of data processing based on traditional methods. We thus need a new method for simultaneously reducing data volume while retaining the essential information.

1.2. Spatial data represent non-stationary characteristics

Most current signal analysis theories and methods are based on the stationary hypothesis (Kamarianakis and Prastacos Citation2005; Mennis and Liu Citation2005). most signals in the real world do not satisfy the hypothesis, however. First, the total signal length is too short to satisfy it; when the length of the signal is too short for the range of non-stationary signal fluctuation, for instance, the signal will present non-stationary characteristics. Second, the processing signal essentially belongs to the non-stationary signal; thus, the signal presents non-stationary features. Finally, due to nonlinear factors such as probe and quantization, most signals obtained using scientific instruments, reflecting the law of natural change, are nonlinear and non-stationary. Thus, few signals are suitable for processing such signals; for example, Fourier transform is based on the hypothesis that the system must be linear and that the data should be periodic and generally stationary. Spatial data also represent non-stationary characteristics, which present spatial object information. In order to accurately analyze spacial data, we need new methods for treating non-stationary features.

1.3. The complex relationships of Spatiotemporal data are increasing analytical difficulty

Tobler's first law of geography pointed out that all attribute values on a geographic surface are related to one another (Tobler Citation1970); the spatial data neighborhood domain clearly represents cross-correlation. In contrast, because spatial series data generally describe different aspects of spatial geographic elements in the same position, most spatial data have different degrees of relevance; for example, population data are relevant to gross domestic product (GDP) data, and there is a relationship between population data and digital elevation data. In addition, because spatial data also present clustering, random, and regularity distribution (Star and Estes Citation1991), there are redundancies among the data. In order to analyze and mine the knowledge contained in spatial data, we need to clarify the correlation between the data.

Consequently, we first propose a new analytical framework for non-stationary Spatiotemporal big data based on sparse representation. Then, in order to reduce data volume, decrease the internal redundancy of spatial big data, and represent the Spatiotemporal series data's non-stationary characteristics, we improve the K-singular value decomposition ( K-SVD) algorithm for correlation analysis. Finally, to measure the complex relationship among Spatiotemporal series data, we propose a new method for computing the autocorrelation coefficient in the sparse domain to measure the relativity, and we use various correlation analysis methods based on sparse representation to analyze China's Beijing, Tianjin, and Hebei provinces' (Jing-Jin-Ji, for short) spatial time-series data.

The remainder of the paper is organized as follows. We discuss related work in Section 2; we then present a theoretical analysis in Section 3 to demonstrate the properties of our correct analysis framework. In Section 4, we describe the basic processing steps; Section 5 demonstrates the experimental results and provides a discussion. Section 6 concludes the paper and discusses future work.

2. Related work

Most Spatiotemporal data exhibit Spatiotemporal correlation and heterogeneity, which allows for vague Spatiotemporal internal correlation, thus making the discovery of knowledge about spatial data more difficult. Two main analytical methods are involved: one based on the original data and one based on transform coefficients.

2.1. Traditional methods based on original data

For the sake of analyzing the relationships among the different processes, Spatiotemporal analysis based on original data can mainly be divided into data preprocessing, characteristic choosing, modeling, validation verification, and results interpretation. Data processing methods generally involve resampling used for data format transform, regularization (used to eliminate measuring-scale differences), de-noising, interpolation (used to estimate absent values), and values-absent analysis (used for image completion). Characteristic choosing analyses correlates to distinguish relevance among different variables, such as histograms, scatter diagrams, time-correlation analyses, and spatial-correlation analyses. The general methods of spatial correlation include Moran's I, Geary's C, and the semi-variable function. Each of these methods has advantages and limitations; for instance, the Pearson time correlation coefficient represents the linear correlation, and histograms represent variation trends directly but cannot provide quantitative analysis.

Modeling methods may be divided into stationary Spatiotemporal series modeling and non-stationary Spatiotemporal series modeling. As a typical example of the former, the Spatiotemporal autoregressive and moving average(STARMA; (Martin and Oeppen Citation1975)) creatively uses a space–time delay operator to express the influences of time and space delay. For example, Weiguo Han (Han et al. Citation2007) used STARMA to establish a Spatiotemporal relationship among traffic intersections that is used for short-term forecasts and analysis of space–time regional traffic flow; Halim et al. (Citation2009) presented a new parameter estimation of the space–time model based on a genetic algorithm, although the parameters are difficult to estimate and the space weight only reflects a linear spatial relationship. The STARMA model still cannot address the spatial non-stationary problem, however. In addition, while a few artificial intelligence methods such as neural networks, genetic algorithms, support-vector machines, and Bayesian networks have been used to analyze non-stationary signals, the convergence problem and need for complex calculations limit their application.

Validation verification is used to recognize whether the model is appropriate for data analysis. Three methods are frequently used to analyze residual error distribution: exploratory spatial data analysis (ESDA), time–series analysis, and the use of Spatiotemporal correlation coefficients. ESDA describes the spatial distribution via the associated measure and recognized spatial mode; this is common to most spatial analysis methods, such as the spatial correlation analysis and time correlation analysis methods mentioned above.

2.2. Intelligent methods based on transform coefficients

The new features of Spatiotemporal series data provide new challenges to acquiring useful information. Several other analytical methods have been proposed for Spatiotemporal series data based on transform functions, including short-term Fourier transform, time–frequency analysis, wavelet transform, and Gobor transform, among others. A few methods based on discrete Fourier transform (DFT) have been used to process spatio-time-series data for change detection (Primechaev, Frolov, and Simak Citation2007; Salmon et al. Citation2010, Citation2011; Kleynhans et al. Citation2011). Other methods, based on wavelet transform, have been use to analyze time–series spatial data information as extracted features (Celik and Ma Citation2010; Piao et al. Citation2012; Chabira et al. Citation2013); all of these publications focus on one kind of data, however, and they rely on the hypothesis that the signal is stationary (or piecewise-stationary), with poor local representational ability. Another typical transform method is sparse representation; the traditional transform methods belong to the non-adaptive category. Adaptive dictionaries, meanwhile such as modification (MOD) function, the aforementioned K-SVD, restricted least squares, and online dictionary learning generally represent the data sparsely, without fixed analytical forms. Several papers have attempted to use a reference image as auxiliary information based on sparse representation (Wang, Lu, and Liu Citation2015), which overall yields better results than may be found when using direct methods. Because the authors did not analyze the differences among the various reference images, however, they could not clarify which reference images were better than others.

In order to reduce the dimensions of Spatiotemporal data, many methods have been proposed to excavate essential features and low-dimensional structures, including principal component analysis (PCA), independent component analysis (ICA), Fourier transform, and wavelet transform. These traditional dimension-reducing methods are not suitable for highly nonlinear distributions, however. Tenenbaum and Rowesis first proposed manifold learning in Science in 2000; they used the terms isometric mapping (Isomap) and local linear embedding to describe the highly dimensional manifold structure. Many new manifold learning methods were proposed following Tenenbaum and Rowesis's work, including Laplace feature mapping, Hessian-based locally linear embedding , local tangent space alignment , and others. Most of these approaches are only appropriate for local data and small sample data, however, and the results are easily influenced by sample noise. The existing manifold learning methods thus are unsuitable for Spatiotemporal data series.

The typical processing algorithm for non-stationary signals is called the Hilbert–Huang transform (HHT), which primarily comprises empirical mode decomposition and Hilbert spectrum analysis . HHT is used in many applications and achieves effective results. For example, Lin et al. (Citation2012) analyzed heart sounds based on HHT to extract a series of parameters from the time and frequency domains; Song et al. (Citation2011) used HHT to analyze seismic signals; and Jia et al. (Citation2012) explored the diagnosis of gas transport pipelines based on HHT; and Kong, Xu, and Zhou (Citation2010) used HHT for ultrasonic flaw inspection. HHT is suitable for non-stationary signal processing, which is self-adaptive, nonlinear, and not limited by Heisenberg's uncertainty principle. Because solving the upper and lower envelope curve of the signal is slow and complex, however, it exhibits poor timeliness.

2.3. Correlation representation methods

Correlation analysis methods for representing the similarities between two different series can be divided into several types, based on assorted definitions of similarity. The classical factor for representing correlation is the Pearson correlation coefficient, which assumes that two different signals correlate if they have a linear relationship; it thus only measures and reflects linear relations. The general factor for representing similarity within the signal processing field is mutual information, which reflects the dependency relationship between two different signals: the greater the mutual information value, the more obvious the dependency between them. Another factor is relative entropy (or KL-divergence), which reaches a maximum value when the probability distributions of two different signals are equal. The other significant measurement factor is distance, including Euclidean distance, Markov distance, and PAP distance: the smaller the distance between two signals, the more similar the signals will be. These different factors also have obvious limitations, however; for example, complex calculations must be carried out when computing mutual information between different signals. Because the distance factors only reflect the spatial correlation based on Tobler's first law of geography, however, when we try to describe the difference between two series, we need to analyze the best method for representing the similarity between the two.

3. Problem definition

Although Spatiotemporal series data provide a new perspective for analyzing spatial data and obtaining new knowledge and rules, they also raise new challenges related to analyzing spatial big data, since the volume of spatial big data increases the complexity and difficulty of data processing. In order to describe these challenges, we first must assume that the two different Spatiotemporal series follow these models: In particular, p and q represent different kinds of spatial data, and represents the spatial data p in time t. Therefore, the time correlation can be presented by the following functions:

If the Spatiotemporal series is stationary, the time correlation function will not change with the time, which only correlates with the time interval between these two series. The time correlation function can be simplified as . Most of the current time correlation analysis methods are based on hypotheses related to time-series data, however, such as the autoregressive-moving-average (ARMA) and STARMA models. These methods only reflect the difference of time interval, and they use average values that represent one certain time correlation. Because the Spatiotemporal series are non-stationary, however, we must use more than simply the traditional time correlation methods to present the time correlation of Spatiotemporal series.

Several spatial correlation methods have been proposed to describe the spatial correlation, including Moran's I and Geary's C; most of these methods rely on a space weight matrix to represent spatial distance. We may represent the spatial correlation as the following function:

The weight is the function of h, where h represents the space delay. Therefore, the different spatial correlations are present in the aspects of the weight matrix, and the convolution is present with the values of the spatial data. The Spatiotemporal series data generally describe a large space range, however, which results in highly convoluted computations when computing the spatial correlation. We thus needed to discover new methods for reducing the calculation capacity. As a result, we have proposed new time and spatial correlation methods to represent the Spatiotemporal series correlation.

The new analysis frame for computing spatial big data correlations is described as the following figure . In order to reduce the volume of Spatiotemporal series data and to self-adaptively represent non-stationary characteristics, thus simplifying the computing processing, we have proposed a new analysis frame that consists of Spatiotemporal series data preprocessing, sparse representation, and correlation. For the sake of comparing the sparse coefficients, the basic rule of preprocessing and sparse representation is to unify the datasets in the same measurement space. Because the K-SVD method is self-adaptive and sparsely represents big data, we used that method to represent Spatiotemporal series data. Dictionaries trained by the K-SVD method are redundant and self-adaptive, however; they transform analytical data into different sparse spaces, so that the sparse coefficients cannot be compared with others. Because of this redundancy, one dictionary may represent one piece of data in different sparse coefficients by following different computing orders. We thus needed to improve the present sparse representation methods in order to analyze the correlation between the data (Figure ).

Figure 1. The data processing framework.

Figure 1. The data processing framework.

4. Theoretical analysis

4.1. Sparse representation of Spatiotemporal series data

We make use of sparse representation to reduce the spatial data volume and to solve the unstationality problem. Sparse analysis originated in harmonic analysis; as an important branch of mathematical research, harmonic analysis represents functions by a series of basic waveforms. The basic model of sparse representation is generally represented by the following formula:Here, stands for highly dimensional data, is a dictionary in which each column is called an atom, is the sparse coefficient, and ω represents noise.

The K-SVD algorithm is a perfect example of adaptive sparse representation: it computes the sparse coefficient and refreshes the dictionary alternately until the relevant constraints are met. Although sparse representation and K-SVD are traditional methods that have been widely used in the remote-sensing field for classification, change detection, and target identification, sparse representation is rarely used to analyze Spatiotemporal time-series data. Generally, in order to obtain a sparser coefficient, each datum can have its own sparse coefficient in order to be better represented. If we train the dictionaries separately, however, each dictionary is only adaptive with itself. In order to unify the transformation space, we chose to jointly train the dictionary, with the two spatial data acting as the data for training a unified dictionary; for instance, and , where D is the joint dictionary. The sparse coefficient is consequently calculated in the same space with the unified dictionary.

When computing the sparse coefficients, because the dictionary is redundant, the different computing orders may lead to the same data having different sparse coefficients. The calculating order of the sparse coefficient is determined by the atoms and the residuals. Therefore, the coefficient calculating order is not fixed in order to ensure maximum sparsity. The dictionary computed using the K-SVD algorithm is both redundant and non-orthogonal, however; the same data may have different coefficient vectors, and with different computing orders. We thus need to adjust the computing order to avoid this problem. For example, we assume that one y can be represented as the following formula: where, a, b, c, and d are column vectors and , , . Therefore, y can also be represented with vector b, c, d.Because and , ; therefore, those sparse coefficients are different from one another. This illustrates that y can be represented by different sparse coefficients with different computing orders based on the same redundancy dictionary. Two specific redundancy dictionaries that learned by K-SVD methods are shown in Figure . Consequently, the final sparse representation is ; and are the final sparse coefficients.

To conveniently analyze the spatial-time series data correlation, we first construct the spatial data model, which is , where S represents spatial data for which the mean value equals zero, ; the superscript p represents the kind of spatial data, t represents the time, and h is the space-adjacent domain. The sparse representation of the two spatial data and is as follows:Here, is the joint dictionary, and and represent the respective sparse coefficients of the two spatial data. In order to ensure that the analytical data series is in the same sparse space, we combine two corresponding analytical data as one datum to train a joint dictionary based on K-SVD methods. When computing sparse coefficients, we improve the orthogonal matching pursuit (OMP) method by selecting the maximum product of residual and dictionary atoms to adjust the computing orders; this step avoids the influence of dictionary redundancy. (Details on the computing process are discussed in Chapter 5.) The joint dictionary is represented as follows: In particular,Each is called an atom, m is the number of elements in each atom, and n is the number of atoms. Therefore, the spatial data can be represented as and.

4.2. The time-series autocorrelation analysis

Typically, the time autocorrelation coefficient computing function is based on the hypothesis that the data are stationary (Shumway and Stoffer Citation2010). Then the k time delay correlation coefficient computing formula can be reduced to formula 4:While the Spatiotemporal series data do not have stationary signals, and the variances and means change with time, the time-series correlation formula cannot be simplified to Formula 4; it thus is unsuitable for non-stationary data. We need to define a new method to compute autocorrelation coefficients in sparse space.

Traditional sparse coefficients may be used to analyze correlation due to the orthogonality of the primary function. Because the sparse dictionary is redundant and non-orthogonal, however, the sparse coefficients are affected by the represented order of the dictionary atoms. In order to avoid this influence, we must adjust the computing coefficients so that they represent data sparsely as much as possible; we can use the coefficients directly to measure time correlation. For the sake of measuring time autocorrelation in sparse space, we use the coefficients directly as a way to represent the time autocorrelation based on the above spatial-data-sparse representation model; the formula is as follows:Here, and are the sparse coefficients of the spatial data in t and t+k. When k=0, equals the maximum value 1. To analyze the range of correlation, we assume that the sparse coefficients are high-dimensional space points, as shown in Figure . The figure illustrates that the correlation coefficient equals the cosine of the angle between two vectors. Therefore, the range of time correlation is.

Figure 2. The correlation representation in high-dimensional space.

Figure 2. The correlation representation in high-dimensional space.

Previous studies have generally used the transform coefficients to compute time correlations; we thus need to analyze what differs from methods that use data directly and what remains the same. The general formula to directly calculate the time correlation is shown below. and represent the different coefficients obtained by different transform methods at a particular place and a certain time:

The difference between the formulas has to do with centralization. When we transform this difference within highly dimensional space, the difference is converted by using the different angle cosines to measure the similarity between two space points. Because the sparse dictionary obtained by the K-SVD method is redundant and non-orthogonal, it is better to directly use the angle between the sparse coefficients for which the original data have been centralized.

P and Q are the different series data in two-dimensional and three-dimensional space. The and the represent the centralization data series, while PO and QO are the non-centralization data series. From the figure on the left, we can see that the centering correlation is identical to 1, while the other correlation is not. The θ and γ indicated in the figure on the right separately represent centering and non-centering vectorial angles. When P, Q and the line where x=y, y=z are in the same plane, the correlation equals 1, which simply measures the linear correlation.

When two sparse coefficients represent different kinds of spatial data, the time correlation analyses between them are similar in terms of the autocorrelation formula and the range of :

4.3. Spatial autocorrelation analysis

Spatial autocorrelation is different from other correlation analysis methods, which describe the relationship between one spatial unit and other units around it. Spatial autocorrelation is based on the space adjacency; in other words, all attribute values on a geographic surface are related to one another, but closer values are more strongly related than distant ones (Tobler's first law of geography). In order to analyze the spatial distribution, we need to research the computing method for the spatial autocorrelation coefficient.

Correlation between time and space is one of the primary reasons for the order, pattern, and diversity of the natural world (Goodchild Citation1996). The existence of spatial autocorrelation means that traditional statistical methods are not suitable for researching spatial character. The purpose of spatial statistics is to present the distribution and to determine the effect of spatial autocorrelation on spatial distribution (Cressie and Cassie Citation1993).

When spatial data are transformed to a sparse space, each dictionary atom represents the features of spatial data, and the coefficients describe how those features build up the spatial data. In addition, because the spatial features reflect the distribution of spatial data, the dictionary contains spatial-position-associated information, and sparse coefficients describe the spatial features' linear combination. Thus, the combination of each atom's sparse correlation can be used to measure the spatial correlation. We thus combine sparse coefficients and dictionary spatial autocorrelation to calculate the spatial data autocorrelation in sparse space. The spatial autocorrelation coefficient I can be calculated as follows:Here, h represents the space delay, while is the coefficients normalization, expressed as follows: represents the dictionary spatial autocorrelation coefficients; it is determined by the dictionary and the spatial weight matrix through the following method: Here, represents an element of the atom ; meanwhile, w represents the spatial contiguity with values that correlate with h, as follows:

We defined the space neighborhood to represent the relationship between h and ; the value of the weight reflects the space-related order. The specific definition is shown in Figure .

Figure 3. Definitions of spatial close neighborhoods.

Figure 3. Definitions of spatial close neighborhoods.

The black block is our analysis object, and the blue blocks stand for different orders of close neighborhoods: a shows first-order spatial neighbors, b shows second-order spatial neighbors, and c shows third-order spatial neighbors.

Because of in and in [0, 1], the space correlation coefficient has a range of to 1. When I is greater than 0, the image has a positive spatial correlation. The larger the spatial correlation coefficient, the greater the spatial distribution of the correlation. In contrast, the image has a negative correlation when it is smaller than 0. Based on the spatial autocorrelation formula, we provide two examples to illustrate the range of spatial autocorrelation. In the first example, we create an image only using 1, shown on the first line in Figure . Subgraph (a) represents the original image, while subgraph (b) represents the dictionary of subgraph (a), based on the K-SVD method. The spatial autocorrelation coefficient of subgraph (a) equals 1, based on Formula 8. In the second example, we create an image alternately using 0 and 1; this is shown in the second line of Figure . Subgraphs (c) and (d) have the same meaning as subgraphs (a) and (b); the spatial correlation coefficient of subgraph (c) equals . The values of spatial autocorrelation range from to 1.

According to the formula of spatial autocorrelation and time autocorrelation, we may then derive the time and spatial autocorrelation formula . The space–time autocorrelation coefficients describe the whole Spatiotemporal changes of Spatiotemporal time-series data: not only those related with sparse coefficients, but also those that are associated with sparse dictionaries. The space–time autocorrelation formula is thus based on the time autocorrelation formula and then is increased with the dictionary spatial autocorrelation coefficients. The space–time autocorrelation formula is expressed as follows, with a range of :

5. The correlation coefficient computing procedure

Given the above discussion, the whole processing flow can be summarized in the steps listed below (Figure ). First, before analyzing the different kinds of correlation, we need to sparsely represent the spatial-temporal series data. Second, we need to analyze the time-series autocorrelation and the cross-correlation between different series of spatial data. Third, in order to measure the space distribution, we must compute the spatial autocorrelation coefficient. Finally, we analyze the whole space–time correlation.

Figure 4. Images in which spatial autocorrelation equals 1 and .

Figure 4. Images in which spatial autocorrelation equals 1 and .

Figure 5. The general processing steps.

Figure 5. The general processing steps.

The general procedure is as follows:

Input parameters: two preprocessed spatial data and , sparsity K, number of atoms in the dictionary m, iterations l;

Output parameters: the different kinds of correlation coefficients between the two spatial data;

  1. Preprocessing the data: for convenient analysis, we need to ensure that the two spatial data are registered, then decentralize the processing of spatial data.

  2. Joining the two spatial data, dividing them into blocks, and transforming them into vector. After carrying out these steps, one datum can be organized using the appended columns.

  3. Using the K-SVD algorithm to calculate the joint dictionary D.

  4. Calculating the sparse coefficient and in the joint sparse space through the corOMP algorithm.

  5. Computing the time autocorrelation coefficient in sparse space through the timeCorr algorithm, based on the time-correlation coefficient formula.

  6. Computing the space autocorrelation coefficient in sparse space through the spaCorr algorithm.

  7. Computing the space–time correlation coefficient based on the space–time correlation computing formula using the spatimeCorr function.

The corOMP algorithm is an improved version of OMP. The function computes the sparse coefficient jointly and reduces deviation in the results. The main computing steps are as follows:

Input parameters: preprocessed vector, joint dictionary, and sparsity K;

Output parameters: the coefficients of sparse representation and ;

  1. Defining the matrix index and assigning it to 0, the line is K, and the column is equal to the column of the sparse signal. Then, parameters , , , and t=1 are initialized.

  2. Respectively computing the maximum inner product and recording the index with and , comparing the two maximum values, then recording the index of the larger one by index.

  3. Updating the dictionary collection .

  4. Solving the coefficient by computing the pseudoinverse of the dictionary; that is, and .

  5. Updating the residual error, , , and t=t+1.

  6. Assessing the iteration condition: if t>K, then stop computing; otherwise go on to step 2.

The time-series autocorrelation and the time-spatial autocorrelation measure the amount of spatial data; before computing the correlation coefficients, we need to train a joint dictionary and then transform the series data into the sparse domain based on the function of corOMP. The computing method of the spatial autocorrelation coefficient is roughly one datum in a predetermined position, so it is only necessary to make it sparse based on the K-SVD and OMP methods. The detailed calculation methods are based on the formula above. According to the computing steps described above, we attempted to calculate the correlation coefficients in order to analyze the Spatiotemporal series-data relationships.

6. Experiments

As typical geography Spatiotemporal data, GDP and population-series data are usually used as examples for verifying applications or for testing new methods. We thus chose these two datasets as examples. The Normalized Difference Vegetation Index (NDVI) data are used to detect vegetation growth status and vegetation coverage; vegetation coverage obtains maximum values in August in the Jing-Jin-Ji zone, which is typical of many urban regions. In addition, NDVI series data are computed by remote-sensing data, and thus represent one type of Spatiotemporal series data. Finally, the Spatiotemporal data that are traditionally analyzed are time-series data. We thus chose those three types of data (GDP, population-series, and NDVI ) as examples in order to analyze correlations among the three. In particular, the NDVI series data are based on moderate resolution imaging spectrometer (MODIS) monthly vegetation indices (1 km products).

In the following experiments, we use GDP; population social statistics data for 1995, 2000, 2005, and 2010; and the NDVI time-series data for each August from 2000 to 2014 in the Jing-Jin-Ji region in order to analyze the change regulations. The NDVI data are based on remote-sensing data. Figure demonstrates the research zone, and shows examples of the experimental correlation data, including GDP, population, and NDVI images in 2000. Because different types of spatial data have different spatial projections and spatial resolutions, before analyzing the relationship between them, we need to preprocess the data so that they will have the same spatial scale; this includes projection transformation, mosaic, image section, resampling, and image registration operators.

Figure 6. The GDP, population, and NDVI distribution of the Jing-Jin-Ji region in 2000.

Figure 6. The GDP, population, and NDVI distribution of the Jing-Jin-Ji region in 2000.

Before analyzing these three types of data based on our new methods, we need to illustrate whether or not these data present the properties of Spatiotemporal big data. We first compute these three time-series data in terms of time correlation and spatial correlation in order to illustrate the data's internal features. Although the traditional time autocorrelation method is based on the assumption of stability, in can detect whether or not the data are stable. The time- and spatial autocorrelation computation results are shown in Figures and , respectively.

Figure 7. Time autocorrelation of GDP, population, and NDVI series data. Subgraph (a) represents the GDP series time autocorrelation, subgraph (b) represents the population series time autocorrelation, and subgraph (c) represents the NDVI series time autocorrelation.

Figure 7. Time autocorrelation of GDP, population, and NDVI series data. Subgraph (a) represents the GDP series time autocorrelation, subgraph (b) represents the population series time autocorrelation, and subgraph (c) represents the NDVI series time autocorrelation.

Figure 8. The first-order, second-order, third-order, and fourth-order spatial auto-correlation of GDP, population, and NDVI series data.

Figure 8. The first-order, second-order, third-order, and fourth-order spatial auto-correlation of GDP, population, and NDVI series data.

The first subgraph in Figure is related to GDP time autocorrelation. The autocorrelation coefficients decline linearly, not exponentially, thus illustrating that the GDP series data are unstable. The population and NDVI time autocorrelation coefficients share the same rules with GDP. Thus, the three types of data are not time-stable in the Jing-Jin-Ji region.

The state of being stationary is too strict to be used as a spatial variable, and thus is only theoretically feasible (Zhang Citation2005). We compute spatial autocorrelation coefficients based on the Moran I method, as shown in Figure . The subgraphs on the first line represent four different spatial correlation coefficients, all of which are great than 0.5. The subgraphs on the second and third lines represent population and NDVI series spatial correlation coefficients, respectively, that are greater than 0.4. These three types of datasets thus belong to spatially unstable signals, both in theory and in practice. Because the obvious feature of Spatiotemporal big data is volume, we chose sparse representation methods to reduce data volume. As a result, these three kinds of datasets belong to non-stationary series, which are suitable for analyzing time and spatial correlations as Spatiotemporal big data examples.

We used three types of spatial data in order to analyze their time autocorrelation coefficients, and then used the GDP and population data as examples for computing the cross-correlation. We next computed their spatial autocorrelation coefficients and analyzed the spatial distribution; finally, we analyzed the spatial-time series correlation. Based on the correlation coefficients, we were able to obtain several time and spatial variations related to these three types of spatial data.

Before analyzing the time autocorrelation of the data, we needed to transform them into the sparse domain. Using the time-series sparse representation as an example, Figure illustrates the dictionary of each kind of data based on the K-SVD method; the representation error is shown in Figure . All of the representation errors were less than 1e−004.

Figure 9. Dictionary of three kinds of data series. We used an image block to train the dictionary; the dictionary's size is 64, based on the sparsity mode for computing the sparse coefficients. The first image (left) is the GDP series sparse representation dictionary, the second (middle) is the population series sparse representation dictionary, and the third (right) is the NDVI series sparse representation dictionary.

Figure 9. Dictionary of three kinds of data series. We used an image block to train the dictionary; the dictionary's size is 64, based on the sparsity mode for computing the sparse coefficients. The first image (left) is the GDP series sparse representation dictionary, the second (middle) is the population series sparse representation dictionary, and the third (right) is the NDVI series sparse representation dictionary.

Figure 10. Different types of image sparse representation errors based on the above dictionaries.

Figure 10. Different types of image sparse representation errors based on the above dictionaries.

We first analyzed the time autocorrelation of GDP, population (pop), and NDVI data at different time delays, as shown in Figure ; specifically, because the GDP and population time interval was 5 years, the unit of time delay was also 5 years; the maximum GDP and population time delay was 15 years. NDVI time autocorrelation was analyzed from 2000 to 2014; the maximum time delay was 14 years. The GDP time autocorrelation is not significant; this mainly illustrates that the GDP development in the Jing-Jin-Ji region is not balanced: in others words, some areas' development rates increased rapidly and some remained slow from 1995 to 2010. The population time autocorrelation of the first 5-year time delay is greater than for the other time delay, showing that the population change trend in the Jing-Jin-Ji region is similar. The time autocorrelation of the third 5-year time delay tends to increase, which also reflects a changing trend in the population pyramid. The NDVI time autocorrelation is stable, with an increasing time delay. The NDVI describes the vegetation rate, mostly reflecting the difference between towns and farmland in August at the Jing-Jin-Ji region. The intensity of the NDVI time autocorrelation describes the extended character of cities towns, which feature economic development and urbanization. Most of the the development of cities and towns was concentrated, which means that the NDVI time autocorrelation could be significant. In addition, the vegetation types remained broadly unchanged in August in the Jing-Jin-Ji region from 2000 to 2014.

Figure 11. Two time correlation analysis examples.

Figure 11. Two time correlation analysis examples.

Although tradition methods are not suitable for non-stationary signal correlation analysis, we may simply analyze the results of those experiments. Comparing the time autocorrelation coefficients of these three series data (shown in Figures and ), the correlation coefficient fluctuation trends are the same among them, thus illustrating that the methods we propose are reliable. In addition, the coefficients in the sparse domain are less than those in the time-space domain, because the traditional correlation methods ignore the effects of spatial correlation for two-dimensional signals. For example, we take advantage of two random matrices to generate two different images by sequential filling, which is shown in subfigures (a) and (b) of Figure . The correlation coefficient equals based on the Pearson correlation method, while the time correlation in the sparse domain equals 1. Therefore, because the same construct methods were used in these two images, this experimental result reflects that the time correlation method we propose could describe structural (or spatial) similarity between two images.

Figure 12. GDP, population, and NDVI time-series autocorrelation in different time delay.

Figure 12. GDP, population, and NDVI time-series autocorrelation in different time delay.

Figure 13. GDP and population time-series cross-correlation.

Figure 13. GDP and population time-series cross-correlation.

Based on the time-series autocorrelation coefficient computing formula, we used the cross-correlation coefficient to measure the relationship between GDP and population shown in Figure . The correlation coefficient declines with time delay, reaching the maximum value in the first time delay. GDP and population thus have some correlation; the maximum correlation coefficient is obtained within 5 years.

Based on the spatial autocorrelation computing method, we discuss three different kinds of spatial correlation below with different spatial neighborhood domains. Although the spatial correlation is different with various spatial weights, the variations are similar. First, most of the spatial correlation values are larger than 0.5, so the three different types of spatial data in different spatial orders are obviously spatially correlated. Second, the same kinds of spatial data in different years have different spatial correlation coefficients; the change in spatial data is thus not uniform. For example, the spatial correlation of GDP in 2005 was larger than the others in three different spatial orders, and it reached its minimum value in 1995; the NDVI spatial correlation reached its maximum value in 2008. Third, the first-order spatial correlation coefficient was larger than the second- and third-order coefficients; in addition, the spatial correlation coefficients tended to become smaller as the spatial order increased. This illustrates that the spatial elements correlate with the neighboring elements; the correlation between the element and its neighbors becomes more obvious as the distance becomes shorter. Finally, despite the differences in the spatial correlation orders, the analyzed images exhibited similar trends, and in the same order. In other words, different spatial orders maintained the spatial-correlation-changing rule. Taking the population spatial correlation for distance, the population spatial correlation coefficients initially show a declining tendency, followed by an increase. This shows that the population of the Jing-Jin-Ji region was quite transient during the first 10 years, but that the mobility of the population was reduced after 5 years. It also demonstrates that the spatial weight influences the values of the spatial correlation coefficient, but has little influence on the spatial change tendency.

Compared with the spatial autocorrelation analysis results based on the Moran I method, the spatial autocorrelation coefficients of the three types of data based on two different methods decreased order by order, which is what common sense might tell us. In addition, the discrimination of spatial coefficients in the sparse domain is larger than in the time-space domain,as shown in Figures , . Sparse correlation analysis based on sparse representation coefficients is therefore suitable for analyzing spatial autocorrelation in the sparse domain.

Figure 14. GDP spatial autocorrelations from 1995 to 2010, every 5 years. From left to right, the orders proceed from first to third. The spatial correlation coefficient is shown in descending order with increases in spatial order.

Figure 14. GDP spatial autocorrelations from 1995 to 2010, every 5 years. From left to right, the orders proceed from first to third. The spatial correlation coefficient is shown in descending order with increases in spatial order.

Figure 15. NDVI spatial autocorrelations in August, from 2000 to 2014. Because most of the spatial correlation coefficients are larger than 0.6, the NDVI has more obvious spatial correlation. Both of the spatial correlation analyses (in different spatial orders) acquired their maximum values in 2008.

Figure 15. NDVI spatial autocorrelations in August, from 2000 to 2014. Because most of the spatial correlation coefficients are larger than 0.6, the NDVI has more obvious spatial correlation. Both of the spatial correlation analyses (in different spatial orders) acquired their maximum values in 2008.

Figure 16. Population spatial autocorrelation from 1995 to 2010, shown every 5 years. The maximum values were obtained in 1995, and the minimum in 2005. The change regulation of the population spatial autocorrelation underlying this situation illustrates that population distribution was not uniform during this time span, especially in 2000 and 2005.

Figure 16. Population spatial autocorrelation from 1995 to 2010, shown every 5 years. The maximum values were obtained in 1995, and the minimum in 2005. The change regulation of the population spatial autocorrelation underlying this situation illustrates that population distribution was not uniform during this time span, especially in 2000 and 2005.

The Spatiotemporal correlation coefficient reflects the complex relationship of spatial data with time and spatial changes. Figure shows the NDVI Spatiotemporal correlation coefficients with different spatial orders and time delays. All of the results are in the range of [0.9, 1]; therefore, most of them exhibit highly significant autocorrelation. This situation also illustrates that the Spatiotemporal series data have complex spatial and time correlations among themselves. The spatial-time correlation coefficients are also reduced given increases in time, as is also shown in the spatial correlation.

Figure 17. Spatial-time correlation coefficients of NDVI in three different spatial orders and time delays; all are in the range of [0.9, 1].

Figure 17. Spatial-time correlation coefficients of NDVI in three different spatial orders and time delays; all are in the range of [0.9, 1].

7. Conclusion

Spatiotemporal series data contain rich geographic information. Alongside the recent accumulation of spatial time-series data, we are encountering new challenges in analyzing and mining the correlations among the data. Yet because traditional analytical methods are restricted in terms of processing and analyzing data series, we have presented a new framework based on sparse representation for analyzing the temporal, spatial, and spatial-temporal correlation in order to analyze the correlation of spatial time-series data. First, we analyzed the conditions of the K-SVD algorithm so that we could sparsely represent Spatiotemporal series data for the next correlation analysis. Second, we proposed several computation methods for calculating the temporal, spatial, and spatial-temporal correlation coefficients in the sparse domain; we also discussed the properties of those functions. Finally, we employed new methods to evaluate GDP, population, and NDVI spatial and temporal change regulations. From these experiments, we determined that GDP and population have poor time autocorrelation, and that the NDVI-series data have significant temporal and spatial autocorrelation, thus reflecting the variety in time and spatial distribution.

We used partial data correlation analysis based on sparse representation to analyze Spatiotemporal series data that are non-stationary and that have complex relationships. The experiments confirmed the effectiveness and adaptability of the new methodology for correlation analysis. The correlation analysis methods in the sparse domain create the foundation of Spatiotemporal big-data analysis and will provide new methods for researching the smart, digital city.

Disclosure

No potential conflict of interest was reported by the authors.

Additional information

Funding

This work is supported by the National Natural Science Foundation of China [No. 41471368 and No. 41571413].

References

  • Beyer Mark A., and Douglas Laney. 2012. The Importance of Big Data: A Definition. Stamford, CT: Gartner. https://www.gartner.com/doc/2057415.
  • Celik Turgay, and Kai-Kuang Ma. 2010. “Unsupervised change detection for satellite images using dual-tree complex wavelet transform.” IEEE Transactions on Geoscience and Remote Sensing 48 (3): 1199–1210. doi:10.1109/TGRS.2009.2029095.
  • Chabira Boulerbah, and Takieddine Skanderietal. 2013. “Unsupervised Change Detection From Multitemporal Multichannel SAR Images Based on Stationary Wavelet Transform.” In MultiTemp 2013: 7th International Workshop on the Analysis of Multi-temporal Remote Sensing Images, 1–4. IEEE. doi:10.1109/Multi-Temp.2013.6866025.
  • Cressie Noel A.C., and Noel A. Cassie. 1993. Statistics for spatial data. Vol. 900. New York: Wiley.
  • Curry Leslie. 1970. “Univariate Spatial Forecasting.” Economic Geography. 241–258. doi: 10.2307/143142
  • Glaeser Edward L., Jed Kolko, and Albert Saiz. 2001. “Consumer City.” Journal of Economic Geography 1 (1): 27–50. doi:10.3386/w7790.
  • Goodchild Michael F.. 1996. “The Application of Advanced Information Technology in Assessing Environmental Impacts.” SSSA Special Publication 48 48, 1–18. http://dx.doi.org/10.2136/sssaspecpub48.c1.
  • Guo Huadong, Lizhe Wang, Fang Chen, and Dong Liang. 2014. “Scientific Big Data and Digital Earth.” Chinese Science Bulletin 59 (12): 1047–1054. doi:10.1360/972013-1054.
  • Halim S., I.N. Bisono, D. Sunyoto, and I. Gendo. 2009. “Parameter Estimation of Space–Time Model Using Genetic Algorithm.” In IEEE international conference on industrial engineering and engineering management, 2009. IEEM 2009, 1371–1375. IEEE. doi:10.1109/IEEM.2009.5373039.
  • Han Wei-guo, Wang Jin-feng, Gao Yi-ge, and Hu Jian-jun. 2007. “Forecasting and Analysis of Regional Traffic Flow in Space and Time.” Journal of Highway and Transportation Research and Development 24 (6): 92–96.
  • Jia Ying, Bingkun Gao, Chunlei Jiang, and Sihan Chen. 2012. “Leak Diagnosis of Gas Transport Pipelines Based on Hilbert-Huang Transform.” In International conference on measurement, information and control (MIC), 2012, Vol. 2614–617. IEEE. Doi:10.1109/MIC.2012.6273368.
  • Kamarianakis Yiannis, and Poulicos Prastacos. 2005. “Space–Time Modeling of Traffic Flow.” Computers & Geosciences 31 (2): 119–133. doi:10.1016/j.cageo.2004.05.012.
  • Khan M., Yong Jin, Maozhen Li, Yang Xiang, and Changjun Jiang. 2016. “Hadoop Performance Modeling for Job Estimation and Resource Provisioning.” IEEE Transactions on Parallel and Distributed Systems, 27 (2): 441–454. doi:10.1109/TPDS.2015.2405552.
  • KITOV Ivan.O.. 2009. “The Evolution of Real Gdp Per Capita in Developed Countries.” Journal of Applied Economic Sciences 4 (2. http://www.jaes.reprograph.ro/articles/summer2009/KitovI.pdf.
  • Kleynhans Waldo, Brian P. Salmon, Jan C. Olivier, Konrad J. Wessels, and Frans van den Bergh. 2011. “A Comparison of Feature Extraction Methods Within a Spatio-Temporal Land Cover Change Detection Framework.” In Geoscience and remote sensing symposium (IGARSS), 2011 IEEE International, 688–691. IEEE. doi:10.1109/IGARSS.2011.6049223.
  • Kong Tao, Chunguang Xu, and Shiyuan Zhou. 2010. “A Time–Frequency Method for Ultrasonic Flaw Inspection Based on HHT.” In 3rd International congress on image and signal processing (CISP), 2010, Vol. 83988–3991. IEEE. doi:10.1109/CISP.2010.5647817.
  • Lin Li, Dejun Guan, Dongrui Zhang, Jinhan Feng, and Lisheng Xu. 2012. “Refined Analysis of Heart Sound Based on Hilbert-Huang Transform.” In Paper presented at the 2012 International conference on information and Automation (ICIA), 100–105. IEEE. doi:10.1109/ICInfA.2012.6246790.
  • Martin Russell L., and J. E. Oeppen. 1975. “The Identification of Regional Forecasting Models Using Space: Time Correlation Functions.” Transactions of the Institute of British Geographers 66, 95–118. doi:10.2307/621623.
  • Mennis Jeremy, and Jun Wei Liu. 2005. “Mining Association Rules in Spatio-Temporal Data: An Analysis of Urban Socioeconomic and Land Cover Change.” Transactions in GIS 9 (1): 5–17. doi:10.1111/j.1467-9671.2005.00202.x.
  • Pan Wei, Gourab Ghoshal, Coco Krumme, Manuel Cebrian, and Alex Pentland. 2013. “Urban Characteristics Attributable to Density-Driven Tie Formation.” Nature Communications 4. doi:10.1038/ncomms2961.
  • Piao Yingchao, Baoping Yan, Shan Guo, Yanning Guan, Jianhui Li, and Danlu Cai. 2012. “Change Detection of MODIS Time Series Using a Wavelet Transform.” In Paper presented at the 2012 International conference on systems and informatics (ICSAI), 2093–2097. IEEE. doi:10.1109/ICSAI.2012.6223465.
  • Primechaev S., A. Frolov, and B. Simak. 2007. “Scene Change Detection Using DCT Features in Transform Domain Video Indexing.” In Paper presented at the 14th international workshop on systems, signals and image processing, 2007 and 6th EURASIP conference focused on speech and image processing, multimedia communications and services, 369–372. IEEE. doi:10.1109/IWSSIP.2007.4381118.
  • Putterman Louis, and David N. Weil. 2010. “Post-1500 Population Flows and the Long Run Determinants of Economic Growth and Inequality.” The Quarterly Journal of Economics 125 (4): 1627–1682. doi:10.3386/w14448.
  • Salmon Brian P., Jan C. Olivier, Waldo Kleynhans, Konrad J. Wessels, and Frans van den Bergh. 2010. “Automated Land Cover Change Detection: the Quest for Meaningful High Temporal Time Series Extraction.” In Geoscience and remote sensing symposium (IGARSS), 2010 IEEE International, 1968–1971. IEEE. doi:10.1109/IGARSS.2010.5653723.
  • Salmon Brian P., Jan Corne Olivier, Konrad J. Wessels, Waldo Kleynhans, Frans Van den Bergh, and Karen C. Steenkamp. 2011. “Unsupervised Land Cover Change Detection: Meaningful Sequential Time Series Analysis.” IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 4 (2): 327–335. doi:10.1109/JSTARS.2010.2053918.
  • Shumway Robert H., and David S. Stoffer. 2010. Time series analysis and its applications: with R examples. New York: Springer Science & Business Media.
  • Song Jian-guo, Song-hui Lin, Cui-xia Zhao, and Hao-jie Liu. 2011. “Decomposition of Seismic Signal Based on Hilbert-Huang Transform.” In 2011 International conference on business management and electronic information (BMEI), Vol. 1813–816. IEEE. doi:10.1109/ICBMEI.2011.5917060.
  • Star Jeffrey, and John Estes. 1991. “Geographic Information Systems: An Introduction.” Geocarto International 6 (1): 46–46. doi:10.1080/10106049109354297.
  • Tobler Waldo R.. 1970. “A Computer Movie Simulating Urban Growth in the Detroit Region.” Economic Geography 46, 234–240. doi:10.2307/143141.
  • Wang Lizhe, Ke Lu, and Peng Liu. 2015. “Compressed Sensing of a Remote Sensing Image Based on the Priors of the Reference Image.” Geoscience and Remote Sensing Letters, IEEE 12 (4): 736–740. doi:10.1109/LGRS.2014.2360457.
  • Zhang Renyi. 2005. Spatial Variability Theory and Applications. Vol. 188. Beijing: Science Press.
  • Zhang Xuyun, Wanchun Dou, Jian Pei, S. Nepal, Chi Yang, Chang Liu, and Jinjun Chen. 2015. “Proximity-Aware Local-Recoding Anonymization with MapReduce for Scalable Big Data Privacy Preservation in Cloud.” IEEE Transactions on Computers, 64 (8): 2293–2307. doi:10.1109/TC.2014.2360516.
  • Zhang Xuyun, L.T. Yang, Chang Liu, and Jinjun Chen. 2014. “A Scalable Two-Phase Top-Down Specialization Approach for Data Anonymization Using MapReduce on Cloud.” IEEE Transactions on Parallel and Distributed Systems, 25 (2): 363–373. doi:10.1109/TPDS.2013.48.

Reprints and Corporate Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

To request a reprint or corporate permissions for this article, please click on the relevant link below:

Academic Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

Obtain permissions instantly via Rightslink by clicking on the button below:

If you are unable to obtain permissions via Rightslink, please complete and submit this Permissions form. For more information, please visit our Permissions help page.