Graph learning-based spatial-temporal graph convolutional neural networks for traffic forecasting

Traffic forecasting is highly challenging due to its complex spatial and temporal dependencies in the traffic network. Graph Convolutional Neural Network (GCN) has been effectively used for traffic forecasting due to its excellent performance in modelling spatial dependencies. In most existing approaches, GCN models spatial dependencies in the traffic network with a fixed adjacency matrix. However, the spatial dependencies change over time in the actual situation. In this paper, we propose a graph learning-based spatial-temporal graph convolutional neural network (GLSTGCN) for traffic forecasting. To capture the dynamic spatial dependencies, we design a graph learning module to learn the dynamic spatial relationships in the traffic network. To save training time and computation resources, we adopt dilated causal convolution networks with a gating mechanism to capture long-term temporal correlations in traffic data. We conducted extensive experiments using two real-world traffic datasets. Experimental results demonstrate that the proposed GLSTGCN achieves superior performance than all state-of-art baselines.


Introduction
Traffic forecasting is a fundamental component in Intelligent Transportation Systems (ITS). Accurate traffic forecasting is of great significance to traffic control, traffic management, and traffic planning. However, traffic forecasting is a challenging task due to its complex spatial and temporal dependencies: • Spatial dependency: The traffic conditions on a roadway are affected by the topological structure of the traffic network. The traffic conditions on the downstream roadways will be affected by traffic conditions of the upstream roadways through the transfer effect, and the traffic conditions on the upstream roadways will be affected by traffic conditions of the downstream roadways through the feedback effect (Dong et al., 2012). At the same time, spatial dependencies among roadways change over time. Figure 1 shows an example for illustration. As shown in Figure 1(b), the bold lines between two nodes represent their spatial correlation strength. The deeper the colour is, the closer the correlation is. We can observe that the correlation between node A and node C changes as time goes by. • Temporal dependency: The traffic conditions change dynamically over time and will be affected by traffic conditions of previous moments and even longer earlier.
There has been a lot of work on traffic forecasting. The statistical model such as autoregressive integrated moving average (ARIMA) (Ahmed & Cook, 1979) and its variants (Lee & Fambro, 1999;Voort et al., 1996;Williams & Hoel, 2003) rely on the ideal stationary assumption that is not true in complicated real traffic dynamics. Traditional machine learning approaches, such as the k-nearest neighbour (k-NN) method (L. Zhang et al., 2013), support vector regression (SVR) (Chen et al., 2015) rely on the human-crafted features that need domain knowledge and is labouring.
With the development of deep learning, many deep learning-based approaches have been proposed. Recurrent neural network (RNN) and its variants (long short-term memory (LSTM) (Hochreiter & Schmidhuber, 1997) and gated recurrent unit (GRU) ) are widely used for traffic forecasting due to their excellent ability in learning temporal dependencies (Cui et al., 2018;Liao et al., 2018). Although RNN-based approaches can learn spatial dependencies, their structures are over-complex. In addition, RNN-based approaches have time-consuming iterations and the gradient vanishing problem in the case of long-term time sequence modelling. There are also methods that apply convolutional neural network (CNN) for traffic forecasting (Ma et al., 2017;J. B. Zhang et al., 2017). However, these single-type neural network models cannot deal with both spatial and temporal dependencies.
In order to capture both spatial and temporal dependencies, many hybrid neural network models that apply CNN for spatial feature learning and RNN for temporal features learning (Z. J. Lv et al., 2018;H. Yao et al., 2018). However, conventional CNNs are suitable for Euclidean space, such as images and regular grids. The traffic network has a complex topological structure and is essentially a graph structure.
Graph convolutional neural network (GCN) has become a natural choice for traffic forecasting as GCN can efficiently learn spatial features of data with irregular graph relationships based on spectral theory (Chung & Graham, 1997). Many approaches combine GCNs with CNNs (B. Yu et al., 2018) or RNNs (Geng et al., 2019;H. Yao et al., 2018;Zhao et al., 2019) to model temporal-spatial dependencies in traffic data. Although these GCNs based approaches have significantly improved traffic forecasting accuracy, they still have the following limitations.
First, these approaches adopt a fixed adjacency matrix in the graph convolutional network and lack the ability to modelling the dynamic change of the spatial relationships between roadways.
Second, these approaches are inefficient in capturing long-range temporal dependencies. RNN-based approaches are time-consuming for training and suffer from gradient vanishing for capturing long-range dependencies. CNN-based approaches can capture long-range dependencies but need to stack many layers because their receptive field size increases linearly with the number of stacked layers.
For overcoming these problems, we propose a graph learning-based spatial-temporal graph convolutional neural network (GLSTGCN) to more accurately predict network-wide traffic speed. Specifically, we design a graph learning module to learn the dynamic spatial relationships in the traffic network. At each time step, the graph learning module will learn a new adjacency matrix according to the current input data. Afterward, we apply the learned adjacency matrix to the graph convolutional layer to extract dynamic spatial features of traffic data. Inspired by WaveNet (Oord et al., 2016), we choose dilated causal convolutions to extract temporal features of traffic data. As the number of hidden layers and dilation factor increases, the receptive field size of dilated causal convolution networks increases exponentially. With the aforementioned components, GLSTGCN can efficiently deal with the long-range temporal and complex spatial dependencies for traffic forecasting. The main contributions of this work are summarised as follows: • We propose a graph learning module to capture dynamic spatial dependencies in the traffic network. Specifically, the graph learning module learns an adjacency matrix according to the current inputs and the adjacency matrix changes dynamically with the inputs changes. • We present an effective and efficient framework that combines the graph learning module, GCNs with dilated causal convolution networks to capture temporal and spatial dependencies simultaneously. The core idea is that we combine graph learning module with GCN to capture dynamic spatial dependencies, and adopt a dilated causal convolution network to capture long-range temporal dependencies that can reduce training time compared with RNN-based approaches and save computation resources compared with CNN-based approaches.
• We evaluate our proposed model on two real-world traffic datasets. The experiment results show that the proposed model outperforms all baselines with low computation costs.
The rest of the paper is organised as follows. We introduce the related work in Section 2 and background knowledge in Section 3. The details of the proposed model are described in Section 4. We evaluate the performance of the proposed model in Section 5. Finally, we summarise our work and plan the future work in Section 6.

Related work
Existing traffic forecasting approaches can be roughly classified into parametric approaches and nonparametric approaches. Parametric approaches include the time-series model, Kalman filtering model (Hinsbergen et al., 2012;Ojeda et al., 2013), and so on. The Autoregressive Integrated Moving Average (ARIMA) model (Hamed et al., 1995) is a popular time-series model. To improve prediction accuracy, different variants are proposed, such as subset ARIMA (Lee & Fambro, 1999), Kohonen ARIMA (Voort et al., 1996), seasonal ARIMA (Williams & Hoel, 2003), etc. These parametric approaches have simple algorithms and are convenient for computation. However, they have an ideal stationary assumption and can not deal with the nonlinearity and uncertainty characteristics of traffic data.
The nonparametric approaches can automatically learn statistical regularity from enough history traffic data. The early nonparametric approaches include the k-nearest neighbour (k-NN) method (L. Zhang et al., 2013), support vector regression (SVR) (Chen et al., 2015), the Bayesian network model (Sun et al., 2006) and so on. These early nonparametric approaches are traditional machine learning methods that require human-crafted features (Dai & Wang, 2021;Daid et al., 2021;Liu et al., 2021;Revathy et al., 2021). They need domain experts to extract features manually which is labouring and cannot fully reflect the traffic characteristics.
With deep learning becoming a hot research spot (Liang, Li, et al., 2021;Liang et al., 2020). Most recently, deep learning-based approaches receive more and more attention. Huang et al. (2014) proposed a traffic flow prediction method that combines a deep belief network with a multitask learning layer. Y. Lv et al. (2015) proposed a stacked autoencoder model to predict traffic flow. RNN and its variant LSTM were applied successfully in speech recognition, machine translation as their superior capability in modelling long-range sequences. They also have achieved good performance in traffic forecasting. Cui et al. (2018) proposed a deep-stacked bidirectional and unidirectional LSTM neural network architecture that can capture bidirectional temporal dependencies. Liao et al. (2018) proposed an LSTM-based encoder-decoder sequence learning framework, this approach took auxiliary information (e.g. social events, online crowd queries, etc) into account to facilitate the traffic speed prediction. However, those RNNs-based approaches have time-consuming iterations and do not learn spatial dependencies well. Different from the above RNNs-based approaches, we adopt dilated causal convolutional networks to deal with temporal dependencies in the traffic network to avoid time-consuming iterations.
CNNs are widely used to process images (Neena & Geetha, 2021;Srivastava & Biswas, 2020). There are also many approaches adopted CNNs to capture spatial dependencies in the traffic network. J. B. Zhang et al. (2017) employed three residual convolution networks to model the temporal closeness, period, and trend properties of crowd traffic and then combined the output of three networks with the external factors to predict the crowd traffic in each region. Ma et al. (2017) proposed a CNN-based method that learned traffic as images. Spatial-temporal dynamics were converted to images, a CNN was applied to these images to extract spatial and temporal features of traffic speed data. However, these singletype neural network models cannot deal with both spatial and temporal dependencies well.
To capture both spatial and temporal dependencies, many studies combined RNN and CNN. Z. J. Lv et al. (2018) combined RNN and CNN to learn meaningful time-series patterns that can reflect the traffic dynamics of the surrounding area. A look-up CNN was designed to learn spatial features, an RNN was used to learn temporal features. H. Yao et al. (2018) proposed a multi-view spatial-temporal network framework for demand prediction, which adopted an LSTM to capture temporal dynamics, a local CNN and semantic network embedding to capture spatial dependencies. To capture the irregular flows of urban traffic lines more effective, DU et al. designed an irregular CNN to learn the spatial features of the traffic passenger flows and utilised LSTMs to learn the temporal features . The advantages of the CNN model in learning spatial features make it achieve great progress in traffic prediction tasks. However, CNN is more suitable for Euclidean space, such as images and grid data, rather than non-Euclidean space, such as graph structure data. Different from CNN-based approaches, we adopt GCN in this work.
The traffic network is essentially a graph structure. GCN is an appealing choice (Cui et al., 2019;Diao et al., 2019). B. Yu et al. (2018) proposed a novel deep learning framework that adopted GCN to extract spatial features and 1D CNN to extract temporal features of traffic data. Cui et al. (2019) proposed a traffic graph convolutional long short-term memory neural network to forecast the network-wide traffic state, which combined GCN and LSTM to learn the interactions between roadways in the traffic network. B. Yu et al. (2019) adopted a 3D graph convolutional network model to jointly learn spatial-temporal dependencies. Zhao et al. proposed a traffic prediction model which utilised GCN to capture spatial dependencies and GRU to capture temporal dependencies (Zhao et al., 2019). Wu et al. proposed a traffic speed prediction model which combined a self-adaptive adjacency matrix generation module with a predefined adjacency matrix to capture spatial dependencies and a dilated convolutional network to capture the temporal dependencies (Wu et al., 2019). However, these GCNs based approaches mostly adopted a fixed adjacency matrix which does not consider the dynamic change of the spatial relationships between roadways. To capture the dynamic spatial dependencies, Peng et al. proposed to use incident dynamic graph structures to model the dynamic station relationships . To overcome the data defects in traffic prediction, Peng et al. proposed to generate dynamic graphs by using reinforcement learning and GCN was applied on the dynamic graphs to learn dynamic spatial features (Peng et al., 2021). There are also attention-based traffic forecasting models which utilised attention mechanism Vaswani et al., 2017) to learn dynamic spatial dependencies. Guo et al. proposed an attention-based spatial-temporal forecasting model, which introduced attention to the STGCN model to learn dynamic spatial and temporal dependencies (Guo et al., 2019). Tang et al proposed a dynamic spatial and temporal graph attention network that utilised multihead attention to capture the dynamic spatial dependencies (Tang et al., 2020). However, attention-based approaches have a large number of trainable parameters which make the model training time-consuming and costs large memory resources. In this work, we propose a graph learning module to model the dynamic spatial relationships in the traffic network. The proposed graph learning module has a simple algorithm and the number of introduced parameters is low.

Graph convolutional neural networks
GCN is an extension of CNN from regular grids to irregular graphs and has achieved great performance in various tasks based on the graph structure. Existing GCNs can be classified into two categories. One is spectral domain-based graph convolution (Defferrard et al., 2016;Kipf & Welling, 2017) and the other is spatial domain-based graph convolution (Atwood & Towsley, 2016;Hamilton et al., 2017). Spectral graph convolution smoothes the features of nodes by spectral graph theory (Chung & Graham, 1997). Spatial graph convolution extracts the high-level features of a node by aggregating the features of its neighbour nodes. For signal x ∈ R N , spectral graph convolution operator can be defined as where U ∈ R N×N is the matrix of eigenvectors of the normalised graph Laplacian To reduce the number of parameters and make the filter K-localised, the filter ( ) can be designed as the Chebyshev polynomial ( ) Defferrard et al., 2016).
It can be beneficial to reduce the number of parameters further and address overfitting, a first-order approximation of graph Laplacian is proposed (Kipf & Welling, 2017), the graph convolution can be defined as

Abbreviation list
For ease of reading, we have listed all the frequently used abbreviations in Table 1.

Problem definition
Traffic forecasting is a time-series prediction problem, we intend to forecast the future traffic speed by leveraging history traffic speed data. The traffic network can be treated as a graph G = (V, E) where V is a set of N = |V| nodes, each node represents a roadway, E is a set of edges each representing the connection between two nodes. These connections also can be described as a weighted adjacency matrix A ∈ R N×N , A i,j represents the strength of relationship between node i and j. The history traffic data can be represented as a time- , where x t ∈ R N×1 represents the traffic speed of N roadways at time t. With the aforementioned notations, the traffic forecasting problem can be defined as: Given a graph G and the previous h time steps traffic speed of N roadways [x t−h+1 , . . . , x t ], we aim to learn a function f : R N×h → R N×p which is able to forecast the next p time steps traffic speed, it can be represented as

Overview of the proposed model
The architecture of the proposed GLSTGCN is shown in Figure 2, it includes a graph learning module, two graph learning-based spatial-temporal convolutional blocks (GLST-Conv Block), and an output layer. The graph learning module is used to model the dynamic spatial relationships in the traffic network. The structure of the graph learning module is presented in Figure 3. The spatial-temporal convolutional block is to learn complex spatial and temporal features in traffic data. It is a "sandwich" structure that includes two temporal convolutional layers (TCL) and a graph convolutional layer (GCL) in-between. The "sandwich" structure can help to reduce the model parameters and training time. The TCL is used to  capture temporal dependencies and the GCL is used to capture spatial dependencies. The output layer includes a temporal convolutional layer and a fully connected layer. The out layer maps the output of the last spatial-temporal convolutional block to the final output Y = R N×p , i.e. the next p time steps traffic speed of N roadways. Next, we will give a detailed description of the graph learning module and the spatialtemporal convolutional block.

Graph learning module
The adjacency matrix is crucial for graph convolutional neural networks, it determines the receptive field of graph convolutional operation and the importance of the neighbourhood. An accurate adjacency matrix will improve traffic forecasting accuracy. In most existed approaches, the adjacency matrix of the traffic network is fixed. However, the spatial relationships between roadways change over time. Thus, a fixed adjacency matrix cannot reflect the spatial dependencies in the traffic network accurately. How to capture the dynamic spatial dependencies is a crucial problem.
Inspired by Jiang et al. (2019), this work proposes a graph learning module to learn the dynamic spatial relationships in the traffic network. According to the previous traffic speed data, the proposed graph learning module can learn an adjacency matrix to reflect the current spatial relationships in the traffic network. We will describe the dynamic adjacency matrix computation details of the graph learning module in the following section.
Given the traffic data for previous h time steps of N roadways X = (x 1 , x 2 , . . . , is the speed sequences of roadway i, our purpose is to learn a nonnegative function A ij = f (x i , x j , S) that can represent current spatial relationship between roadway i and roadway j, where S ∈ R N×N is the fixed adjacency matrix computed by geographical distance. Specifically, we learn f (x i , x j , S) via a single neural network, which is parameterised by a weight vector k = (k 1 , k 2 , . . . , k d ) T ∈ R d×1 . Firstly, we multiply input data X by a projection matrix P ∈ R h×d to preprocess the input data,x i = x i P, fori = 1, 2 . . . N.
Then, the preprocessed data and the fixed adjacency matrix S will be input to a single neural network to learn the current adjacency matrix. The theory behind the graph learning module is that if two roadways have close spatial dependencies, their speed sequences are similar. Thus, the output dynamic adjacency matrix of the graph learning module A is defined as where ReLU(x) = max(0, x) is an activation function that guarantees A ij 0. The elements of weight vector k are learnable parameters. The more similar between the speed sequences of roadways i and j, the value A ij is bigger. The softmax function is used to guarantee the row sum of A is 1, The fixed adjacency matrix S ∈ R N×N is computed by geographical distance, it can be computed as where S ij is the weight of the edge between roadway i and j, d ij is the geographical distance between roadway i and j. σ 2 and are the thresholds to control the distribution and sparsity of the adjacency matrix.

Graph learning-based spatial-temporal convolutional block
In our framework, we adopt a graph learning-based spatial-temporal convolutional block to process graph-structured time-series and jointly capture long-range temporal dependencies and dynamic spatial dependencies in the traffic network. To extract the high-level time features of all time steps and the high-level spatial features of nodes in the multi-hop range, we stack two GLST-Conv Blocks. The GLST-COnv Block includes two temporal convolutional layers and a spatial convolutional layer as shown in Figure 2.

Temporal convolutional layer
To learn the temporal features of traffic data, we propose a temporal convolutional layer. As opposed to RNNs-based approaches, dilated causal convolution networks do not have recurrent connections, which alleviate the gradient vanishing problem and save training time. Compared with CNNs-based approaches, dilated causal convolution networks can capture long sequences with less stacked layers, which saves computation resources. As shown in Figure 4, we adopt a dilated causal convolution network (Oord et al., 2016) combined with a gated mechanism (Dauphin et al., 2017) in our temporal convolution layer to capture the long-range temporal dependencies in the traffic network. The dilated causal convolution network can increase the receptive field exponentially as shown in Figure 5. A dilated convolution network can be considered as the filter slides over  an input sequence by skipping input values with a certain step. Given an input x ∈ R N and a filter f ∈ R m , the dilated causal convolution of x with f at step t can be expressed as where d is the dilation factor that controls the skipping steps. Given the input of temporal convolutional layer X in ∈ R t×s ), where t, s present temporal, spatial dimension, respectively. The output of temporal convolution layer can be defined as where K 1 , K 2 are temporal convolutional kernels, denotes element-wise product. The sigmoid function σ (·) controls which input of the current status are relevant for discovering compositional structure and dynamic variances in time-series.

Graph convolutional layer
To learn the spatial features of traffic data, we propose a graph convolutional layer. The graph convolutional layer is a graph convolutional nueral network which adopts the first approximation (Kipf & Welling, 2017) of the Chebyshev spectral filter (Defferrard et al., 2016). LetÃ ∈ R N×N denotes the normalised adjacency matrix with self-loops of the learned dynamic adjacency matrix, X ∈ R N×h denotes the input data, Z ∈ R N×p denotes the output, and W ∈ R h×p denotes the model parameter matrix, the output of the graph convolutional layer can be defined as

Loss function
Existing approaches mostly generate multiple steps prediction via step-by-step iteration. Small errors in earlier time steps will lead to big errors in the latter time steps. To reduce long-term traffic forecast error, we choose to generate multiple time steps predictions at once. As L1 loss is robust during the training process, we use L1 loss as the training objective of the traffic forecasting task. Thus, the loss function of traffic forecasting can be written as wherex t+i j denotes the prediction value, x t+i j is the ground truth. Generating the prediction resultX = [x t−h+1 , . . . ,x t ] by using the output layer 9:

Algorithm 1 Graph learning-based spatial and temporal prediction algorithm
Computing the model loss according to Equation (12) 10: Update the model parameters by back propagation. 11: end while

Algorithm of the proposed model
According to the above operations, Algorithm 1 shows the overall process to train the proposed model.

Experiments
In this section, we will compare the proposed GLSTGCN with other state-of-art baselines for traffic speed forecasting. We conduct experiments on two real-world datasets and illustrate the results of the experiments comprehensively.

Datasets
We verify our model on two real-world large-scale datasets, PEMS (B. Yu et al., 2018) and NYC (Diao et al., 2019), which are collected from the monitoring of traffic in California and New York City, respectively. PEMS: This traffic dataset was collected from Caltrans Performance Measurement System (PEMS) by over 39,000 sensor stations. We select 228/142 stations among District 7 of California and collect data on weekdays of May and June of 2012. NYC: This traffic dataset was collected from traffic speed detectors deployed on roads of the Manhattan District in New York. We select 50 sensors and 36 days of data range from 5th December 2017 to 9th January 2018.
The missing data are preprocessed by linear interpolation. All the data are aggregated into 5 min intervals, thus there are 288 data points of each roadway every day. Both datasets are divided into training sets, validation sets and test sets. And the input data are normalised by Z-score normalisation.

Experimental setup
The experiments were conducted on a Linux server under a computer environment with one Intel (R) Core (TM) i9-9900KS CPU @ 4 GHZ and one NVIDIA GeForce GTX 2080 GPU card. All the tests use 60 min of historical data to forecast the traffic speed in the next 15, 30, 45 min.

Evaluation metrics and baselines
To compare the performance of different methods, we adopt Mean Absolute Error (MAE) and Rooted Mean Square Error (RMSE) to evaluate the difference between the real traffic speed x t+i j and the predicted speedx t+i j .
We compare our method with the following baselines: • HA: Historical Average, it model the traffic speeds as a seasonal process. We use the average of the last 12 seasons as the prediction result. • FNN: Feed-Forward Neural Network.
• DCRNN: Diffusion Convolutional Recurrent Neural Network applies graph convolution network into the recurrent neural network, and adopt a sequence-to-sequence architecture for traffic forecasting (Li et al., 2018); • STGCN: Spatial-temporal Graph Convolutional Networks utilises graph convolution layer to extract spatial features and 1D convolution layer to extract temporal features (B. Yu et al., 2018); • Graph WaveNet: Graph WaveNet adopts graph convolutional networks to learn spatial features and dilated causal convolution networks to learn temporal features of traffic data (Wu et al., 2019). • ASTGCN: Attention-based Spatial and temporal graph convolutional networks introduce attention mechanism into the STGCN model. Only the recent component is taken to ensure fairness (Guo et al., 2019).
All these deep learning models are trained for 50 epochs, the batch size is set to 50 and use the hyperparameters provided by the authors. The model uses an Adam optimiser. The initial learning rate is 0.001 with a decay rate of 0.7 after every 5 epochs. The graph convolution kernel size is 3. the temporal convolution kernel sizes of two spatial-temporal convolution blocks are 3, 2, respectively. The dilation factors of two temporal convolution layers in each spatial-temporal convolution block are 1, 2, respectively. The output channels of three layers in the spatial-temporal convolutional block are 64, 16, 64, respectively. Table 2 shows the comparisons of the proposed method and baselines for 15, 30, 45 min ahead prediction on datasets PEMS and NYC. Our proposed method outperforms all baselines by achieving the lowest MAE and RMSE on both datasets.

Experiment results
The traditional linear prediction method HA performs worst due to its simple algorithm and incapability to deal with nonlinear temporal dynamics and complex spatial dependencies in the traffic data. The neural network method FNN gets better results than the traditional linear prediction method. But FNN cannot model spatial dependencies which makes it cannot get satisfactory prediction accuracy. Graph convolutional neural network methods (e.g. DCRNN, STGCN, Graph WaveNet, ASTGCN and GLSTGCN) all achieve better performance which demonstrates the importance of simultaneously learning complex temporal and spatial dependencies in the traffic network. DCRNN and STGCN use a predefined adjacency matrix to model spatial dependencies in the traffic network, they cannot capture the dynamic spatial relationships between roadways. Although Graph WaveNet uses an adaption adjacency matrix combined with a predefined adjacency matrix to model spatial relationships between roadways, it still can not model dynamic spatial relationships. ASTGCN learns dynamic spatial dependencies with attention mechanism, but it performs not well. The proposed GLSTGCN achieves the best performance among all methods as it efficiently captures the dynamic spatial dependencies and temporal dependencies in the traffic network.
Figures 6 and 7 further shows the prediction results at each horizon on datasets PEMS and NY. As we can see, as the prediction horizon becomes longer, the prediction performance goes down for all methods. Compared with the baselines, the prediction performance of the proposed method degrades slowly when the prediction horizon becomes longer. GLSTGCN performs well for both short-term and long-term prediction and obtains better MAE and RMSE for almost all horizons.

Effect of graph learning module
We conduct experiments with two different configurations on dataset PEMS and NYC to verify the effectiveness of the proposed graph learning module. The first one uses the predefined adjacency matrix to model spatial dependencies which do not have the graph learning module, the second one is the proposed GLSTGCN.   within 0.02. The experiment results show that the proposed GLSTGCN gets better prediction accuracy which indicates the importance of learning dynamic spatial dependencies for traffic prediction.

Parameter selection
To evaluate the effect of a different number of hidden units in the graph learning module, we conduct experiments on dataset PEMS and NYC to test the prediction performance of the proposed model with a different number of hidden units. As shown in Tables 4, 5 and 6. We get the most worse prediction accuracy when the number of hidden unit is set to 5. And the prediction accuracy changes slightly with the number of hidden units greater than 10. The possible reason is that the feature dimension of the input data of the graph learning module is 12. It doesn't make much sense to choose more hidden units. Considering the prediction performance and the computation cost, we set the number of hidden units to 10 in this paper. To validate the rationality of the hyperparameter settings, we conduct experiments with different settings on the dataset PEMS228. The prediction results of different hyperparameter settings are shown in Figures 8 and 9. From Figures 8(a) and 9(a), we can find that the lowest MAE and RMSE are achieved when the learning rate is 0.001. From Figures 8(b) and 9(b), we can see that the MAE and RMSE change slightly with decay rate, and get the lowest MAE and RMSE when the decay rate is 0.7. From Figures 8(c) and 9(c), we also find that the MAE and RMSE change slightly with decay frequency (i.e. decay after how many epochs), and get the lowest MAE and RMSE when the decay frequency is 5.

Computation cost
We compare the computation cost of the proposed method with GCGRU and STGCN on dataset PEMS and NYC. As shown in Table 7, in training, GLSTGCN runs 7 times faster than GCGRU at least and a little faster than STGCN. For the test, we measure the total time of     each model on the test data. GLSTGCN has the lowest computation time. This is because we generate 9 predictions in one run while GCGRU has to produce the results step-by-step. Besides, GCGRU has time-consuming recurrent connections.

Conclusion and future work
In this paper, we propose a graph learning-based spatial-temporal graph convolution neural network (GLSTGCN) for traffic forecasting. We propose a graph learning module to learn the dynamic spatial relationships according to the current input in the traffic network. The proposed GLSTGCN can capture long-range temporal and complex spatial dependencies efficiently and effectively. Experiments show that our model achieves state-of-art results on both two real-world datasets. In future work, we will further optimise the network structure and combine it with other deep learning methods to learn the dynamic spatial dependencies to improve prediction accuracy.

Disclosure statement
No potential conflict of interest was reported by the author(s).

Funding
The work was supported by the National Natural Science Foundation of China [grant number 61976087] and the National Natural Science Fund for Distinguished Young Scholars (grant number 62025201)