Traffic flow prediction models – A review of deep learning techniques

Abstract Traffic flow prediction is an essential part of the intelligent transport system. This is the accurate estimation of traffic flow in a given region at a particular interval of time in the future. The study of traffic forecasting is useful in mitigating congestion and make safer and cost-efficient travel. While traditional models use shallow networks, there has been an exponential growth in the number of vehicles in recent times and these traditional machine learning models fail to work in current scenarios. In our paper, we review some of the latest works in deep learning for traffic flow prediction. Many deep learning architectures include Convolutional Neural Network (CNN), Recurrent Neural Network (RNN), Long Short-Term Memory (LSTM), Restricted Boltzmann Machines (RBM), and Stacked Auto Encoder (SAE). These deep learning models use multiple layers to extract higher level of features from raw input progressively. The latest deep learning models developed to tackle this very problem are reviewed and due to the complexity of transport networks, this review gives the reader information about how various factors influence these models and what models work best in different scenarios.


PUBLIC INTEREST STATEMENT
The Intelligent transport system is helpful in estimating the road network capacity, alleviating the traffic congestion and guiding the traffic participants, through traffic signal control, real-time traffic information collection and effective traffic data dissemination. Therefore, through our research paper we want to give readers and researchers a good idea on the latest deep learning models that exist to predict traffic flow. There are different architectures developed to solve this problem from Convolutional Neural Networks capturing spatial data, Hybrid CNN-LSTM which work very well because they capture both temporal and spatial aspects of traffic flow and other semi supervised networks like SAEs. It is very difficult choose from all these networks hence through our research paper we want to simplify these complex networks, their pros, cons and also in what environment they are being trained upon. This review gives the reader information about how various factors influence these models and what models work best in different scenarios.

Introduction
Traffic flow prediction is an essential part of the Intelligent Transport System (ITS). This helps traffic stakeholders to make safer and smarter use of transport networks (Cen Chen et al., 2020;. Traffic stakeholders include individual travelers, traffic managers, policy makers, and road users. Various traffic stakeholders are shown in Figure 1. The effectiveness of these systems is determined by the quality of traffic data and only then an ITS will succeed. The global status on road safety of the World Health Organization (WHO) 2018, reported that road traffic deaths continue to increase, 1.35 million deaths recorded in the year 2016, making the study of traffic forecasting a useful way in mitigating congestion and make safer and cost-efficient travel. (Makaba et al., 2020;World Health Organization, 2018). The benefits of traffic forecasting are illustrated in Figure 2.
The big question is if traffic patterns, queuing patterns, and time be accurately predicted? Can predicting traffic flow enable decisions to resolve potential traffic congestion proactively? These questions have, therefore, been a critical area of research, especially in an urban context. In recent times the  urban fleet of vehicles has evolved to an extent where they can upload filtered sensor data directly to the cloud (Gerla et al., 2014). A vehicular cloud exists which provides services to autonomous vehicles. This further enables traffic flow prediction but the amalgamation of different features from different road segments and time-varying traffic patterns pose difficulties in accurately predicting traffic by different prediction models. The accurate prediction of traffic parameters in prediction models are difficult due to the dynamic and stochastic nature of traffic (Vijayalakshmi et al., 2021). Physical infrastructure and traffic rules impose huge constraints on prediction models and external factors like weather, accidents, and road closures play a huge role in influencing the model. Therefore, it is important to understand the various architectures which have been developed to tackle this problem.
In the last few years, we have entered the era of big data in transportation (Lv, Duan, Kang, Li, Wang et al., 2015). Traffic congestion forecasting is heavily dependent on sensors and other equipment to acquire relevant data, such as traffic speed, weather and accident data, etc. While traditional models use shallow networks, there has been an exponential growth in the number of vehicles in recent times and these traditional machine learning models fail to work in current scenarios. Therefore, this paper reviews deep learning models in-depth and also explores the realm of unsupervised learning with respect to traffic flow prediction.
Traditional models used parametric methods (Zheng et al., 2020). The Auto-Regressive Integrated Moving Average (ARIMA) model is a well-known framework and benchmark for shortterm traffic flow prediction (Van Der Voort et al., 1996). Many changes to the ARIMA model were introduced and results indicated improved performances (Lee & Fambro, 1999;Williams, 2001;Williams & Hoel, 2003). However, due to the stochastic and non-linear nature of traffic flow, the parametric models failed to provide accurate results (Ken Chen et al., 2020). Therefore, nonparametric models were preferred, and thus Neural Networks became popular for traffic flow prediction. A shallow Back-Propagation Neural Network (BPNN) (Smith et al., 1994) showed promising results, but it failed to work in the big data era. Thus, came the emergence of deep learning, which uses multiple layers to extract a higher level of features from raw input progressively. Many deep learning architectures include Convolutional Neural Network (CNN) (Simonyan & Zisserman, 2015), Recurrent Neural Network (RNN) (Graves et al., 2013), Long Short-Term Memory (LSTM) (Sainath et al., 2015), Restricted Boltzmann Machines (RBM) (Good Fellow et al., 2013), Deep Belief Network (DBN) (Sarikaya et al., 2014), and Stacked Auto Encoder (SAE) (Gehring et al., 2013). This paper presents the review of these architectures. Due to the complexity of transport networks, it also gives the reader information about how various factors influence these models and what models work best in different scenarios. The work also presents the review of some of the unsupervised learning approaches and how they fare against supervised learning in deep learning.

Background study
Traffic congestion and resultant issues are the banes of our times, more so in developing countries with inadequate infrastructure. Traffic congestion impact in terms of lost productivity, energy costs, health risk cost through pollution, safety costs, and more are grossly underestimated and present a pressing opportunity to improve the standard of living and quality of life in general (Ahmed Yasser & Kader, 2017). The problem of traffic flow prediction is the accurate estimation of traffic flow in a given region at a particular interval of time in the future. The issue of traffic flow prediction has complex non-linear spatio-temporal dependencies and dependencies on external factors such as weekends, holidays, weather, events, road conditions, and more. Traditionally, traffic prediction models used statistical methods (Lee & Fambro, 1999;Van Der Voort et al., 1996;Williams, 2001;Williams & Hoel, 2003). However, due to the complex dependencies of traffic flow data, researchers have eventually moved from statistical methods to machine learning (Dong et al., 2018;L. Liu et al., 2019;Lu et al., 2020) and deep learning architectures (Sun, Wu, Xiang et al., 2020;Tampubolon & Hsiung, 2018;Yi et al., 2017) to solve the problem. Previous review papers addressed this shift from shallow Neural Network (NN) architectures to Deep Neural Network (DNN) architectures for traffic flow predictions (Ali & Mahmood, 2018;Nguyen et al., 2018;Yin et al., 2020).
Deep learning models have shown promising results to represent the non-linearity of traffic flow prediction. For example, popular deep learning architectures such as CNNs have proved to adapt to spatial dependencies of traffic (Deng et al., 2019;Sun, Wu, Xiang et al., 2020). RNNs, especially LSTM architectures, adjust to long-term and short-term temporal dependencies of traffic flow (Osipov et al., 2020;B. Yang et al., 2019).
While there are several advantages to using the mentioned deep learning models to predict traffic flow individually, there are significant disadvantages. According to (Bae et al., 2018), as modern transportation systems rely on accurate data, any missing data profoundly impacts the accuracy of the deep learning model and can lead to suboptimal or ineffective performance. In the paper , an attempt is made to solve the issue of missing data using tensor completion methods for data imputation, which are unsupervised methods.
Another limitation of deep learning architectures such as CNNs and RNNs is that they require large amounts of historical data for training. Here, achieving the requirement is considerably easy as we live in the era of big data. However, according to , large amounts of historical data can cause over-fitting of the model due to high fluctuations in traffic flow over a small time interval.
Thus, recently, researchers are starting to move from deep learning architectures to hybrid and unsupervised methods (Bhatia et al., 2020;Feng et al., 2020). The recent review papers have mainly focused on the review of machine learning based models, statistical models, and urban flow prediction models for traffic forecasting Sun, Aljeri et al., 2020;Wang & Boukerche, 2020;Xie et al., 2020;Y. Zhang, 2020). In this review, we aim to critically address the various existing deep learning architectures used for traffic flow prediction and the rising popularity of hybrid methods (Z. Duan et al., 2018;Y. Liu et al., 2017;Petersen et al., 2019). In the proposed work, we discuss the potential for unsupervised learning methods for traffic flow prediction.

Deep Neural Networks (DNN)
In this section, we review some of the Multilayer Neural Network (MLNN) ideas used in traffic flow prediction as in Figure 3 and also briefly describe various techniques to further optimize these networks. These NNs allow computational models that are composed of multiple processing layers to learn representations of data with multiple levels of abstraction (Lecun et al., 2015).
As traffic flow is affected by a variety of factors such as weather, accidents, holidays, etc., a deep architecture is suitable to solve the complex task of traffic flow prediction. This is now possible after Hinton proposed a breakthrough deep network that converted high-dimensional data to lowdimensional data which outperformed the PCA (Hinton & Salakhutdinov, 2006).
The authors in (Tampubolon & Hsiung, 2018) demonstrated hyper parameter tuning techniques to optimize the networks and yield better results to the generalization and over-fitting problem. They used a backpropagation algorithm to calculate the gradient and update the weights using Stochastic Gradient Descent (SGD). A dropout layer was added and batch normalization was done to ensure that the input distribution is standardized before passing the activations to the nonlinear layers.
A Cascaded Artificial Neural Network (CANN) was proposed in (S. Zhang et al., 2018) which consists of three Artificial Neural Networks (ANN): Long term, Medium-term, and Short term. The long-term network captures weekly recurrences in the traffic flow, the Medium-term network captures daily periodicity in traffic flow and the short-term network captures the numeric variations in trends of traffic flow. The CANN model showed better results upon using historical data.
A Wavelet Neural Network (WNN) introduced in (Q. Chen et al., 2021) uses a Morlet function as the wavelet basis function. It is a wavelet of a complex exponential function multiplied by a Gaussian window.
The proposed model uses one hidden layer with the Morlet function as the transfer function of the WNN. The input layer takes normalized inputs of traffic flow from 4 points, the hidden layer has 6 neurons and the output layer has 1 neuron. It was further improved by using the particle swarm optimization algorithm. The swarm algorithm iteratively calculates location updates on a d-dimensional search space based on the formula: Here V is the flight speed and X is the location. The improved algorithm offers an update mechanism for the weights W: Here W i is the i th inertial weight and I max is the maximum number of iterations. The algorithm uses the updates to iteratively reach an optimal solution. Figure 4 shows a representation of this model.

CNN
These are deep learning algorithms that possess the capabilities of taking in multi-dimensional inputs and carry complex computations with relative ease when compared to a conventional DNN. The Convnets prove to be an effective tool to capture spatial-temporal features with the help of filters that are self-learned. The pooling layers within a Conv-Net enhance the generalization ability of the model and reduce computation time. These features of CNNs act as a motivation for their application in the field of traffic flow prediction. To establish spatial correlations in the data, the authors introduced a Random Subspace learning based deep CNN (RSCNN) model in (Deng et al., 2019). The dataset was obtained from the Performance Measurement System (PeMS). For each location, three random subspaces are constructed and fed into three deep convolutional neural networks. The random subspace includes a batch of locations chosen from a candidate matrix which comprises locations with maximum spatial correlation. The model effectively considers correlations in traffic flow and produces promising results in high traffic situations. Although, the model fails to differentiate between weekdays and holidays, and other long-term periodicity. To overcome the previous shortcomings the next step taken by the authors was to represent locations using vectors and extract the similarities among them using the CBOW model. Negative sampling was adopted to reduce the training cost. Real data, average data, and mode data are fed to the CBOW model to generate a feature matrix. The feature matrix was fed to the convolutional neural network. The model also uses pool layers that serve the purpose of grouping. The Data Grouping CNN (DGCNN) introduced in (Xia et al., 2016) showed high accuracy for short-term prediction on being evaluated over the metrics of Root Mean Square Error (RMSE) and Mean Relative Error (MRE), but the network didn't achieve long term prediction.
Researchers have developed several methods to get higher accuracy on short-time prediction as well as long-time prediction. In (D. , the authors fed real-time data, daily periodicity data, weekly periodicity data, and data regarding external factors such as events and holidays separately to the CNN. The outputs were then merged and flattened using a logistic regression layer. But the Multifeature Fusion-CNN (MF-CNN) illustrated in Figure 5 managed to be accurate with only short-term prediction. The run time of the model did not prove to be optimum and the model didn't take into account the unprecedented changes due to accidents or other unforeseen circumstances.
Another interesting idea was reported in (P. Liu et al., 2019) to extract Spatio-temporal features to get better accuracy. The improved Spatio-temporal Residual Networks (ResNet) having four components: three describing Spatio-temporal relationships and one for scenario patterns captured both Spatio-temporal correlations as well as scenario patterns accurately by acquiring data from smart cards. The improved ResNet used a shortcut every three layers as compared to the traditional ResNet that used a shortcut every two layers. The image-like tensor was divided into three factors: closeness, period, and trend. All these components had the same structure: one convolutional layer followed by two improved residual blocks which were then followed by a final convolutional layer. This defined the spatial relationships. The scenario patterns were a summation of the boarding flow and alighting flow for all time intervals which was normalized and encoded into a 1D matrix. This matrix was input into two fully connected layers to produce the output.
A 3D CNN (C. Chen et al., 2018) was used to learn the Spatio-temporal correlation features jointly from low-level to high-level layers for traffic data. Compared to 2D CNN, 3D CNN can model 3-dimension information owing to 3D convolution and 3D pooling operations. By preserving the temporal dependencies of the volumetric data resulting in an output volume, moreover, adopting the same kernel sharing across space and time dimensions, the model could take full advantage of spatial and temporal dependencies. This model did not rely on historical data to predict the future values and overlook spatial and external features. It also outperformed the ST-ResNet and the ConvLSTM models.

SAE
One of the earliest definitions of auto-encoders was given by (Ballard, 1987). They define autoencoders or an auto-associative configuration as a model where the output is constrained to be identical to the input. Paper (Palm, 2012) explains that an auto-encoder or auto-associator is a discriminative graphical model that attempts to reconstruct its input signals. An SAE model is, as the name suggests, created by stacking auto-encoders to form a deep network, as shown in Figure 6. Stacking refers to taking the auto-encoder's output found on the layer below as the  input of the current layer . SAEs are known to show promising results with traffic flow data as they inherently consider spatial and temporal correlations (Lv, Duan, Kang, Li, Wang et al., 2015b) and according to (Erhan et al., 2010), unsupervised pre-training helps prevent overfitting. Furthermore, SAEs do not require a labeled training set, being an unsupervised learning method, and thus, they can easily handle mislabeled data. However, they do have a significant disadvantage in situations where the hidden layer's dimensions are the same as or larger than the input layer. In such cases, the model can end up copying the input, without extracting more meaningful features (Bengio et al., 2007;Erhan et al., 2010). However, currently there exist two main approaches to tackle the issue of the model becoming potentially useless. The first is to use denoising auto-encoders (Erhan et al., 2010;Xu et al., 2018) and the second is to use sparse autoencoders (Palm, 2012). An auto-encoder with sparsity constraints is called a sparse auto-encoder. All papers discussed in this section use sparse auto-encoders for their solution.
The first paper to use SAEs to predict traffic flow data is (Lv, Duan, Kang, Li, Wang et al., 2015) to the best of our knowledge. The model's training and test data were from the Caltrans PeMS database. Specifically, the traffic flow data for the first two months of 2013 comprised the training data, and the third month of 2013 was used as test data to carry out experiments. The paper uses a traditional SAE model trained in a greedy layer wise fashion. Kullback-Leibler (KL) divergence provides the sparsity constraint on the coding as KL(ρ||ρˆj) = 0 if (ρ = ρˆj). Logistic regression is added as the last layer after the SAE layers for supervised traffic flow prediction. The SAE model, along with the logistic regression unit, comprises the architecture proposed by the paper.
Although it is straightforward to train the deep network by applying the back-propagation (BP) algorithm, networks trained with the BP method were inaccurate. Thus, the paper (Hinton & Salakhutdinov, 2006) uses a greedy layer wise unsupervised learning algorithm to pre-train the deep network from the bottom-up. Once pre-training is complete, the authors carried out finetuning using the BP algorithm with gradient-based optimization to tune the model's parameters in a top-down approach to obtain even more accurate results. The model was then used to predict traffic flow in intervals of 15, 30, 45, and 60-minutes. The paper concluded that the SAE model could discover nonlinear Spatio-temporal correlations from traffic data and performed better than Back-Propagation Neural Network (BPNN), Random Walk (RW), Support Vector Machine (SVM), and Radial Basis Function Neural Network (RBFNN) models.
Considering the model mentioned previously, another paper evaluated the proposed SAE model through 250 experimental tasks (Y. Duan et al., 2016). The model's performance was evaluated at different times of the day, both on weekdays and weekends. Further, the best combination of hyper-parameters was proposed for the same. Unlike the previous paper, data was collected from original vehicle detector stations rather than the processed global flow. Experimental results showed that the Mean Absolute Error (MAE) and RMSE during the day are larger than at night, while the MRE during the day is smaller than at night.
Although these studies tackled the main disadvantage of SAEs and obtained promising results, there were other causes for inaccuracies. For example, the low dimensionality of traffic flow data contributed to networks sticking to poor local minima due to large numbers of saddle points with zero gradients scattered on the landscape (Dauphin et al., 2014). However, large fluctuations and uncertainties in traffic flow data over hours can lead to overfitting of the model.
Thus, to combat these issues, (Zhou et al., 2017) introduces a boosting scheme for the SAE network to improve traffic flow forecasting accuracy. The boosting scheme uses prediction error, a measurement of the network's generalization ability, to retrain the SAEs by rearranging the training data, similar to (Huang et al., 2014). The authors use an adaptive ensemble scheme based on δ-agree AdaBoost regression to boost the model and tackle uncertainties. Accuracy is also improved by constructing an ensemble of SAEs. Experiments carried out used traffic flow data from four motorways A1, A2, A4, and A8, ending on Amsterdam's ring road (A10 motorway). The data provided was from May 20th, 2010 to June 24th, 2010, with one-minute aggregation collected by the MONICA sensor. Any zero values in the data due to incorrect measurements were filled by averaging the same moments of other weeks. The first 4 weeks were used for training, and the rest were used for testing. The training set was then divided further into 10 parts. The first nine were used for training the SAE, and the remaining one was used for cross-validation after every training epoch. Drop out regularization was used to prevent overfitting, and the limited-memory Broyden-Fletcher-Goldfarb-Shanno (LBFGS) algorithm (Liu & Nocedal, 1989), a typical gradient descent algorithm, was used for optimization. It was observed that the model outperformed methods such as SVM Regression (SVR), Historical Average (HA), RW, autoregression, ANN, Kalman Filtering (KF), LSBoost, and traditional SAE. Although the proposed model could accurately predict heavy traffic scenarios, accuracy dropped when traffic flow was low such as early in the morning or late at night.
Building on the concept of using SAEs for traffic flow prediction in 2017, another novel method was proposed by (H. F. Yang et al., 2017). The authors used a stacked auto-encoder Levenberg-Marquardt (SAE-LM) architecture. The model was designed using the Taguchi method for optimized structure and to learn traffic flow features through layer-by-layer feature granulation with a greedy layer-wise supervised learning algorithm. The model was trained and tested with traffic flow data collected from the M6 freeway in the UK on working days between March 3rd, 2014, and March 28th, 2014, sensed by induction loops located at junctions J6, J7, J17, J18, J34, and J35. The dataset was divided into six road link data sets, where the first 15 days from each were used as the training set, and the remaining was used as the test set. The LM algorithm provides a numerical solution to the nonlinear problem of traffic flow prediction. The model uses a sigmoid activation function and uses the concept of the KL divergence penalty function as an extra penalty term, providing the sparsity constraint on the coding. LM is used as the last layer as it is a supervised learning algorithm that can fine-tune all parameters of the deep architecture. The evaluation results showed that the SAE-LM model with an optimized structure showed superior performance (about 90% accuracy rate). This result was achieved with five hidden layers (four auto encoders) and limited computational time. The model's main disadvantage is that it did not perform well if the observed traffic data has a highly smooth distribution. As traffic data is rarely smooth, the model works well for real-life applications. Further improvements can be made using an effective imputation method for missing data to apply to other datasets.
In 2018, (Jin et al., 2018) proposed a SAE model similar to  and instead used the Rectified Linear unit (ReLu) as the activation and dropout regularization to prevent over-fitting. Furthermore, dropout prevents units from co-adapting. Each SAE layer is pre-trained with the greedy unsupervised learning algorithm to optimize the weights of the layer. The model is then fine-tuned using the BP algorithm. The authors trained the model using traffic flow data from the city ring way of Xi'an, Shaanxi, China, where data was collected every 5 minutes from the vehicle detector K27 + 000 from September to December 2017. The first three months were used as the training set, and the last month was used as the test set. The paper concluded that the SAE model performed better than SVM, DNNs, LSTM, and DBNs for 15 and 30-minute traffic flow prediction. Further refinement of the network could provide even more accurate results.
Recently, (Abdollahi et al., 2020) used the New York City Taxi and Limousine Commission trip records from January 2016 to June 2016 to predict traffic travel time-a concept closely related to traffic flow. The authors used SAEs with dropout regularization for feature extraction of traffic data as they found that this prevented overfitting. They used data augmentation and feature engineering techniques such as geospatial feature analysis, Principal Component Analysis (PCA), and k-means clustering. After obtaining a better feature set, deep SAEs were used to represent features in lower dimensions to decrease the chances of overfitting and increase accuracy. Finally, a deep multi-layer perceptron was used to make the architecture more robust. The algorithm was able to capture traffic flow dynamics. However, rare events such as heavy snow decreased the prediction accuracy of the model. Although SAEs are found to work relatively accurately on their own, the trend seems to be towards hybrid models that exploit the advantages of LSTM and SAE. For instance, in recent times (Essien et al., 2021;Lin et al., 2019) used LSTM and SAE-based architectures to effectively capture Spatio-temporal relations in traffic data with decreased risk of overfitting.

RNN
The failure of ANN and CNN in traffic flow prediction is mainly due to the Spatio-temporal nature of traffic. Therefore, RNN and its forms, which are illustrated in Figure 7, are widely used in traffic flow prediction. However, standard RNNs fail to solve problems that require long-term temporal dependencies due to the vanishing gradient and exploding gradient problem. A model called Long Short-Term Memory Recurrent Neural Network (LSTM RNN) was used to solve this problem (Hochreiter & Schmidhuber, 1997). The authors in this paper used the LSTM RNN to solve the traffic flow prediction problem (Tian & Pan, 2015). The primary objectives of LSTM RNN are to model longterm dependencies and determine the optimal time lags for time series problems. The dataset from Caltrans PeMS is used for building the model and comparing LSTM RNN with several wellknown models. The generalization capability of LSTM RNN was validated by comparing it with other machine learning models namely RW, SVM, single layer Feed Forward Neural Network (FFNN), and SAE. The study did not account for spatial impact from neighbor observation stations and weather factors. Experiment results show that both MAPE and RMSE are lowest with different prediction intervals which proved that the proposed model achieved high accuracy.
However, it is difficult to solve exceedingly long-term dependencies, possibly because the LSTM errors increase as the sequence length increases. An LSTM+ model was proposed that can sense both long short-term memory and remarkably long distances (B. . The LSTM+ model consists of four layers namely the mixed input layer, attention mechanism layer, the hidden layer, and the output layer. The sequence before the time step "t" in consideration is given to the attention layer, which captures most features of the long sequence. It is then passed to a softmax layer. The output of this softmax layer is given as the input to the LSTM model. In this way, the model aspires to capture most of the features of the past long sequence. In the attention mechanism layer, the input sequence is x = (x1, x2, . . ., xn), where it was experimentally proved that the values near the t th time interval of each cycle have a high impact on the prediction value. Thus, only the data near the predicted time of each cycle t is given as the input to the attention layer to reduce redundancy. The long-distance problem arises in the prediction of traffic flow for the time intervals of 5 minute and 1 minute. The LSTM+ thus solved this problem and fared better than other models like BPNN, SVM, RBFNN.
Similar to LSTM, Gated Recurrent Units (GRU) was introduced to solve the vanishing gradient problem (Chung et al., 2015). Unlike the LSTM, GRU does not have a separate memory block which makes it easier and more efficient to train but it consists of two gates the reset and update gate which decide what should go to the output. The reset gate decides if any past information needs to be retained or deleted and the update gate decides the amount of past information to keep around. In recent years authors have tried using GRU to predict traffic (Fu et al., 2016). The PeMS dataset was used to train the model. In this experiment, both LSTM and GRU models were trained and it showed that LSTM and GRU NNs have better performance than ARIMA models and GRU has a little better performance than LSTM. On 84% of the total time series, GRU NNs have better performance than LSTM NNs. For many sequence processing tasks, it is useful to analyze the future as well as the past of a given point in the series. However, conventional recurrent neural nets are designed to analyze data in one direction only in the past. A better approach is provided by the bidirectional networks pioneered by Schuster and Baldi (Baldi et al., 1999;Schuster & Paliwal, 1997). A Deep Bidirectional LSTM (DBL) was introduced in (Essien et al., 2021) to solve the traffic flow problem. A total of 6 BiLSTM layers were used in this model. A deep hierarchy always leads to the vanishing gradient problem therefore residual connections were introduced among the DBL layers in a stack. The prediction architecture consists of four parts: the embedding layer, the DBL, the mean pooling layer, and the logistic regression layer. The data is mapped into the space of a 64-dimensional vector space. After the DBL encodes the time-aware traffic flow information, a sequence h (0), h (1) . . . h (n-1) is produced. Then, the mean pooling layer extracts the mean values of the sequence over time intervals. Besides, the mean pooling layer makes the features encoded into a vector h. The vector is fed into the logistic regression layer at the top of the prediction architecture. The model was trained on the PeMS dataset and its results proved that DBL obtains a high prediction accuracy with the MAPE being 4.83% and the RMSE of DBL is 46.01%. The model outperformed ARIMA, SVM, DBN, SAE, and LSTM.

Hybrid networks
The field of Traffic Flow Prediction has seen the inception of hybrid models which aim to overcome the shortcomings of the state of the art models and also capitalize on their advantages.
A Traffic Flow Forecasting Network (TFFNet) was proposed (Selvaraju, et al., 2020) to predict shortterm traffic flow. The TFFNet is made of two components one for spatiotemporal dependencies and the other for external influences. The traffic flow matrices are concatenated and fed into the first convolutional layer. The output from this is fed into a series of residual units to obtain output denoting spatiotemporal dependencies, XRes. A shorter connection is added to the framework. The output from this is concatenated to the output from the residual layers. A two-layer fully connected NN is used to embed external factors, and XExt describes the output. XRes and XExt are fused to obtain the final output XFinal. TFFNet_16 that consists of 16 residual units had an RMSE of 14.07. The low RMSE and relatively low training time make it the best version of the model. The Concept of a Conv-LSTM model which is a combination of convolutional neural networks and a Long Term-Short Term memory network has had several renditions to it, in an attempt to capture the Spatio-temporal features of traffic flow effectively. One such model (Y. Liu et al., 2017) uses a Conv-LSTM to extract spatial-temporal features and a bidirectional LSTM to capture periodicity in the trends. The Conv-LSTM network receives a 1D time series vector as input. The vector undergoes 1D convolution, followed by pooling. This feature vector is then fed to an LSTM network to identify temporal features. The input to the Bi-LSTM includes time series info both before and after the forecast time. The output of the Bi-LSTM is further concatenated with that of the Conv-LSTM with the help of Fully Connected Layers. This Conv-LSTM hybrid does not predict short-term fluctuations effectively.
The authors propose an intelligent model for bus travel time prediction (Petersen et al., 2019) to exploit Spatio-temporal correlations in urban bus traffic. CNN's are also deployed for detecting spatial patterns across the links. The model applies convolutional filters in the input-to-state transitions as well as the state-to-state transitions of the LSTM. Here, the use of CNN's significantly reduces the number of parameters in comparison to a fully connected network. The model uses the encoder-decoder LSTM architecture along with a fully connected layer at the end, allowing predictions of future time steps with complex patterns. The 1D convolution can be extended to 2D to be able to predict across multiple bus routes. The model also uses RMSprop as an optimizer for efficient training. The model proves to be computationally heavy and does not effectively take into consideration the frequency of accidents, holidays, traffic signals at link intersections, weather conditions. Several architectures also adapt to fragmented approaches with CNNs and LSTMs to combine their properties to reach optimum results. The hybrid deep neural network proposed in (Z. Duan et al., 2018) is fragmented into two heterogeneous sub-deep neural networks. One sub-network uses typical convolutional layers without pooling as no data should be lost in the process to accurately determine the spatial features. The other sub-network uses convolutional layers along with LSTMs to deduce the temporal features of traffic flow. Both the sub-networks are merged and averaged in the output layer of the network. The Epsilon-greedy policy is used in the process of learning. Implementing a residual network would prove to be beneficial in reducing training time, it compromises the accuracy of prediction as the ResNet tends to skip layers whereas the epsilon greedy policy proves to explore and garner enough substance from the available data to generate satisfactory results.
An attention model is introduced in (Wu et al., 2018) to use a speed matrix S spanning over space and time to learn an attention weight matrix A of the same size as S. The traffic flow matrix is multiplied with the attention matrix to form the weighted traffic flow matrix. The model uses convolutional layers for spatial feature mining and GRUs for temporal feature mining. The GRUs prove to be better than conventional LSTMS and are less expensive. The model also uses three 3-layer CNNs to extract near-term spatial features (CNN-n), daily periodicity (CNN-d), and weekly periodicity (CNN-w). The input data contains information regarding average speeds at specific time points during the previous day, the previous week and includes spatial information. This is given directly to CNN-d, CNN-w, and through the attention model to CNN-n forcing it to focus on near-term values. The outputs of these CNNs are combined and fed into a ReLu activated regression layer.
A rather unconventional hybrid model (Yu & Liu, 2019) uses 4 sections: local predictor, global predictor, prediction integrator, outlier predictor. This is shown in Figure 8. The local predictor manages to consider the temporal factors by adapting the method of linear regression. The global predictor accounts for spatial factors by using an unsupervised Bayesian network to predict traffic volume using data from neighboring roads. Conditional probability equations are used on the given data to iteratively calculate the values of parameters such as time, road type, observed data, and points of interest features that determine traffic volume. A deep learning model is constructed with the help of these parameters by forming a stacked autoencoder using 3-layer sparse autoencoders along with a regression layer. The predictor integrator integrates the outputs of the local and global predictors using a regression tree. This outlier predictor takes into account the dynamic fluctuations in traffic flow inflection = (k = 0, q-1(f i, j-k-f i, t-k ) 2 ). The vector f denotes features of road segment i at the given time step. Taking account of minimum inflection allows the model to eliminate recurring trends in traffic. A regression tree algorithm is used to obtain abnormal paths. The output of this section is added to that of the prediction generator. This model proposed gives a prediction accuracy above 90% for over 88% of the predictions. In recent times a graph convolutional neural network (Hu, 2021) has been developed to obtain the sensor node vector set and recurrent models are used to forecast the result. The nodes and the neighbor nodes are connected to represent the spatial correlation of traffic flow. Different moments of the same node are taken as a time series. The gated loop unit and attention mechanism process the time dynamics of the traffic flow. The spatial and temporal dependencies of traffic flow have been taken care of with the graph convolutional network and attention mechanism. Another paper (Zhang & Jiao, 2020) uses the attention mechanism with deep convolutions for short-term traffic flow prediction. The inception network is the base layer. Each layer obtains additional input from all the previous layers and uses its own features.
Because of these dense connections, the vanishing gradient problem arises. Then, the output of the base unit of the last layer is plugged into the regression layer to get the final prediction. In order to reduce the complexity 3 × 3 and 5 × 5 convolution operations, a 1 × 1 convolution operation are added and feature maps are obtained. Each of feature maps is plugged into a channel attention module to enhance the information of useful channels in the feature maps and suppress the information of redundant channels. The authors of this paper have achieved better RMSE results compared to conventional models like DCNN, Resnet, DNN, and dense net.
A hybrid CNN-LSTM model which is shown in Figure 9 is used to predict traffic flow at different time points in the future (Mihaita et al., 2019). Authors find that the CNN-LSTM model in this paper underperforms the LSTM model with its performance fluctuating. Temporal and Spatial features are incorporated and modeled by connecting the output of CNN to the input of each LSTM unit. The final prediction is made using a Fully Connected layer. Data profiling, outlier identification were done before the model was trained. The CNN-LSTM model manages to outperform traditional methods but the individual LSTM model outperformed the hybrid CNN-LSTM which indicated that complex deep learning models do not necessarily improve the prediction accuracy for our motorway flow prediction study. A comparison of various traffic flow prediction models is shown in Table 1.   Kashyap et al., Cogent Engineering (2022) The Data Grouping CNN (DGCNN) (Xia et al., 2016) Real Vehicle Passage records from the surveillance system of a major metropolitan area CBOW model Represented locations as vectors.
Spatio-relations,Short Term Showed high accuracy for short term prediction on being evaluated over the metrics of RMSE and MRE, but the network didn't achieve long term prediction.
3D CNN (C. Chen et al., 2018) NYC Bike dataset, BJTaxi dataset Preserving the temporal dependencies of the volumetric data resulting in an output volume Spatio-temporal correlation features jointly from low-level to high-level layers.
Adopting the same kernel sharing across space and time dimensions allowed model could take full advantage of spatial and temporal dependencies. It did not rely on historical data to predict the future values and overlook spatial and external features. Out-performed the ST-ResNet and the ConvLSTM models.

Simulation
Data was obtained from the Caltrans PeMS. Data was collected in real-time from individual detectors spanning the freeway system across all major metropolitan areas of the State of California.
The first 5 rows of the dataset look as in Table 2.
The model was trained using code from (GitHub, 2021). Different optimizers to change the hyperparameters were used to see how well the model fares. LSTM and GRU architectures were employed and therefore compared in this study. Different optimizers used here are Stochastic Gradient Descent (SGD), Adam, and rmsprop. It was trained on an NVIDIA GeForce 1660 Ti GPU. This simulation was done to understand the deviation from the actual data to what the model predicts when trained with different optimizers.

RMSprop
To deal with the vanishing and exploding gradient problem, RMSprop was developed as a stochastic technique for mini-batch learning. RMSprop deals with the above issues by using a moving average of squared gradients to normalize the gradient. This normalization balances the step size (momentum), decreasing the step for large gradients to avoid exploding and increasing the step for small gradients to avoid vanishing. Figure 10 shows the comparison of traffic flow prediction for a duration of 20 minutes. Both LSTM and GRU were able to predict the traffic efficiently.

Adam
Adam adapts the parameter learning rates in real-time based on the average of the first and second moments. It calculates an exponential moving average of the gradient as well as the squared gradient. Then, the parameters beta1 and beta2 can control the decay rates of both moving averages. Finally, bias correction estimates are run before updating the parameters. Tests are conducted to study the deviation of LSTM and GRU from true data using Adam optimizer.  Figure 10. Traffic flow prediction with RMSprop optimizer for 20 data points. Figure 11 shows the comparison of traffic flow prediction using Adam optimizer for a short duration of 20 minutes.

Sgd
Batch gradient descent performs redundant computations for large datasets, as it re-computes gradients for similar examples before each parameter update. SGD does away with this redundancy by performing one update at a time. It is therefore usually much faster and can also be used to learn online. Figure 12 shows the comparison of traffic flow prediction for a duration of 20 minutes. In our tests, we have observed a higher deviation in the predicted data from the true data for the Sgd optimizer for a short duration.

Comparison discussion of optimizers
RMSProp is developed to overcome the shortcomings of initial optimizers like AdaGrad. In AdaGrad, the learning rate decays very aggressively. As a result, after a while, the frequent parameters will start receiving very small updates because of the decayed learning rate. To avoid this, RMSProp decays the denominator and prevents its rapid growth. The difference between RMSProp and AdaGrad is that AdaGrad gets stuck when it is close to convergence and it is no longer able to move in the vertical. RMSProp avoids this by being less aggressive on the decay. The Adam works similar to RMSProp with respect to how the decay takes place. In addition to that, the update rule for Adam is very similar to RMSProp, except we look at the cumulative history of gradients as well. Here, we control the rate of gradient descent in such a way that there is minimum oscillation when it reaches the global minimum while taking big enough steps (step-size) so as to pass the local minima hurdles along the way. In the tests conducted above, both RMSProp and ADAM optimizers were able to predict the traffic flow efficiently. However, we have observed higher traffic flow prediction errors using SGD optimizer compared to RMSProp and Adam.
A comparison of the results of Adam, RMSprop, and SGD optimizers on the different models is provided in Table 3, and a comparative analysis of various architectures used in traffic flow prediction is provided in Table 4.

Conclusion
The movement of goods and humans is an integral part of existence. With the increase in the population and the necessity of social wellbeing of humans, travel is exponentially growing. As technology is evolving day by day so is the number of vehicles increasing. With this rapid rate of increase in vehicle, management of movement of vehicles is very critical. Vehicular management helps in optimizing the travel time and cost of the travel. For developing a precise vehicular management system, it is essential to have accurate background information. Traffic flow is one of the most important data which is required for developing a precise vehicular management system. This paper presents a review of recent deep learning approaches in the field of traffic flow prediction. Most of the contributions are application based while very few articles have a strong contribution to theory. Deep learning models for traffic forecasting have shown promising results to represent the non-linearity of traffic flow prediction. While there are several advantages to using the deep learning models to predict traffic flow individually, there are significant disadvantages also. Thus, recently, researchers are starting to move from deep learning architectures to hybrid and unsupervised methods. This review addressed the various existing deep learning architectures used for traffic flow prediction and the rising popularity of hybrid methods.

SAE
• Reduces the dimensionality of data that is being used • Great for feature extraction • An auto encoder learns to capture as much information as possible rather than as much relevant information as possible.
• High time complexity because of multiple forward pass HYBRID • Best at handling spatio-temporal Correlations because it combines the advantages of both CNN and LSTM • Although hybrid networks which combine the advantages of both CNN and LSTM seem to be the most effective approach to solve traffic flow prediction in some cases, as shown in our paper, Hybrid networks have failed to outperform LSTM networks.