A comparative analysis of LSTM and ARIMA for enhanced real-time air pollutant levels forecasting using sensor fusion with ground station data

ABSTRACT One of the most basic needs of any human being to survive is air. Unfortunately, this basic need is being polluted by many natural factors like volcanic eruptions, forest fires, and man-induced factors like transportation emission. Unpolluted air is now an ideal environment that can never be achieved. So, the pollution levels should be monitored continuously. However, monitoring the levels of pollution will not fix the environment. Forecasting these pollution levels can make the society more aware of the environment and help prepare safety measures. This research paper aims to forecast the air pollutant levels by comparing futuristic machine learning models, which are Long Short-Term Memory (LSTM) and Auto-Regressive Integrated Moving Average (ARIMA) performed on the ground station data from Central Pollution Control Board (CPCB), the data collected from a low-cost IoT hardware setup, and the fused data. The LSTM and ARIMA models have been used to forecast the air pollutant levels in the future. Further, the main novelty of this research is to show that the concept of sensor fusion increases the accuracy of the dataset. The outputs obtained after implementing LSTM and ARIMA models show more accurate results when compared with ground station data and the IoT data from sensors.


PUBLIC INTEREST STATEMENT
One of the most basic needs of any human being to survive is air. Unfortunately, this basic need is being polluted by many natural factors like volcanic eruptions, forest fires, and man-induced factors like transportation emission, mining operations, etc. All of these factors have deprived us and our future generations of the meaning of the word 'Pure Air'. Unpolluted air is now an ideal environment that can never be achieved. This research paper aims to forecast the air pollutant levels by comparing futuristic machine learning models that are Long Short-Term Memory (LSTM) and Auto-Regressive Integrated Moving Average (ARIMA) performed on fused data obtained from ground station data collected from the CPCB (Central Pollution Control Board) and the data collected from a low-cost hardware setup of sensors. Furthermore, the concept of sensor fusion, which is the main technical innovation, and the novelty of the paper have been discussed. The LSTM model and ARIMA model have been used to forecast the air pollutant levels in the future.

Introduction
Air is one of the most important elements for all living life on earth. The quality of the air we breathe in determines the quality of the life we live and how long we live it. Several cases of ischemic heart disease, stroke, chronic obstructive pulmonary disease (COPD-progressive lung diseases), and acute lower respiratory infections have been reported in people and children, who breathe in polluted air. Apart from the effect on human beings, polluted air has also been shown to contribute to global warming (Sankar Ganesh et al., 2017). Air pollution is a major contribution to some of the deadliest diseases, which are being analyzed every day. Continuous exposure to air pollutants has affected the health of all human beings and has now quickly led to mortality and morbidity (World Health Organization, 2016). All these reasons make the monitoring of air pollution a very crucial task.
The main motivation behind this research work comes from the fact that air pollution can be easily accounted for as one of the biggest disasters in the 21st century. It has caused great harm to not just human beings but also the entire flora and fauna. Moreover, air pollution has always been a side product of the greatest inventions of human beings. Hence it is also our responsibility to rid the earth of this disaster. Air pollution has been on the radar of several stations for a very long time. Hence, a lot of data have been generated and stored, but most of these data come from stations that have been poorly maintained and hence are not reliable. With all of these data at our disposal, it is our objective to pick the best working machine-learning algorithm to help us forecast air pollution levels. However, this forecast is of no use if the data that we base our research on is not reliable, so the main novelty of this research work is that we introduced sensor fusion of the ground station data and the IoT data from the sensors to get more accurate data to which we can apply machine learning models.
In the literature, there has been a significant amount of progress in forecasting the level of air pollutants using several machine learning algorithms. The researchers in (D.  proposed a hybrid CEEMD-VMD-DE-ELM model that depends on a two-stage decomposition method and ELM model upgraded by the DE calculation to boost the figure accuracy of everyday Air Quality Index (AQI) series. To boost the demonstrating precision and reasonability of displaying, a Hampel filter has been utilized in (Li et al., 2019) to diminish outliers, and a feature determination approach that depends on CEEMD and BCCSA was talked about to decide the best information variable structure. An altered LSSVM, which depends on the MOMVO calculation, has been discussed at the same time for acquiring high accuracy and solid stability. The authors in (Peng et al., 2017) showed us the potential of using nonlinear machine learning methods to improve air quality forecasts. An innovative hybrid-Garch model is proposed in (P.  to get rid of the issue of heteroscedasticity. An ARIMA model is compared with an ARIMA-XGboost hybrid model, and it is concluded that declaring the ARIMA-XGBoost hybrid models got better forecasting as they better accuracy (Li & Zhang, 2018). Moreover, VMD-SE-LSTM model is proposed to display a superior capacity for AQI forecasting by capturing the characteristics of the original AQI very comprehensively (Wu & Lin, 2019). The ARIMA model is utilized in (Kulkarni et al., 2018) to forecast the air pollution levels in Nanded city, Maharashtra, India. The academicians in (Ha et al., 2020) proposed the utilization of a Kalman filter with a fractional order to predict indoor air pollutants by taking into account the nonlinear and stochastic nature and hence improving accuracy.
This paper discusses the implementation of LSTM and ARIMA models and compares both these techniques to forecast the air pollutant level. Both the techniques are applied on the ground station dataset, the IoT dataset from sensors, and the fused dataset obtained after fusing the IoT and ground station data. The main contribution of this research is to show that the concept of sensor fusion increases the accuracy of the dataset. The outputs obtained after implementing LSTM and ARIMA models show more accurate results when compared with ground station data and the IoT data from sensors.
The research manuscript is divided into five sections. The first section walks us through the literature study and discusses the importance of the forecasting of air pollutant levels. The second section describes the methodology that was followed in this work. The major topics that were dealt with are time series analysis and their background information, performance indices used in time series analysis, a hardware description of the circuit that was made for collecting data, and background information of the sensor fusion. In the fourth section, experimental results have been shown in the form of actual versus predicted graphs and also by computing the RMSE. The results of sensor fusion and its effect on the forecasted values have also been discussed in this section. Finally, the paper is concluded indicating the main focus of work, results, and also the future work that can be done to make this research even more productive.

Proposed methodology
This section explains the proposed method as shown in Figure 1. In this work, ARIMA and LSTM models have been implemented on the data collected from the ground station, sensor output, and the fused data. Furthermore, the sensor fusion technique has been claimed to be better for forecasting.
The whole methodology of our research paper can be divided into four parts:

Time series analysis
A time series can be explained as a sequence of observations, which have been taken sequentially in time. We can see countless batches of data that appear as time series like daily rainfall amounts, heights of ocean tides, monitoring of a person's heart rate, etc. Therefore, it can be seen as a method of forecasting to predict the future depending on previously obtained information. The two most popularly used techniques for the purpose of time series analysis are LSTM and ARIMA. The original datasets have been divided into 80% (training) dataset and 20% (testing) dataset. Also, the number of epochs has been set to 150.

Auto-Regressive Integrated Moving Average (ARIMA)
The ARIMA model has been presented by Box and Jenkins in 1970. It is also time to time referred to as the Box-Jenkins methodology consisting of a set of exercises for assessing, diagnosing, and identifying ARIMA models with the time series data (Ariyo et al., 2014). It is actually a combination of three different models, which are the autoregressive, integrated, and moving average models. ARIMA model can be implemented on a non-stationary type of data. Non-stationary type of data has means, variances, and co-variances that change over time (W. Wang et al., 2015).
ARIMA models can be represented with p, d, q p: the count of total time lags in the Auto-Regressive model d: the degree of the differencing model   q: the order of moving average model ARIMA can be implemented on stationary datasets, so it is important to turn the non-stationary dataset to a stationary dataset. The ARIMA model is used a lot for forecasting. Also, it has been utilized for its known ability to produce short moment forecasts. This model works in such a way that the future estimation of a variable is a direct blend of past errors and past qualities (Ariyo et al., 2014).
It can be represented mathematically as shown in equation (1) If any trend is identified in the data, it is removed using the process of differencing and power transformation. The important steps in building ARIMA models are the model identification, parameter estimation, and the model utilization. ARIMA model development as a flowchart can be represented as shown in Figure 2.
The time series analysis of air pollutant level data is done by both the ARIMA and the LSTM models. The ARIMA model is basically made of three different individual models, which are autoregressive model (p), differencing model (d), and moving average model (q). To calculate the coefficients of Auto-Regressive and Moving average models, Partial Auto-Correlation Function (PACF) and Auto-Correlation Function (ACF) are used. The number of times we differentiate the data determines the coefficient of differencing. The Dickey-Fuller test determines if the data are stationary or not; as a result, it decides the number of times we have to differentiate. The dickey fuller test also confirms that the data is non-stationary. So, the data are differentiated twice to make it stationary, and the coefficients are given in Table 1. ARIMA model has been made by using the training data by using the coefficients from the table. The results are then validated against the test dataset, which has been kept completely separate.

Deep Long short-term memory (LSTM)
In a time-series dataset, a certain dependency amongst the input variables is noticed. The recurrent neural network (RNN) has been known to be very reliable to handle the dependency in all the input variables. LSTM comes under a special category of RNN with some features to learn the sequence of data. Each of the LSTM comprises a batch of modules in which the information streams have been stored. These modules represent a transport line between each module. They relay information from the past and gather it for the present. The purpose of the gates in the cells is that the data can be filtered, disposed of, or added. The basis of these gates is sigmoidal neural networks. They allow the cells in some situations to let the data pass or dispose of (Siami-Namini et al., 2018). Gates are found to be of three types, which are being used in each LSTM to control the state of each cell. Forget Gate gives a number between 0 (i.e. ignore it) and 1 (i.e. keeping it).
The memory gate picks the new information that is kept in the cell. A sigmoid layer opts for the values that are to be modified followed by the 'tanh' layer for making a vector of the new values that could be appended to the state. The output gate controls the output of each and every cell. The output value depends on the cell state, and the filtered and newly added data. The parameters of the LSTM can be seen in Table 2.
An LSTM cell consists of the following components: In an original LSTM model, let the input at time step t be X t (Siami-Namini et al., 2018). To calculate the hidden state (St) from the last state (St-1) (Sagheer & Kotb, 2019), the first step would be to decide which data are going to be tossed from the cell state. This is decided by forget gate (ft): Next, it has to be decided which of the information is kept in the cell state, and this step has two folds. First, the input gate (It) layer decides which of the values should be updated. Second, the tanh layer creates a vector of new candidate values Ct. It can be represented as Then, the old state Ct-1 updates to a new cell state Ct. It could be represented as The output depends on the state of the cell, and the output gate decides the part of the cell state that is produced as the output. Also, the cell states go past the 'tanh' layer and then multiply it by the output gate.
The LSTM has three groups of parameters: 1. Weights that are inputs:

Performance indices
To compare the accomplishments of LSTM and ARIMA models, certain measures of performance indices are required. Root Mean Square Error (RMSE) comes in the form of statistical criteria and can be used to evaluate and compare the efficiency of both models.

Root mean square error (RMSE)
The performance of a model is said to be high when the value of RMSE is low. The RMSE values tell us how close the actual values are to the predicted values. It can be mathematically expressed as shown in equation (7).

Mean absolute error (MAE)
MAE can be explained as the mean of absolute value of errors. It can be derived using equation (8).

Hardware specifications
The three types of datasets are being used to compare the LSTM and ARIMA models. Among them, the IoT data are obtained from a hardware model consisting of an Arduino UNO, MQ131 Ozone sensor module, MQ-135 Air Quality sensor module, and MQ-7 Carbon monoxide sensor module. This section discusses the hardware specifications of each module and the sensitivity adjustment/ calibration for each sensor. Figure 3 shows the IoT node architecture used for collecting the realtime sensor outputs.

MQ-7 Carbon monoxide sensor module
This sensor in Figure 4 is used for the detection of the concentration of carbon monoxide (CO). The standard measuring circuit of the sensor consists of a heating circuit, which has a control function and a signal output circuit that responds to changes in surface resistance of the sensor. The sensor also has a potentiometer to increase and decrease the sensitivity of the sensor. (https://www. sparkfun.com/datasheets/Sensors/Biometric/MQ-7.pdf) Figure 11. Sensitivity characteristic of MQ131.

MQ-131 Ozone sensor module
MQ 131 in Figure 5 is used to detect ozone concentration. It is a semiconductor sensor. It has four pins for voltage, ground, and output pins. For proper working, it has an inbuilt heater and a potentiometer to adjust the sensitivity of the sensor.

MQ-135 air quality sensor module
This sensor in Figure 6 determines the levels of gas concentrations of many gasses at the same time. NOx, NH3, alcohol, benzene, CO2, and smoke can be detected. It has four pins for outputs, voltage, and ground. The given input voltage is 5 volts.

Arduino uno
Arduino as shown in Figure 7 works on an ATMEGA328P controller. The board contains Pulse Width Modulation (PWM), analog, and digital pins. Arduino has six analogs (can be utilized for PWM tasks) and 14 digital pins. The Arduino can be programmed with Arduino Integrated Development Environment (IDE). It is primarily powered by a USB cable.
The circuit board with the connections as shown in Figure 8 collects data from sensors.

Sensor fusion
The sensor fusion is used to get more accurate values by the fusion of ground station data and the IoT data from the sensors. For sensor fusion, the Kalman filter algorithm is used. Kalman filter is a linear weighted average of two sensor values. A basic block diagram of sensor fusion is shown in Figure 9. If we take a time series of the following information points x 1 , x 2 , x 3 . . . . . . xn, someone who can forecast would compute an estimate for the point x n+1 . A smoother would take a look back at the data and would calculate the best x i by considering the points before and after x i . A filter can correct x n+1 by considering x 1 , x 2 , x 3 . . . . . . xn, and not an exact measure of x n+1.
For example, we have a time series with x 1 , x 2 , x 3 . . . . . . xn. The average run of the time series is given by equation (9) Suppose a new data point x n+1 is taken, we could either compute it again or we could just make use of the old value and just make a small correction by utilizing x n+1 . This correction can be represented by: The gain factor is k, The estimation of the average is

Figure 15. LSTM and ARIMA implementation on a training dataset of Fused CO.
An estimate of the new standard deviation is given by Taking a situation in which there are two instruments with a reading x. The reading from an instrument can be x 1 and that of another instrument can be x 2 . It is known that the first instrument has an error, which is modeled by a Gaussian with a standard deviation of one. The second instrument has an error that is normally distributed around zero with a standard deviation of two. The objective is to combine both the readings so that there is a single estimation. In the event, both the instruments are considered good on an equal level and the average of both the numbers is computed. If an instrument is considered as superior to the other, then the estimate of the superior is kept.
The goal is to get a weighted average of the readings to compute an approximate of x, which is denoted by x z}|{ It can be simplified to get x It can be expressed as where the gain is It is known that the error curve is Gaussian, so the probability of x being the right measure is computed and mathematically given by for the first instrument  It is known that both the measurements do not depend on each other, so they can be merged by multiplying their probability distributions and normalizing.
where C is the obtained constant after multiplying the probability distributions.
The center of distribution can be expressed as The variance can be expressed as If the gain factor is included, then the best estimate can be written as Hence, the general form of Kalman filter can be seen as σ ¼ ð1 À KÞσ 1 2 (25) Figure 9 gives us a holistic view of the overall system.
The sensor fusion algorithm has been extensively explained and derived in this section. The tuning parameters of Kalman filter are given in Table 3.

Sensor calibration
Sensors have to be calibrated before they can be used to get data from the environment. Calibration is necessary as this improves the accuracy of the sensor by reducing errors in the obtained output from the sensor. Steps for calibration: (1) Takedown the load resistance value and check the sensitivity characteristic of the sensor from the sensor datasheet provided by the manufacturer. The details are given in Appendix -A (2) The graph plots the ratio of R s /R o and Parts Per Million (ppm or PPM). The plot shows the sensitivity of the sensor changes for all the gases it is made to detect.
(3) Now, take two points from the sensor graph and with these two points form a line, which is approximately equal to the original curve.
(4) The sensor resistance of clean air (R 0 ) is found and the change in sensor resistance R s value in fresh air is also calculated.
(5) The idea behind this script for calibration and reading is to create a line and calculate the amount of gas. To make these two points are required for calculating the slope.
(6) Once the R 0 value is found, the value of R s can also be calculated. The desired ratio R s /R o is then calculated with the help of Arduino code.
(7) This is taken as a whole and plotted as linear regression to show the relation between the ratio Rs/Ro and the PPM.
(8) The sensitivity characteristics for each sensor are plotted individually and shown in Figures 10, 11, and 12.
From the characteristic curves in Figures 10, 11, and 12, it can be seen that the sensor has been calibrated as the sensitivity characteristics match with the datasheet sensitivity characteristics given in Appendix -B. Hence, the sensor's calibration is confirmed. The IoT node measurements are contrasted and the air pollutant information from the Central Pollution Control Board (CPCB) (https://www.sparkfun.com/datasheets/Sensors/Biometric/MQ-7.pdf), Authority of Vijayawada is introduced in Appendix -C. The prototype cost analysis is also given in Appendix -C.

Data pre-processing
Each of the air pollutants has its sub-indices, which are set according to the CPCB, India. Hence, the data have to be processed before they are used to implement the machine learning techniques. Air pollutants in some situations have non-existing or zero values. This is solved by removing the value from the data set. The CPCB data has been collected from the ground station of Vijayawada. The test data have 15 samples, while the training data have 520 samples. The data have been taken 15 minutes apart. For forecasting, 15 steps have been taken into the future.

Sensor fusion data
A general algorithm of sensor fusion that can be used to obtain the values of IoT sensor reading and ground station data. In a general situation, a process that undergoes state transitions x 1 , x 2 , and x 3. The propagator matrix A permits us to forecast the state at a time K + 1, given that the best past estimate x k of the state at time k: The error in the forecast is given by v, a variable that has a Gaussian distribution which centers at 0. If the covariance matrix of the system state estimated at time k is named P, the covariance matrix of x À kþ1 can be expressed as where Q is the covariance matrix of the noise v.
The measurement z kþ1 is prone to errors. The state and the measurement are linearly related to where H is a matrix and R is the covariance matrix of the measurement error.
The Kalman gain: State update: Covariance update: The ground station data and the sensor data have been fused using the Kalman filter. The estimated value is the fused data. Mean and variance of fused data, ground station and IoT data have been computed and are tabulated in Table 4. It can be seen from Table 4 that the mean and variance of the sensor fused data are much lower than when compared to both the ground station and the IoT data. An inference can be made from the table that the sensor fusion data have less uncertainty associated with measurements and more accuracy compared with both ground station and IoT data. Because of less variance, the fused data can be used for accurate forecasting.

ARIMA and LSTM implementation
In general, ARIMA and LSTM have been used very popularly for forecasting and predictions; therefore, both ARIMA and LSTM models have been compared to know which model is better for this use case. Furthermore, the results of ARIMA and LSTM on fused data are compared with the results from the sensor and the ground station data. Figure 13 shows the implementation of the LSTM and ARIMA model on the training dataset after the sensor fusion of O3 sensor data and ground station O3 data. Figure 14 shows the implementation of the LSTM and ARIMA model on the training dataset after the fusion of NH3 sensor data and NH3 actual data from CPCB. Figure 15 shows the implementation of the LSTM and ARIMA model on the training dataset after the sensor fusion of CO sensor data and CO ground station data.
Both the LSTM and ARIMA models have been validated against the unseen test dataset. Moreover, MAE and RMSE have been used to evaluate the model accuracy. Figures 16, 17, and 18 show the forecasting results of ARIMA and LSTM on a fused test dataset of CO, O3, and NH3, respectively. The RMSE and MAE of each implementation can be seen in Table 5.
The least MAE and RMSE values have been observed in LSTM in all the cases and hence can be concluded that LSTM is a better fit for the dataset than the ARIMA model. Also, it can be observed that the implementation of the LSTM and ARIMA model has the least MAE and RMSE in the sensor fusion dataset of O3 and NH3 compared with the sensor data and ground station data. Therefore, the sensor fusion has resulted in less uncertainty and more accuracy when compared to the data from CPCB and the sensor data.

Conclusion
This paper has dealt with the objective of forecasting three different air pollution levels such as CO, NH3 and O3. This has been done by implementing the LSTM model and the ARIMA model and then comparing both of them to see which gives the best results. Both the ARIMA and the LSTM model are used popularly for time series analysis, but in this case, the LSTM model has proven to be a better model for forecasting air pollution levels. Also, to enhance the accuracy of our forecasting result, the concept of sensor fusion is used by implementing the Kalman filter. The fused output has shown much lower variance and means from that of the sensor data and the ground station data. Future works can be done by choosing better algorithms and then using the cloud-like Microsoft azure for deploying the model. Also, a dashboard could be included for the better visualization of results.