Groundwater level prediction in arid areas using wavelet analysis and Gaussian process regression

Utilizing new approaches to accurately predict groundwater level (GWL) in arid regions is of vital importance. In this study, support vector regression (SVR), Gaussian process regression (GPR), and their combination with wavelet transformation (named wavelet-support vector regression (W-SVR) and wavelet-Gaussian process regression (W-GPR)) are used to forecast groundwater level in Semnan plain (arid area) for the next month. Three different wavelet transformations, namely Haar, db4, and Symlet, are tested. Four statistical metrics, namely root mean square error (RMSE), mean absolute percentage error (MAPE), coefficient of determination (R 2), and Nah-Sutcliffe efficiency (NS), are used to evaluate performance of different methods. The results reveal that SVR with RMSE of 0.04790 (m), MAPE of 0.00199%, R 2 of 0.99995, and NS of 0.99988 significantly outperforms GPR with RMSE of 0.55439 (m), MAPE of 0.04363%, R2 of 0.99264, and NS of 0.98413. Besides, the hybrid W-GPR-1 model (i.e. GPR with Harr wavelet) remarkably improves the accuracy of GWL prediction compared to GPR. Finally, the hybrid W-SVR-3 model (i.e. SVR with Symlet) provides the best GWL prediction with RMSE, MAPE, R2, and NS of 0.01290 (m), 0.00079%, 0.99999, and 0.99999, respectively. Overall, the findings indicate that hybrid models can accurately predict GWL in arid regions.


Introduction
The sharp decline of groundwater level (GWL) in arid areas has attracted the attention of water resources engineers. By increasing the demand for water resources in industrial, residential, and agricultural sectors, forecasting the groundwater level has gained its particular importance in groundwater resources management (Nayak et al., 2006;Suryanarayana et al., 2014). Moreover, GWL predictions have been used to avoid the side effects of groundwater level decline including the loss of water pumping capacity from wells and ground subsidence (Moosavi et al., 2013;Shirmohammadi et al., 2013a).
Numerical methods are often computationally expensive and need many input variables in predicting GWL (Anderson & Woessner, 1992). Hence, in recent years, a number of studies used machine learning algorithms to predict GWL (Taormina et al., 2012;Porte et al., 2018;Lal & Datta, 2018;Su et al., 2020;Niu & Feng, 2021; Alcalá García et al., 2021). By using air temperature, precipitation and GWL in preceding months, Chang et al. (2015) employed the feedforward neural network model to predict monthly GWL. Gholami et al. (2015) applied a multilayer perceptron (MLP) model to forecast GWL. Their results indicated that MLP model had a high capability for GWL forecasting. Rezaie-balf et al. (2017) utilized Multivariate Adaptive Regression Splines (MARS) and M5 Model Trees (MT) to simulate GWL fluctuation. Ghose et al. (2018) used radial basis neural network (RNN) to predict GWL. Rainfall, air temperature and humidity, runoff, and evapotranspiration were used as inputs in their model.  attempted to predict GWL by hybrid radial basis neural network-whale algorithm (RNN-WA), hybrid MLP-WA, and genetic programming (GP) approaches. The results indicated that MLP-WA was more accurate than the other two approaches for 3, 6, and 9 months ahead forecast scenarios.
Recently, Support Vector Regression (SVR) and Gaussian Process Regression (GPR) were utilized to predict different hydrologic variables such as water level in lakes and dams (Hipni et al., 2013;Lan, 2014), river flow forecasting (Esmaeilzadeh et al., 2017;Samadianfard et al., 2019a), global solar radiation (Samadianfard et al., 2019b), daily dew point temperature (Qasem et al., 2019), longitudinal dispersion coefficient in natural streams , pan evaporation (Shabani et al., 2020) and standardized streamflow index . SVR is a type of support vector machine (SVM) algorithm, which was introduced by Vapnik (1995). It belongs to supervised learning algorithms and can be used to solve regression problems (Cortes & Vapnik, 1995;Vapnik, 1999). SVR is able to achieve good prediction of variables of interest based on a small number of statistical samples. Compared to fuzzy logic and neural network approaches, SVR has a shorter training time, lower data dependence, and easier implementation. Gaussian Process Regression (GPR) is a Bayesian learning approach, which assumes the output probability distribution be Gaussian since it does not change the probability of inputs (O'Hagan, 2013;Sun et al., 2013). GPR is able to model complex relationships between inputs and outputs and cope with a broad range of simulation behaviors of alternative models (Karami et al., 2018;Liu & West, 2009;O'Hagan, 2013;Rasmussen & Nickisch, 2010). It deals with certain aspects of regression and prediction problems (Reis et al., 2005;Stedinger & Tasker, 1985;Vogel et al., 1999).
Using single techniques may be risky for water resources management. Hence, hydrologists usually take advantage of hybrid approaches to make better decisions for managing water resources. Various studies indicated that combining different techniques and developing hybrid methods could improve the prediction. Graf et al. (2019) applied 4 mother wavelets, namely Daubechies, Symlet, discrete Meyer and Haar, coupled with ANN to predict daily water temperature at a river in Poland. Zhu et al. (2020) applied a hybrid wavelet-multi layer perceptron approach to forecast daily surface water temperature at eight lakes in Polish. Campozano et al. (2020) investigated the capabilities of ANN to predict river discharge using wavelet analysis.
Wavelet transfer (WT) can analyze changes in time-series to provide the time scale of non-stationary signals (Adamowski, 2008;Adamowski & Chan, 2011;Karami et al., 2018;Kisi & Cimen, 2011;Samadianfard et al., 2018;Zhang & Dong, 2001;Zheng et al., 2000). In contrast to classical Fourier analysis (FA), which is a trigonometric polynomial, WT propagates functions in waves. This form of translations and dilations is generated using a fixed function called mother wavelet (Zhu et al., 2020). It decomposes time series into subcomponents and supplies more data at different resolutions, ultimately increasing the veracity of prediction models (Tikhamarine et al., 2019). WT has been widely used in water resources management studies (Karthika & Deka, 2015;Shafaei & Kisi, 2015;Shirmohammadi et al., 2013b). Also recently Rezaie-balf et al. (2017) coupled WT with MARS and MT to increase the accuracy of GWL forecast. Consequently, the goal of this study is to predict one-month ahead groundwater level in Semnan plain (Iran) by SVR, GPR, wavelet-support vector regression (W-SVR) and wavelet-Gaussian processes regression (W-GPR) approaches. Semnan plain has an arid climate with a mean annual rainfall of 140 (mm). Groundwater is the main source of water supply for residential and agricultural sectors in Semnan plain. However, its improper use has caused severe problems (e.g. sharp decline in groundwater level, land subsidence, and shortage of irrigation water) in Semnan province. Hence, the results of this study can help decision makers and stakeholders to manage groundwater resources more efficiently. Three different mother wavelets, namely Haar, db4, and Symlet, are tested in W-SVR and W-GPR approaches. Monthly air temperature, precipitation, relative humidity, and groundwater level for a period of 20 years  are used as inputs.

Study area
Semnan plain is located in the east of Tehran and has an area of about 1207 km 2 . The study area spans over the longitude of 53°3 E -53°40 E, and latitude of 35°18 N -35°43 N. The plain has an altitude of 874-1857 m above the sea level. It has an arid climate with an average annual precipitation of about 140 mm. More than 80% of precipitation in the Semnan plain evaporates. The remaining 20% is converted into the runoff and groundwater. The location of Semnan plain in Iran and its observation wells are shown in Figure 1.

Data
Various factors such as air temperature, humidity, precipitation, evapotranspiration, aquifer storage capacity, harvesting rate from wells, and recharging aquifers affect GWL (Adeleke et al., 2015). However, only a subset of these variables is available in Semnan plain. Based on literature (Banadkooki, Ehteram, Ahmed, Teo, Fai, Afan, Sapitang, et al., 2020;Emamgholizadeh et al., 2014;Khoshand, 2021) and correlation analyses between GWL and available micrometeorological variables in Semnan plain, mean minimum air temperature (T min ), minimum relative humidity (RH min ), precipitation (P) and GWL in the current and previous three months are used as inputs.

Support vector regression (SVR)
Support vector regression (SVR) is a nonparametric method, which was developed by Vapnik (1995). It has been used as a machine learning tool for classification and regression (Mirarabi et al., 2019). The performance of SVR is highly dependent on its kernel. Consider a dataset where x i is the input vector, d i is the target value, and N is the number of data samples. The SVR function, f (x), is given by, where ω is the weight vector, φ is the non-linear transfer function and b is the bias. To achieve a proper performance of f (x), the regression problem is estimated by minimizing the structural risk function: where ξ i and ξ i * are the slack variables, c is the error penalty term, which denotes the trade-off of the risk term    and regularization observed so that the deviation ε can be tolerated, and ε is the error tolerance (Suryanarayana et al., 2014). By making an appropriate change to the initial formula of the target function, we can turn it into a dual problem for quadratic programming. Quadratic programming is then used to solve the SVR problem: where α i and α i * are dual Lagrange multipliers, k (x i , x) is the kernel function, which is equal to φ(x i )φ(x). The above equation is used for both linear and non-linear regression analyses (Isazadeh et al., 2017). It is important to choose the right parameters to achieve a reliable predictive accuracy (Zounemat-Kermani et al., 2016). One of the widely used kernel function is the radial basic function, which is given below: where γ is the kernel width (Yu et al., 2018).

Gaussian process regression (GPR)
GPR is a non-parametric and stochastic approach that can solve nonlinear regression problems (Grbic et al., 2013). GPR learns the relationship between the predictor and target. According to linear regression, the response variable y can be defined as a function of p-dimensional variables x = [x 1 , . . . , x p ] T parameterized by w = [w 1 , . . . , w p ] T as shown in Equation (6) (Wang & Chen, 2015): where n is the number of data points, x id is the dth covariate of x i , and ε i is the Gaussian noise with zero mean and variance σ 2 ε (Wang & Chen, 2015). The output value y = [y 1 , . . . , y n ] T , as a linear function of w and ε i , is also Gaussian with zero mean and covariance matrix C. The Gaussian process can be defined as p(y) = G (0, C), where G denotes the Gaussian distribution. The equation for C is given by Wang and Chen (2015): where δ ij is equal to 1 for i = j, and 0 for i = j. Equation (8) presents the Gaussian variance (Wang & Chen, 2015): where x refers to the input values and k * = [C(x * , x 1 ), . . . , C(x * , x n )] T . Accordingly, the log-likelihood of the training data can be defined by Equation (9) to take hyper-parameters in the presence of maximum likelihood computation (Wang & Chen, 2015): Equation (10) presents the derivative of the loglikelihood with respect to each hyper parameter (Wang & Chen, 2015):

Wavelet transform (WT)
The wavelet transform (WT) has been successful in the field of signal processing (Starck & Murtagh, 2001). In general, WT replaces local oscillating base functions (called wavelets) with infinitely oscillating base functions of Fourier transform. In fact, wavelets are displaced versions of a stem wavelet with a value of ψ(t) (Selesnick et al., 2005). These wavelets, by selecting and combining with a low-pass scale function with a value of ϕ(t), can form a balanced base shape for signals of a onedimensional time series (1-D). Accordingly, any finite energy analog signal x(t) can be separated in terms of wavelets and scale functions by Equation (11) (Selesnick et al., 2005): where ψ(t) is the mother wavelet function, j is the scale factor, and n is the time shift. It should be noted that in WT, the basic functions are translations of a function named the mother wavelet. Besides, the scaling coefficients c(n) and wavelet coefficients d(j, n) are computed via the inner products as shown in Equations (12) and (13) (Selesnick et al., 2005): c(n) and d(j, n) provide a time-frequency analysis of the signal by measuring its frequency content (controlled by the scale factor j) at different times (controlled by the time shift n). This study uses three different mother wavelets, namely Haar, db4, and Symlet, to compare their performance to improve groundwater level modeling when applying hybrid W-SVR and W-GPR models.

Performance indicators
Root mean square error (RMSE), mean absolute percentage error (MAPE), coefficient of determination (R 2 ), and Nah-Sutcliffe efficiency (NS) are used to measure the accuracy of predictions. The expression for each of these metrics is shown below (Faizollahzadeh Ardabili et al., 2018): where x i and y i are the ith observed and predicted groundwater levels, respectively.x,ȳ are the average of observed and predicted groundwater level data, respectively, and n is the number of total data.
The correlation coefficient (R 2 ) indicates the degree of correlation between the observed and predicted values. The closer it is to 1, the more the observed and predicted data are correlated. RMSE and MAPE denote the error of predictions. The closer they are to 0, the more accurate the prediction is. The closer NS is to 1, the more accurate the prediction is.

Results and discussion
In this study, the hybrid Haar-, db4-, and Symlet-SVR models are named W-SVR-1, W-SVR-2, and W-SVR-3, respectively. Similarly, the Haar, db4, and Symlet mother wavelets combined with GPR are called W-GPR-1, W-GPR-2, and W-GPR-3, respectively. The results of SVR, GPR, W-SVR, and W-GPR models are compared in Figure 6 and Table 2. As can be seen, SVR significantly outperforms GPR. RMSE and MAPE of GWL predictions from SVR are 91.3% and 95.4% less than those of GPR ( Figure 6(a)). Besides, all three hybrid W-SVR models perform better than their corresponding hybrid W-GPR approaches (Figure 6(b-d)). GWL predictions from all models can capture the continuous decline in GWL observations. W-GPR-1, as the most accurate W-GPR model, provides GWL predictions with lower RMSE and MAPE values (0.29406 (m) and 0.02274%) and higher R 2 and NS values (0.99866 and 0.99554) in comparison with standalone GPR with RMSE, MAPE, R 2 and NS values of 0.55439, 0.04363, 0.99264 and 0.98413, respectively. Besides, W-GPR-1 reduces RMSE and MAPE of standalone GPR by 47.0% and 47.9%, respectively. The statistical metrics indicate that W-SVR-3 yield the most precise GWL predictions with RMSE, MAPE, R 2 and NS values of 0.01290 (m), 0.00079%, 0.99999 and 0.99999, respectively. W-SVR-3 reduces RMSE and MAPE of SVR by 73.1% and 60.3%, respectively. The overall comparison of GPR, SVR, W-GPR, and W-SVR models illustrates that hybrid models have higher capabilities in predicting GWL compared to standalone GPR and SVR models. The findings indicates that Haar and Symlet mother wavelets have higher positive effects on improving error metrics of standalone GPR and SVR models, respectively. Thus, W-GPR-1 and W-SVR-3 may be recommended as the most accurate models for GWL prediction.
Scatterplots of observed and predicted GWL values from different models are shown in Figure 7. As can be seen, the predictions from GPR and W-GPR fall around the 45-degree line (left column in Figure 7), implying that these models can reliably predict GWL. GWL predictions from SVR and W-SVR mostly fall on the 1:1 line, denoting these models can forecast GWL very precisely. Overall, a comparison of scatterplots in the left and right columns shows that SVR-based models are superior to GPR-based ones. Besides, among the three tested mother wavelets, Haar and Symlet have the best efficacy when combined with GPR and SVR, respectively.
Comparing the results of GPR and SVR (first row) with those of W-GPR and W-SVR (second-fourth rows) illustrates that WT, as a data preprocessing technique, is able to map the relationship between GWL and its influential variables more robustly. This happens because WT decomposes complex time series by catching information on different levels, which results in a more effective structure for machine learning methods (Adamowski & Sun, 2010;Nourani et al., 2009;Zhou et al., 2020). In fact, the decomposed input variables give GPR and SVR models more flexibility in simulating the relationship between input and output variables. As a result, WT shows a significant contribution in explaining the hidden relationship between meteorological variables and GWL in Semnan plain. Taylor diagram is designed to graphically indicate which of several prediction models are more realistic. This diagram is based on the geometric linkage between the correlation coefficient and standard deviation. It can be presented in two forms: semicircle (showing negative and positive correlation) and quadrant (indicating positive correlation only). In both forms, the values of the correlation coefficients are expressed on the arc of the circle. The graph evaluation method is tailored such that any model whose location is in the graph closer to the reference point is more accurate and therefore more appropriate (Taylor, 2001). The impact of different mother wavelets on the accuracy of groundwater level predictions has not been fully investigated in the literature. The Taylor diagram in Figure 8 shows a comparison of all models used in this research. The results indicate that applying different mother wavelets, namely Haar, db4, Symlet, for developing hybrid W-SVR and W-GPR models can provide different performances. The green point (observed on the horizontal axis) indicates the position of the reference point according to standard deviation. The pink and orange points (W-SVR-3 and W-SVR-2) are closer to the  reference point and provide better results. The blue point (GPR) falls outside the semicircle and is far from the reference point. However, by adding wavelet analysis to this model, the results are improved and the yellow point (W-GPR-1) is closer to the observed point. Yet, at the end, the best model is W-SVR-3 with the Symlet mother wavelet. These results show that the effectiveness of different mother wavelets should be explored for predicting GWL. These findings may be used by local authorities for monitoring and managing groundwater level decline of Semnan plain.

Conclusion
Groundwater level (GWL) prediction is of vital importance for residential, agricultural and industrial sectors, especially in arid regions. In this study, support vector regression (SVR), Gaussian process regression (GPR), hybrid wavelet-SVR (W-SVR), and hybrid wavelet-GPR (W-GPR) methods with three different mother wavelets (i.e. Haar, db4, Symlet) are used to predict GWL in Semnan plain, which is an arid region. Performance of these approaches is compared based on the root mean square error (RMSE), mean absolute percentage error (MAPE), coefficient of determination (R 2 ), and Nah-Sutcliffe efficiency (NS) metrics. Minimum air temperature, minimum relative humidity, precipitation and groundwater level of three preceding and current months are employed as inputs to predict the one-month ahead GWL. All approaches can reliably capture the continuous decline in GWL during the testing period (2012-2017). However, SVR can predict GWL more accurately than GPR. RMSE and MAPE of GWL predictions from SVR are respectively 0.04790 (m) and 0.00199%, which are 91.3% and 95.4% less than those of GPR. Besides, adding any of the three mother wavelets to GPR and SVR improves their performance. It is found that Harr and Symlet mother wavelets are the most effective ones when combined with GRP and SVR, respectively. The hybrid W-SVR-3 model (that uses Symlet mother wavelet) reduces RMSE and MAPE of GWL predictions by 73.1% and 60.3% compared to SVR. Similarly, the hybrid W-GPR-1 model (that used Harr wavelet) decreases RMSE and MAPE of predicted GWL values by 46.9% and 47.9% relative to GPR. Overall, the results show that both GPR and SVR and specifically their combination with the wavelet analysis can accurately predict GWL, ultimately allowing water resources planners to create effective measures for conserving groundwater resources in Semnan plain.