Time series-based groundwater level forecasting using gated recurrent unit deep neural networks

ABSTRACT In this research, the mean monthly groundwater level with a range of 3.78 m in Qoşaçay plain, Iran, is forecast. Regarding three different layers of gated recurrent unit (GRU) structures and a hybrid of variational mode decomposition with gated recurrent unit (VMD-GRU), deep learning-based neural network models are developed. As the base model for performance comparison, the general single-long short-term memory-layer network model is developed. In all models, the module of sequence-to-one is used because of the lack of meteorological variables recorded in the study area. For modeling, 216 monthly datasets of the mean monthly water table depth of 33 different monitoring piezometers in the period April 2002–March 2020 are utilized. To boost the performance of the models and reduce the overfitting problem, an algorithm tuning process using different types of hyperparameter accompanied by a trial-and-error procedure is applied. Based on performance evaluation metrics, the total learnable parameters value and especially the model grading process, the new double-GRU model coupled with multiplication layer (×) (GRU2× model) is chosen as the best model. Under the optimal hyperparameters, the GRU2× model results in an R 2 of 0.86, a root mean square error (RMSE) of 0.18 m, a corrected Akaike’s information criterion (AICc) of −280.75, a running time for model training of 87 s and a total grade (TG) of 6.21 in the validation stage; and the hybrid VMD-GRU model yields an RMSE of 0.16 m, an R 2 of 0.92, an AICc of −310.52, a running time of 185 s and a TG of 3.34.


Introduction
In groundwater hydrology, the hydraulic head is regarded as one of the most significant criteria, which has been chiefly evaluated as groundwater level (GWL) with natural and unnatural bases. It determines the potentiometric condition of water recharge, discharge and storage of aquifers (Conlon et al., 2005). Continuous monitoring of GWL over time provides a valuable information source for hydrologists to understand the evolution and geometry of aquifers, particularly spatial hydraulic heterogeneity and anisotropy, by analyzing diverse spatiotemporal attributes, including the hydrometeorological cycle and groundwater resources development under global longstanding climate unpredictability and human-caused influences, and applying rational policies on logical water CONTACT Qian Zhang 20200420@wzu.edu.cn; Shahab S. Band shamshirbands@yuntech.edu.tw; Amir Mosavi amir.mosavi@kvk.uni-obuda.hu resources management (Asadi et al., 2020;Bekesi et al., 2009;Valdes et al., 2014). Assessing long-term groundwater resources in a potential agronomic site requires specific knowledge, high expenditure on the installation of observation piezometers, and the gathering of diverse data, such as hydrogeological and meteorological parameters, over time. It is unreliable to estimate GWL through conventional linear and deterministic statistical models since GWL is neither a continuous stationary multifactor hydrological time series nor a recurrent physical procedure with a lagged response (Post & von Asmuth, 2013;Schuurmans et al., 2007). Consequently, it is vital to develop robust approaches that take into consideration the dynamic behavior of observations such as heteroscedasticity, non-stationarity and autoregression (Fathian et al., 2016). It can thus be inferred that estimating monthly scale GWL for irrigation purposes is often thought of as a burdensome task compared to surface water computations (Favero et al., 2007).
In the past few decades, different methods have been developed to assess GWL fluctuations, which are generally divided into four subgroups: physical, statistical, artificial intelligence (AI) and hybrid methods. Because of the extreme randomness of GWL fluctuations, it is difficult to determine which model has the best estimation performance. Each model has its strengths and weaknesses, and many models are site dependent. Hence, it is hard to forecast all GWL fluctuation series with a definite model. Since the late 1960s, numerous studies have been carried out on the numerical emulation of groundwater quality and quantity (e.g. solute and contaminant transport) in porous media by different methods (Anderson et al., 1992;Dalkiliç & Gharehbaghi, 2021;Gharehbaghi, 2021aGharehbaghi, , 2021bHenriksen et al., 2008). This research contemplates flexible and reliable techniques for solving groundwater flow problems in complicated field conditions (Hemker & Bakker, 2006). Nevertheless, emulating the groundwater flow by numerical methods for various real-world problems is too difficult owing to the uncertainty and nonlinearity corresponding to influential parameters such as conductivity and hydraulic gradient, depth of saturated zone, long computational time for emulation, rate of discharge and recharge, and hydraulic head (Gómez-Hernández & Gorelick, 1989;Kovács et al., 2015).
To avoid the restrictions and disadvantages of physically based methods, several empirical techniques have been developed as a significant and renowned branch of machine learning models (MLMs). They are viewed as dependable methods with acceptable precision and a lower computational time for forecasting complex phenomena in different sciences by making connections between input/output without requiring a basic knowledge of physical relationships (Ahmadi et al., 2020;Choubin et al., 2019;Ghalandari et al., 2019;Golestani Kermani et al., 2019;Joloudari et al., 2020;Mahmoudi et al., 2021;Qasem et al., 2019;Shabani et al., 2020;Taherei Ghazvinei et al., 2018;Torabi et al., 2019).
The application of empirical techniques in geosciences is increasing rapidly, mainly in the prediction of extreme events such as GWL fluctuations . Many kinds of research have been performed using traditional MLMs for the prediction of GWL in different places. Shiri et al. (2013) utilized several classical soft computing methods to predict GWL fluctuations with meteorological impact implications in Hongcheon Well station, South Korea. The results of the prediction showed that for the 1-day-ahead prediction, gene expression programming (GEP) and adaptive neuro-fuzzy inference system (ANFIS) models presented small testing values (t-statistic) of 0.86 and 0.126, respectively, and high significance levels of 0.932 and 0.9, respectively. Sharafati et al. (2020) developed a novel AI model called gradient boosting regression (GBR) for forecasting GWL over Rafsanjan aquifer in Iran. Using correlation analysis, depending upon the lead periods, the R 2 value changed from 0.66 to 0.94. MLM-based deep neural network (DNN) models, which are an advanced type of recurrent neural network (RNN), are receiving more attention in various sciences Nabipour et al., 2020). Deep learning (DL) algorithms are black-box methods in which the input/output data are directly connected through a huge-dimension matrix of biases and weights in the hidden layers. In the training process, through the intrinsic ability of this method, the network feedback to the output of calibration data is matched using an optimization algorithm Rajaee et al., 2019). The emulation results rely on the quantity and quality of calibration data and also the construction of the network. The increased use of DL models, with gated recurrent unit (GRU) and long short-term memory (LSTM) neural networks as improvised architecture, provides vast opportunities for improvements in the fields of water resources and hydrology (Shen, 2018). Although a time scale was embedded into the RNN, theoretical and practical experiments confirmed that conventional gradient-based networks cannot dependably operate with information from more than the last 10 time steps (Gers et al., 1999). To permit the networks to recall inputs for a long period, an explicit memory, called LSTM, was formulated to enhance their ability (Hochreiter & Schmidhuber, 1997). Consequently, the exploding and vanishing gradients that arise in RNNs could be circumvented by substituting the traditional neuron in the hidden layer with an LSTM cell, which has a periodically self-connected linear component. The GRU, according to Cho et al. (2014), has an inherent capability for learning long period dependencies using raw time-series data. One benefit of such a model is that it can efficiently capture nonlinear relationships in historical GWL datasets .
So far, LSTM-based models have been effectively used for modeling in the field of hydrological science (Bowes et al., 2019;Jeong et al., 2020;Vu et al., 2021;Yin et al., 2020;Zhang et al., 2018). Nevertheless, inadequate research has been conducted on different applications of GRU-based models for complex hydrological processes such as GWL prediction. Ghasemlounia et al. (2021) predicted GWL fluctuations of four observation piezometers in an agrarian region through the structure of the novel bi-directional long short-term memory (BiLSTM)-based DNN model, and found the BiLSTM2× to be the best model. This model resulted in a root mean square error (RMSE) of 0.17 m and an R 2 of 0.89 in piezometer 4, with a range of 4.49 m. Gao et al. (2020) forecast short-time runoff with an artificial neural network (ANN), GRU and LSTM neural networks without using time-step optimization in sample generation. The results confirmed that DNN models outperformed the typical ANN and their estimation precision was raised by increasing the time step. Finally, GRU was chosen as the best method as it needed less running time for model training. Chen et al. (2021) developed a surrogate method for groundwater emulation using GRU to enhance the efficacy of parameter autocalibration and global sensitivity analysis. The results showed that the GRU surrogate combined with the particle swarm optimization (PSO) algorithm could be used to adopt high-dimensionality parameter calibration tasks with outstanding ability. Pan et al. (2020) forecast the water level in the Yangtze River using different models. They reported that for predicting the 8 o'clock water levels for the next 5 days, the ability of the hybrid convolutional neural network-gated recurrent unit (CNN-GRU) model with three water stations was greater than the other applied models. This model resulted in an RMSE of 0.13, a mean relative error of 3.13% and a Nash-Sutcliffe efficiency (NSE) coefficient of 0.97 as the average value of three seasons. Jeong and Park (2019) applied GRU, LSTM, autoregressive exogenous (ARX) and nonlinear autoregressive exogenous (NARX) models to forecast GWL. They found that LSTM and NARX neural network models worked better than ARX and GRU for monitoring GWL.
In recent years, many scientists have investigated hybrid models. These models can utilize the benefits of diverse models by incorporating their functions. Typical hybrid models are data preprocessing and parameter choice and optimization. The decomposition technique is one of the generally employed data preprocessing models. Commonly used decomposition methods include singular spectrum analysis (SSA), variational mode decomposition (VMD), empirical mode decomposition (EMD) and wavelet decomposition. VMD (Dragomiretskiy and Zosso, 2013) is a groundbreaking non-recessive technique that was developed to disentangle nonlinear eurhythmics from intricate data signals with extreme computational efficacy. In VMD, a variational model tries to find a sequence of elements with an assorted rate of recurrence bands and time resolutions (Fang et al., 2019;. In contrast to EMD, VMD can successfully improve the procedure of the integration problem, where one element encompasses two or further subsignals with distinct variances or one subsignal is separated into two or more elements with comparable traits (Fosso and Molinas, 2018;Naik et al., 2018). Several investigators have developed hybrid decomposition-based models for forecasting the time-series GWL (Bahmani et al. 2020;Gehrels et al. 1994;Nourani & Mousavi, 2016;Rezaiebalf et al. 2017). Zare and Koch (2018) estimated GWL fluctuations by ANFIS and hybrid wavelet-ANFIS/fuzzy C-means (FCM) clustering approaches in Miandarband plain, Iran. They concluded that the developed methods could be used with satisfactory precision; the hybrid wavelet-ANFIS approach with a Symlet mother wavelet outperformed the other models, with RMSE and R 2 values of 0.17 and 0.984, respectively, in the validation stage. Bahmani and Ouarda (2021) developed different hybrid decomposition models for predicting GWL. The results confirmed the hybrid M5 model tree with wavelet transform (WT-M5) model as the best technique, with an R 2 value of 0.91 and RMSE of 0.26 in the validation stage. Azari et al. (2021) used a coupling preprocessing method with linear stochastic models (i.e. a linear model with one autoregressive and one seasonal moving average parameter) to develop a novel model for GWL prediction. The statistical evaluation of this model confirmed its acceptable performance for GWL estimation, with a scatter index (SI) of 0.0004, root mean squared relative error (RMSRE) of 0.0004, corrected Akaike's information criterion (AICc) of 151 and mean absolute percentage error (MAPE) of 0.0003. Wu et al. (2021) predicted GWL using different standalone and hybrid models. They found that the hybrid wavelet transform-multivariate long short-term memory (WT-MLSTM) technique, which integrates exterior correlated parameters into forecasts, worked better than the traditional methods. Azizpour et al. (2021) estimated GWL using the self-adaptive extreme learning machine (SAELM) and the hybrid SAELM with WT models. They reported that the hybrid (WT-SAELM) model was the best model, with variance accounted for (VAF), NSE and correlation coefficient (R) values of 97.450, 0.973 and 0.988, respectively. Tao et al. (2022) reviewed advanced MLMs developed to predict GWL in together with accomplishments in this field from 2008 to 2020 (138 papers). They evaluated the types of methods, data extent, time scale, input/output variables, performance metrics applied and the excellent models recognized.
In this study, since the nature of modeling consists of temporal dependencies, GRU neural networks as emulators are investigated as a suitable possibility to predict GWL in Qoşaçay plain, Iran. Thus, several GRU-based models with the sequence-to-one regression module are developed because of the lack of meteorological variables recorded in the study area. The developed models are a general single-gated recurrent unit-layer network, a simple double-gated recurrent unitlayers network (GRU2), a novel proposed double-gated recurrent unit model coupled with multiplication layer (×) (GRU2×)-layers network and a hybrid variational mode decomposition-gated recurrent unit (VMD-GRU) model. GRU2× is a state-of-the-art coupling version of the GRU2-layers network model with a multiplication layer (×), and the hybrid VMD-GRU model is a new combination of VMD with a general single-GRU-layer network model. Of these models, the hybrid VMD-GRU and GRU2× models are considered as the newly combined version models among the traditional DL models, which improve the accuracy of prediction. The novelty of the present research lies in developing an advanced new architecture of the GRU2× neural network and hybrid VMD-GRU models with the sequence-to-one regression module for predicting elaborate natural events such as the mean monthly time-series GWL. To the best of our knowledge, despite many studies having been performed for estimating GWL using the conventional architecture of DL-based models, no studies have been reported in the literature aimed at the application of the exceptional and newly proposed GRU2× and hybrid VMD-GRU models.
The scope and main purposes of the present research are: (1) To develop different layer structures of GRU-based and hybrid VMD-GRU models for the prediction of GWL using a long-term water table depth with 33 different monitoring piezometers with time-series traits without operating meteorological variables in Qoşaçay plain.
(2) To determine the ideal numbers of different hyperparameters in each model for better configuration of the developed models, to reduce the adverse influences of the overfitting problem, using a trial-anderror process. (3) To assess and match the outcomes of the emulation to determine the perfect model through statistical metrics.

Site description and data collection
Qoşaçay plain is located to the south of Lake Urmia in Qoşaçay town, West Azerbaijan Province, Northwest Iran, with an area of approximately 1100 km 2 . It has coordinates of 46°06 10 E latitude and 36°58 10 N longitude, lies 1314 m above sea level, comprises cultivable land, and has semi-arid and cold weather. Figure 1 depicts the geographical location of the study district.
Since the real evapotranspiration ratio (ET a ) is high in summer, groundwater stores are used as a supplementary or alternative source. The irrigation period is from April to September and the average annual potential evapotranspiration (ET Pot ) is about 900-1500 mm (Jalilvand et al., 2019). Peak rainfall generally occurs from January to May. The mean annual temperature and precipitation were reported to be 12°C and 284 mm, respectively (EARWO, 2020). Since this area is very vulnerable to drought, groundwater plays an important role. The stream direction of the groundwater is from the south-east to the north-west of the plain (Norouzi & Moghaddam, 2020).
To predict the average monthly GWL in the study region, the current investigation used 216 average monthly measured water table depths from 33 monitoring piezometers with different statistical characteristics in the period of April 2002-March 2020. Figure 2 shows the sites of the observation piezometers in Qoşaçay plain. Figure 3 presents the long period time series of average monthly GWL between April 2002 and March 2020 in the study area.
Descriptive statistics of the average monthly GWL in Qoşaçay plain are presented in Table 1, where CV and STDV are the coefficient of variation and standard deviation, respectively.
Based on Table 1, it can be seen that the variation trend in GWL is very unsteady, with an extremely complicated basic nature and nonlinearity, owing to the high range, skewness and kurtosis. Therefore, robust accurate techniques are required for modeling.

Methodology
In the current study, to forecast the time-series GWL, different GRU-based neural network models are developed. In this regard, first, all datasets are normalized to zero average and unit variance, as suggested by Lawrence et al. (1997). Then, the average monthly recorded GWL datasets are divided into two subsets: the first 70% of the total datasets (150 months), between April 2002 and September 2014, is used for calibration of models; and the remaining 30% of the total datasets (66 months), between October 2014 and March 2020, is used for validation.

Gated recurrent unit (GRU) cell structure
The RNN, which is a specific type of neural network model, is very appropriate for time-series and serial data modeling. In RNNs, an unfurled loop cell induces to inflow the previous data. Nevertheless, because of the explosion gradient dilemma, the accuracy of the typical RNN model diminishes during the learning process of a long period sequence during backpropagation (Rangapuram et al., 2018).  GRU, as a special kind of RNN, has been developed to preclude the fundamental difficulties in the common RNN using a specific infrastructure (Sak et al., 2014). It unites three gates of LSTM i.e. input, forget and output, into two control gates: Z t (update gate) and r t (reset gate). The interior structure of a GRU memory cell is depicted as a sequence diagram in Figure 4.
As described in Figure 4, there is an input layer comprised of several neurons, where the number of neurons is controlled through the dimensions of feature space. Correspondingly, the number of neurons in the output layer corresponds to the output space (Cho et al., 2014). The structure of the GRU can be formulated using Equations (1) and (2)

2014):
where h t , h t−1 and h t are the hidden layer of step t, the hidden layer of step t − 1 and the present new state of step t, which are considered as a summation of the information at h t−1 and the input information at step t, x t . The network weight matrices and bias vectors are signified as U, W and b, which are attained in the training phase. The hidden layer contains memory cells that take into consideration the principal functions of the GRU network.
GRU uses control gates to adjust the retaining, forgetting and updating of sequence data. z t determines the volume of ignored previous information and newly added data, while r t controls the volume of the above information which is passed into the new state. z t and r t can be formulated by the following equations (Cho et al., 2014): where x t denotes the input at step t, and σ and tanh are the logistic sigmoid and hyperbolic tangent functions, respectively. In contrast to LSTM, GRU has no remote memory cells; instead, it operates a single hidden state h t to send information over time steps. Both GRU and LSTM perform in nearly the same way; nonetheless, as GRU has a lower number of learnable parameters and a more solid structure, it can converge readily, when the volume of data is not too big (R. Fu et al., 2016;Greff et al., 2016).

Variational mode decomposition (VMD)
As an advanced data decomposition technique, the VMD algorithm takes advantage of variational modes to generate a certain number of distinct elements for determining the variational problem. Based on the set modes number, the VMD algorithm separates a signal into certain bandwidths with diverse center frequencies and upgrades progressively by operating the alternating direction multiplier optimization method (ADMM) (Bai et al., 2021;Wang et al., 2015).
In VMD, it is presumed that every model has a bandwidth with a given center rate of recurrence. In more detail, it is supposed that a basic signal f (t) is broken into K dissimilar separate modes u k (t) to minimize the total computed bandwidth of every mode, where every mode is accumulated around a center frequency w k (He et al., 2019;Nechikkat et al., 2015). A constrained variational problem, depicted by Equation (5), is applied to answer the decomposition problem for a time series f (t): where K is the number of modes and t is time. u k (t) and w k (t) are the decomposed modes and relevant center frequencies, respectively. δ(t) is the Dirac distribution and j is an imaginary unit, so that j 2 = −1. ⊗ is the convolution operator and f (t) is the tth data of the basic signal (Wang et al., 2015).
To find the optimal resolution to the constrained variational problem and progress the performance efficiency, a quadratic penalty parameter (a) and Lagrange multiplier (λ) are applied to obtain an unconstrained version of Equation (5) (J. Niu et al., 2018). The augmented Lagrangian L is expressed by the following formula: The ADMM algorithm, which is a well-known dividing technique for discrete optimization problems, is applied to search for the saddle point of the augmented Lagrange function and solve Equation (6) in the VMD algorithm (Dragomiretskiy and Zosso, 2013; W. Fu et al., 2019). In ADMM, decision variables ( u n+1 k , w n+1 k and λ n+1 ) are divided into numerous blocks on the basis of their characteristics, and the main problem is separated into several lesser subproblems of a single block, which will be optimized reiteratively by affixing all other blocks in every internal sequence (Chang et al., 2016;W. Fu et al., 2020). Equations (7)-(9) are used to upgrade the mode u k (ω) in the frequency sphere, center frequencies ω k and concurrently λ, respectively. In the time sphere, the mode u k (t) is the real section of the inverse Fourier transform of u k (ω), described by Equation (7).
where n is the number of repetitions and τ is the time step of the dual ascent (iterative parameter) signifying the noise tolerance of VMD. f (w), u i (w), λ(w) and u n+1 k (w) are Fourier transforms of f (t), u i (t), λ(t) and u n+1 k (t), respectively (Wang et al., 2015).
Once the stopping conditions have been fulfilled, the whole reiterative optimization procedure in Equation (8) is stopped; otherwise, the calculation will be repeated with Equations (7) and (8) until Equation (10) In Equation (10), ε is the convergence tolerance that affects the reformation error of VMD.

Model development
As recommended by Maier et al. (2010), the initial architectural construction of layer network models should be designed according to the intentions of the investigation. Therefore, the general single-LSTM layer network model (model 1) was first developed in MATLAB ® release 2021a (MATLAB 2021). This is taken as our base model and is used to match the performance of other developed models. Then, to tune the network topology pattern (NTP) hyperparameter in the general single-GRU layer network model (model 2), two deeper structures (models 3 and 4) are developed. Finally, model 2 is combined with the VMD technique to generate the hybrid VMD-GRU model. Figure 5 illustrates the structure of the designed models. As depicted in Figure 5, model 3 is a simple double-GRU model (GRU2), whereas the newly proposed model (model 4) is a double-GRU coupled with multiplication layer (×) with the sequence output mode. The number of layers in the construction of models 1, 2, 3 and 4 is 5, 5, 6 and 7, respectively. In the hybrid VMD-GRU model, the VMD algorithm is applied to decompose the original time series of GWL into a series of elements to reduce the complexity and non-stationary of GWL (decomposition phase). Next, the general single-GRU-layer network model (model 2) is trained to estimate the complex relationships among the input/output variables and create the ultimate outcome by uniting the information from all extracted subsequences (ensemble phase).
In this research, to achieve a suitable configuration to reduce the impact of overfitting and improve the capability of the developed models, the algorithm tuning process, using different types of hyperparameters, is operated. Because there are no effective instructions to determine the appropriate hyperparameters in a certain model and dataset, the current procedure is considered a time-consuming and arduous undertaking. In this respect, various scenarios are characterized to determine the appropriate hyperparameters, in combination with a trial-and-error procedure by the user.
In the building of the developed models, the input layer intakes time-series datasets with a size of 1. Since GWL has a time dimension, time steps on the monthly scale should be used as an independent variable. To tune the state activation function (SAF), diverse unions of tanh and softsign, including softsign-tanh, softsign-softsign, tanh-softsign and tanh-tanh, in each GRU layer with the sigmoid gate activation function are used separately. Moreover, to tune the number of hidden units (NHU), diverse values in each model are examined.
The multiplication layer multiplies inputs from multiple neural network layers element-wise. The number of inputs in this layer is determined characteristically by the operated program; nonetheless, it has a sole output (MATLAB, 2021).
DNNs are suitable for assessing large datasets; however, a very large dataset will cause overfitting problems. Because a dropout layer provides a functioning regularization system, it has been used to reduce overfitting (Hinton et al., 2012). Theoretically, it automatically ignores or drops some neurons with a certain probability rate of P in a layer allocated at each training reiteration (Srivastava et al., 2014). In this way, it breaks up positions where network layers coadjust to errors from previous layers and, consequently, makes the model more robust. In this emulation, values of 0.3, 0.5 and 0.7 are used to tune the learning dropout rate (P-rate).
Although LSTM and GRU neural networks have a high capability to learn long-term time-series datasets, their fitting ability can be unsatisfactory (Zhang et al., 2018), and thus a fully connected layer must be operated.
The fully connected layer multiplies the inputs via a matrix of weights and then adds a vector of bias to progress the fitting ability of the developed models (MATLAB, 2021). In this emulation, the input size of this layer in all developed models in the training phase is set to 'auto' to control its size automatically via the operated program, and its output size is set to 1. The final layer of all developed networks is set as a regression output layer to estimate the half-mean-squared-error loss for regression via the loss function, as follows: where y p and y m are the predicted and measured values of GWL at time i, respectively. To update the weights and bias of each network in the developed models, the calibration options are set as in the Adam optimization algorithm, with a maximum iteration number of 2000, a learn rate drop period of 125, an initial learning rate of 0.001 and a learn rate drop factor of 0.2. In addition, to avoid the vanishing gradients problem, the initial batch size and gradient threshold are set to 128 and 1, respectively. These parameters are fundamental components of the developed structures, which have a noticeable influence on their operation. On the whole, in the hybrid VMD-GRU model, K, α, τ and ε factors affect the decomposition performance of VMD, but it is difficult to specify their optimal values. Hence, in this study, K is deliberated as a key hyperparameter and is tuned with different values, while the other factors are set to the default values.

Performance evaluation criteria
In this study, the RMSE and coefficient of determination (R 2 ) were applied to assess the effectiveness and accurateness of the designed model for predicting the time-series GWL. These statistical metrics are formulated by the following equations: where N is the number of datasets, x i and y i are the measured and estimated GWL at the time i, σ x and σ y are the standard deviation of the measured and estimated GWL, respectively, and μ x and μ y are the average of the measured and estimated GWL, respectively. The optimal values for these metrics are 0 and 1, respectively.

Validation of general single-LSTM and single-GRU models
After numerous experiments, the optimal values of the Prate and NHU for the general single-LSTM model (model 1) are determined as 0.5 and 100, respectively, and for the general single-GRU model (model 2) as 0.5 and 30, respectively. Moreover, the appropriate kind of SAF for both models is identified as tanh. In the training phase of the optimal scenario of both models, after 2000 reiterations, the final loss, RMSE and learning rate values are approximately 1E-4, 1E-3 and 1.638E-13, respectively. After calibration, the model is tested through the testing datasets. The results of both models with respect to SAF, P-rate and NHU in the training and testing phases for emulating GWL in Qoşaçay plain are presented in Tables  2 and 3, respectively, where the bold values signify the attributes of the optimal scenarios. Based on Table 2, for both models, with the same NHU, by increasing the P-rate, the value of the running time for model training decreases; and with the same Prate, by increasing the value of NHU, the running time increases. Moreover, with the same NHU and P-rate, scenarios with an SAF of tanh are faster than those with softsign. Besides, in accordance with Table 3, it can be discerned that in the single-GRU model (model 2), too high a value of NHU (50) causes to an increase in the RMSE for the overfitting problem, whereas a smaller value of NHU (15) reduces the learning ability of the network for the underfitting problem.
All in all, both models are more precise in the calibration stage than in the validation stage. Although model 1 is very slightly more accurate than model 2, as it needs more NHU for convergence, it takes too long in comparison with model 2.

Validation of double-GRU models
Various combinations of activation function were used in the SAFs of GRU layers in the hidden layer of the simple double-GRU (GRU2) model (model 3) and the newly proposed GRU2× model (model 4). In this respect, after abundant experiments, the softsign-tanh union qualified as the best choice in both models. Likewise, the optimal values for P-rate and NHU in both models are determined as 0.5 and 25 (each layer), respectively.
Applying different activation functions in the hidden layers of both models results in more complex and nonlinear functions being learnt. In this regard, the softsign-tanh combination encourages the models to be less susceptible to the overfitting problem in the training phase. The softsign activation function can increase the rapidity of model calibration, while tanh can suitably capture complex relationships in long-period time series (Yin et al., 2020). Nevertheless, in this case, the type of data has a substantial impact the model quality.
In the training stage of the optimal scenario after 2000 reiterations, the values of RMSE and loss obtained for model 3 are approximately 1E-3 and 1E-4, respectively, and for model 4 are approximately 1E-4 and 1E-5, respectively. The value of the learning rate for both models is 1.638E-13. The results of models 3 and 4 with respect to P-rate and NHU (each layer) in the training and testing phases are presented in Table 4, where the expression softsign-tanh indicates that the kind of SAF in layers 1 and 2 of the GRU is softsign and tanh, respectively. The bold values indicate the characteristics of the optimal scenario. Consistent with Table 4, it can be noticed that, in both models, too high a value of NHU (50) leads to an increase in RMSE because of the overfitting problem, but a smaller value of NHU (15) causes the underfitting problem. Furthermore, for the same NHU, by increasing the P-rate value, the running time decreases, and at the same Prate, as the value of NHU increases, the running time increases. In summary, both models are more precise in the calibration stage than in the validation stage, but the GRU2× model is noticeably faster than the GRU2 model.

Results of the hybrid VMD-GRU model
In the hybrid VMD-GRU model, after several tests, the optimal values of P-rate, NHU and K are 0.5, 128 and 3, respectively, and the appropriate sort of SAF is determined to be tanh. In the validation stage, the hybrid VMD-GRU model under optimal hyperparameters results in an R 2 of 0.92 and an RMSE of 0.16 m. Figure  6 presents the values of RMSE against NHU under the optimal value of hyperparameters for the K values used in the validation stage.

Performance comparison
In the current research, the authors emphasize the substructure and capacity of the developed models (models 1-4) to assess their performance. In this regard, MAT-LAB 2021a (MATLAB 2021) offers a noteworthy parameter, namely, total learnable parameters (TLP), as a vital measure to judge the operational capacity of the models. A model with optimal NHU and NTP results in an appropriate TLP, and, accordingly, encourages a reduction in the impacts of the overfitting condition. In this direction, Table 5 presents the TLP values of the developed models (models 1-4) under the optimal scenario.
According to Table 5, model 3, owing to the maximal value of TLP, is seen as the most exact and best of the  developed GRU-based models. However, based on the values of RMSE and R 2 , it can be differentiated that for the same hyperparameters the novel proposed model 4 works better than model 3. This incongruity and incompatibility can mainly be explained by the excessing the TLP value in model 3 compared to models 4 and 2. In other words, as a result of overcapacity, model 3 has memorized the calibration dataset and has overfitted and, consequently, has lost out in the optimization procedure.
All things considered, model 3 has poor performance and cannot be considered as an appropriate model to forecast the average monthly time-series GWL in the study region.
From the TLP values in Table 5, it can be inferred that for the same hyperparameters, adding a multiplication layer yields a fit NTP accompanied by a suitable TLP in comparison with model 3. Model 4 decreases the value of RMSE by 25% and 18%, respectively, compared with models 3 and 2. Overall, it cannot be unconditionally concluded that only by increasing the number of layers and NHU will the precision and ability of GRU-based models definitively improve. In particular, as a significant outcome, to accomplish an effective GRU-based model, a fitting structural NTP with suitable TLP and an appropriate NHU should be designed and operated.   In connection with the number of iterations and running time for model training in the optimal scenarios, model 2, because of the lowest TLP value, can learn suboptimal weight sets more quickly (2000 reiterations in 71 s). Nonetheless, model 3, for the maximum TLP value, can learn further and even globally optimum weight sets, and, consequently, it takes too long to train (2000 reiterations in 102 s).

Selection of the best model
In this research, to select the best model, taking into account the effect of running time for the model training scale, the AICc and model grading process are performed.
Because the applied models have different layers, their tuned parameters are not similar. Accordingly, it is necessary to equate the performance of the developed models not only in terms of accuracy but also in terms of simplicity. In this regard, the AICc is usually used to scrutinize and compare the developed models by considering the precision and intricacy of the model concurrently (Zeynoddin et al., 2019). The AICc index is defined as follows : where n and k are the numbers of months and parameters, respectively, and σ ε is the standard deviation of the residuals.
The model grading procedure presented by Vaheddoost et al. (2016) is used for a complementary statistical judgment. In this method, the success grade (SG) and failure grade (FG) of performance significant parameters are expressed as follows, along with the NSE: The total grade (TG) of each model is obtained by adding the SG and FG of each model separately; the TG ranges between −20 and +20, and is defined as The values of AICc and TG obtained by Equations (14)-(20) are given in Table 6. Figure 7 shows a graphical comparison of the observed and forecast time-series GWL by model 4 and the hybrid VMD-GRU model in the testing phase. Examination of Figure 7 validates that the hybrid VMD-GRU model can remember the previous measured time-series GWL and capture the variational trend, especially in the bottom and peak points. Note: VMD-GRU = variational mode decomposition-gated recurrent unit; AICc = corrected Akaike's information criterion; NSE = Nash-Sutcliffe efficiency; RMSE = root mean square error; TG = total grade.
According to Table 6, where the hybrid VMD-GRU model has the lowest AICc and RMSE, combined with the highest R 2 and NSE values, by discarding the effect of running time criteria in the selection process of the best method, this hybrid model may be considered the most precise technique for predicting the mean monthly time-series GWL. However, considering the effect of the running time, because the highest TG value is found for the GRU2× model (model 4), it is considered to be the best method. All in all, based on Table 6 and Figure 7, since the hybrid VMD-GRU model is slightly more accurate than model 4, the authors prefer model 4 as the optimum model for forecasting GWL. Despite the superiority of model 4, it has some limitations, such as the need for continuous measurement of different water table depths and the high number of monitoring piezometers required over a long period to forecast exactly the average monthly GWL in the study region. Furthermore, as GWL is not an unremitting stationary multifactor time series with a lagged response, forecasting the GWL for farming purposes is often expected to be time-consuming.

Conclusions
In this study, the time-series GWL was predicted in Qoşaçay plain, an arable region in Qoşaçay city, West Azerbaijan Province, Northwest Iran. For this purpose, 216 monthly measured water table depths of 33 monitoring piezometers in the long period between April 2002 and March 2020 were employed. For modeling, three different-layer GRU-based neural network models, along with the hybrid VMD-GRU model, were developed. To compare the performance of these models, the general single-LSTM-layer network model was developed as the base model. For better configuration and reduced overfitting effects, algorithm tuning was performed using various types of activation function combinations, P-rate and NHU as hyperparameters, along with a trial-anderror procedure. The main conclusions of this study are as follows: (1) The general single-LSTM-layer network model (model 1) was very slightly more precise than the general GRU-layer network model (model 2), but, as it required more NHU, it took too long to reach convergence in comparison with model 2. Also, in both models 1 and 2, tanh was selected as the most suitable and fastest type of SAF.
(2) In the calibration phase, all developed models were more precise than in the validation stage. As well, for the same NHU, by increasing P-rate values, the running time for model training decreased. However, for the same P-rate, by increasing the value of NHU, the running time increased. (3) After several experiments, the softsign-tanh combination in SAF was determined to be an optimal assemblage in both the GRU2 model (model 3) and GRU2× model (model 4). In addition, model 4 was noticeably faster than model 3. (4) The hybrid VMD-GRU model under the optimal hyperparameter values (P-rate = 0.5, NHU = 128, K = 3) and tanh as the most suitable type of SAF resulted in an R 2 of 0.92 and an RMSE of 0.16 m. This hybrid model was a little more precise than the GRU2× model, but it took too long to reach convergence compared to all of the developed models. (5) Despite the maximum value of TLP being obtained by model 3, because of the highest value of TG in the GRU2× model, the latter is preferred as the best model. As an important consequence, it can be deduced that to achieve an operative model, the optimal NTP, NHU and TLP must be used in the emulation process.
The newly proposed GRU-based structure (model 4) is different from typical neural network models and has not been previously utilized in the field of hydrology. It can be viewed as an accurate DL-based structure and a robust intelligent model since its performance was verified by statistical metrics. In contrast to other models, it can significantly reduce the computational costs and is therefore cheaper to run. Moreover, owing to the innovative framework and construction of the model (model 4), which was developed in environments lacking records on meteorological parameters, it can be straightforwardly used in various climatic regions around the world. Also, for the same hyperparameters, its innovative and suitable layer network structure resulted in a suitable TLP value among the designed GRU-based models. Although the current study evaluated the ability of different structures of the GRU-based model and hybrid VMD-GRU model in forecasting GWL, future investigations could be performed using other types of models by hybridizing DNN methods with optimization algorithms such as the genetic algorithm or PSO. The relevant outcomes could be equated with the results of this research so that the best AI approaches can be determined. The authors also strongly recommend predicting GWL in an environment in which meteorological and hydrological data records are available, to investigate the effect of these parameters on the ability and precision of the designed models. Finally, the authors advocate using the TLP factor as a useful measure to judge the predictive ability, overfitting conditions and performance of DL-based models, since it can specify the functioning dimensions.