Prediction of evaporation in arid and semi-arid regions: a comparative study using different machine learning models

Evaporation, one of the fundamental components of the hydrology cycle, is differently influenced by various meteorological variables in different climatic regions. The accurate prediction of evaporation is essential for multiple water resources engineering applications, particularly in developing countries like Iraq where the meteorological stations are not sustained and operated appropriately for insitu estimations. This is where advanced methodologies such as machine learning (ML) models can make valuable contributions. In this research, evaporation is predicted at two different mete-orological stations located in arid and semi-arid regions of Iraq. Four different ML models for the predictionofevaporation– theclassificationandregressiontree(CART),thecascadecorrelationneu-ral network (CCNNs), gene expression programming (GEP), and the support vector machine (SVM) – were developed and constructed using various input combinations of meteorological variables. The results reveal that the best predictions are achieved by incorporating sunshine hours, wind speed, relativehumidity,rainfall,andtheminimum,mean,andmaximumtemperatures.TheSVMwasfoundtoshowthebestperformancewithwindspeed,rainfall,andrelativehumidityasinputsatStationI ( R 2 = .92), and with all variables as inputs at Station II ( R 2 = .97). All the ML models performed well in predicting evaporation at the investigated locations.


Introduction
Evaporation is a key process in the hydrologic cycle which has a direct effect on the planning and operation of water resources (Penman, 1948;Stewart, 1984). Therefore, the accurate quantification of evaporation is very important to water managers and engineers (Qasem et al., 2019). The rate of evaporation is extremely high in arid and semi-arid environments, such as in Iraq (Sayl, Muhammad, Yaseen, & El-shafie, 2016). High evaporation rates cause substantial volumes of water in reservoirs, natural lakes, and river basins to vaporize into the atmosphere (Boers, De Graaf, Feddes, & Ben-Asher, 1986;Khan et al., 2019). Hence, there is a need to consider the rate of water loss from surface water bodies when designing and operating dams and other hydraulic structures for irrigation and water resources management (Moazenzadeh, Mohammadi, Shamshirband, & Chau, 2018 of climate change on evaporation also highlights its paramount importance in surface water balance (Sartori, 2000); evaporative losses have been making this issue more significant for water resources management in recent years (Eames, Marr, & Sabir, 1997;Priestley & Taylor, 1972). Evaporation estimation can be accomplished using direct or indirect approaches (Moran et al., 1996;Penman, 1948). One direct method involves the measurement of the evaporation rate from a Class A pan (with a diameter of 1.22 m and a depth of 0.25 m), positioned 0.15 m above the soil surface (Stanhill, 2002). This approach not only provides accurate evaporation estimates over time but is also easy and cheap, as it does not require the installation of pans and meteorological stations, which are often expensive (Ali Ghorbani, Kazempour, Chau, Shamshirband, & Taherei Ghazvinei, 2018). However, the Class A pan approach is restricted by its general implementation in several climate regions where climate characteristic varies from one to another.
The development of indirect estimation methods based on the use of different meteorological variables such as sunshine hours, wind speed, relative humidity, rainfall, and the minimum, mean, and maximum temperatures is often suggested for the estimation of evaporation, especially when working with empirical and semiempirical models (Ali Ghorbani et al., 2018;Lu et al., 2018). However, the major problem with using this form of evaporation estimation is the dynamic nature of the applied meteorological variables, owing to their nonlinearity, non-stationary, and stochastic features. It is thus imperative to develop reliable and robust intelligent predictive models of evaporation. The development of such models has become a major focus in water resources management and engineering (Khan, Shahid, Ismail, & Wang, 2018).
ML models such as the classification and regression tree (CART), the cascade correlation neural network (CCNN), gene expression programming (GEP), and the support vector machine (SVM) have achieved significant advancements in hydrologic modeling (Danandeh Mehr et al., 2018;Fahimi et al., 2016;Jing et al., 2019;Yaseen et al., 2019). These models can efficiently mimic and solve the stochasticity of different complex hydro-climatological processes. Recent evaporation prediction studies have demonstrated a noticeable achievement in better, more reliable generalized predictive models. This has also been the aim when developing and implementing new evaporation prediction methods, as the target is to achieve low prediction errors.
In this context, the aim of this study is to investigate the feasibility of the four different ML models listed above for modeling the evaporation at two Iraqi meteorological stations located in Mosul and Baghdad. The performances of the four applied models are compared in order to assess their prediction efficiencies and evaluate the role of the various climatic factors in the prediction of evaporation in arid and semi-arid regions.

Case study and data description
Iraq is mostly characterized by an arid to semi-arid climate, ranging from semi-humid in the north to semi-arid in the south (Chenoweth et al., 2011). Iraq experiences a lack of water resources and is susceptible to drought (Al-Ansari, Ali, & Knutsson, 2014;Lelieveld et al., 2012). This severely affects the socioecological system of the Tigris basin, which has a population of over 18 million. Rising temperatures are associated with increasing scarcity of surface water and decreasing water tables in aquifers, which indicates that the current drought conditions may intensify in the coming years. Temperatures and droughts in the region are forecast as increasing steadily until they reach unsustainable levels (Abbas, Wasimi, Al-Ansari, & Sultana, 2018). Currently, Iraq loses about 61% of its total precipitation to evaporation (Al-Taai & Hadi, 2018), which has a significant impact on the country's hydrological cycle. Hence, it is necessary to make accurate estimations of evaporation under various climates, especially those of drought-prone Iraq (Abdullah, Malek, Abdullah, Kisi, & Yap, 2015).
Drought is usually a direct outcome of the balance between precipitation and temperature (Moazenzadeh et al., 2018). The Tigris basin experiences an annual average rainfall ranging between 400 and 600 mm, but it can be as low as 150 mm in the downstream and as high as 800 mm in the upstream in some years. Based on the data from Iraqi meteorological stations, both Baghdad and Mosul have recorded high temperatures over the period from 1999 to 2009. The mean July temperature in Baghdad has ranged from 23.5 to 44.0°C, while the annual rainfall and evaporation rates for the period are 244 and 3200 mm, respectively. Mosul has experienced a mean July temperature range of 24.8 to 43.0°C, with annual precipitation and evaporation rates of 729 and 3900 mm, respectively. Figure 1 shows the location of the meteorological stations.

CART model
The CART is an ML approach that has been widely used to develop regression and classification models (Breiman, Friedman, Olshen, & Stone, 1984). The CART structure consists of a set of nodes, and each node contains a portion of the data (Kumar, Pandey, Sharma, & Flügel, 2016). A CART model is constructed via recursive partitioning: starting with a single root node, that node is split into left and right child nodes (Pham, Bui, & Prakash, 2018). These child nodes can additionally be split in turn, and themselves become parents of their own child nodes, and so on. Three types of node are used: the root node, the inner nodes, and the terminal nodes. The root node, known as the 'first parent', is a parent only, the inner nodes are both children and parents, and the terminal nodes -as the last on their branches -are children only, hence they are also referred to as leaf nodes. All data are included in the root node. Two steps are carried out for each split in the tree: the variable to be used for splitting is selected, then the sets of variable values to be inherited by the left child node and the right child node are defined. The partitioning can then be drawn diagrammatically as a tree (Figure 2(a)).

CCNN model
The CCNN (Fahlman & Lebiere, 1990) is a selforganizing network which begins with only input and output neurons. Every input is linked to every output, and each connection is defined by an adjustable weight. From this starting point the network is trained, a process through which neurons are selected from a pool of candidates and added to a hidden layer. One neuron at a time is added to the hidden layer, and once added these neurons do not change. Because the CCNN is self-organizing and grows during the training process, it is not necessary to define the numbers of layers and neurons to be used in the network. Variables are fed into the input layer, which through the weighting in its connections to the output layer along with a constant parallel input known as a bias value uses the neurons in the hidden layer to generate and distribute values into the output layer. The network thus uses feedback from the outputs in conjunction with the neurons in the hidden layer to maximize the correlations while minimizing the residual error. Figure 2(b) presents the cascade architecture.

GEP model
Gene expression programming (GEP) is an evolutionary learning model introduced by Ferreira (2001) that is made up of chromosomes. The expression tree consists of two parts, the head and the tail, which are encoded in a linear series of fixed-length chromosomes or genomes and then represented as non-linear entities of various shapes and sizes (Ferreira, 2006). Mathematical expressions are automatically represented by a tree expression that consist of nodes containing functions and leaves containing variables and constants. The generated candidates are assessed by using the root relative squared error as a fitness function. The best candidates are then reassessed by applying a modification evaluation cycle until the best solution is achieved. Karva language is used to translate the expression tree by reading it from left to right and from top to bottom. A GEP model is developed using five major steps: (1) identifying the set of independent variables to be utilized in individual programs; (2) defining a set of functions and arithmetic operations; (3) selecting the fitness measure; (4) selecting the head length, the number of genes, and the linking function; and (5) selecting the genetic operators to be used (Ferreira, 2006).

SVM model
The SVM is a relatively new data-driven method of applied mathematics learning theory that can be used to unravel regression and classification issues (Vapnik, 1995). Classified as a new-generation learning machine, the SVM uses a hyperplane to separate the data from one dimension to high dimensional space and then solves the regression problems using the following equation (Raghavendra & Deka, 2014): where w is the weight vector, b is the bias value, and k(x i , x) is the kernel function. The values of the internal parameters are determined using the least squares method by minimizing the sum of the squared deviations. The most common regression modeling method in SVM is called ε-SVM, the cost function for which is formulated as follows: (2) The goal of the cost function is to maximize the εderivation. The minimization function can be explained as: which is subject to where C is the cost factor and ε i and ε * i are the slack variables. Linear, polynomial, radial basis, and sigmoid kernel functions can be used in the SVM algorithm. A trial and error technique is employed to select the best kernel function for the specified problem according to the results. In the SVM model the predictors are called attributes, and a hyperplane known as a feature space is used to transform these attributes. The task of selecting the optimum representation is called feature selection, and the set of features that describes a row of predictor values is named a vector. The vectors that are close to the hyperplane are known as support vectors. The model accuracy depends on the choice of optimal parameters for the kernel operation, like C, γ , etc. C is a trade-off between an estimation error and the weight of the vector. The SVM uses the cross-validation method to eliminate overfitting. An example of a traditional SVM network is shown in Figure 2(c).

Model development
Choosing the appropriate predictors is one of the most important steps in building a robust predictive model (Yaseen, El-shafie, Jaafar, Afan, & Sayl, 2015). In the present study, four different ML models (the CART, the CCNN, GEP, and the SVM) were selected for predicting monthly evaporation. The suggested models were developed using five different input combinations: Model2 (M2) : Model3 (M3) : Model4 ( where ET o is the evaporation, WS is the wind speed, RF is the rainfall, RH is the relative humidity, T min is the minimum temperature, T max is the maximum temperature, and Sh is the number of sunshine hours. The hydrometeorological data were divided into two phases, consisting of 80% training and 20% testing. The data division was performed based on a trial and error procedure through which the best prediction performance was attained. There was no way to anticipate which settings for each proposed model would be needed to obtain the optimum prediction, so several models were developed using a trial and error process and the results of these models were compared in order to select the optimal settings in each case. The GEP model requires the selection of the functions set, population size, genes per chromosomes value, gene head length, fitness function, and linking function, and the genetic operators include the mutation rate, inversion rate, gene transportation rate, one-point recombination rate, two-point recombination rate, gene recombination rate, IS transportation rate, and RIS transportation rate. The SVM model requires the selection of the parameters for the proposed kernel function. The CCNN model requires the selection of the kernel function and the number of hidden neurons that are created during the training process. The CART model requires the selection of the minimum number of rows per node, the minimum size node that can be split, and the maximum number of tree levels. The predictive modeling software DTREG was used to develop the models in this study (Sherrod, 2003).

Prediction performance metrics
The performance of the predictive models was assessed using the following statistical metrics: the root mean square error (RMSE), the mean absolute error (MAE), the Nash-Sutcliffe coefficient (NSE), Willmott's Index (WI), Legate and McCabe's Index (LMI), and the determination coefficient (R 2 ) (Chai & Draxler, 2014;Legates & McCave, 1999). The mathematical expressions of the performance metrics are as follows: where ETo obs and ETo pred are the observed and predicted evaporation processes, and ETo obs and ETo pred are the mean values of the observed and predicted evaporation processes.

Application and analysis
The performance of the four constructed ML models was tested by predicting evaporation at two different meteorological stations in Iraq. Station I (in Mosul) is located in a semi-arid environment, whereas Station II (in Baghdad) is located in an arid environment. Evaporation was predicted using five different input combinations of related meteorological variables (M1, M2, M3, M4, and M5). The performance of the four ML models during the training and testing phases at Station I in Mosul is given in Tables 1-4. The results show that during training the CCNN model produced the most accurate prediction with the lowest RMSE (23.83 mm.d −1 ) and the highest R 2 (.97) using only the three input variables of wind speed, rainfall, and relative humidity. However, in general, the best prediction capacity is attained when all the input variables are used for the prediction matrix (the M5 input  combination). This is clearly evidenced in the high correlation of evaporation with different climate variables such as wind speed, rainfall, sunshine hours, humidity, and air temperature. The second best model during the training process was the CART model, with the lowest RMSE (33.33 mm.d −1 ) and the highest R 2 (.94) for the M5 input combination. During the testing phase, all four models attained their best performances for the M2 input combination, wherein just the wind speed, rainfall, and relative humidity are used. The results also show that the CCNN model performed best on all the metrics during the training phase, and the other three models tended to produce performances that were relatively similar to each other. However, the CART, GEP, and SVM models produced very similar results for all the performance metrics during the testing phase for the M2 input combination, and their performance was found to be better than that of the CCNN model in this phase.  For Station II in Baghdad, all the applied ML models achieved their best performance during both the training and testing phases when the M5 input combination was used (Tables 5-8). The results show that the CART performed better than the other three models during the training phase, whereas the SVM performed the best during the testing phase. In terms of the statistical metrics, the SVM yielded the lowest RMSE (32.27 mm.d −1 ) and the highest R 2 (.97).
Overall, the results from the two stations indicate that the applied ML models achieved their best performance in terms of the six metrics when all the climatological information was incorporated, i.e. for the M5 input combination. In other words, the results demonstrate that the performance of these models increases as the number of inputs into the models increases. Moreover, when comparing the results of the performance metrics from both  stations, the ML models tend to produce different results for different stations, which indicates that the model performance also depends on the climate of the investigated area. Figures 3 and 4 present box plots of the relative errors computed for Stations I and II, respectively. The figures show that the performances differ based on the employed ML model, the input combinations, and the locations. Higher relative errors are obtained when the minimum number of inputs is used for constructing the predictive models. The error decreases exponentially as the number of inputs increases, which further supports the tabulated results presented above.
Scatter plots are also used to compare the performance of the ML models (Figures 5 and 6). The plots show the agreement between the observed and predicted evaporation using the determination coefficient (R 2 ) and           the slope of the regression model. Generally, the results show that all the models have the highest R 2 for the M5 input combination, and the regression models best fit the predicted observations for that combination.
Figures 7 and 8 exhibit the performance of the ML models at Stations I and II, respectively, through Taylor diagrams. The figures show a statistical summary of the predicted and observed evaporation in accordance with several statistical metrics, including the RMSE, standard deviations, and correlation coefficients. The results here demonstrate the superiority of the SVM model over the other applied models.
Finally, the observed and predicted evaporation are plotted as time series in Figures 9 and 10 for Stations I and II, respectively. The black and dark blue lines in the figures denote the observed and predicted time series, respectively. It can be clearly seen that the predicted series fluctuate more compared to the observed series when a lower number of input variables is used for the development of the models, whereas the predicted series completely overlap the observed series when all the meteorological variables are used as inputs.
Overall, these findings demonstrate that the applied ML models produce more accurate results when a full set of meteorological information is used. The predictive performances vary for the different methods used for the development of the models. Overall, the SVM achieves the most accurate prediction performance for most of the cases. It is also important to note that the performance of the models differs according to the climate of the investigated region.

Conclusion
The prediction of evaporation is one of the most complex tasks in hydrological engineering. In nature, evaporation is associated with multiple climate variables and thus is characterized by high non-linearity and stochasticity. For a developing country like Iraq, evaporation monitoring and measurement is limited due to the non-maintained meteorological stations. Thus, the introduction of ML technologies can contribute greatly by better modeling this hydrological process. Four different versions of ML models -the CART, the CCNN, GEP, and the SVMwere developed for the prediction of evaporation at two meteorological stations in Iraq located in different climates, Station I in Mosul and Station II in Baghdad. Overall, the applied ML models demonstrate high performance in simulating evaporation for both arid and semi-arid climates. Among the four ML models, the SVM exhibits superior prediction performance. In quantitative terms, at Station I, M2 was the best input combination for the SVM, yielding an RMSE of 35.76 mm.d −1 , an MAE of 27.71 mm.d −1 , and an R 2 of .92 during testing.
At Station II, M5 was the best input combination for the SVM, yielding an RMSE of 32.27 mm.d −1 , an MAE of 24.75 mm.d −1 , and an R 2 of .97 during testing. Based on the attained prediction results, the predictive model demonstrated better accuracy at Baghdad Station. This can be justified due to the high climate stochasticity at Mosul Station. Further work is needed to assess the uncertainty in prediction due to the model structures and input combinations.