Comparative analysis of artificial intelligence techniques for the prediction of infiltration process

ABSTRACT Knowledge of the infiltration process is beneficial in designing and planning of irrigation networks, soil erosion, hydrologic design, and watershed management. In this study, the infiltration process was analyzed using predictive models of artificial neural network (ANN), multi-linear regression (MLR), Random Forest regression (RF), M5P tree, and their performances were compared with the empirical model: Kostiakov model. Field experimental data was implemented for training and testing the above models, and their outcomes were assessed with the help of suitable performance assessment parameters. These models were assessed using a field dataset containing 340 observations, out of which 70% were used for the training purpose and the remaining for the testing. The RF-based models perform better than other models with Nash-Sutcliffe model efficiency (NSE) equal to 0.9963 and 0.9904 for the training and testing stages, correspondingly. ANN, MLR, and M5P model also give a good prediction performance, but the Kostiakov model’s performance is inferior. Sensitivity investigation suggests that the parameters, cumulative time, and moisture content in the soil are the most influential parameters for assessing the cumulative infiltration of soil.


Introduction
Infiltration is water movement into the subsurface from surface sources, for instance, snowfall, irrigation, precipitation, etc. The soil-water relationship plays a crucial role in modeling towards water management, control of droughts and floods, rainfallrunoff, evaluations of risk, design, scheduling of irrigation system, development of water resources, and drainage design, etc. Various physical properties of the soil are affecting the infiltration characteristics. Soil texture, soil moisture, and density have considerable influence on the infiltration process (Angelaki et al., 2013). The texture of the soil is also one of the most crucial factors which influence the infiltration process. Water accessibility in the soil depends on the soil's water-holding ability, which is affected by the texture and structure of the soil (Al-Azawi, 1985). The infiltration rate is high in unsaturated soil. It reduces gradually and finally reaches to the constant infiltration rate. Knowledge of infiltration is necessary for any valuable and durable projects of water resources management (Sihag et al., 2018a). The irrigation system's design and scheduling rely on the soil's infiltration because it affects various design considerations of agriculture and canal systems.
Infiltration characteristics vary at the scale due to variation in texture and type of the soil and other soil conditions. The experimental measurement of the infiltration process is laborious, tedious, and timeconsuming (Vand et al., 2018). Assessment of the infiltration process is very complex due to spatial and temporal variation (Pandey & Pandey, 2018).
Numerous studies (Mishra et al., 2003;Singh et al., 2018) proposed implementing conventional infiltration models as a substitute for experimental observation. The use of any specific model needs complete knowledge of boundary conditions and assumptions of that model. Several soil water scientists introduced several infiltration models such as Kostiakov, Horton, Philip, Holton, Green-Ampt, Novel, Modified Kostiakov, etc. for estimating the infiltration (Richards,1931;Philips,1957;Singh & Yu,1990;Mishra et al.2003;Sihag et al.2017a). Mishra et al. (2003) divided these models into three groups, Physical models, Semi-empirical and empirical models. Most of these models are based on the basic assumption of homogeneous water absorption, pounding head, and constant infiltration rate. These hypotheses hardly ever found under real field conditions, which may lead to the inaccurate prediction of the infiltration process.
Some researchers used an alternative method for estimating the infiltration process. They use several soft computing based infiltration models based on soil properties. Several successful applications of soft computing based infiltration prediction reported in the literature such as (Tiwari et al., 2017;Singh et al.,2017;Sihag et al.,2017b)concluded that soil physical properties and elapsed time are effectively selected to estimate infiltration process with higher precision.
In recent years, soft computing approaches such as Random Forest, M5P, SVM, GEP, Gaussian process, ANFIS, and many more approaches have been successfully implemented in water resources problems (Azamathulla et al.,2016;Parsaie & Haghiabi,2015,2014Parsaie et al.,2017;Sihag, 2018;Sihag et al.,2017c;. This paper uses a model based on RF, as proposed by Breiman (2001). It is a powerful tool for nonlinear regression and classification. Examples using the RF capability include infiltration process modeling (Singh et al.,2017. ANN is working on the principle of nerve cells of the brain. ANN has widely applied in the field of engineering and observed that performs better than conventional models e.g., Sihag (2018), , Haghiabi et al. (2017), and Ghorbani et al. (2016). M5P tree, initially proposed by Quinlan (1992), is a decision tree learner for regression problems. M5P tree-based model involves linear regression functions at the terminal nodes and fits a multivariate linear regression model to each subspace by classifying or separating the entire data space into multiple subspaces. M5P has been successfully used in the water resources related studies e.g., Sattari et al. (2018), Sattari et al. (2013), Pal et al. (2012), and Pal and Deswal (2009). RF, M5P, and ANN-based models extract knowledge from data itself. The best performing model identifies using appropriate performance assessment parameters such as Nash-Sutcliffe model efficiency, root mean square error, mean square error, and correlation coefficient. In this study, models are developed to predict the cumulative infiltration of soil and compare the performances of soft computing-based models with empirical models (Kostiakov model and multilinear regression (MLR)).

Study area
Kurukshetra district of Haryana state lies in the north-east part of the State, India, and is bounded by North latitudes 29°53ʹ00" and 30°15ʹ02" and East longitudes 76°26ʹ27" and 77°07ʹ57". Thanesar Tehsil of Kurukshetra district is selected as a study area. The total area of the Kurukshetra district is 1530 Km 2 . The site map of the study area is given in Figure 1. The study area (Thanesar) is a division of the Ghaggar basin. A total of 20 different sites were selected for experimentation in the study area. The coordinates of all sampling sites are scheduled inTable 1. The texture of the soil is scheduled inTable 2.

Dataset
The whole dataset containing 340 observations from field infiltration experiments was separated into two groups: training and testing. Training data involves 70% of the total data chosen randomly from the whole data set, while testing data consists of the remaining 30% of the entire data. The features of training and testing data sets are represented in Table 3. Time, sand, clay, silt, bulk density, and moisture content are input parameters, and soil's cumulative infiltration is the target.

Observation of cumulative infiltration
Experiments were performed to measure the cumulative infiltration of soil in the study area's locations using a mini-disk infiltrometer (Decagon Devices Inc., Devices, 2014). Two chambers are available in a mini-disk infiltrometer. One is a water reservoir, and the other is a bubble. Both are connected via a Mariotte tube. This tube is used to provide a steady water pressure head of 0.05 to 0.7 kPa. The instrument's bottom part contains a porous sintered steel disk having a diameter of 4.5 cm and a thickness of 3 mm. The water is filled in both chambers and placed on the soil's flat surface (Figure 2), ensuing in water moving into the soil. During the measurement, the quantity of the water in the reservoir chamber was recorded at specific intervals. The flowchart diagram for the current investigation is represented in Figure 3. Figure 2 represented the flowchart of the investigation. The first step was designing the experiments followed by the collection & analysis of data, comparison of the Artificial Intelligence techniques and empirical models, the best-fitted model for prediction of the infiltration process, and conclusion.

Artificial neural networks (ANN)
ANN is a data mining approach, generally implemented in several engineering fields. The idea of the ANN model generation is inspired by the nerve cell of the human brain. ANN is a parallel knowledge processing system containing a set of neurons in layers. In this study, the ANN model includes three layers input, hidden, and output layers. The input layer receives the data, the hidden layer processes them, and the output layer shows the model's target resultant. Each input into a neuron in a hidden and output layer is multiplied by a corresponding interconnection weight (X ij ) and total by a threshold steady value called bias (y i Þ. The addition and multiplication functions in   every neuron are shown in equation 1. P j is output achieved by the activation function to generate an output for unit j. The complete information about ANN is provided by Haykin (1994).

M5P model (M5P)
M5P tree, proposed by Quinlan (1992), is a decision tree learner for regression problems. This tree algorithm assigns linear regression functions at the terminal nodes. It fits a multivariate linear regression model to each subspace by classifying or dividing the whole data space into several subspaces. The M5 tree model develops conditional linear models for the nonlinear behavior of the data set. The information about the splitting criteria for the M5 tree model is gained on the source of the assess of error at every node. The error is calculated by the standard deviation of the class values that arrive at a node. The standard deviation reduction (SDR) is defined as follow: Zshows the set of occurrences that arrive at the node; Z i shows the subset of instances with the i th target of the possible set, and sdshows the standard deviation. The splitting practice ends if the target values of all instances that arrive at the node differ very minutely.

Random forest regression (RF)
Random forest, introduced by Breiman (2001), is a classification and regression process, comprises a gathering of regression trees trained using various bootstrap samples (bagging) of the training data. Each tree acts as a regression function on its own, and the  final target is considered as the average of the individual tree outputs (Adusumilli et al., 2013).In the case of bagging, the training set contains about 67% of data from the actual training set; thus, about one-third of the data are left out from each tree developed (Singh et al., 2017). These left-out data training data, termed as out-of-bag (out of the bootstrap sampling), were used to estimate prediction error and variable importance. The quantity of trees to be grown (k) in the forest and the number of features or variables selected (m) at each node to generate a tree are the two standard primary parameters necessary for random forest regression (Breiman, 2001).

Empirical models
Kostiakov and Multi Linear regression models are empirical models. The least-square techniques were used to drive regression coefficients of the Kostiakov and MLR models with the training data set.

Multi-linear regression (MLR)
The following relationship's general form of the multilinear regression model is considered to develop a nonlinear regression model.
where F t ð Þ is the dependent variable representing cumulative infiltration of soil; t,S, C, Si, ρ, andMC are regarded as explanatory variables, a is the constant, and the estimate of parameters (regression coefficients)b 1 ,b 2 , b 3 ,b 4 ,b 5 and b 6 is found by minimizing the sum of squares of error in prediction based on least squares. Based on above equation, the following relationship is developed from the training data set: ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffiffi where H is the actual values, F is the estimated/forecasted values,H is the mean of actual values, and n is the number of observations.

Implementation of machine learning methods
Four standard statistical measures: R, MSE, RMSE, and NSE, were selected as performance evaluation parameters judge the accuracy of the machine learning models and Kostiakov model. Several manual trials were carried out to discover the optimum value of the primary parameters. Higher values of R, NSE, and Lower values of MSE, RMSE indicate that the models' better prediction accuracy. The number of trees to be grown (k) in the forest and the number of features or variables selected (m) at every node to generate a tree are the two standard primary parameters essential for random forest regression. In M5P, calibration of models was done utilizing changing the value of no. of instances allowed at each node (m). In the ANN number of the hidden layer, the number of neurons, iterations, learning rate, and momentum are the primary parameters. The selected primary parameters of the modeling methods are presented in Table 4.

Results and discussion
Performance of Empirical models: Table 5   and 0.3301 & R = 0.8698, and 0.5757 were found for MLR and Kostiakov models respectively. The scatter plot among the actual and predicted F(t) value is represented in Figure 4 for the MLR and Kostiakov model for the training and testing stages. Figure 4 illustrates the closer comparison of F(t) between both models, where it indicates that the MLR model's performance is slightly better than Kostiakov models for every data.

Performance of soft computing based models
The preparation of M5P, ANN, and RF-based models is a trial and error process. Numbers of manual trials were carried out to discover the optimum value of primary parameters. The optimum values of userdefined parameters of M5P, ANN, and RF are listed in Table 4. Outcomes of the accuracy of predicted F(t) for M5P, ANN, and RF-based on R, MSE, RMSE, and NSE are reported in Table 5. For the best model, a Higher value of R and NSE and lower MSE and RMSE were considered. Cementing the outcomes, R, MSE, RMSE, and NSE indicate that RF is more accurate than ANN and M5P models in predicting the cumulative infiltration of soil. The Scatter plot among the actual and predicted F(t) values is represented in Figures 5, 6, and 7 Figure 4. Scatter plot of empirical models for both training and testing stages.

Comparison of empirical and soft computing based models
The performance of predicted F(t) using RF, M5P, and ANN compared with the predicted F(t) using empirical models is assessed. The same data set was used to assess the empirical model selected for RF, M5P, and ANNbased models. Figure 8 shows the scatter plot and performance pot using soft computing based models, empirical models, for both training and testing stages. A performance plot is also shown in Figure 8 using the testing data set for the Kostiakov model, MLR, M5P, ANN, and RF model. The comparison of statistical parameters obtained from soft computing based models and empirical models are listed in Table 5. Overall, RF shows a better prediction method, having a higher NSE of 0.9904. Predicted values of F(t) using RF (as represented  in Figure 8) lies more closely to the perfect fit line and follow the same path, which is followed by actual values compared to those estimated using empirical models.

Sensitivity investigation
A sensitivity investigation was done to find the mainly effecting input factor in F(t) prediction. Table 5 suggests that RF is more suitable than other soft computing and empirical models, so the RF method was used. Seven sets of training data were developed, one including all input parameters and six were developed by eliminating one input factor at a time, and results were listed in terms of R as well as RMSE with testing data set. Outcomes from Table 6 conclude that elapse time and moisture content were the most significant factors for predicting soil's cumulative infiltration.

Conclusion
Infiltration plays a crucial role in rainfall-runoff modeling, design, and scheduling of irrigation systems, etc. The performance of soft computingbased models in predicting soil's cumulative infiltration over varying sand, silt, clay, density, and moisture content was investigated. The three most popular RF, M5P, and ANN, were utilized as soft computing models. The prediction of RF's cumulative infiltration values was more superior to those from ANN, M5P, MLR, and Kostiakov models. At the same time, the ANN model also showed improvement compared to M5P and empirical models. Sensitivity results conclude that soil samples' elapse time and moisture content are the most significant factors when the RF-based modeling method is implemented for the prediction of cumulative infiltration of soil for a given dataset. This investigation enhance the use and capabilities of the artificial intelligence techniques and also train a general model to predict the infiltration process for successfully implemented in the other study area.

Disclosure statement
No potential conflict of interest was reported by the authors.