Assessment of infiltration models developed using soft computing techniques

ABSTRACT In this study, predicting ability of support vector machines (SVM), Gaussian process (GP), artificial neural network (ANN), and Random forests (RF) based regression approaches was tested on the infiltration data of soil samples having different compositions of sand, silt, clay, and fly ash. In addition to this, their performances were compared with the Kostiakov model (KM) and Philip’s model (PM). Dataset containing a total of 392 observations was collected from the experimental measurements of soil infiltration rate on different soil samples. Out of the total dataset, 272 recordings were randomly selected for training and the residual 120 observations were selected for validation of the developed models. Standard statistical parameters were used to measure the predicting ability of various developed models. The result suggests that the best performance could be achieved by Polynomial kernel function-based GP regression (GP_Poly) with coefficient of correlation values as 0.9824, 0.9863, Bias values as 0.0006, −2.3542, root-mean-square error values as 47.7336, 40.3026, and Nash Sutcliffe model efficiency values as 0.9655, 0.9727 using training and testing dataset, respectively. Furthermore, time is found as the most influencing input variable for predicting the infiltration rate when GP_Poly-based model is used to predict the infiltration rate.


Introduction
Infiltration plays a very important part in hydrology related to above and under the surface of the earth as well as irrigation; it has earned an enormous consideration from hydrologists. A noticeable number of infiltration models have been developed for the estimation of infiltration rate and they can be further classified into subcategories (Mishra, Tyagi, & Singh, 2003): (i) physical model, (ii) semi-empirical model, and (iii) empirical model. Physical and semi-empirical models depend on the derived laws and the equations. Smith (1972), Green and Ampt (1911), Horton (1938), Holtan (1961) are their examples. Empirical model is dependent on the field and laboratory experimental data. Modified Kostiakov (Smith, 1972), Soil conservation service (1972) are the examples of empirical models. Clearly, there are numerous infiltration models but still, their fitness is not clear for the actual world conditions.
Various researchers and scientists have done a lot of research to compare the infiltration models. Skaggs, Huggins, Monke, and Foster (1969) have compared Green-Ampt, Philips, Horton, and Holton models for dissimilar soils with different surface conditions and moisture contents. The Horton and Holton models evaluated an accurate steady-state infiltration rate. Whisler and Bower (1970) analysed the Philip's model, Green-Ampt model, and numerical model for estimating the infiltration rate of different soil profiles. Rawls, Brakensiek, and Miller (1983) focused on the progress of methods to estimate infiltration model parameters. GENSTAT which is a statistical tool was used by Ogbe, Jayeoba, and Ode (2011) to fit four infiltration models.  proposed a Novel Model for the estimation of infiltration rate through soil. This study includes two most popular infiltration models : Kostiakov Model (KM) and Philip Model (PM).

KM
Kostiakov model (KM) proposed by Kostiakov (1932) is as follows: where f(t) is the infiltration rate (mm/hr) at time t, t is time of infiltration (sec), x and y are constants in this equation.
where f t ð Þ is the infiltration rate (mm/hr), t is time (sec), S is the Sorptivity (mm/hr 0.5 ), and A is the rate factor (mm/hr).
In this paper, modelling methods, viz., GP, SVM, ANN, and RF-based regression are investigated and the abilities of these strategies in modelling the infiltration rate of water through sandy soil are explored. Further, the performances of these modelling methods are compared with KM and PM infiltration models.

Support vector machines (SVM)
This method is a regression and classification approach which originates from statically learning theory (Cortes & Vapnik, 1995). The SVMs classification techniques depend on the standard of ideal division of classes. In the event that the classes are divisible, this strategy chooses, from amongst the endless number of linear classifiers, the one with minimum generalization error. Along these lines, the chosen hyperplane will be one that leaves the most extreme edge between the two classes, where edge is characterized as the total of the separations of the hyperplane from the nearest purpose of the two classes. It very well may be accomplished by anticipating the first arrangement of factors into a higher dimensional element space and figuring a straight characterization issue in the element space (Aghelpour, Mohammadi, & Biazar, 2019;Smola, 1996;Vapnik, 1998).

Gaussian processes regression (GP)
The Gaussian (GP) models depend on the presumption that nearby observations ought to pass on data about one another. They indicate an earlier specifically over function space. Therefore, the GP is a natural generalization distribution whose covariance is a matrix and mean is a vector. The Gaussian method is based upon the function, whereas distribution relies upon the vector. Due to earlier information pertaining to the function, the validation is not necessary for speculation and Gaussian process regression model can comprehend the prescient distribution related to test input Rasmussen (2006).
A Gaussian procedure is characterized as an accumulation of arbitrary factors, any limited number which has a joint multivariate Gaussian distribution. The n number of pairs (x i × y i ) has been made through the χ Â γ ð Þwhich indicates the input and output data domain, correspondingly. It is assumed that y R, accordingly the GP on χ, is uttered by mean function µ: χ ! R and covariance function κ : χ Â χ ! <.

Artificial neural networks (ANN)
The artificial neural system (ANN) is widely drawn in for numerical prescience and grouping (Aggarwal, Goel, & Singh, 2012;Jahani & Mohammadi, 2019;Kia et al., 2012;Moazenzadeh & Mohammadi, 2019). It is manufactured with quantities of handling components and includes three essential layers, for example, the information layer, hidden layer, and output layer correspondingly. The channel in the midst of the layer is used to make the weight relationship in the midst of the hubs. Each node is similar to biological neuron and performs mostly two tasks. It has done the total of the information values and weight related with each interaction. Further, this summation is yielded over activation function to make the result. By giving the weight, the system creates a result which is existed close to the watched target result. The complete detail about ANN is given by Haykin (1999). In this current investigation, one hidden layer is used for the model development.

Random forest regression (RF)
Random Forest is one of the latest techniques which is used for the classification and regression-based analyses. In this technique, the number of trees with different verities was used for forecasting or estimation. Tree predictors were used numerical values as randomly to class labels in random forest classifier (Breiman, 1999). Random forest regression used in this study contains an assembly of input parameters or arbitrarily chosen parameters at each node to grow a tree. RF technique requires only two user-defined parameters such as the number of parameters used at each node and the amount of trees (Breiman, 1999).

Methodology and dataset
An experimental investigation was performed using mini-disk infiltrometer (Decagon Devices Inc., 2006) in the laboratory at the National Institute of Technology, Kurukshetra, India. The materials chosen for this study were sand, rice husk ash, and fly ash. All the observations had been taken on pre-assumed mixing proportion and other initial conditions such as moisture content and bulk density. The characteristics of the material are indicated in Table 1.
Total 392 observations were collected from the experiments, in which 272 observations, randomly taken from the total data, were used for preparing the models and rest 120 were used to validate/test the developed model performance. Time (T), sand content (S), rice husk ash (R), fly ash (F a ), suction head (s), bulk density (B), and moisture content (w) were chosen as input variables in this study, whereas infiltration rate was taken as the output variable. Table  2 records the statistical features of training and testing dataset that were taken in this investigation.

Detail of kernel functions and user-defined parameters of soft computing techniques
In this investigation, the two most popular kernel functions, i.e., polynomial (K(x, x ' ) = ((x. x ') + 1) d ) and radial basis kernel function were implemented in GP and SVM, where d and γ are the variables of polynomial and radial basis kernel function correspondingly. In ANN, hidden layer neurons (H), learning rate (L), iterations, and momentum (M) are adjusted during model learning and testing, while in RF regression, setting of two user-defined parameters: no. of trees (I) and features allowed at each node (m) are required during training and testing.
Several manual adjustments were tried by tuning the user-defined parameters of soft computing techniques aiming to obtain the minimal error between the predicted and actual output. The optimal values of primary parameters are listed in Table 3 for SVM, GP, ANN, and RF-based models.

Performance criteria
Correlation of coefficient (C.C), Bias, root-meansquare error (RMSE), and Nash-Sutcliffe model efficiency coefficient (E) are the performance evaluation criteria, which are implemented in the current study to evaluate the fitting ability of the models.

Coefficient of correlation
C.C. is used for evaluating the performance of any model using numeric values. The C.C is given as: The range of correlation coefficient is: -1 to +1. Zero value indicates that there is no relation between the actual and predicted data.

Bias
The bias is the average difference among actual and predicted values. Its value is defined by Equation (4):

Root-mean-square error (RMSE)
Root-mean-square error is generally selected to measure numeric assessment. RMSE is calculated as:  It is implemented to examine the predicting power of the models. It is expressed as (Nash and Sutcliffe, 1970): When the value of E≥90%, it means that the performance of the model is a satisfactory performance, if value E lies between 80% and 90%, it means that the performance is fairly good, and a value ≤80% shows that the performance of the model is an unsatisfactory performance.
The combined use of C:C, Bias, RMSE, and E provides a sufficient evaluation of every model's performance and favours a judgment of the precision of the six modelling approaches implemented in the current study.

Results
The coefficients of Kostiakov (empirical) model and Philip's model (physical) were derived using leastsquares methods.
Kostiakov model: Philip's model: The performance of Kostiakov and Philip's models is compiled in Table 4. It indicates that the performances of Kostiakov model and Philip's model are unsatisfactory. Nash-Sutcliffe model efficiency coefficient was achieved 0.3417 for Kostiakov model and 0.3401 for Philips model. As shown in Table 5, single factor ANNOVA outcomes indicate that F-values (4.397386) were greater than f-critical (3.880827), and P-values (0.03705) were less than 0.05, which suggests that the variation among predicted values by Kostiakov model and actual values is significant. Figure 1 indicates the flowchart of the methodology. Figure 2 illustrates the agreement diagram among actual and predicted outcomes of the infiltration rate for both models. Figure 3 displays the agreement diagram between the actual and predicted data of infiltration rate of sand, obtained by using GP regression-based Radial basis kernel and Polynomial kernel functions with the testing dataset. Table 4 suggests that the GP regression-based Polynomial kernel function shows superiority over Radial basis kernel function. GP regression-based Polynomial kernel function acquired higher values of C.C (0.9863), as well as E (0.9727), while lower values of Bias (−2.3524) and RMSE (40.3026). Nash-Sutcliffe model efficiency coefficient suggests that RBF kernel has a fairly good performance, but Polynomial kernelbased GP regression has a very good satisfactory performance. From Table 5, single factor ANNOVA results suggest that the variation among predicted values by GP_RBF and GP_Poly and the actual values is not considerable. Figure 4 displays the agreement of actual and predicted value of infiltration rate of sand obtained by using SVM regression-based Radial basis kernel and Polynomial kernel functions with the test data.SVMbased Polynomial kernel functions show closer agreement to the line of perfect prediction. Table 4 suggests that the SVM regression-based Polynomial kernel function performs better than Radial basis kernel function. SVM regression-based Polynomial kernel function acquired higher values of C.C (0.9629), E (0.9229) and lower value of Bias (4.5753) and RMSE (67.7070). Nash-Sutcliffe model efficiency coefficient suggests that RBF kernel has a fairly good performance, but Polynomial kernel-based SVM regression has a very good satisfactory performance. According to Table 5, single factor ANNOVA suggests that the variation among predicted values by SVM_RBF and SVM_Poly and the actual values is not considerable.
Figures 5 and 6 display the agreement of actual and predicted value of infiltration rate of sand observed by using ANN and Random Forest models with the testing dataset. The infiltration data predicted by the RF model ( Figure 6) lie closer to the agreement line when compared with ANN ( Figure 5). Table 4 suggests that the Random Forest model works better than ANN and the higher value of the Nash-Sutcliffe model efficiency coefficient (0.9565) gained by RF suggests its satisfactory performance over ANN. Table 5 shows that single factor ANNOVA suggests that the difference in  Figure 7 displays the agreement of the actual and the predicted values of infiltration rate of sand obtained by Kostiakov, Philip, GP, SVM, ANN, and Random Forest regression approaches with the testing dataset. Table 4 suggests that GP-based Polynomial kernel function works better than other approaches for this dataset. Figure 7 indicates that Polynomial kernel-based (GP and SVM), ANN, and Random Forest approaches work better than Kostiakov and Philip's models. Nash-Sutcliffe model efficiency coefficient suggests that GP, SVM, ANN, and Random  Forest have satisfactory performances. Single factor ANNOVA also suggests that there is not so much variation in actual and predicted values using these approaches. Figure 8 indicates the residuals plot for the testing period using six modeling approaches. This figure suggests that GP, SVM, ANN, and Random Forest approaches have less residual than Kostiakov and Philip's models.

Discussion
Infiltration prediction models are considered important for the management of stormwater and groundwater recharge and are utilized to predict the nature of subsurface recharge as well as surface runoff. The soil infiltration characteristics largely dependent on the properties of soil and the availability of moisture  into the soil. Table 6 contains the modeling studies addressing the performance of various soft computing models in simulating the infiltration through soil. A lot of research has been conducted in the field as well as in the laboratory pertaining to the estimation of infiltration characteristics, viz., hydraulic conductivity, cumulative infiltration and infiltration rate of soil, recharging rate, and permeability of soil. Most of the modelling studies recognized the application of ANN on the laboratory as well as field data and substantial number of researchers found appreciable results with this technique (Anari et al., 2011;  Esmaeelnejad et al., 2015;Schaap & Leij, 1998;Sedaghat et al., 2016;Sihag, 2018Sihag, , 2018Sy, 2006). Observing some recent field-based studies, illustrated in Table 6, RF regression is recognized as superior modeling method in predicting the infiltration rate and hydraulic conductivity of soil in the region of Kurukshetra, India (Kumar & Sihag, 2019;Singh et al., 2017). Regarding the infiltration characteristics, the learning suitability of SVM and GP regression methods cannot be ignored as some of the studies suggest better accuracy in measuring the infiltration properties in contrast to some of the other popular soft computing approaches (Das et al., 2011;Elbisy, 2015 Reviewing these data-mining-based studies; ANN, SVM, GP, and RF regression techniques come out to be strong modelling tools in determining the infiltration characteristics of soil in the field as well as in the laboratory (Table 6), so the authors acknowledged the combined utility of these modelling tools in an attempt to compare the prediction performance in simulating the infiltration rate through mixed soil of variable material of different characteristics (sand, rice husk ash, and fly ash) with basic soil properties. The infiltration estimation comparison yielded the highest estimation accuracy with a polynomial kernel-based GP regression followed by RF regression. Results of GP and RF regression methods are way better than the other applied modelling methods, as well as other two popular conventional models (Kostiakov and Philip). The performance of Polynomial kernel is found best with both SVM and GP regression, as compared to RBF kernel. So this study and some past studies (Kumar & Sihag, 2019;Sihag et al., 2019;Singh et al., 2017) affirm that RF model is a reasonably good predictor of infiltration characteristics in the field as well as in laboratory and can be successfully employed in both cases.

Sensitivity study
Sensitivity study was carried out to find the most influencing input variable for the infiltration rate of soil. For this investigation, the best performing model was selected (GP_Poly). The number of training datasets was prepared by eliminating a single input variable at a time and outcomes were listed in the form of CC and RMSE for the test dataset. The higher variation in the value of RMSE observed from Table 7 concludes that time is the most influencing variable for estimating the infiltration rate of soil.

Conclusion
This paper inspects the performance comparison of Kostiakov model, Philip's model, GP, SVM, ANN, and RF approaches in approximating the infiltration rate of water through soil. Results of performance assessing parameters suggest that Polynomial kernel-based GP approach works superior to conventional models, SVM, ANN, and RF models with coefficient of correlation values as 0.9863, Bias values as −2.3542, root-mean-square error values as 40.3026 and Nash-Sutcliffe model efficiency values as 0.9727 using testing dataset. One of the most imperative conclusions is that Polynomial kernel Function works better than Radial basis kernel function with both GP and SVM approaches. RF regression is found as second best modelling alternative after GP_Poly regression. All the artificial intelligence techniques work superior to conventional models. ANNOVA single factor outcomes suggest that the variation in the predicted values by artificial intelligence techniques is not considerable. Results of sensitivity study conclude that the most influencing input variable is time for estimating the infiltration rate of soil for this dataset.

Compliance with Ethical Standards
Conflict of Interest: no conflict of interest: Parveen Sihag, Munish Kumar, and Balraj Singh declared that there is no conflict of interest. Ethical approval: This article does not contain any studies with human participants or animals performed by any of the authors.

Disclosure statement
No potential conflict of interest was reported by the authors.