Resampling in neural networks with application to spatial analysis

ABSTRACT In developing Artificial Neural Networks (ANNs), the available dataset is split into three categories: training, validation and testing. However, an important problem arises: How to trust the prediction provided by a particular ANN? Due to the randomness related to the network itself (architecture, initialization and learning procedure), there is usually no best choice. Considering this issue, we provide a framework, which captures the randomness related to the network itself. The idea is to perform several training and test trials based on the Jackknife resampling method. Jackknife consists of iteratively deleting a single observation each time from the sample and recomputing the ANN on the rest of the sample data. Consequently, interval prediction is available instead of point prediction. The proposed method was applied and tested using pH, Ca and P data obtained by analyzing 118 georeferenced soil points. The results, based on the dataset size simulation, showed that 60% reduction in available dataset offers compatible accuracy in relation to full dataset, and therefore a higher cost of sampling in the field would not be necessary. The re-sampling method spatially characterizes the points of greater or lesser accuracy and uncertainty. The re-sampling method increased the success rate by using interval prediction instead of using the mean as the most probable value. Although we restrict it to the regression neural network model, the resampling method proposed can also be extended to other modern statistical tools, such as Kriging, Least Squares Collocation (LSC), Convolutional Neural Network (CNN), and so on.


Introduction
One of the most important tasks in developing a neural network is related to partitioning the dataset. Often, most researchers randomly and uniformly split a known dataset into three categories: training, validation and testing. The training set is often used to estimate the unknown parameters of the net (e.g. weights and biases in a regression feed-forward neural network). The validation set helps one in making a decision on how to train the network and how to stop it in order to prevent oneself from overfitting. In machine learning languages, test sets are unseen data, which play an important role in the evaluation of the generalized performance of the network. This method of splitting data randomly is known as Hold-out.
This method in Artificial Neural Network (ANN) has been used in geo-spatial applications, especially when the data set is large, as can be seen here Aboufazeli, Jahani, and Farahpour (2021), Jahani et al. (2021a), Jahani and Saffariha (2021b), and Aslam et al. (2021). There are, however, significant limitations of using this method, as pointed out by Ziggah et al. (2019) as follow: (i) the results produced is based on the uncontrolled chosen split sets; (ii) improper split of the data set could have an adverse effect on the model performance; (iii) depends on having large dataset making it unsuitable to be applied in data-insufficient situations (i.e. sparse dataset).
To overcome these problems, the K-fold crossvalidation has been recommended as an alternative method (Burman 1989;Reitermanová 2010). In this case, the data is separated into K disjoint subsets (K-fold) of approximately equal size, so that each subset will be in a test set just once. Although the K-fold present advantages, it is still not clear how to choose the K disjoint subsets and in some cases the subsets are not of equal sizes, which does not guarantee a balanced version of cross-validation. Furthermore, the two methods above provide only a single-point prediction. Therefore, splitting data methods still seems to be challenging. Thus, an important issue arises here: how to split the data properly so that the prediction is as reliable as possible?
Within the context of neural network modeling, the best choice is one in which the test set is as small as possible. In other words, if the training set is close to the full sample size, the more accurate the neural network model, as can be seen in the recent work of Balmer, Weibel, and Huang (2021). For this, one can use the Delete-1 Jackknife resampling method (Quenouille 1949). The Delete-1 Jackknife is a special case of the K-fold method by choosing K to be equal to the size of the available dataset (sample), say n. In this approach, each observation is used to validate the performance of the neural network trained with the remaining (n À 1) observations. In the context of neural networks, the Delete-1 Jackknife method is a balanced version of cross-validation.
However, the uncertainties related to the network itself (architecture, initialization and learning procedure) are still questionable. Therefore, there is usually no best choice (Pan 1998). In light of this issue, here we extend the Delete-1 Jackknife resampling method to capture the randomness related to the network itself. The basic idea consists of repeating the Delete-1 Jackknife process a large number of times. Hence, each test subset, which in this case is composed of only one single observation, predicts a desired number of times. Consequently, we obtain an interval prediction for each test subset, and not only a single prediction as in classical methods. In this way, we are able to describe the uncertainty of each predicted point, which was not possible using traditional Hold-Out and K-fold methods.

Extension of delete-1 Jackknife resampling method in neural networks
In general, the number of possible combinations for independent cross-validation, denoted by C d n , and the number of repetitions of each point prediction R d n À � are given respectively by where n is the size of the available sample (or the size of the dataset at hand), is the number of sample points, which are removed from the available sample for validation. This means that we validate the subset of size d and train the remaining n À d ð Þ observations at a time. The subset size d is selected from all of the observations without replacement. For instance, if n ¼ 10 and d ¼ 2, we would have C 2 10 ¼ 45 independent cross-validation subsets, with each output predicted R 2 10 ¼ 9 times. This method is known as Jackknife-d (Efron 1980(Efron , 1992Wang and Yu 2020).
The Delete-1 Jackknife is a special case by taking d ¼ 1. In this case, each observation is used to validate the performance of the neural network trained with the remaining (n À 1) observations ( Figure 1). In the context of neural networks, the Delete-1 Jackknife method is a balanced version of cross-validation. In machine learning languages, the Delete-1 Jackknife method is commonly referred to as the leave-one-out procedure (Efron 1982).
For instance, if n ¼ 10, the application of the Delete-1 Jackknife method would provide 10 crossvalidation, with each observation predicted only once, i.e. C 1 10 ¼ 10 and R 1 10 ¼ 1. Here, on the other hand, we extend the Delete-1 Jackknife method by drawing many trials in order to capture the randomness related to the network itself (architecture, initialization and learning procedure). We call this method Delete-1 Jackknife Trials (or simply, Jack-1 T). Therefore, instead of having only one single particular neural network predictor, we will have hundreds or even thousands of predictors. This means that instead of making a single-point prediction, we can describe its empirical distribution. In other words, this gives us the opportunity to do interval prediction instead of just one-point prediction.
The accuracy of the Jackknife re-sampling method depends on the choice of the number of groups that will be deleted from the neural network training process (i.e. d). In general, the more observations in the training set, the better the network in terms of learning, which justifies the idea of the Jack-1 T. Similar idea can be found in (Miller 1974;Pan 1998).

Outcomes from Jack-1 T
The most probable estimate (expected value) of an output quantity is not based on only one single prediction, but rather on an interval basis. For this, we sort the N predicted values of a point into strictly increasing order. The sorted predicted values provide an empirical distribution function for each output point (denoted by). Then, for a stipulated coverage probability (denoted by α) we are able to compute the desired percentiles p as follows: where : ½ � denotes rounding down to the next integer, which indicates the position of the selected elements in the ascending order of G. This position corresponds to a prediction for a stipulated overall probability α. This can be done for any α. Indeed, a confidence interval for the predictions may also be constructed. This procedure is very similar to that found for the computation of critical values of outlier statistical tests (Rofatto et al. 2020).
The re-sampling methods presented here also allow us to measure the prediction performance for the entire sample set. This is due to the fact that all sampling points are replicated in the test set. The Root Mean Squared Error (RMSE) on the test set can be used to measure the accuracy of the prediction of each point "i" as follows ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffiffi 1 N where Ô k are the N predicted values for each sample point i and T k are the N true output values. The RMSE has been used as standard statistical parameters to measure model performance in several applications. The RMSE indicates how far the output responses of the neural network are from the true outputs (targets) for each i. Note that RMSE is computed for each of the n available sample points. Although this expression represents the RMSEs for a single output, it can be applied to the case where one more output is available. Consequently, accuracy and uncertainty maps can also be provided.
To make it clearer, the procedure associated with Jack-1 T is summarized step-by-step as follows: (1) The dataset available of size n is divided into two parts: training and test sets.
(2) Train the model with (n -1) observations and evaluate it using the remaining observations. For instance, given x as input and y as output variables in a neural network, we would have Þg as training set, and the remaining observation x k ; y k ð Þ would be used to evaluate the model.
(3) Repeat the procedure in (2) n times, excluding at each moment a different observation until all the observations have been deleted ( Figure 1). (4) The procedures in (2) and (3) are repeated a certain number of times (number of trials t). For example, if n ¼ 10 and the number of trials (denoted by t) is t ¼ 500, one will get 10 � 500 ¼ 5000 neural network predictors, being 500 predictions for each single test point sample. In this case, the number of neural network predictors will be equal to n � t. (5) Summarize the results using the repeated predictions by constructing the empirical distribution function for each variable in play, as can be seen in Equation (3). The RMSEs can also be obtained for each testing point (Equation 4).
In practical applications, the user will be interested in applying the neural networks generated by the resampling process to new input variables where the outputs are unknown. In this case, the RMSE will not be available. However, other statistics can be computed for each predicted point, such as mean, standard deviation (uncertainty), quartiles and coefficient of variation. Consequently, an uncertainty map can be provided.

Material and methods
The proposed method was tested using soil chemistry spatial data. We also investigate whether the proposed method is able to provide good predictions under conditions of low data density.

Available datasets
The data set used consists of soil chemical attributes, as follows: potential of hydrogen (pH); exchangeable calcium (Ca); and phosphorus concentration (P). They were obtained by analyzing soils from 118 georeferenced points following standard laboratory methods ( Figure 2). The coordinates are given in the UTM system. These chemical attributes play an important role in the soil fertility, soil life and soil management practices. Furthermore, the chemical attributes have different variability as can be seen in Table 1, which makes it an interesting problem for the application of the proposed method.

Spatial neural network development
A feed-forward neural network with back-propagation learning procedure was developed by using the UTM coordinates as input data set and the soil chemical attributes (pH, Ca and P) as the targets. The command-line functionality of the MATLAB's Neural Network toolbox (R2019b) was used for training and validation of the neural network. After several trials, the optimum neural network architecture found was defined by three hidden layers, with the first layer composed of three neurons, the second of 14, and the last of a single neuron, i.e. [3 14 1] (Figure 3). Since the input data is only composed of UTM coordinates (E, N), the neural network is called a spatial neural network. This architecture was also found in Cagliari, Veronez, and Alves (2011).
Among the best practices for training a neural network is to normalize the dataset to speed up the learning and to get a faster convergence (Huang et al. 2020). Hence, both input and target values were normalized as follows: being: y n the inputs or targets mapped to interval [−1, 1], y the original inputs or target data, y min and y max the minimum and maximum values obtained from the original inputs or targets, respectively. In this case, the normalized function gives the input/ target values between [−1, 1]. The log-sigmoid transfer function (activation function) was applied in the intermediary layers (hidden layers) in order to compute a layer's output from its net input. The logistic function (denoted by φ) adopts values between 0 and 1, as follows where x i is a value coming from the linear combination of the inputs, weights and bias for a given neuron "i," as follows with w being the weights parameters of the network, I the input data, and θ the bias parameter for 1, . . ., n input sample data. During the learning stage, it is customary and convenient to minimize the Mean Squared Error (MSE) between the targets (t i Þ and estimated output (o i ), as follows (Haykin 1999): The weights and biases of the neural network were estimated by minimizing the function in (4) in order to produce the desired outcome, i.e. the chemical attributes. Here, Levenberg-Marquardt algorithm (LM), also known as the Damped Least-Squares (DLS) method, was employed to estimate the unknown parameters of the neural network, i.e. weights and biases (Levenberg 1944;Madsen, Nielsen, and Tinglef 2004;Marquardt 1963;Gavin 2020). The Nguyen-Widrow algorithm (LW) was used to initialize the weights and biases (Nguyen and Widrow 1990).

Experiments
At first, we used the 100% of the available data (118 points) for the application of the Jack-1 T re-sampling method ( Figure 2). In this stage, the cross-validation is based only on the test set, which consists of the internal validation of the resampling procedure. The number of trials was adopted as t ¼ 100, which provided 11,800 neural network predictors for the case where we have used 100% of the available dataset. The size of the validation set is fixed by randomly taking 10 points for each trial. The aim of the validation set is to improve the generalization of the network learning.  When the network starts to over-fit the training set, the performance on the validation set typically begins to decrease. When the validation error increases for a specified number of epochs, the training is stopped, and the weights and biases at the minimum of the validation error are returned. This method for avoiding over/under-fitting is called early stopping (Reitermanová 2010). The learning process is configured to stop when the network performance on the validation set fails to improve or remains the same for six epochs (iterations). Since the validation set has an effect on how to train the network and how to stop it, it was considered as part of the training process. Therefore, the training was taken as being the set without the test set. Still, at this stage, we investigate to what extent Jack-1 T is able to provide good predictions under conditions where the samples are smaller than the original dataset. For this purpose, the original dataset (n ¼ 118 points) was intentionally and randomly reduced to ~40% (47 points). The number of trials was taken as t ¼ 100, which provided 4700 neural network predictors for the case where we have used 40% of the available dataset. Thus, 100 predictions were provided for each sample test point, which allowed an internal validation with respect to the original dataset available.
The remaining (71 points) which have not participated in the Jack-1 T procedure, were used to evaluate the performance of the Jack-1 T in terms of external validation. They were totally left out of network's training and testing set and therefore were taken as being out-of-sample sets of the resampling procedure. This guarantees an unbiased evaluation of the neural network-based predictions. Thus, 4700 neural predictors generated from the resampling of the 40% of the data (47 points) were applied to each of the 71 points, which were left out of the resampling procedure (i.e. 60% remaining that were part of the out-of-sample set). Consequently, each sample point was predicted 4700 times, which allowed for interval prediction instead of the classical point prediction one.
The analysis was performed from two perspectives: (i) quantitatively by comparing the descriptive statistics between the original attributes and those predicted using the Jack-1 T-based neural network; (ii) qualitatively by comparing the class to which each individual prediction belongs with respect to its original class from classification tables of soil chemical attributes.

Quantitative evaluation
On average, there were no differences, according to the levels of significance, between the predictions from different sample sizes and the original dataset in terms of internal validation (Figure 4). The maximum and minimum values (shown as solid black lines in Figure 4) for the predictions are the maximum of the 97.5th percentile values (α = 0.975 computed from Equation 7) and the minimum of the 2.5th percentile values (α = 0.025 computed from Equation 7), respectively. These percentiles correspond to the lower (α = 0.025) and upper (α = 0.975) boundaries of the 95% empirical confidence interval, respectively. This means that the predictions that fall outside this confidence interval are considered outliers and are not considered in the solutions. The use of lower and upper bounds was suggested by Maldaner, Molin, and Spekken (2020). Regardless of the data size, we observed that the predictions for pH and Ca showed a slightly greater variability than the original data (ref). On the other hand, the maximum values of the predictions for P were underestimated. This means that it would be necessary to develop a specific neural network architecture for the variable P. However, here we are not concerned with the issue of improving the architecture by adding new parameters (more hidden layers and/or more nodes) or by tuning the hyperparameters on the training set. Instead, we try to adjust the prediction interval with respect to the interval of the original dataset. In fact, the variability is better described by adjusting the upper bound to the maximum values instead of the   97.5th percentile values ( Figure 5). In terms of internal validation (Figure 6), 95% of RMSE's were below ~ 0.7 (pH), ~ 1 cmol c dm −3 (Ca) and ~ 100 mg dm −3 (P).
In terms of external validation (Figure 7), 90% of RMSE's were below ~0.7 (pH), ~0.9 cmol c dm −3 (Ca) and ~70 mg dm −3 (P). These results are consistent with those from the internal validation. This means that the neural network model is able to generalize the results. Consequently, the use of 47 sampling points to generate a prediction model based on neural networks is sufficient, and a higher cost of sampling in the field would not be necessary.

Qualitative evaluation based on the out-ofsample set
The predictions for pH, Ca and P (71 points: 60% outof-sample set) and their original values (true values) are classified according to Table 2. The success rate of the predictions was computed as being the ratio of the number of predictions correctly identified to their class and the total number of sampling points. The most probable value of each prediction is given by the mean value, which is used to classify the attributes according to Table 2. We recall that 4700 neural predictors generated from the resampling of the 40% of Figure 7. RMSE spatial distribution (left) and overall proportion of the RMSE (right) of the pH, Ca and P computed from the application of 4700 neural predictors generated from the resampling of the 40% of the data (47 points) into the 71 out-of-sample points (60%). the data (47 points) were applied to each of the 71 points (60%) which were left out of the resampling procedure (out-of-sample set).
The results are displayed in Figure 8. The success rates were 72% (pH), 87% (Ca) and 89% (P), respectively. All means of the predictions were classified within the same class. This means that the resampling-based regression feed-forward neural network is shown not to be sensitive to abrupt changes, which can be interpreted as outliers. The original dataset also had a predominance of the same class although there are some anomalous points.
Since we have at our disposal the uncertainty of each sampling point, it is also possible to analyze it in terms of the confidence interval instead of just considering the mean of each prediction. The spatial distribution of the both standard-deviation (σ) and its distribution for the predictions are displayed in (Figure 9). In this case, the success rates are based on the confidence interval given by mean Probable Error Table 2. Classification of the pH, Ca (cmol c dm −3 ) and P (mg dm −3 ). Classification of pH and Ca from Alvarez et al. (1999), and P from Raij et al. (1996). (± PE). Here, PE was defined as being ±2σ for pH, ±3σ for Ca and ±4.4σ for P, where σ is the standarddeviation of the predictions. These PE choices were based on the variability behavior described by the results of the 4700 neural predictors generated from the resampling of the 40% of the data (47 points), as can be seen in the previous section ( Figures 5 and 6). The success rate occurs when the original field value is within the predicted confidence interval. For this case, the success rates were 92%, 93% and 92% for pH, Ca, and P, respectively. The confidence interval stayed exactly between [6, 7.2] for all pH predictions, which corresponds to the classes (Table 2) "Suitable" (lower bound of −2σ) and "High" (upper bound of +2σ), respectively. On the other hand, all upper bounds of Ca predictions (+3σ) were classified as "Very High," but the lower bounds (−3σ) were classified as "Medium" for 30 points (42%) and the others 41 points (58%) as "High." The upper bounds (+4.4σ) for P were classified as "Very High," whereas the lower bounds (−4.4σ) were classified as "High" for 58 points (82%), "Medium" for six points (8%) and "Very Low" for seven points (10%). The proposed method makes it possible to describe the uncertainties in the predictions of each sampling point. This can help neural network developers in Figure 9. Uncertainties of the pH, Ca and P predictions based on ±2σ, ±3σ and ±4.4σ probability, respectively, for the 71 sampling points (60% out-of-sample set).
decision-making, such as detecting regions that need more samples to reduce errors and improve the model. For instance, Qi et al. (2020) could use the Jack-1 T idea in their convolutional neural network for regression as both the study area and available dataset of their work are relatively small. However, the limitation of the Jack-1 T method is due to the sample size. If the sample size is too large, the method may be unfeasible in terms of computational time and hardware resources. For instance, if we had a dataset of n = 10,000, and if we took the number of trials t = 100, the total number of neural networks would be one million so each prediction would be repeated 100 times. These large sample set scenarios are often encountered in many remote-sensing imagery classification by deep learning algorithms, as can be seen here Liao et al. (2020) and Hong et al. (2021). In this case, it is very common to use Hold-out or K-fold methods. Thus, the proposed method may be extended to the case of Hold-out and K-fold by repeating these processes by resampling a certain number of times.

Conclusion
Here, we describe a data-based re-sampling method called Jack-1 T in the case of cross-validation backpropagation associated with neural networks. The proposed method can be used to construct inferential procedures for modern statistical spatial data analysis. The method replaces the theoretical derivations required in applying traditional methods by repeatedly re-sampling the original data and making inferences from the resamples. The advantage of these re-sampling techniques is that they allow us to determine the empirical probability distribution for the predictor rather than simply providing a point prediction. Consequently, it allows us to measure the prediction performance for each individual sampling point.
The results, based on the dataset size simulation, show that a 60% reduction in available data offers compatible accuracy in relation to the full dataset, and therefore reduces the cost of sampling in the field. Therefore, the proposed method supports in decisionmaking to define the sample size in the field. The resampling method spatially characterizes the points of greater or lesser accuracy and uncertainty. The resampling method increases the success rate by using interval prediction instead of using the mean. Although we restrict it to the neural network model, the resampling method proposed can also be extended to other modern statistics tools such as Kriging, Least Squares Collocation (LSC), Convolutional Neural Network (CNN), and so on. Although we restrict the Jack-1 T resampling method to a neural network for a regression problem, it can be applied to classification problems and for the case where deep learning algorithms is in play. Furthermore, the proposed method can be extended to other resampling methods such as Hold-out and K-fold.

Disclosure statement
No potential conflict of interest was reported by the author(s).

Notes on contributors
Bruno Póvoa Rodrigues received his MSc degree in Agriculture and Geospatial Information from the Federal University of Uberlandia, Monte Carmelo city, Minas Gerais state, Brazil, in 2021. He has developed his research in the field of artificial neural networks applied in agriculture.

Data availability statement
For those interested in the dataset of this work, please send their request to the corresponding author by e-mail or access https://data.mendeley.com/datasets/z3399p5h9f/.