Ensemble Feature Selection Framework for Paddy Yield Prediction in Cauvery Basin using Machine Learning Classifiers

Abstract Machine learning technique involves a significant amount of time and the model performance reduces in multi-dimensional datasets because of redundant features. Feature selection is a significant step in machine learning and involves the selection of a subset of relevant data feature from larger data feature, which further enhances model performance by simplification. Crop yield prediction is a significant application of machine learning, and feature selection plays a crucial role in it. To predict paddy yield accurately, weather, soil, and crop attributes are essential factors. Therefore, feature selection techniques are employed to identify relevant and non-redundant attributes from a larger dataset, which simplifies the prediction model. In this study, an ensemble feature selection method is proposed that selects an optimized subset of attributes by combining various attribute subsets based on mutual information between attributes and between attributes and the class. The proposed ensemble approach is validated using five classification techniques, including K-nearest neighbor, Random Forest, Support Vector Machine, Naive Bayes, and Bagging. Several evaluation metrics such as Accuracy, Error rate, Kappa, Precision, Recall, Specificity, and F1 score are used to assess the performance of the ensemble approach and compare it with other feature selection techniques. The experimental results indicate that the proposed ensemble approach with the Random Forest classifier outperforms other classifiers, with Accuracy and Error rate values of 0.9491 and 0.0509, respectively.


PUBLIC INTEREST STATEMENT
Feature selection technique is an important aspect in any prediction-related implementation using machine learning and deep learning techniques.This will enable us to narrow down the factors affecting the prediction results in realworld applications.The proposed Ensemble Feature Selection technique guides researchers on the optimised method for performing feature selection.This article clearly illustrates the proposed methodology with adequate comparison of existing feature selection methods.Also, a comparison of the proposed method with the available literature gives a holistic view of the different feature selection techniques.This technique is applied to agriculture yield prediction but can be further expanded to other applications not limited to weather, stock market, fraudulent bank transactions, cybersecurity, FMCG good stocking, and other retail prediction problems.

Introduction
In order to simplify the prediction model, it is necessary to select important attributes that are not redundant from a large dataset.This is achieved through the use of feature selection techniques.This technique is used in many machine learning applications like agriculture, pattern recognition, and health care choose the optimized data subset from the larger available dataset.When working with large datasets, repeated data occurs, and identifying and removing processes helps to remove redundant values from the original dataset in the learning process.The objective of feature selection is to help in removing the prediction accuracy, eliminate unnecessary attributes, and reduce analysis time.Performing feature selection on the original dataset can enhance the performance of the classification model.However, using a single feature selection method to select features may result in a suboptimal or locally optimal feature subset, which can negatively impact the effectiveness of the learning technique (Rodríguez et al., 2007).On the other hand, an ensemble-based feature selection method employs multiple feature subsets and combines attribute rankings to identify the best attribute subset, resulting in improved classification accuracy.The ensemble first-phase method involves selecting a variety of feature selectors, each of which offers a sorted order of features.The second phase employs several aggregation methods to aggregate the chosen subsets of features (Wang et al., 2010).In the previous studies, numerous feature selection techniques have been presented, each of which uses a different set of metrics and operates differently.It has been observed that the data dimensionality is decreased using feature selection techniques (Abdullah et al., 2014;Bhattacharyya & Kalita, 2013;Hoque et al., 2014Hoque et al., , 2016;;Hu et al., 2013;Pudil et al., 1994;Swiniarski & Skowron, 2003;Xuhua et al., 1996).Multiple feature selection algorithms may choose variable subsets of attributes for a particular dataset, and as a result, the final result may have different accuracy.In these studies, it is found that the classification accuracy has been increased with the help of Ensemble-based method.This method addresses the data diversity problem in an efficient manner.

Related Work
The primary aim of feature selection is to identify a subset of characteristics that can reduce prediction errors made by classifiers.Predicting paddy yield is a challenging task due to the existence of several complex factors.Factors like weather, soil quality, pest infestations, landscapes, crop genotype, availability, and quality of water, as well as harvest planning, all play a vital role in crop production.In existing studies, multiple research activities are concentrated around ensemble feature selection techniques.Boosting (Schapire, 1999) and bagging (Breiman, 1996) are the most widely used techniques, operating on bootstrap samples of the training set.N instances are randomly chosen to construct a bootstrap sample, which is a replication of the dataset with replacements taken from the training set.Each replication of the data passed into a filter individually.Simple voting strategy is used to aggregate each classifier's prediction.In contrast, the boosting approach selects instances proportionally to their weights, giving more weight to instances that were misclassified by the prior model.For text classification, common ranker methods like information gain, chi-square, and frequency threshold are often combined (Olsson & Oard, 2006), while Opitz (1999) proposed genetic ensemble feature selection for neural networks.Wang et al. (2010) presented an ensemble of six commonly used filter-based rankers, and Lee (2002) and Rokach et al. (2006) combined results from several non-ranker filter-based feature subset selection strategies.Guyon and Elisseeff (2003) suggested several key strategies for attribute selection, including feature construction, feature ranking, multivariate feature selection, effective search techniques, and feature validity assessment techniques.Hall et al. (2009) investigated six methods for ranking lists of features and applied them to various datasets from the UCI machine learning library.
Feature selection has also been employed in classification problems related to biometrics and signal processing (Zhang et al., 2017).A common method used to maximize classification technique performance and reduce costs associated with features is the cost-based feature selection approach, which is a multiple objective optimization issue.Zhang et al. (2017) proposed a multiple objective particle swarm optimization approach for cost-based feature selection.According to (Bay, 1998) using an ensemble of several classifiers can improve classification accuracy.Stephen suggests a combined method for nearest neighbor classifiers that uses various attribute subsets.In existing work for crop yield prediction model, researchers use different kinds of agriculture and weather parameters.Not all researchers use the same kind of parameters.Some researchers use environmental parameters, agriculture datasets, satellite data, and images and some researchers use individual parameters.(May Gopal, 2019) to develop a hybrid crop yield prediction model using area, tanks, canals length, open wells, and maximum temperature features.Sathya and Gnanasekaran (2023b) proposed a forecasting system using weather parameters such as rainfall, temperature, and humidity.In our previous work (Sathya & Gnanasekaran, 2023a), we develop a hybrid model for crop yield prediction using six parameters such as pH range, wind speed, rainfall, maximum temperature, minimum temperature, and block.Existing feature selection model (Suruliandi et al., 2021), used sixteen soil and environmental parameters to develop a framework model that achieved 92.72 accuracy.Fei et al. (2022) proposed a feature selection model using hyperspectral vegetation indices for crop yield prediction, the model achieved R 2 ranging from 0.648 to 0.679.In current research scenarios, most researchers do not consider the repeated nature of the chosen parameters.The use of a single feature selection technique may result in inconsistent classification prediction accuracy as it can be biased towards a specific feature subset.To address this issue, an ensemble of feature selection techniques is employed to select the best non-redundant and important feature subset, which improves the prediction accuracy during implementation of classification.

Dataset description
This research concentrates on paddy cultivation data in Thanjavur district, Tamil Nadu.It uses dataset from the agriculture sector and comprises information about the previous year's crop details, soil, and environmental conditions collected from the following sources: datasets are collected from the Joint Director of Agriculture Office (JDA, 2022), Kattuthottam, Thanjavur; Directorate of Economics and Statistics (DES, 2023), Chennai;and Giovanni (2022), an open source NASA website that provides free geophysical and weather satellite data.The dataset contains 1562 samples and 21 features.Table 1 shows a description of the parameters used for this crop yield prediction framework.

Feature selection
Repetitive information in datasets makes it difficult to classify objects.In data analytics study, feature selection is a crucial process because datasets have numerous features.Feature selection techniques used for the classification algorithm can train the model more  efficiently.It minimizes the model's complexity and improves interpretation.Feature selection is a crucial aspect of developing accurate machine learning models as it helps to reduce the number of irrelevant or redundant attributes, leading to better performance and reduced overfitting.There are three types of feature selection models, namely filter, wrapper, and embedded, that can be used to identify the best subset of attributes.In this study, an ensemble feature selection framework was developed using three different techniques: RReliefF, RFE, and Boruta.Each of these techniques has shown promising results in previous studies and follows a unique set of principles.The proposed model employs ensemble learning to combine the results of these three techniques and improve the accuracy of the feature selection process (Bhadra et al., 2020;Li et al., 2020;Tatsumi et al., 2021;Zare et al., 2013).

RReliefF
RReliefF is an improvement over the Relief algorithm that addresses several challenges, including missing data, multi-class problems with noise, and regression issues.This enhancement involves the introduction of probabilities that are used to calculate the relative distance between predicted values for two observations.By using these probabilities, RReliefF can assign weights to features, enabling effective feature selection.
Algorithm:1 Input: Training instance s k with S parameters; The number of samples N; the number of nearest neighbors K; Output: The quality weight vector for all variables L Initialize M rQ , and all elements in M rP , M rQ ^rP , L to 0; For i = 1 to n do: Select instance W i randomly; Select K nearest instances H j to W i ; For j = 1 to K do: M rQ = M rQ + diff (0, H j , W i )/K; For P = 1 to D do M rP (P) = M rP (P) + diff (P, H j , W i )/K; M rQ ^rP (P) = M rQ ^rP (P) + diff (0, H j , W i ) * diff (P, H j , W i )/K end end end for P = 1 to D do: L (P) = M rQ ^rP (P)/M rQ -(M rP (P)-M rQ ^rP (P))/(N-M rQ );

Recursive feature elimination
When dealing with short training samples and high dimensionality, feature selection plays a crucial role in preventing overfitting and enhancing classification accuracy.Therefore, it is essential to incorporate feature selection techniques in such scenarios.For small sample situations, the RFE feature selection method is frequently employed.Weak attributes are eliminated until the necessary traits are met since they match a model.The Gini coefficient is primarily used by the RFE to rank attributes according to their significance.The importance of each attribute is determined by counting the number of splits it is involved in during the process of splitting the data based on the number of samples.To utilize Recursive Feature Elimination (RFE), a machine learning model and the desired number of features to be included must be provided as input.In other words, RFE assesses the significance of each feature by examining the number of times it participates in splitting the data, and it requires a machine learning model and the desired number of features to operate effectively.

Algorithm:2
Input: E → Dataset, K → Training set, Set of F Parameters S = {s 1 , . . .s f } Ranking method UV (K, S) Output: Final ranking R Repeat for i in 1 to F Rank fusing UV (K, S) s* ← last parameter in S R (f-i + 1) ← s* S ← S-s*

Boruta
The Boruta algorithm is a wrapper method for feature selection that can be used with any classification technique, although it is mainly designed for random forest.It is efficient in identifying highly significant features in a dataset and requires minimal time for analysis.In this study, the relevance of each parameter was assessed based on its Z score, which was evaluated using the mean decrease accuracy and standard deviation.Boruta considers the Z score as the most critical factor in the process of feature selection.To sum up, Boruta is a wrapper method that works with any classification technique, designed for random forest, capable of quickly identifying important features, and evaluates parameters based on their Z score, with the Z score being the most significant element according to Boruta.
A random forest classifier is built on the merged datasets to determine the attribute values.The importance of each attribute is measured by comparing its value with that of randomized attributes.Attributes that show a higher value than the randomized ones are considered more significant.To put it differently, a random forest classifier is employed on the merged datasets to establish the attribute value, and the significance of each attribute is determined by contrasting its original value with that of randomized attributes, where higher attribute values are regarded as more important than randomized values.

Classification techniques
Once the most significant parameters are identified using feature selection techniques on each dataset, a classification algorithm is employed on the processed data.Popular classification algorithms, such as K-Nearest Neighbor, Naive Bayes, Support Vector Machine, Random Forest, and Bagging, are commonly used.The optimized dataset is divided into two subsets, training and testing datasets, before the classifier method is implemented.The training dataset is utilized to train the classifier algorithm, which is then applied to the testing set.The results of this process are used to determine the most important features for predicting crop yield based on the crop dataset.In essence, after selecting the essential parameters using feature selection techniques, a classification algorithm is run on the processed data, dividing it into training and testing subsets, and then identifying the crucial features for predicting crop yield based on the crop dataset.

K-Nearest neighbor
The KNN algorithm is a nonparametric supervised learning method that leverages training sets to classify data points into specific categories.In other words, KNN is a technique that utilizes a training set to assign data points to particular groups without making assumptions about the underlying distribution of the data.Predictions are made using the KNN classification model which is only based on neighboring data values, with no assumptions made regarding the dataset.The number of closest neighbor data values is indicated by the letter "K" in a KNN algorithm.The KNN method decides how to classify a given dataset based on "K" and the quantity of nearest neighbors.

Naive Bayes
The Naive Bayes classifier is a sophisticated probabilistic categorization technique commonly utilized in machine learning.It utilizes the features of the Bayes theorem to classify data points.The prediction of each class label is based on the likelihood of a particular instance.Naive Bayes classifiers assume that the value of one attribute is unrelated to the value of any other attribute, given the class attribute.In summary, Naive Bayes classifier is a probabilistic categorization technique that relies on the Bayes theorem and assumes that attributes are independent of each other when given the class attribute.

Support vector machine
The SVM algorithm divides data into decision surfaces and separates these surfaces into two groups of hyperplanes.The hyperplanes are determined by the training points, which define the vector supporting the hyperplane.The hyperplane with the largest separation from the closest training data point is often preferred because it results in stronger margins and fewer errors, allowing for higher classifier generalization.This is because the larger margins result in more accurate classification.To sum up, SVM algorithm partitions data into decision surfaces and hyperplanes, with the hyperplanes determined by the training points.The hyperplane with the largest separation from the closest training data point is preferred, leading to stronger margins and fewer errors, which in turn results in higher classifier generalization.

Random forest
The Random Forest algorithm creates and combines multiple decision trees to generate the most accurate predictions.Prior to dividing each node, the RF algorithm searches for the most important parameter from a set of randomly selected attributes.In other words, the algorithm selects the most significant attribute before dividing each node in order to produce the best possible forecast.

Bagging
Bagging, also called Bootstrap Aggregating, is an ensemble machine learning technique that combines weak learners and trains them simultaneously or independently.The samples used for prediction are taken from the training dataset used to train the classifier.Bagging collects votes for each sample to improve prediction accuracy, and the weight update equation and algorithm calculations don't need to be modified because bagging algorithms prevent weight updates.In this study, we utilized the Adaptive Bagging (AdaBag) classifier for prediction, which is highly effective when the predictor's bias exceeds its variance.

Ensemble feature selection technique
Ensemble learning is a technique that involves combining multiple learners to achieve the learning objective, rather than relying on a single machine learning method.The proposed ensemble feature selection technique has demonstrated superior model performance compared to individual feature selection methods during evaluation.The aim of this study was to apply the concept of ensemble learning by integrating models generated by several feature selection methods.To achieve this, the original dataset was divided into two parts with a 70%−30% split between the training and testing sets.Following data pre-processing, the complete raw dataset was utilized with various feature selection algorithms in this study.Separate optimized subsets were obtained from the feature selection algorithms, and the average was calculated to determine the most important features for crop yield prediction.Finally, the optimized subset was combined and applied to a classification algorithm to evaluate performance.To avoid plagiarism, the sentences were rephrased and written creatively.
The proposed ensemble learning approach combines the optimized subset from various feature selection algorithms based on mutual information, as shown in Figure 1.The proposed framework combines the optimized subsets of different feature selection techniques.When a specific feature is chosen by the entire selection algorithm, then the specific feature is added to the optimized subset without any other strategy.
Mutual information is a measure of the amount of information that is shared by two random attributes.It is a quantity that indicates how much information about one attribute can be obtained by knowing the other attribute.
More formally, the mutual information between two random attributes J and K is defined as: where H(J) is the entropy of J, and H(J|K) is the conditional entropy of J given K. Intuitively, mutual information is a metric commonly used in machine learning to evaluate the relationship between two variables.It measures how much knowledge of one variable can decrease the uncertainty of the other variable.This is done by calculating the difference between the entropy of the first variable and the conditional entropy of the first variable given the second variable.Mutual information is often utilized in techniques like feature selection and dimensionality reduction, where it helps identify the most relevant variables for a given task.
The proposed method computes, attribute-attribute selects when the mutual information is minimum, while the attribute-class selection for computation when the mutual information is maximum.It eliminates bias brought on by individual attribute selection approaches by using attribute-attribute mutual information to identify significant attribute and attribute-class mutual information to identify repeated attributes.

Performance of combiner
The combiner uses attribute-class and attribute-attribute mutual information to combine an optimized subset of attribute.In all the selected subsets the combiner considers the first ranked attributes.The common attributes are chosen from all the selected sets and added to the optimized subset, if all the factors in the selected subset are same.We compute attribute-class and calculate the mutual information, considering the factor has a maximum value, if the selected subset has a different attribute.After finishing attribute-class selection, we move to choose attribute-attribute since all the attributes have already been chosen as optimal features and have been computed for this attribute.If the attribute-attribute mutual information of that attribute is smaller than the user defined threshold, then the attribute value which will be greater than or equal to the threshold value will be chosen.

Proposed
Algorithm: 4 Input: D ← Dataset, R ← Number of attributes included in the subset, α ← Threshold value Result: E is an optimized subset of input attributes Steps: S 1 , S 2 , S 3 , . . .., S n is a selected subset from various attribute selection model Initialize E ← ϕ and a counter i ← 1 while end select the feature a with max AC mutual information if (E == NULL) then E ← E ⋃ {a} else compute AA (a, a s ) for a with all the selected attribute a s ɛ E If the information calculated for all chosen attributes in set E is lower than α, then. ..E ← E ∪ {a} end end i ← i + 1; end In proposed Algorithm 4, value of R represents the number of attributes selected for the optimized dataset.Rank-based strategy helps to choose the attributes for the optimal dataset from subset.First, the highest ranked attribute is utilized in the proposed method, which then uses a classifier to calculate the classification accuracy.In the following iteration, a subset of attributes is utilized, starting with first-ranked attribute, followed by the second ranked attribute and followed by the next ranked attribute processed.The process calculates classification accuracy for every subset of attributes.If a subset of attributes that contains R highly ranked attributes yields better classification accuracy compared to another subset containing R + 1 highly ranked attributes, then the subset with the high-ranked R attributes is deemed the most optimal attribute subset.

Evaluation metrics
In this study, different evaluation metrics are used to evaluate the performance of the proposed model and other techniques.Accuracy, Error rate, F1_Score, Recall, Kappa, Specificity, and Precision metrics are used in this work.

Accuracy
Accuracy metrics are used to measure how the model performs.This enables us to understand if the predicted model is close to the desired value or not.
where TP -

Precision
Precision is an evaluation metric commonly used to measure the accuracy of positive predictions made by a classifier.

Recall
It measures the ability of a classifier to correctly identify positive instances from the total number of actual positive instances.

F1 score
It provides a balanced measure of a classifier's precision and recall by taking their harmonic mean.

Kappa
Kappa evaluation metric quantifies the level of agreement beyond chance between classifiers.
where P a -observer agreement, P c -expected agreement

Specificity
It measures the ability of a classifier to correctly identify negative instances out of all actual negative instances in the dataset.

Experimental Results and Analysis
The various feature selection methods used for the different attributes are shown in Table 2.This study utilizes three feature selection techniques to identify a subset of attributes, and the performance of the proposed ensemble method using mutual information as a combiner.The combiner selects the best subset based on attribute-attribute and attribute-class relationships.Similar to ensemble learning, using multiple feature selection techniques may lead to subsets that are local optima in the feature space.However, the proposed method can combine these local optima to improve model performance.The proposed method involves selecting the topranked features based on threshold value, from various feature selection techniques to enhance prediction accuracy.Time and accuracy are the most important aspects in result analysis as discussed below.The study proposes an ensemble approach based on mutual information and demonstrates that it improves prediction accuracy and yields accurate predictions for paddy crop.Also, the proposed EFS model takes less computational time when compared with other approaches.
The prediction accuracy of ensemble approach and other feature selection technique were analyzed to validate the model.As per the results shown in Figures 2, 3 , 4 , and 5, ensemble approach with Random forest performed effectively compared with other classifiers.For optimized subset, the ensemble approach with random forest yielded the highest accuracy (acc = 0.9491, error rate = 0.0509), followed by ensemble approach with bagging method (acc = 0.89, error rate = 0.11) and support vector machine yielded (acc = 0.849, error rate = 0.151).For full feature validation analysis, the prediction accuracy achieved by random forest (acc = 0.9092, error rate = 0.0908).For individual accuracy of feature selection technique recursive feature elimination technique performed efficiently, followed by the RReliefF method was slightly lower than RFE method.Table 3 shows the comparison of computational time comparison for the different methods.The proposed method takes less computational time than the traditional ones making it the most efficient.Table 4 shows the accuracy and error rate of the individual and the proposed ensemble feature selection model.Figure 6 and Table 5 shows the comparison of evaluation metrics of individual and proposed ensemble feature selection model.

Conclusion
In this paper, we introduce a novel ensemble feature selection technique framework that utilizes mutual information to combine feature subsets selected by different algorithms such as Boruta, RReliefF, and RFE.The aim of this framework is to identify an optimal feature subset.To evaluate the performance of our proposed method, we use various machine learning classifiers such as Random Forest, Support Vector Machine, K-nearest neighbor, and bagging.Our study demonstrates that our proposed ensemble feature selection framework effectively overcomes the issue of high dimensionality in large datasets by selecting relevant and non-redundant features.To avoid any bias introduced by a single classifier, we developed an ensemble approach.The results indicate that the proposed approach achieves superior performance compared to the individual feature selection methods and single classifiers, thereby providing a robust and efficient solution for feature selection in large datasets.
With the observed outcomes the following conclusions are made: (i)Based on the outcomes of evaluation measures, the proposed ensemble feature selection model achieved higher accuracy than other algorithms.
(ii)The proposed ensemble model was applied in five different classifier algorithms.
(iii)Accuracy, Error rate, Recall, F1_Score, Kappa, Precision, and Specificity metrics are used for this evaluation value of 0.9491 and 0.0509, respectively (iv)The accuracy of the proposed ensemble model improved to compare with the other existing from the literature report.
(v)In this analysis, the most important features are selected based on the mutual information as a combiner and the selected features applied into classifiers.The ensemble feature selection method with random forest works better than others.

Figure 1 .
Figure 1.Framework of this study.

Figure
Figure 5. Performance of ensemble feature selection Technique.

Table 6
shows the various existing feature selection techniques used in yield prediction for different types of crops.From the table it is evident that the accuracy of the feature selection model using the proposed EFS model is the highest among the existing models.