Comparative Study of AutoML Approach, Conventional Ensemble Learning Method, and KNearest Oracle-AutoML Model for Predicting Student Dropouts in Sub-Saharan African Countries

ABSTRACT Student dropout in secondary schools is a major issue in developing countries, particularly in Sub-Saharan Africa. Sub-Saharan African countries had the highest dropout rate (37.5%), followed by South Asia (15.5%), the Middle East (11%), East Asia (9.5%), Latin America (7%), and Central Asia (3.5%). Various initiatives such as the big results now initiatives, no child left behind, and secondary education development programme as well as machine learning prediction models have been used to reduce the severity of the problem in Sub-Saharan countries. The ongoing dropout problem, particularly in secondary schools is ascribed to improper root cause identification and the absence of formal procedures that can be used to estimate the severity of the issue. This study has compared the AutoML model, ensemble learning approach, and KNORA-AutoML to predict student dropout problems. The KNORA-AutoML model scored 97% of accuracy, precision = 71%, and AUC = 87% when compared to the conventional ensemble of optimized ML models with accuracy = 96%, precision = 70%, and AUC = 78%. KNORA-AutoML model performance increased by 0.6% accuracy, 0.8% precision, and 8.7% AUC. An optimized model draws a lot of attention to the findings related to student dropout rates in developing countries.


Introduction
Student dropout in Sub Saharan continues to be a challenging problem. Sub-Saharan countries have 37.5% dropout rates in secondary schools followed by South Asia 15.5% and Middle East 11% in 2018 (Statista 2022). In Tanzania, student dropouts in secondary schools increased from 3.8% in 2018 to 4.2% in 2019 (PO-RALG 2019. Student dropout has been addressed by different interventions to reduce the negative impact in secondary schools. Several measures have been proposed and implemented by education stakeholders to reduce student dropout (UNESCO-Tanzania 2015). Faruk (2015) proposed the teachers' training, seminars, and workshops as a measure for students not to drop out of school. Similarly, Bibi (2018) evidenced that parents-teachers meetings contribute 93.5%, and parent workshops 82.3% to control student dropout rate. Likewise, authority in the United Republic of Tanzania (URT) established the Secondary Education Development Programme (SEDP) in 2005 that aimed to introduce at least one secondary school in every administrative ward so as to increase the number of secondary schools and reduce the distance from their homes as a measure to solve school dropout (URT 2008). Later on, the United Republic of Tanzania adopted the Big Results Now Initiative (World Bank 2014) to fast-track quality improvement of the education in secondary schools and address student dropout. Furthermore, the government's fee-free basic education policy contributed to a significant decrease in student dropout in 2016 (UNICEF-Tanzania 2018).
The studies by Mduma, Kalegele, and Machuve (2019), Lee and Chung (2019), Chareonrat (2016), Aguiar (2015), and Sara et al. (2015) have focused on establishing machine learning (ML) prediction models as measures to fight against student dropout in secondary schools, but the dropout problem still persists. The persisting dropout problem, especially in secondary schools, is attributed to a lack of proper identification of root causes and the unavailability of formal methods that can be used to project the severity of the problem. The difficulty stems from the fact that traditional machine learning algorithms suffer from feature processing and algorithm selection (Feurer et al. 2015). Moreover, no single machine learning algorithm/classifier or ensemble of classifiers can perfectly adapt to every test set (Azizi and Farah 2012). This is a significant drawback of most existing machine learning algorithms and ensemble learning techniques, as it compromises proper feature processing as well as the prediction accuracy of the machine learning models. However, automated machine learning (AutoML) methods provide better prediction results in different classification tasks. On the other hand, AutoML faces the challenge of selecting one of the optimal prediction models generated by the optimization methods for the various subsets of samples in the dataset (Tsiakmaki et al. 2020;Waring, Lindvall, and Umeton 2020). The challenge of selecting optimal prediction models among the pool of predictive methods is addressed by static and dynamic ensemble selection schemes (Ko, Sabourin, and Britto 2008). The static ensemble selection technique selects the optimal performing classifiers subsets for the whole test set (Azizi and Farah 2012). The method determines an ensemble of classifiers (EoC) for all test samples, not every selected classifier in the AutoML pool is an expert in classifying all known training samples (Ko, Sabourin, and Britto 2008). Dynamic ensemble techniques work by estimating the competence level of each classifier from a pool of classifiers (Vriesmann et al. 2012). The estimated competence of the ensemble of classifiers is based on a local region of the feature space where the query sample is located (Zhu, Wu, and Yang 2004). The local region can defined by different methods such as overall local accuracy, local class accuracy, A priori, A posteriori, and K-Nearest Oracle (KNORA) (Ko, Sabourin, and Britto 2008).

Literature Review
This study has examined traditional machine learning algorithms, ensemble learning methods, and automated machine learning methods. Researchers reviewed Random Forest (RF), Decision Tree (DT), K-Nearest Neighbors (KNN), AdaBoost, Multilayer Perceptron (MLP), Logistic Regression (LR), Stochastic Gradient (SGD), Linear Discriminant Analysis (LDA), artificial neural network (ANN), support vector machine (SVM) and Nave Bayes (NB). It is difficult to determine which technique is superior to another because each has its own set of merits and demerits, and implementation issues (Soofi and Awan 2017). DT selects features in a top-down approach beginning with the attribute that offers the highest degree of information gain with the lower entropy (Berens et al. 2018). The lower value of entropy gives higher information purity of the node (Azad et al. 2021). LR is a statistical method that makes categorization according to the rules of probability by estimating the value assumptions of dependent variables as the probability (Korkmaz, Güney, and Yiğîter 2012). The probabilities describing the possible outcomes of each feature vector are modeled using the logistic function (Rovira, Puertas, and Igual 2017). NB uses Bayes Rule together with the assumption that the attributes are conditionally independent for the given class (Webb 2011). Random forest can be defined as h x; θ k ð Þ; k ¼ 1; . . . . . . : f gwhere the {θ k } are independent identically distributed random vectors, and each tree casts a unit vote for the most popular class at input x. Significant improvements in classification accuracy have resulted from growing an ensemble of trees and letting them vote for the most popular class (Breiman 2001). ANNs involve nonlinear relationships among different datasets that are difficult to identify using conventional analyses (Kriesel 2007). Neural networks can learn non-linear and complex relationships and do not restrict input variables that represent dropout determinants (Kalegele 2020).

Ensemble Learning Approach
Ensemble learning brings the practice of multiple learner classifiers on a dataset to extract the number of predictions, which are combined into one composite prediction (Steinki and Mohammad 2015). Ensemble different machine learning algorithms is an effective method for acquiring a high level of predictive accuracy (Shanthini, Vinodhini, and Chandrasekaran 2018). Predictions made through ensembles are usually more accurate than predictions made by a single model (Shanthini, Vinodhini, and Chandrasekaran 2018). Ensemble learning improves classification accuracy by static and dynamic ensemble techniques (Mousavi and Eftekhari 2015). The static ensemble selection approach searches a pool of candidate classifiers for the best ensemble classifier structure based on a specific criterion (Karlos, Kostopoulos, and Kotsiantis 2020). The dynamic ensemble selection automatically selects a competent subset of ensemble classifiers when making predictions for different test sets of training samples (Cruz et al. 2020). The competence of an ensemble of classifiers can be achieved by various methods such as overall local accuracy, local class accuracy, A priori, A posteriori, and K-Nearest Oracle (KNORA) (Ko, Sabourin, and Britto 2008). KNORA selects all classifiers that correctly classify all subsets of samples in the region of competence (Oliveira et al. 2018). KNORA simply finds its nearest k-neighbors in the validation set for any test sample to determine optimal classifiers (Ko, Sabourin, and Britto 2008). KNORA -Eliminate and KNORA -Union DES techniques estimate the classifier competence over a test sample of competence region using a given criterion (Oliveira et al. 2018). KNORA -Eliminate selects all classifiers that correctly classify all samples in the test sample's region of competence only if such a classifier exists. Otherwise, it removes the sample that is farthest away from the test sample from the region of competence. As a result, KNORA -Eliminate can lead to the selection of incompetent optimized machine learning models by reducing the region of competence (Janardhan and Kumaresh 2022). The study adapts the KNORA-Union method proposed by Ko, Sabourin, and Britto (2008) to compose the ensemble of optimized machine learning models using the AutoML approach presented in Figure 14. KNORA -Union selects all classifiers/optimized ML models that correctly classify at least one sample in the competence region. The more neighbors a classifier correctly classifies, the more votes it receives for a test sample (Junior et al. 2020). Hutagaol (2019) results revealed the performance of ML algorithms individually; KNN = 98.20%, NB = 98.24%, and DT = 97.91% and later combined them to build an ensemble classifier which showed a performance of 98.82% that was greater than the individual classifiers. Adejo and Connolly (2018) used three algorithms such as DT, ANN, and SVM to build an ensemble classifier with an accuracy of 81.67%. Satyanarayana and Nuckowski (2016) applied DT (J48), NB and RF to build an ensemble classifier with an accuracy of 95% which was higher compared to individual classifiers. Ayyappan and Sivakumar (2018) used Bayes, Rules and Trees-based classifiers to build ensemble models by using boosting, bagging, and dagging methods. Results for Bayesian based classifiers when boosting method applied are as follows; BayesNet 72.18%, complement NaiveBayes 52.96%, and Naïve Bayes 69.98%. Results for Rules based classifiers were as follows; Conjunctive Rule 51.38%, Decision Table 71.68%, and JRip 70.66%. On the other hand, Trees based classifiers performed as follows; BTTree = 72.03%, Decision Stump = 51.38%, and J48 = 71.73%. Their study used a residence, medium of study, sex, attendance, subjects taught and student grade to determine the student performance and dropout. Unfortunately, the study did not show the impact of each factor on the dropout prediction models. Moreover, some of these factors are not relevant to developing countries. Satyanarayana and Nuckowski (2016) applied DT (J48), NB and RF to build an ensemble classifier. The ensemble of DT (J48) and Bagging was modeled by mathematics and Portuguese grades. The individual results of DT (J48) = 78%, Bagging = 82%, and ensemble model = 95% for the mathematics grades. Similarly, for the Portuguese grade, DT (J48) = 71%, Bagging = 79%, and ensemble model = 94%. The obtained results indicate the power of the ensemble learning techniques compared to individual machine learning algorithms. As a result, prediction accuracy improved significantly with the ensemble model. Moreover, their results showed that extra educational support, daily alcohol consumption, large family size, and internet access at home highly correlated with student dropout.

Automated Machine Learning (AutoML)
Automated Machine Learning (AutoML) is a technique to drive the best classification model and corresponding hyper-parameter for a given decisionmaking problem (Feurer et al. 2015). The AutoML selects the best combination of hyper-parameters and features for the optimal prediction model (Kotthof et al. 2017). The process of building such actionable machine learning models can generate added value to the existing problem (Tuggener et al. 2019). Several studies have predicted student performance and dropout rate in higher education using traditional machine learning models, and a few studies have used AutoML models (Zeineddine, Braendle, and Farah 2021). The KNORA-AutoML method for improving prediction models for student dropout is not yet covered in secondary schools in developing countries. Prabaharan, Mehta, and Chauhan (2020) and Waring, Lindvall, and Umeton (2020) evidenced the significance of using the AutoML method in healthcare, demonstrating the influential features and the best combination of the classification methods to improve health outcomes. Krauß et al. (2020) applied the AutoML model using hyper-parameter optimization techniques in quality production. Their preliminary results showed F1 scores of 37% for the untuned model and 42% for the tuned model using Random forest. Automatic generation of influential features and machine learning algorithms lead to the optimal prediction model implemented by hyper-parameter optimization techniques (Feurer et al. 2015). Mnyawami, Maziku, and Mushi (2022) developed an improved model to predict student dropout in developing countries, particularly in Tanzania. The model was developed using AutoML to improve prediction results for secondary schools in Sub-Saharan countries. The results indicate that Grid search, Randomized and Bayesian optimization techniques perform better than conventional machine learning models. The average accuracy of the AutoML model using grid search method were DT = 99.9%, RF = 99.9%, LR = 98%, NB = 97%, and SGD = 95%. Results of the Randomized search method is DT = 97%, RF = 97%, LR = 97%, NB = 97%, and SGD = 96%. Again, their study demonstrated students' performance/marks, age, school distance, mother's education, number of children in the family, parents' occupation, father's education, gender, mother's tongue language and means of transport used to school have a high impact on student dropouts. Mduma, Kalegele, and Machuve (2019) improved prediction accuracy after applying grid search parameter tuning to conventional machine learning algorithms. Their results showed LR = 89.7%, MLP = 86.5%, NB = 78.4%, and RF = 88.8% for tuned model, and traditional ML algorithms; LR = 75%, MLP = 76%, RF = 75%, and KNN = 73%, and LR = 78%, MLP = 64%, RF = 50%, and KNN = 55% for under-sampling technique and over-sampling technique, respectively. The AutoML has a significant contribution in improving the prediction results in machine learning and ensemble learning approaches. Moreover, the authors revealed that the main source of household income, boy's pupil latrines ratio, presence/absence of girls' privacy room, student gender, a parent who checks his/her child's exercise book once a week etc., as the factors that lead student to dropout from school. Their research employed the SMOTE technique to address the under-fitting and overfitting issues in machine learning classification and prediction.

Bayesian Hyper-Parameter Optimization
The hyperparameter optimization can be achieved by grid search, random search, and Bayesian optimization (Yang and Shami 2020). Grid search suffers from the high dimensional data and is computationally expensive (Bergstra and Bengio 2012). Randomized search solves large-scale problems efficiently in a way that is impossible for grid search (Zabinsky 2011). On the other hand, random search does not employ a search technique to forecast the subsequent trial and does not use data from trials to choose the next set (Tsiakmaki et al. 2020). Therefore, this study selected the Bayesian optimization method due to its superiority over random and grid search (Turner et al. 2021). When performing Bayesian optimization, prior information is established for the optimization function, which is gathered from the previous dataset to update the posterior of the optimization function using Bayesian theorem (Wu et al. 2019).
The theorem states that the posterior probability of a model M given data D P(M|D) is proportional to the likelihood of D given M P(D|M) multiplied by the prior probability P(M). As for the hyperparameter optimization problem, model M should not be mistaken for the output model of machine learning algorithms (Tsiakmaki et al. 2020). The function generates the ideal value deduced from the posterior knowledge. For minimizing or maximizing objective functions that are expensive to analyze for machine learning models, a Bayesian optimization is a useful approach (Wu et al. 2019). Due to its controllability for posterior and predictive distributions, Bayesian optimization adapts the Gaussian process method to choose prior information (Rasmussen 2004). The Gaussian process has desirable properties such as uncertainty estimates over function values, resistance to overfitting, and principled approaches to hyperparameter optimization (Gal 2015). The Gaussian process is efficient and is applied when very little information about the objective function is available, making helpful in optimizing costly black-box functions (Brochu 2010). The Gaussian process is specified by its mean function m x ð Þ and covariance function k x; x � ð Þ. Therefore, Gaussian process is defined in Equations (1) and (2).
Given observation, D = (X, f) where function f is distributed as a gp with mean function m and covariance function k Due to the limitations of the previous research on conventional machine learning, ensemble learning methods, and automated machine learning approaches, the study employs the KNORA-AutoML prediction model to address the challenges of conventional ensemble learning models and AutoML approaches to compose the dynamic ensemble of optimized machine learning model.

Dataset Processing Methods
The dataset was obtained from the Twaweza Uwezo data repository. Students who dropped out of school in Tanzania, Uganda, and Kenya were included in these datasets. 1 The datasets were in Stata(.dta) file format, Jupyter Notebook was used to read and merge files, and then converted to CSV file format. Datasets contained 385,634 records with 37 features before Scikit-learn data analysis and classification. The remained dataset was 206,885 samples and 15 features after removing inconsistent rows and features. The univariate feature processing method imputed values in the i-th feature dimension using only non-missing values in that feature dimension (Emmanuel et al. 2021). The missing values were handled by imputation technique using the mean of each column in which the missing values were found (Rezaie et al. 2010). The mean imputation method is the simplest and most widely used method in generic imputation practice to replace missing data, and it has good robustness (Fu, Liao, and Lv 2021). Then, 36723 records with outliers were removed by interquartile range (IQR) (Whaley 2005). The study adapts features presented by Mnyawami, Maziku, and Mushi (2022) in Figure 1. Therefore, the study adopted DT and Chi Squared method to select the important features used in classification and prediction (Mengash et al. 2022). The chi squared method selects the minimum number of features needed to represent the data accurately (Liu and Setiono 1995). Figure 1 shows that each feature has an impact on student dropout prediction, such as school distance (28%), household size (17%), means to school (16%), household children (12%), age (7.5%), parents' occupation (4.5%), mother's education (4%), father's education (3.5%), gender (3%), and mother tongue language (2.5%). A higher feature score indicates a significant contribution to predicting student dropouts.
Moreover, Figure 2 presents the permutation method for selecting important features. Important features are selected from the training and testing set of the acquired dataset (Ganz and Konukoglu 2017). If the decrease in accuracy scores in training and testing sets are very similar, the model accurately demonstrates important features for prediction; otherwise, the model is overfitting (Ojala and Garriga 2010). Therefore, the selected features are appropriate for predicting student dropouts in secondary schools in developing countries. Figure 3 shows the reuse of the prediction model presented by (Mnyawami, Maziku, and Mushi 2022). The input dataset is passed to the data preprocessing. Feature engineering technique is selected and applied to the dataset. The list of models is then passed, and the method is run for all of them. The hyper-  parameter optimization technique is then chosen, and the optimal hyperparameters and models are returned. Finally, the models are trained, and the model evaluation metrics are used to determine the best classification models described in pseudocode 1. Again, the static ensemble and KNORA dynamic ensemble selection strategies are used to combine optimized machine learning models to produce a single optimized prediction model, as presented in Figures 5 and 6. The study divided 70% of the training data and 30% of the testing data from 206,885 samples for various experiments. The best results are obtained when 20-30% of the data is used for testing and 70-80% for training (Gholamy, Kreinovich, and Kosheleva 2018). The training/testing ratio of 70/ 30 was the best for training and validating ML models (Saroj et al. 2022). Testing set is used to assess the accuracy of the resulting model (Gholamy, Kreinovich, and Kosheleva 2018). The 70/30 training/testing ratio aids in avoiding model overfitting and underestimating the obtained prediction accuracy of the model (Nguyen et al. 2021).

Implementation of KNORA-AutoML
This section shows the implementation of a conventional ensemble learning model, a static ensemble of optimized machine models using the AutoML approach, and the KNORA-AutoML prediction model. Figure 4 presents the use of the majority voting method to combine un-optimized machine learning algorithms by traditional ensemble selection strategy.
Similarly, Figure 5 demonstrates the ensemble of the optimized MLAs by the traditional ensemble selection strategy. C represents ensemble of optimized machine learning models. Static ensemble selection strategy selects the optimal prediction model from an ensemble of optimized machine learning models (Dos Santos, Sabourin, and Maupin 2008). The subset of the ensemble of optimized models is selected for all test samples acquired from the dataset (Ko, Sabourin, and Britto 2008) as presented in pseudocode 2. Therefore, selecting a set of optimized models that are less susceptible to resource consumption and response time than all of them significantly improved performance (Zhou, Wu, and Tang 2002).

Method:
for all test sample S t in dataset D Do Find Q as the optimal model using S t in training set T set for sample Sample i in Sample Do for every optimized model optm i in optm Do if(optm i classify Sample i correctly) then optm' extracted from optm using S t in D end if end for end for Return optimal model in given S t end while end for Figure 6 shows the implementation of the KNORA to select the optimal prediction model among the optimized models. The dynamic ensemble strategy automatically selects a competent subset of an ensemble of optimized models for different test sets of training samples (Vriesmann et al. 2012). The method chooses an expert ensemble of optimized models and labels each unknown instance x tþ1 : The expert is chosen as the ensemble with the highest prediction accuracy up to instance x t (Albuquerque et al. 2019). The optimized model's performance depends on a local region defined by k-nearest neighbors in the validation set of test samples (Zhou, Wu, and Tang 2002). KNORA applies K hyper-parameters to find the nearest neighbors of the optimized models for a given set of test samples (Oliveira et al. 2018) as demonstrated in pseudocode 3.

Pseudocode 3: KNORAU Optimized Machine Learning Model
Input: pool of optimized machine learning models optm; validation set S val ; testing sample S t ; nearest neighborhood size K. Output: ensemble of optimized machine learning models optm' using (S t Table 1 describes features in the AutoML method, ensemble learning model, and KNORA-AutoML prediction model. Data was transformed and coded before being used in the compared methods.

Model Evaluation Metrics
The confusion matrix is an N x N matrix used for evaluating the performance of the classification model, N is the number of target classes. The matrix compares the actual target values with those predicted by the ML model (Novaković et al. 2017). Confusion matrix is used to measure the classifier's accuracy that is the ratio between correctly predicted results and the total number of samples (Hutagaol 2019). Figure 7 shows model evaluation metrics determined by True Positive Rate, False Positive Rate, True Negative Rate, and False Negative Rate (Deepa and Bharathi 2013). Therefore, the confusion matrix is used to evaluate the precision rate, accuracy rate, recall, f-measure, AUC, and roc curve of the ML prediction models (Hossin and Sulaiman 2015). Additionally, the roc curve is a method for evaluating, comparing, and

Results and Discussions
Several machine learning models fail to provide appropriate prediction results due to improper training/testing features selection and corresponding hyperparameter values for the optimal prediction model (Vaccaro, Sansonetti, and Micarelli 2021). Figure 8 presents experimental prediction results which used basic/default training and testing of naïve bayes (NB) = 94%, linear discriminant analysis (LDA) = 93%, stochastic gradient descent (SGD) = 93%, decision tree (DT) = 97.5%, random forest (RF) = 98%, logistic regression (LR) = 96.5%, k-nearest neighbors (K-NN) = 95.5% and Adaptive Booting (AdaBoost) = 97%. AdaBoost uses ensemble learning method to learn from the mistakes of weak classifiers to create strong prediction models (Wang 2012). Figure 9 presents a better improvement of the prediction accuracy than Table 1. Prediction accuracy results indicate that NB = 95%, LDA = 94%, SGD = 96%, DT = 97%, RF = 97%, LR = 96%, KNN = 97%, and AdaBoost = 97%. Other metrics show slight changes to some machine learning algorithms e.g precision for NB = 62%, SGD = 63%, DT = 70%, LR = 68%, and KNN = 70%. It is proven that there is a trade-off between precision and recall in a given dataset when one increases cause another decrease (Alvarez 2002). Thus, increasing precision through optimized methods often results in the decrease of recall and vice versa.  shows that SGD = 78%, LDA = 60%, DT = 98%, KNN = 85%, LR = 99%, RF = 99%, NB = 60%, and AdaBoost = 99%. Table 2 shows the performance of the AutoML and ensemble learning model. The AutoML model using Bayesian search strategy outperformed the ensemble learning model. The AutoML model has an accuracy of LDA = 95%, NB = 97%, and SGD = 97%, and the Ensemble prediction model has an accuracy of 94.7%. Ensemble learning models and traditional training machine learning algorithms suffer from feature processing and corresponding hyperparameters for an optimal machine learning algorithm (Tuggener et al. 2019). Figure 12 shows the roc curve for the AutoML and Ensemble learning model performed by the same acquired datasets. The classifier that shows curves closer to the left corner indicates better performance, and closer to the 45-degree diagonal leads to poor performance (Zou, O'Malley, and Mauri 2007). The results suggest that NB = 100% performed better than the Ensemble prediction model, with approximately 98%, followed by SGD = 96% and LDA = 95%. Furthermore, the accuracy of the roc curve prediction is determined by the diagonal line that separates correctly predicted (true positive) and incorrectly predicted (false negative) values (Vujović 2021).
Moreover, Figure 13 shows the prediction results of the optimized machine learning algorithms by Bayesian Optimization Technique. The ensemble of optimized machine learning models applied a traditional selection strategy using the majority-voting rule. The classifiers voted by members to obtain the best performing single classifier for predicting student dropouts in developing countries.    Similarly, Figure 14 shows that the KNORA Optimized model outperforms the static optimized model ensemble. The KNORA-AutoML model scored 97% of accuracy, precision = 71%, recall = 76%, F1 = 74%, and AUC = 87% when compared to the conventional ensemble of optimized ML models with accuracy = 96%, precision = 70%, recall = 58%, F1 = 64%, and AUC = 78%. KNORA-AutoML model performance increased by 0.6% accuracy, 0.8% precision, 17.8% recall, 9.9% F1 and 8.7% AUC. The proposed approach shows better prediction results compared to the traditional ensemble of optimized machine learning algorithms/models. Furthermore, age, gender, attendance, truancy, early marriage, parents' occupation, class activities, repetition, family size, school cost, distance, disability, extra-curricular activities, and student performance, student grades used in the majority of existing prediction models. Students who drop out are more likely to face difficult situations to meet their basic needs, which drives them to commit crimes by 76.3%, drug abuse by 73.2%, poverty by 66.1%, unemployment by 59.3%, and early marriage by 54.2% (Chuwa 2018). Figure 1 depicts significant features influencing student dropouts in secondary schools in Sub-Saharan countries. These were features discovered using statistical and machine learning analysis in various experiments. The Bayesian Optimization model finds better search points with fewer function evaluations and is robust to noisy datasets (Frazier 2018). KNORA integrated into AutoML using Bayesian optimization technique to develop a better prediction model for identifying the root causes of student dropouts in Sub-Saharan African countries. In comparison to Figure 13, the KNORA-AutoML model scored 97% accuracy, 71% precision, 76% recall, 74% F1 and 87% AUC in Figure 14.
Previous research indicates that Mirza and Hassan (2020) Table 3 shows optimized values for some machine learning algorithms used in different experiments. The optimized algorithms significantly affect prediction results more than traditional machine learning modeling.
Additionally, Table 4 depicts student dropouts' feature combinations and prediction accuracies. Each feature makes a significant contribution to predicting secondary school dropouts. Home language, age, mother's education, school distance, and mode of transportation are the most important features influencing student dropouts by 95%. Similarly, household size, gender, school distance, and household children all contribute 94% to student dropout. A combination of five (5) or more features influences student dropouts compared to features less than or equal to four (4). .

Conclusion and Future Directions
In many secondary schools in developing countries, children leave school for various reasons that are hard to pinpoint precisely. The severity of the issue has been lessened in developing countries through several programs, including the Big Results Now projects, Free Education for All, No Child Left Behind, Secondary Education Development Programme, and machine learning prediction models. Due to incorrect root cause identification and a lack of formal protocols that may be used to gauge the severity of the problem, dropout rates continue to be a problem, especially in secondary schools. The KNORA-AutoML model outperforms the conventional ensemble learning model and the static ensemble of optimized models using AutoML. The performance of the KNORA-AutoML model is evaluated by the average accuracy, precision, recall, f1, AUC, and roc curve. Performance measures show that KNORA-AutoML model increased by 0.6% accuracy, 0.8% precision, 17.8% recall, 9.9% F1 and 8.7% AUC. Moreover, the findings show that each feature has a significant impact on student dropout prediction; for example, school distance is 28%, household size is 17%, means to school is 16%, household children is 12%, age is 7.5%, parents' occupation is 4.5%, mother's education is 4%, father's education is 3.5%, gender is 3%, and mother tongue language is 2.5%. The various experiments and analyses indicate that the KNORA-AutoML model outperforms the conventional machine learning models/algorithms and static ensemble learning models. The results pertaining to student dropout rates in developing countries receive considerable attention when a well-optimized technique is applied. Furthermore, the study suggests using other dynamic ensemble selection schemes to syndicate the optimized machine learning models by AutoML for predicting student dropouts. Therefore, the proposed prediction model emphasizes closely following up on the suggested features and proper planning for early intervention.   Feature Names Accuracy (%) (age, school distance, means of transport, father's education, home language, parent's occupation, gender, mother's education, household size)