Enhanced Model for Predicting Student Dropouts in Developing Countries Using Automated Machine Learning Approach: A Case of Tanzanian’s Secondary Schools

ABSTRACT The Sub-Saharan countries are leading in dropout rates in secondary schools by 37.5% followed by South Asia 15.5% and Middle East 11% in 2018. In Tanzania, student dropouts in secondary schools increased from 3.8% in 2018 to 4.2% in 2019. Different initiatives such as parent-workshops, parent-teacher meetings, community empowerment programs, school feed programs, and secondary education development program (SEDP) have been used to address student dropout but unfortunately, the dropout problem still persists. The persisting dropout problem especially in secondary schools is attributed to a lack of proper identification of root causes and unavailability of formal methods that can be used to project the severity of the problem. In addressing this problem, machine learning (ML) techniques have done a great job in predicting secondary school dropouts. However, most of the ML models suffer from processing features, and hyper-parameters tuning leads to poor prediction accuracy in identifying the root causes of the student dropout. In this study, the AutoML model has been used to improve prediction accuracy by selecting the corresponding hyper-parameters, features, and ML algorithm for the acquired dataset. The proposed model achieved a better prediction accuracy of DT = 99.8%, KNN = 99.6%, MLP = 99% and NB = 97%. The improved prediction score indicates an accurate selection of features that cause student dropout that can be looked in a close eye in the learning process for early intervention.


Introduction
School dropout is untimely withdrawal from school, which make students prematurely end up not obtaining a minimal credential (Witte et al. 2013). By the year 2017, it was estimated that a total of 5.1 million children aged 7-17 have dropped out from school, including nearly 1.5 million of secondaryschool-aged children (HRW 2017). The Sub-Saharan countries are leading in dropout rates in secondary schools by 37.5%, followed by South Asia 15.5% question: How can an enhanced machine learning model for predicting student dropouts in secondary schools in Tanzania be developed? The rest of the paper is organized as follows. Section 2 discusses the related works. Section 3 presents the methodology and Section 4 presents the results. Last, Section 5 presents the conclusion and future scope of the work. Mirza and Hassan (2020) conducted student dropout intervention using the socioeconomic and school factors such as sex, age, disability, marriage, number of siblings, income, residence, distance from School, transport facility, toilet facility, and drinking water. Their study showed that RF = 96%, SVM = 93%, NB = 94%, DT = 89%, and GLM = 98%. The study provided general results implemented in machine learning algorithms. The authors did not show the most contributing factors that lead student dropout. Mduma, Kalegele, and Machuve (2019) explored factors which reduce secondary student dropout from school namely; the main source of household income, boy's pupil latrines ratio, the school has girl's privacy room, student gender, a parent who check his/her child's exercise book once in a week, and so on. Their prediction enhancement was achieved by the ensemble classifier which combined the Logistic Regression and Multilayer Perceptron to predict secondary students' dropout. Moreover, Mduma, Kalegele, and Machuve (2019) evidenced improvement of prediction accuracy after deploying tuning parameters. Their results showed LR = 89.7%, MLP = 86.5%, NB = 78.4%, and RF = 88.8% when compared with traditional ML algorithms training for under-sampling technique; LR = 75%, MLP = 76%, RF = 75%, and KNN = 73%, and for over-sampling; LR = 78%, MLP = 64%, RF = 50%, and KNN = 55% to avoid under-fitting and overfitting problem of the machine learning prediction. Hutagaol (2019) results revealed the performance of ML algorithms individually; K-Nearest Neighbor (98.2%), Naïve Bayes (98.2%), and Decision Tree (97.9%) and later combined them to build the ensemble classifier that showed a performance of 98.8% when compared to the individual classifiers. His study applied student grade, student location, parent's income, parent's education, student gender, age, homework, and attendance as the student dropout factors. Student attendance and homework has been recommended as the most contributing student dropout factors followed by mid test and final test. Sara et al. (2015) performed prediction based on gender, absence, missing assignments, education history, average income, school size, class size, travel time to school and teacher-pupil ratio. However, their study predicted only the first three months of studies while there are factors which persist for a long time in education tenure. These student dropout factors applied to conventional machine learning algorithms which demonstrated individual performance of each algorithm such as Random Forest 93.5%, SVM 90.5%, CART 89.8% and Naïve Bayes 85.6%. However, prediction accuracy can be improved by the averaged output of ensemble learning prediction model. Moreover, this study applied machine learning algorithms such as Decision Tree, Naïve Bayes, Random Forest, Support Vector Machines, Multilayer Perceptron, Logistic Regression and K-Nearest Neighbors to build the AutoML prediction model. The identification of the optimal prediction model depends on the voting scheme from the majority of the constituent methods implemented on AutoML model (Zeineddine, Braendle, and Farah 2021). Therefore, the study used the proposed AutoML model to identify the optimal ML algorithms and features that improved the prediction accuracy. These optimal features stress close observation to reduce the student dropout in secondary schools and take proactive better strategies to avoid the risk of dropping out the school.

Methodology
The AutoML model for predicting student dropout in secondary schools was developed by selecting best related model using literature reduction process (Page et al. 2021). The AutoML refers to a combination of various automated techniques that produce an end ML model such as data preparation, feature engineering, model generation and model evaluation as shown in Figure 3. The AutoML model refers to the large scale automation of data preprocessing, feature engineering, model searching and hyperparameter optimization (Nagarajah and Poravi 2019). The model generation is divided into search space and optimization techniques such as hyperparameter optimization techniques (He, Zhao, and Chu 2021). This paper adapted Zeineddine, Braendle, and Farah (2021) methodology approach to develop the improved model for predicting the student dropouts. Their study developed model to predict the student performance in higher learning institutions. The selection of the methodology based on the relevant model components to improve the prediction accuracy, and obtain influential features for the optimal prediction model using hyper-parameter optimization techniques as grid and randomized search. The corresponding hyper-parameter values for the best features and machine learning algorithm greatly improves the prediction accuracy (Wu et al. 2019). Automation of feature processing and hyper-parameters are important for ML since they directly control the behaviors of training algorithms and have a significant impact on the performance of ML models (Wu et al. 2019).

Data Preprocessing Methods
This study extracted the dataset from the TwawezaUwezo information repository. These datasets included students dropping out of school in Tanzania, Uganda, and Kenya. The datasets were in Stata(.dta) file format, Jupyter Notebook used to read and merge file, and then converted to CSV file format. Datasets contained 385,634 records with 37 features before Scikit-learn data analysis and classification. The remained dataset was 206885 samples and 15 features after removing inconsistent rows and features using univariate feature processing method imputed values in the i-th feature dimension using only non-missing values in that feature dimension (Emmanuel et al. 2021). The missing values was handled by imputation technique using mean of each column in which the missing values are found (Rezaie et al. 2010). Then, 36,723 records with outliers were removed by interquartile range (IQR). IQR finds the outliers from the dataset by identifying the data that is over ranging from the dataset (Whaley 2005). IQR is evaluated as IQR = Q3-Q1 where Q3 and Q1 are the upper and lower quartiles, respectively. The lower limit was 25 percentiles, and the upper limit was set to 75 percentiles where Q1 = dataset.quantile(0.25) and Q3 = dataset.quantile(0.75) to handle outliers. Last, data converted from categorical to numerical values so that enable machine learning algorithms to read file using Jupyter Notebook tool. After data conversion, missing values and outliers properly handled to improve predictive accuracy of the model. Table 1 shows features that only 15 features are most relevant to the task of student dropout prediction out of original number of 37 features. Parents' occupation Nominal 1 = Unemployed, 2 = Agriculture, 3 = Selfemployed, 4 = Public sector, 5 = Private sector, 6 = Housewife childNo Number of children Numeric 0 = None, 1 = Two Children, 2 = Three Children, 3 = Four or more mothers_edu Mother's educational status Nominal 0 = None, 1 = primary, 2 = secondary, 3 = Postsecondary fathers_edu Father's educational status Nominal 0 = None, 1 = primary, 2 = secondary, 3 = Postsecondary school_distance Distance Numeric 1 = 0-0.5 km, 2 = 0.5-1 km, 3 = 1-2 km, 4 = 2-3 km, 5 = 4-5 km, 6 = 6-7 km, 7 = >7 km MeansToSchool Means to school Nominal 1 = Walk, 2 = Bicycle/motorbike, 3 = Public transport, 4 = Private car house_lighting House lighting Nominal 1 = electricity, 2 = Solar, 3 = Gas, 4 = Paraffin, 5 = Other school-infra School infrastructure Nominal 1 = Toilet, 2 = Water, 3 = Teaching facilities, 4 = Electricity SchoolMealPerDay School meal taken per day Nominal 0 = None, 1 = Once, 2 = Two Times, 3 = Three Times or more schoolcost School cost Binary 1 = Yes, No = 0 stu_marks Student marks Numeric 1 = Math, 2 = English, 3 = Kiswahili, 4 = History, 5 = Geography, 6 = Civics, 7 = Biology familyincomesource Sources of income Numeric 1 = Formal wage, 2 = Transfers, 3 = Own business, 4 = Farming, 5 = Casual wage, 6 = Home maker, 7 = Pension, 8 = None Class label Dropout Binary 1 = Yes, No = 0

Feature Engineering Techniques
This paper analyzed the relationship between one feature/variable and the target variable, each feature got its test score. Thus, all test scores compared to obtain features with top scores. There are types of the feature selection methods; filter methods and wrapper methods (Zhao et al. 2020). The filter methods evaluated all features except the target feature before data is applied to the machine learning algorithm (Nnamoko et al. 2014). The evaluation of the feature is performed by ranking scores of each feature using information gain, chi squared, and Gini index (Guyon and Elisseeff 2003). The chi squared method selects the minimum number of features needed to represent the data accurately (Liu and Setiono 1995). Therefore, the selection of influential features using the chi squared method affect the performance of the ML algorithms (Nurhayati et al. 2019). Information Gain measures the usefulness of the feature in a given dataset. The impurity of the feature in the student dataset is measured by Entropy (Tangirala 2020). The lower value of entropy gives higher information purity of the node (Azad et al. 2021). Moreover, the Gini checks the purity of specific class after splitting along a particular feature. The feature with a lower Gini index is chosen for a split (Zaman, Kaul, and Ahmed 2020). The wrapper methods evaluates subset based on the learning algorithm performance (Venkatesh and Anuradha 2019). The wrapper methods are recursive feature elimination, sequential feature selection and genetic algorithms computationally more expensive than filter methods that use repeated learning steps and cross validation (Zhao et al. 2020). Figure 1 shows adaptation of the feature selection methodology presented by Aissaoui et al. 2020). The selection criterion of the methodology is based on the relevance of the proposed model components that demands feature enhancement methods to improve prediction accuracy. Figure 2 shows 15 features selected out of 37 after Scikit-learn analysis. This paper adopted DT and Chi Squared method to select the important feature in classification. Ten (10) features were selected, then the five (5) were not considered due to less contribution for predicting student dropouts. The experiment shows that student marks (57%), student age (18%), distance (7%) and number of children (5%) are most statistically significant to student dropout compared to father's education (3%), student gender (3%), and means to school (2.5%).

Application of Machine Learning Algorithms to Student Features
The DT selects features in a top-down approach beginning with the attribute that offers the highest degree of information gain with the lower entropy (Berens et al. 2018). The lower value of entropy gives higher information purity of the node (Azad et al. 2021). The probabilities describing the possible outcomes of each feature vector are modeled using the logistic function (sigmoid function) (Rovira, Puertas, and Igual 2017). LR suffers from small dataset, and it considers as input 0 when recall and precision applied to evaluate the performance of the model  (Rovira, Puertas, and Igual 2017). In NB each feature was assumed to contribute independently to the probability that a student can dropout or not (Aguiar 2015). Random forest works for the large datasets (Kemper, Vorhoff, and Wigger 2020). Therefore, significant improvements in classification accuracy have resulted from growing an ensemble of trees and letting them vote for the most popular class (Breiman 2001). Figure 3 shows the lower value of each feature indicate the highest information purity from the splitting node.

Hyper-parameter Optimization Techniques
HPO definition: ML algorithm A has N hyper-parameters to be organized, the domain of the n-th hyper-parameter denoted by y n , then the overall hyper-parameter configuration space can be computed by y ¼ y 1 � y 2 � y 3 � . . . . . . . . . . . . :y n . D represents the given the dataset, L represents hyper-parameters that can minimize the loss of the model generated by algorithm A with hyper-parameter λ on the training data D train and evaluated on validation data D valid (Zahedi et al. 2021). The optimization of the model can be formulated by; λ � ¼ V L; A λ; D train ; D valid À � . This paper deployed grid search and randomized search to select proper features/variables and machine learning algorithms for better predictive accuracy. The grid search method completes the search manually for the acquired dataset of the hyper-parameter space of the training  algorithm (Liashchynskyi and Liashchynskyi 2019). The advantage of Grid search is that it always finds the best set of hyper-parameters from the given hyper-parameters (Gada et al. 2021). This paper applied the grid search to optimize the performance of ML algorithms that can explore all regions of defined search space (Schaer, Müller, and Depeursinge 2016). Random search replaces the manual enumeration of all combinations by random selection. This method was applied to discrete settings but generalized to continuous and mixed space (Liashchynskyi and Liashchynskyi 2019). Random search handles highdimensional spaces than grid search which is slow and computationally expensive (Bergstra and Bengio 2012). Figure 4 presents the prediction model looped through different predictive models and corresponding values to identify the optimal model with the best prediction accuracy. The output of the best prediction accuracy derived by the following steps; input dataset is passed to the data preprocessing and the list of ML algorithms are passed to feature engineering techniques to obtain influential features. Then, hyper-parameter optimization technique selected and returned the optimal hyper-parameter value for the model. Last, the models were trained and analysis was done by the model evaluation metrics to obtain the best classification models. Data preprocessing aimed to obtain clean dataset and to avoid the curse of dimensionality before applying to the ML model. Likewise, feature engineering phase was applied to extract the importance features corresponding the hyperparameter values to derive the optimal prediction model.

Results and Discussion
Results of this study is divided into two sections. First experiment was the training, and testing using default/conventional machine learning algorithms in Table 2 and Figure 5, and second experiment deployed the hyper-parameter optimization techniques in Table 3 and Figure 7. The first and experiment applied 15 features described in Table 1. The grid search performed better compared to randomized search hyper-parameter optimization technique in    Figure 6. Furthermore, Table 2 shows that when a recall<0.5 indicates that the classifier has a high number of false negatives which can be an outcome of imbalanced class or untuned model hyper-parameters and if a recall is 1.0% shows that classifier has accurately predicted for the given features in Figure 2. Moreover, when Area Under the Curve (AUC) = 1 or approaches to 1, then the classifier has perfectly distinguished between all the positive and negative class points correctly and if AUC = 0, then the classifier predicted all negatives as positives, and all positives as negatives.
Receiver Operating Characteristics (ROC) Curve plots the true positive rate (TPR) against the false positive rate (FPR) at various threshold settings (Vujović 2021). Figure 6 shows the classifiers that give curves closer to the top left corner indicate better performance. Also, the  Similarly, Figure 7 shows improved accuracies of ML algorithms compared to Figure 5 with less prediction accuracy. Therefore, this confirms that AutoML model improves prediction accuracy when hyper-parameter optimization techniques are applied in training algorithms, feature selection as well as hyper-parameters tuning. Table 3 shows better performance of six (6) evaluation metrics compared to Table 2 with default settings of the parameters. Experimental results indicate RFC and NB have 96% AUC, followed by SGD = 94%, MLP = 90%, Adaboost and DT have 87%, and LDA has the lowest value of 76%. Moreover, LDA has proven to work well with other metrics when Grid search HPO tuning is applied. The hyperparameter tuning in LDA increases performance results compared to default training (Muhajir et al. 2022). Table 4 shows RF = 87.5%, DT = 86.9% and KNN = 82.5% better improvement of the AUC compared in Table 2 with 86% for RF, DT = 81%, and KNN = 72%. Other ML algorithms such as AdaBoost, MLP, LDA have slight improvement of 87%, 90%, 77% respectively. Likewise, DT shows precision of 70.9%, MLP = 71.4%, NB = 51%, SGD = 71% compared to table x shows 65%, 61%, 48%, 45% respectively. The randomized search in Figure 8 shows improvement of accuracy compared with Figure 4 that did not deploy parameters tuning. The grid search performs better when a small dataset is used by the model but suffers from the curse of dimensionality and performs less as data extrapolated increases computational cost and wastage of space (Verleysen and François 2005). This paper applied the Manhattan metric and 11 neighbors as the best parameters used in KNN. The Manhattan distance metric is consistently preferable to the Euclidean distance metric for the high dimensional datasets (Aggarwal, Hinneburg, and Keim 2001). The KNN hyper-parameter optimization in Figure 7 performed better than in grid search in Figure 6. Data classified by a majority vote of its neighbors assigned to the class measured between two data points x and y by a distance function (Jawthari and Stoffová 2021). Tables 5 -8 show hyperparameter techniques and the corresponding average accuracy, and the set of the best hyperparameter values that represent the model configuration obtained by each technique.
This paper grounded on previous models performances attempted by various authors to improve prediction results. Results by Lee andChung (2019) evidenced DT = 89.4%, andGil, Delima, andVilchez (2020) Gil,    (2020) and Bibi (2018) evidenced that family financial constraints or poverty in developing economies lead to student dropout by 100%. Similarly, student truancy by 43% causes the student to leave school (Bridgeland, Dilulio, and Morison 2006). Said (2020) results revealed that distance contributed 53.7% students not to persevere, and time spent by student in school was 46.5%. Student attendance and homework have been recommended as the most contributing student dropout factors followed by mid-test and final tests (Hutagaol 2019), and poor performance in academics contributes 51.2% to student dropout (Lee and Chung 2019). The proposed prediction model maximizes the chance of supporting the successful learning of students by considering the impacts of the identified features and planning appropriately school resources.

Conclusion and Future Research Directions
Machine learning algorithms have contributed a lot to student dropout prediction in secondary schools. However, predicting student dropout by conventional machine learning algorithms has led to inappropriate selection of significant features and algorithms for problem intervention. The improvement of prediction accuracy is driven by influential features and machine learning algorithms with outstanding performance. The study contributes mainly to the improvement of prediction accuracy for secondary schools dropout prediction. Results show that Random Forest, Decision Tree, K-Nearest Neighbors, AdaBoost, Multilayer Perceptron and Logistic Regression outperformed better than Stochastic Gradient, Linear Discriminant Analysis and Naïve Bayes. Results show that student marks (57%), student age (18%), distance (7%) and number of children (5%) are most statistically significant to student dropout compared to father's education (3%), student gender (3%), and means to school (2.5%). This study offers comprehensive evaluation by comparing the performance of machine learning algorithms without and with the randomized and grid search hyper-parameter optimization techniques. Moreover, the grid search and randomized search performs better than the default settings of the machine learning algorithms. The improved prediction score indicates an accurate selection of features that cause student dropout that can be looked in a close eye in the learning process for early intervention. Furthermore, the study recommends other computation approaches such as Bayesian Optimization and Genetic Algorithms to accurately predict the student dropouts in developing countries.