An intelligent system to predict academic performance based on different factors during adolescence

ABSTRACT Students need to have an effective education to take advantage of all the latest tools available. Even with a proper education, they are failing to reap its benefits; reasons involve social, economic and psychological factors a student faces during their adolescence. Our research is directed towards this particular problem of educational effectiveness. We have surveyed a large number of students across different districts in Bangladesh. Pre-processing was done thoroughly; the use of data balancing, dimensionality reduction, discretization and normalization in combinations has allowed us to derive the best model that could predict the academic performance based on different factors during the adolescence.


Introduction
Analysing failure rate of college students is a new area for research in Bangladesh. From 1991 to 2001, dropout rates have been reduced from 59.3% to 35% in primary and secondary level schools by setting policies; however, problems still persist at undergraduate level education (Islam & Pavel, 2001). We have aimed to investigate socioeconomic and psychological factors that have influential impact on academic attainment of students. Ineffective education can be fixed by adding critical policies; finding these policies is what our study aims to do.
Predicting the Higher Secondary grade can provide a proper idea of what a student will be able to achieve in life. Students undergo adolescence before their HSC exam and during this phase they are more susceptible to their surroundings. If we could ensure that they receive the proper care and guidance during this time, an effective education will help them pursue a successful career later.
Educational institutions are the main source of data along with some coaching centres. Our target students are ones who have just passed 12th grade. We have pre-processed the raw data by applying data balancing, normalization and optimal equal width binning (OWB) for discretization in order to design the predictive model. We have also applied three different dimensionality reduction algorithms that reduce the number of attributes that are used during training. Since we have used several pre-processing techniques, we have comprised our comparative study with the combination of different pre-processing models to derive the best data model for our system.

Related work
Neural network has been used by Gedeon and Turner (1993) to perform a comparative study across several types of neural networks, including feed forward and back propagation that had been set up to forecast students' final grades. Another system has been developed by Wang and Mitrovic (2002) wherein they used neural networks' back propagation and feedforward techniques to predict the number of mistakes a student would make. The likeliness of being accepted by a university has also been predicted by Oladokun, Adebanjo, and Charles-owaba (2008) using multilayer perceptron topology. Jishan, Rashu, Mahmood, Billah, and Rahman (2015) worked on prediction of educational performance using the optimal width binning technique for discretization of continuous valued attribute. Yadav, Bharadwaj, and Pal (2012) have used the attendance of the students, seminar, class test and assignment marks as attributes for predicting students' final grade of a semester. Ayers, Nugent, and Dean (2009) used model-based clustering, K-means and other clustering algorithms to understand and visualize the skill level of an individual student and tried to group the students according to their skill sets. Pal and Pal (2013) used several classification techniques to cluster such students who need special care such as counselling from the teacher.
In our earlier study (Ahamed, Mahmood, and Rahman, 2016), we designed a predictive model where we have applied several techniques, e.g. one-r, gain ratio, to identify the relevant attributes which are highly correlated with the target class in order to reduce the classification error and to address the overfitting problem in prediction. We have used multi-interval discretization as the pre-processing technique to discretize our data set and then we have applied Random forest and Naïve Bays to predict the academic attainment of students. As data were unbalanced, in a later study (Ahamed, Mahmood, and Rahman, 2017) we applied an advanced balancing technique and used principal component analysis (PCA) on balanced data and report performance of classifiers. Due to the continuous nature of our target class, we have had to discretize it before using machine learning techniques. We have extended our work in this study by scrutinizing a comparative study of different dimensionality techniques such as PCA, self-organizing map (SOM) and generalized Hebbian algorithm (GHA). We inspect our key finding thoroughly by appending and discovering key rules for three categories, which are based on students' academic performance, (i) Good  Designing the survey questionnaire and the collection of data Designing a predictive model requires a data set that has the essential attributes to predict future academic performance. After careful consultation with relevant experts, we have managed to design a questionnaire that contains most relevant attributes regarding academic attainments of the students. Our survey consists of 38 questions amongst which 5 are psychological questions that increase the likeliness of students answering truthfully. We have interviewed students from different economic statuses, cultures and locations to maintain the proper sampling and versatility in the data set. We have surveyed 423 students, amongst whom 8 were excluded from the data set due to missing information.
Father's and mother's education denotes the social status of a family and is a strong predictor of academic performance (Coleman et al., 1966).
Father's and mother's jobs impact academics of the students (Usaini & Abubakar, 2015). Average monthly family income indicates the economic status of that family, which has strong correlation with academic performance (Adler et al., 1994).
Place of education has a notable effect on academic outcomes (Cheers, 1990). Private tuition significantly augmented the academic performance (Atta et al., 2011).

Psychological factors
Family size denotes the level of involvement of parents in education (Odok, 2013).
Age difference with immediate sibling birth spacing between children might have a longterm effect on academic performance (Buckles & Munnich, 2012).
Parents' involvement in education: Christian, Morrison, and Bryant (1998) have observed that responsive parenting in education has a notable impact on academic achievements.
Parents' marital status: Hatzichristou (1993) has notified that marital status of parents has a psychological effect on students, leading to an impact on academic outcomes amongst adolescents.
Absence/death of parent(s) has a sobering effect on mental health and causes impacts on academic outcomes (Jeynes, 2002).
Involvement in physical activity: Keeley and Fox (2009) stated that physical activities help to augment cognitive performance, leading to better academic achievement amongst students.
Motivating teacher helps students to be confident in learning, which plays a significant role in academic achievements (Adeyemo, Oladipipo, & Omisore, 2013).
Motivating parents significantly pass their motivation on to the students, making them confident and helping them become a better academic performer (Aatta & Jamil, 2012).
Academic ambition: Schoon and Parsons (2002) have stated that ambition in education can help students to achieve academic attainments.
Time spent on social media: Flad (2010) has observed through a test amongst students that using social media network has an impact on academic performance.
Romantic relation: Halpern, Joyner, Udry, and Suchindran (2000) have indicated that romantic involvement is correlated to academic attainments.
Time spent with friends: a correlation between a student's performance and peers' influence has been proven by Kirk (2000) through specific studies.
Alcohol/drugs consumption: Aertgeerts and Buntinx (2002) have notified that alcohol consumption has an impact on academic outcomes.
Health status of participant: Behrman and Lavy (1998) have noted that the statues of health have an impact on the cognitive performance of students.
Health status of family members: Parental involvement (Christian et al., 1998) in the education of a student is notably correlated with their health status.

Academic factors
Early childhood education has a long-term effect on further educational performance throughout a student's life (Johnson, 1996).
Weekly study time: McFadden and Dart (1992) have observed through some studies that the amount of study time has an appreciable impact on academic achievement.
Number of absence(s): Kirby and McElroy (2003) state that a student's attendance is a significant predictor of academic performance.
Extracurricular activities help students gather life skills, which are directly or indirectly correlated to academic attainments (Massoni, 2011).
Accessibility of the Internet helps students to connect with the latest information and can significantly broaden their knowledge, influencing academic performance (Ellore, Niranjan, & Brown, 2014).
SSC (Secondary School Certificate) 10th standard result: Previous academic record has an influence on future academic performance (Anderson, Benjamin, & Fuss, 1994).
HSC result: Result of HSC.
Designing the predictive model Figure 1 shows the overview of the predictive model. The first few steps of pre-processing involve data balancing, followed by using three different dimensionality reduction techniques in a comparative study to derive the best performer(s) amongst them. Due to the continuous nature of our target class, we have had to discretize it. Lastly, normalization was used to ensure equal contribution of all features in the predictive model. After using all the pre-processing techniques, we have applied three different classification algorithms in a comparative study to predict the grade of higher secondary result.

Data balancing
SMOTEBoost (Chawla, Lazarevic, Hall, & Bowyer, 2003) is an effective technique which includes the combination of SMOTE (Synthetic Minority Oversampling Technique) which fixes skewed data sets by balancing the minority and majority class and boosting the procedure to balance the data set. Figures 2 and 3 describe the data set before and after using SMOTEBoost.
SMOTEBoost requires a series of operations, T, to perform the task of balancing the data set. For each iteration, it calls a weak learner which is represented by a distribution D t and is altered by the particular training instances.
The distribution gets updated by assigning a higher weight to a wrong classification than a correct one. SMOTE is used in each round of boosting to oversample the minority class for every learner. In order to measure the weak hypothesis, h t , the whole weighted training set is provided to the weak learner. At the end, a final hypothesis is computed from a combination of the different hypotheses.

Dimensionality reduction
In a data set, two or more different attributes could relate to the target attribute very similarly; so similarly in fact that one increases accuracy and the other creates noise. Sometimes it is observed that some attribute(s) have a very weak or even non-existent relationship with the target attribute. Classifier performance improves by removing unwarranted properties from the high-dimensional data set (Jimenez & Landgrebe, 1998). We have used three algorithms, namely, PCA, GHA and SOM.

Principal component analysis
PCA converts redundant dimensions into a set of attributes, which are correlated only to the target attribute using an orthogonal transformation (Jolliffe, 2002). The first attribute generated by PCA is meant to have the highest variance and each successive attribute is made to have the next highest variance. Given a data set x, the kth component can  be found by using the following formula: PCA also needs to calculate the loading vector, W (k) : the highest achievable variance using the following formula: For our PCA setup, we used a variance threshold; the accumulated variance of any attribute greater than a threshold of 0.9 gets removed.

Self-organizing map
This algorithm implements learning by competition; the use of a neighbourhood approach retains the topological characteristics of the data set. Euclidean distance up to all weight vectors is calculated when the data are passed through the network. The weight vector that matches the input neuron the best is known as best matching unit (BMU) (Kohonen & Honkela, 2007). The BMU's weights and surrounding neurons in the SOM structure are moved to the input vector. The amount of change reduces with time and distance from the BMU inside the lattice. Equation (3) describes the relation between neuron v and weight vector W v (s) where S is the step index, tis the index into training sample, u is the index for BMU for D(t), a(S)is the learning coefficient, D(t) is the input vector, Q(u, v, s) is the neighbourhood function; it holds the distance between neurons u and v for step s. With a net size of 25, starting learning rate of 0.5 and adaptation radius of 20, SOM trains over 100 cycles from the data set. The learning rate and adaptation radius are reduced in each cycle until it reaches the finishing values of 0.01 and 1, respectively.

Generalized Hebbian algorithm
This particular algorithm is based on an acyclic and linear neural network which is used for unsupervised learning. The approach it uses involves just one layer of learning based on the weight of the neural node using the formula shown below. This weight depends on the correlation between the input attributes and output attribute (Gorrel, 2006) where W ij is the synaptic weight between the jth input neuron and the ith output neuron, X is the input vector, Y is the output vector and h is the learning rate. A simple and forecastable balance between the effectiveness of learning and precision of convergence can be set using just one parameter. We have configured GHA to cap the maximum number of attributes to 15 over 30 iterations with a learning rate of 0.07.

Optimal equal width binning
OWB has been used to discretize the continuous data into specific intervals; having the same interval length, along with error minimization as stated in Kayah (2008) for our system, an optimal bin width can be obtained by dynamically iterating the searching process until we get the optimal value for the bin width. Using dynamic width values over several iterations and using the Naïve Bayes classifier for a comparison of accuracies, the best number of bins was 8 and the best bin width was 0.5 for our data set:

Normalization
Our data set contains attributes with different units and scale. Therefore, it requires normalization and in our study we have applied min-max normalization to bring attributes into a comparable range. According to Jayalakshmi and Santhakumaran (2011), the min-max method rescales attributes by changing the range of values to a new range which is rescaled from 0 to 1 or −1 to 1. The process in which this technique rescales often is implemented by the following formula: when x max − x min = 0 for a feature, it indicates a constant value in the data set for that particular feature. When a feature with constant value is found in the data set, it gets removed immediately as it does not carry any necessary information for the learner.

Learning algorithms
We have used stratified sampling and made a 70:30 split for training and testing. Stratified sampling retains class distribution properly for each split. This ensures proper validation after training.

Artificial neural network
Complex chemical substances are discharged from neurons to send information to other neurons using synapses to increase or decrease the electrical potential within the receiving neuron; when a certain electrical potential is reached, the neuron fires. This is the feature that the artificial neurons try to replicate. After setting up the network, the neurons are assigned numbers so that they are distinguishable. The artificial neuron has the following activation function: where x j is the user input or output of a predecessor neuron, u i is the threshold and W ji is the weight. We have used the 'feed forward' neural network; the input of each layer is fed from the output generated by the previous layer. Similarly, the output of each layer is considered to be used as the input of the subsequent layer. One of the constraints of the 'feed forward' architecture is that neurons are not enabled to connect to any neuron on the same or prior layer. The layers situated between input layer, the first layer and the output layer, the last layer are called hidden layers (Kar, n.d.).

K-nearest neighbour
In the K-nearest neighbour (K-NN), all the attributes and the target attribute are mapped into a vector space where every vector represents the N dimensions for each attribute. Finding the number of neighbours (K ) is crucial because a very low number of K might lead to a noisy classification whereas a high number of K might lead to overfitting. We have used cross-validation to get an optimized value of K; the data set is divided into V folds of samples. For a fixed number K, we use K-NN for the classification and get the sum of the squared error for the vth fold. We have repeated the process for different numbers of K for the V segments. Then we got the optimized K value from the cross-validation having the lowest error (Mullin & Sukthankar, 2002). After getting the optimized K value, the Euclidean distance is used to measure the similarity metric using where d E is the Euclidean distance between the instances x i and y i . After that, hamming distance is used to get the similarity metric which is based on the following equation: where D H is the hamming distance between categorical instance x i and y i . Then a majority vote approach is used amongst the k neighbours and assigns a classification to the target outcome (Alkhatib, Najadat, Hemeidi, & Ali Shatnawl, 2013). The K-NN is used in our data set in the following ways: (1) Determining appropriate value of k to get the similar neighbours used to classify.
(2) Computing the distance between the instances of training and testing samples.
(3) The samples are then sorted relative to their distance.
(4) Using majority vote of K-NNs, the result for HSC is predicted.

Support vector machine
Support vector machine (SVM) constructs hyper planes in a multidimensional space to perform its classification, which separates class levels into different cases. An iterative training algorithm is used in SVM in order to construct the optimal hyper plane, which leads to optimizing an error function by minimizing it iteratively. We have used c-SVM to train our model in which the error function is characterized by (Min & Lee, 2005) Subject to: where C is the capacity function, w are the vectors of coefficients, b is the constant, j i is the parameter that handles non-separable data, i is the number of training cases (N ), y [ +1 are the class labels, x i is the independent variable; kernel f transforms data from the input to the feature space. As C gets larger the error gets more penalized; it is crucial to choose C very carefully to avoid getting the over-fitting problem. We have used dot kernel for designing our system, which can be characterized as follows: In Equation (10), the kernel function used the dot product method to map input data point transforming to a feature space of high dimension.

Implementation of the system
All the training and testing models we have developed have been built on RapidMiner; a software that allows proper integration of all the important tools we need for training and testing different predictive models. SMOTEBoost was implemented in WEKA; dimensionality reduction algorithms, namely, PCA, SOM and GHA, were all implemented in RapidMiner. We have used OWB to discretize the target class followed by normalization; both implemented in RapidMiner. After all the pre-processing is done, we have used three learning algorithms, namely, Artificial neural network (ANN), K-NN and SVM in RapidMiner.

Key findings
In our earlier studies (Ahamed et al., 2016(Ahamed et al., , 2017 we have reported some of our initial and general findings. In this research we have made a depth analysis by discovering rules for a specific group which is formed by different intervals of CGPA. In this way, the reasons for academic performance of that group of CGPA could be discovered. In most cases, students perform similarly in both HSC and SSC exams as shown: When teachers or parents motivate students, they tend to perform better, especially when they have previous failures. Private tuition helps academically when a student has suffered from previous failures as shown: We observe once again that a family with less than or equal to three members offer students more attention and results in a better HSC result as shown: The following subtree shows that a romantic relationship reduces academic performance for students who go to urban schools but academic performance improves if the student goes to a rural school as shown: Other patterns found through our study (1) When a student spends time with friends a lot and he or she goes to an urban school, they perform worse compared to students who go to rural schools.
(2) When a student without private tuition is part of a family with more than three members, they fail to perform well academically. (3) Romantic relationship's effect on academics is dependent on unique factors: (a) If parents do not live together, students perform worse. (b) If parents live together, students perform better. (4) There is a reduction in academic performance for students who go to urban schools. (5) Academic performance improves if the student goes to a rural school. (6) Students who want to pursue higher studies generally perform better than those who do not. (7) Those who do not want to pursue higher studies require higher number of study hours per week to perform better. (8) SSC results provide a good estimate of what is to be expected from the HSC exam.

Comparison of dimensionality reduction methods
We have compared the dimensionally reductions techniques PCA, SOM and GHA with each data model. Tables 1-4 contain the performances of these techniques.
Without any pre-processing SVM with PCA and no dimensionality reduction scores accuracies of 77.88% and 76.50%, respectively shown in Table 1. K-NN performs the worst for this data model; the problem of overfitting plagues K-NN and causes 50.23% to be the highest achievable accuracy. ANN performs equally well for PCA and no dimensionality reduction with an accuracy of 70.05% for each.

SMOTEBoost
As soon as SMOTEBoost is applied, accuracy for ANN increases by 1% for PCA and for GHA it boosts up by almost 8% as shown in Table 2. K-NN is again the worst performer here; the only significant accuracy boost comes from using SOM which is about 3%. For every other model K-NN has lower accuracy. SVM also has a reduced accuracy for PCA and no dimensionality reduction; reduction of about 1% and 7%, respectively. However, SVM has gained about a 10% higher accuracy for the data model pre-processed using GHA.

SMOTEBoost and optimal width binning
As soon as optimal width binning is applied, the problem of overfitting in K-NN perishes and this reflects in its result as it achieves 68.49%, 65.83% and 63.74% for SOM, no dimensionality reduction and PCA, respectively, as in Table 3, a much higher accuracy compared to its previous models. SVM performs well and produces reliable results with high accuracies of 78.48% for no dimensionality reduction and about 76% for both PCA and GHA. ANN is the best performer for this model, achieving an accuracy of about 85% for no dimensionality reduction with very high precision, recall and F measure; this means the quality of its prediction is higher.

SMOTEBoost with optimal width binning and normalization
For this data model, K-NN has managed to just reach the 70% accuracy mark to produce a considerably better model than the ones it generated with lower pre-processed data. SVM performs well again, achieving 75.93% and 75.71% for PCA and no dimensionality reduction respectively as shown in Table 4. Here, it is observed again that ANN is the best performer with an improvement of over 2% from the previous model to achieve the highest accuracy so far; a highly commendable 86.11%. It can be inferred that for SOM, the methodology to reduce the data into a mere two dimensions destroys the integrity and quality of the data set and in turn it does not favour any of the learners. The same could be said about GHA since it uses unsupervised learning in a similar way to SOM but it performs slightly better due to its preference to work more like PCA. We will discard SOM and GHA because PCA has significantly performed better to predict HSC result. Thus, we will use the models with PCA and without any dimensionality reduction for the remainder of our result analysis.

Deriving the best models
We have derived the best models through various pre-processing techniques. We also find the best pre-processing technique for each of the models. A comparative study shown in Tables 5 and 6. After performing all the comparisons, we incorporated the best results of all the models in Table 7. Artificial neural network ANN achieves accuracies of over 70% for each data model when PCA is applied as shown in Table 5; with no pre-processing, we have got 70.05% accuracy and after applying SMO-TEBoost technique to balance our data set along with OWB it boosts its accuracy to 77.01%. It achieves a highest of 77.78% after pre-processing using SMOTEBoost + OWB + Normalization. From Table 6 it can be observed that, without any dimensionality   reduction, ANN performs the best, achieving accuracies of 70.05% without any pre-processing technique. As we have applied SMOTEBoost + OWB, the accuracy burgeons to 84.97%. Lastly, the accuracy augments to 86.11% as normalization is introduced along with SMOTEBoost + OWB.
K-nearest neighbour Table 5 shows that K-NN is the least reliable in generating the predictive model when used with PCA; however, from the first data model to the last, K-NN has seen an increase in accuracy using PCA; its accuracy started with about 30%, then 35% after SMOTEBoost, 63% after SMOTEBoost + OWB and after adding Normalization to the mix, it soared to 70%. This rapid increase in accuracy and overcoming the problem of overfitting for K-NN shows that our techniques in pre-processing have been nothing short of exemplary. K-NN performs unimpressively, scoring just 65.83% and 69.31% when used with SMOTE-Boost + OWB only and when used with SMOTEBoost + OWB + Normalization respectively for no dimensionality reduction as shown in Table 6.

Support vector machine
In Table 5, it can be observed that SVM achieves accuracies of over 70% for each data model when it is used in junction with PCA, achieving a highest accuracy of 77.88% for no pre-processing. Notwithstanding achieving highest accuracy without any pre-processing its F measure has notably augmented from 62.69% to 72.86% as we have introduced different pre-processing techniques. Increasing the F measure indicates that the quality of the predictive accuracy has significantly increased. It achieves a high accuracy, topping off at 78.48% when used with SMOTEBoost + OWB when no dimensionality reduction is used as shown in Table 6 ( Figure 4). SVM is the second best learner to generate our predictive model as our study suggests; however, considering the target class is continuous, accuracies above 70% are rather excellent. SVM does perform better with PCA, as shown in Table 7, except for the data models which are pre-processed by SMOTEBoost + OWB + Normalization and SMOTE-Boost + OWB only; the differences being 0.22% and 2.91%, respectively. In the other data models SVM shows improvement in accuracies when using PCA vs. no dimensionality reduction; accuracy improvements of 1.38% for no pre-processing and 6.91% for SMOTE-Boost data model ( Figure 5).

Conclusion
From these final models in Tables 5 and 6 and a breakdown of dimensionality reduction comparison in Table 7, it can be observed that ANN performs the best with an accuracy of over 86%. In most cases dimensionality reduction plays a part in reducing the integrity of the data and in turn reduces accuracies for most models except for some cases such as SVM without any pre-processing and SVM with SMOTEBoost when used with PCA; from this it can be inferred that due to PCA generating orthogonal attributes, the learner performs better.
K-NN shows how data pre-processing can help improve the quality of the data; it started off with about 30% accuracy and moved up to 70% as more data pre-processing was applied. While it may not be the best fit to produce a predictive model for this particular study, it demonstrates the importance of data pre-processing when building predictive systems. Parents and the responsible authorities can gather knowledge from the patterns of the HSC performance through our study.
One of the major limitations of our work is that there is no way to validate the responses from the survey. Another limitation is that the study only targets predicting the overall HSC grade. These limitations provide groundwork for future research: survey response validation and subject level grade prediction. Our research would augment the possibility of performing better in a course by knowing the possible grade beforehand through elevating the awareness relevant to that particular course(s). Survey response validation is trickier to solve but if a data set's integrity could be proved then it would provide reliable predictions.

Disclosure statement
No potential conflict of interest was reported by the authors.

Notes on contributors
A. T. M. Shakil Ahamed is working as a part-time Lab Officer at the Electrical and Computer Engineering Department, North South University, Dhaka, Bangladesh. He received his BSc in Figure 5. The effect of pre-processing on each of the learners without using any dimensionality reduction techniques.