A Novel Minimization Approximation Cost Classification Method to Minimize Misclassification Rate for Dichotomous and Homogeneous Classes

ABSTRACT Dependence of the linear discriminant analysis on location and scale weakens its performance when predicting class under the presence of homogeneous covariance matrices for the candidate classes. Further, outlying samples render the method to suffer from higher rates of misclassification. In this study, we propose the minimization approximation cost classification (MACC) method that accounts for some specific cost function . The theoretical derivation is made to find an optimal linear hyperplane , which yields maximum separation between the dichotomous groups. Real-life data and simulations were used to validate the method against the standard classifiers. Results show that the proposed method is more efficient and outperforms the standard methods when the data are crowded at the class boundaries.


Introduction
The idea to optimize the cost function has been of interest since the second half of twentieth century (Gibra, 1967). For instance, engineers in a factory may need to control and optimize the total production cost of goods associated with high quantity and quality in a short time (Zavvar Sabegh et al., 2016). Minimizing the cost is important and has several applications. For example, in the health sector, the risk of misclassifying an infected person with a very contagious disease such as COVID-19 and influenza can be disastrous as many more people could get infected. Researchers in health science may need to minimize the cost of misclassifying patients, especially as they allocate them to wards so as to minimize the undesired outcomes.
Models have been used in medicine and engineering to predict the class membership such as multilevel logistic model (Dey & Raheem, 2016). However, classification when data are crowded around the separable hyperplane still remains a major statistical research problem. The popularly used standard linear discriminant analysis as well as the quadratic discriminant analysis are often characterized with high misclassification rates (Young, D. M.,Raudys,2004). The dependence of these methods on location and covariance weakens their class prediction performance under the assumption of homogeneity of the covariance matrix.
Besides, presence of outlying samples may also render these methods amenable to high misclassification rates. Therefore, the major contribution of our study was to develop suitable cost function that can be used in the classification problem so as to minimize misclassification rates.

Defining the classification problem
The multivariate classification problem involves grouping the features X i in R p space to one of the group membership y i 2 R. The general form of linear classification function for the binary outcomes y i ¼ sign f ðX i Þ f g where y i 2 À 1; 1 f g and f ðX i Þ ¼ θ T X i i ¼ 1; 2; 3; . . . :; N: The linear discriminant analysis (LDA), sometimes called the Fisher's approach, is the most basic linear classifier. As indicated in other studies, this method does not require to satisfy the normality assumption (Liong & Foo, 2013;Tillmanns & Krafft, 2017). Its main assumption is homogeneity of the group covariance matrices, as is the case for the twogroup classification (Puntanen, 2013). The general idea of the LDA is to construct a linear hyperplane so as to separate the two groups as much as possible. Suppose we have a random variable X i ; ði ¼ 1; 2; . . . ; NÞ from one of the two groups y i 2 fÀ 1; 1g with X 1 ,ϕðμ 1 ; � 1 Þ and X 2 ,ϕðμ 2 ; � 2 Þ, where ϕ is any multivariate distribution, which is not necessarily the normal distribution. We wish to classify each data vector X i of size p � 1 to the binary group membership fÀ 1; þ1g where the number of groups k ¼ 2. The overall covariance The right hand side of the last inequality is the Fisher-Roa's Criterion, where S B is the between-class sample covariance matrix, S X is the total-class sample covariance matrix and S W is the within-class sample covariance matrix, and all these estimates are the maximum likelihood estimators. Hence, maximizing the standardized squared distance between groups involves minimizing the within group-sample covariance matrices. The lemma by Johnson and Wichern (Puntanen, 2013) who used the extended Cauchy-Schwarz inequality for optimization was adopted in our search for an optimal estimate of θ in Equation (1).
attained at x ¼ cB À 1 d for any scalar c�0.
After matching the vector x with the right hand side of Equation (1), we found that B ¼S p and d ¼� x 1 À � x 2 . Taking into account a normalized vector x gives c = 1.
The new version of this estimated hyperplane resulted by an iterative method that tries to minimize the covariance within each group S w . We used the cost function CðθÞ to minimize the data points from their corresponding centroids and consequently minimized the denominator containing S w .

A motivating example
The search for another method that minimizes the misclassification between groups has been motivated by a number of studies (Croux & Joossens, 2005;Shen et al., 2011;Velilla & Hernndez, 2005;Zhang, 2004). Further motivation was from our exploratory analysis of simulation results from different distributions of data that revealed the effect of dispersion on the MCR. It was observed that as more data concentrate around boundary (separable hyperplane), their separation becomes very difficult as seen in Figures 1, 2. There are many data points close to the linear separable hyperplane such as the points; 10, 17, 34 and 56 from the squares group, which are highly likely to be misclassified. In other words, the risk of losing the information in estimating the optimal hyperplane is expected to be higher for the points around the hyperplane than the other points in the same group, where Rðx i Þ ¼ E½Lðx i Þ� (Mengyi et al., 2012).
Moreover, among the circular points, there are also data points such as 113,172 and 174 which are highly likely to be misclassified. It is logical, therefore, that  fixing the same cost of misclassification for all data points is unfair. This leads to our idea that introducing a suitable cost function using the MM-principle (Mairal, 2013;Shen et al., 2011;Wang & Zou, 2018) would vary the cost for each data value according to how far this data point locates from the class mean vector. Hence, we will refer to the misclassification rate for the new method as minimal cost classification rate (MCCR), while the new method will be referred to as approximation minimization cost classifier (AMCC).
Thus, the aim of our study was to develop an optimal separable linear hyperplane using the MMprinciple with cost function (discussed in section 2) that minimizes the rate of misclassification MCR. Also, in the next section we show how the algorithm to obtain the updated separable hyperplane that depends on the current one was derived. In section 3, we validate the proposed method by simulating some datasets and comparing them with the classical approaches in terms of misclassification rate. Further, real life datasets were used to further assess the proposed method by utilizing various train-test methods such as; SLDA, BSM, LOOCV and KFCV. All these train-test techniques are discussed in details in section 4. Ultimately, simulated dataset were used to discover the asymptotic behaviour of the proposed method and MCCR comparison to the MCR of SLDA.

Developing the MACC based on the loss function
To achieve the study objectives, we applied a loss function to map values of one or more observed variables onto a real number representing some "cost" associated with the training item in the data (Shen et al., 2011). The total information lost can be represented by the cost function. In fact, the history of minimizing the MCR by using the loss function is a motivation to many researchers, for example, those who have worked to obtain optimum estimators of precision matrix � À 1 under quadratic loss function (Mengyi et al., 2012). Cost may be taken as the average of the losses. We explored a quadratic loss function represented by ðx i Àμ i Þ where μ i is the expected value of x i . It measures how much information is lost between the observed and its predicted value for each data item. A specific form of this cost function is the mean square error abbreviated by MSE.
where the μ i is the corresponding expected value of x i . In this study we used the quadratic loss function. Therefore, for the linear discriminant analysis (LDA), our cost function is: z ij : the transformed value of the vector item X j in the j th group. � z j : the transformed value of the mean vector of the j t h group.
The reason for choosing quadratic cost function was for its ease to show that the total cost for both groups in terms of hyperplane θ can be written as: Therefore, minimizing the total cost requires minimizing the within-class variance for both groups by choosing the optimal value of θ. Thus, this mechanism is similar to the approach of the Fisher-Roa's Criterion which attempts to project the data points towards the centres of the groups, especially when many data points are concentrated around the marginal boundaries (Ahn & Marron, 2010). In some studies this process is called the data piling method which is projecting the high-dimensional data ðp > nÞ into the low dimension leading to maximizing the marginal distance between groups and projecting the data values that are concentrated on the boundaries towards the centres of groups. On the other hand, our proposed method tries to vary the cost needed to minimize based on the location of the data points from their class centres. The main difference is that in data piling, a kernel trick is used, whereas in our method we use the cost function. Therefore, our method works in parallel with classifiers, making it easy to validate against the classical methods of classification.

Overview of the Proposed MACC Method
We applied the majorization-minimization (MM) principle to find an expression that solves the iteratively updated separable hyperplane θ in terms of the current solution θ such that θ ¼ f ðθÞ. After some iterations, the updated θ gives an optimum such that the total cost CðθÞ is at minimum. One limitation is how to obtain an optimum single closed form of θ from the direct differentiation of the cost function CðθÞ. However, a mechanism of direct differentiation does not always lead to a closed form of θ neither does it produce right desired solution. It can be shown that expressing cost function (7) in terms of θ and differentiating partially yields θ ¼ 0 which is logically impossible. Generally, the MM-principle operates in two steps. The first step searches for the majorization function DðθjθÞ such that CðθÞ � Dðθjθ k Þ for any θ�θ k The second step involves differentiating the Dðθjθ k Þ with respect to θ and setting it to 0 and iteratively finding an expression that involves θ and θ (Lange & TongWu, 2008;Wang & Zou, 2018). In fact, based on Fisher's approach (Shin, 2008), θ takes the form of θ ¼ ðβ 1 ; β 2 ; . . . ; β p Þ T for p predictors. Finding the convex supremum majorization function DðθjθÞ for any θ�θ is quite difficult. By doing some manipulation using this principle, it can be called approximation-minimization principle. This can be reached algebraically using quadratic convex approximated function of the cost function (3), such that for any θ�θ, the function: where f ð:Þ is the cost function and hðθÞ is the corresponding value of the linear classification function with unknown parameter θ. Note that the right hand side of this inequality is approximated by the Taylor series approximation (Wu et al., 2019). In addition, θ ¼ ðβ 1 ; . . . ;β p Þ T p�1 is the current solution and θ ¼ ðβ 1 ; β 2 ; . . . ; β p Þ T is the updated solution (Mairal, 2013;Wang & Zou, 2018).

Deriving the minimization maximization cost classification (MACC)
Given the data matrix X n�p with the i th row x T i . Let b be an n � 1 vector with the i th element f 0 ðθ T x i Þ, θ be the current solution and θ the updated solution,

;DðθjθÞ
Now in order to find the iterative equation of θ in terms of θ , we differentiate the majorization (approximation) function DðθjθÞ with respect to θ T , and set it equal to 0 p�1 as follows: Solving Equation (5) for θ T gives (Equation 6):

RMS: RESEARCH IN MATHEMATICS & STATISTICS
The Equation (6) can be iterated a number of times to get an updated solution θ (the hyperplane) until a desired minimum misclassification rate is reached. A threshold P > 0 is the desired minimum misclassification rate that can be set by the researcher. It can also be defined as k θ Àθk 2 where θ is the updated solution and θ the previous solution. Moreover, if the issue of over fitting arises, it can be solved through the crossvalidation method. In Algorithm (1) we show how the estimated value of θ can be iteratively determined such that the rate of misclassification MCR does not exceed P: However, the proposed method performs well under the assumptions of homogeneity of groups and the nonsingularity of the matrix P n i¼1 x i x T i À � . Also using the Taylor's approximation in the majorization function, and putting the parameter θ in the explicit form of the Fisher's approach: θ ¼ ðμ 1 À μ 2 Þ T A gives one property of this majorization function DðθjθÞ based on its partial derivatives with respect to the two mean vectors.
Further, we let θ ¼ ðμ 1 À μ 2 Þ T A in our majorization function DðθjθÞ and taking partial with respect to μ 1 and μ 2 , to get expressions in Equations (7) and (12) respectively: consequently: The last partial differential equation implies that in order to minimize the majorization function DðθÞ and consequently minimize the cost function CðθÞ, leads to minimum misclassification rate MCR. Note that the rate of change of DðθÞ with respect to μ 1 should be approximately the same rate change of DðθÞ with respect to μ 2 but in the opposite direction, while preserving homogeneity within groups.

A Pseudocode for the updated MACC hyperplaneθ
To illustrate the application of the proposed minimal cost classification rate (MCCR), the pseudocode in algorithm (1) describes the procedure for updating the hyperplane θ. It is necessary to set the desired misclassification rate P, sample size, n and number of iteration,iter that represent the maximum number of iterations required to update the hyperplane θ. Then, the parameters α and β are set to be positive so as to control the variances in the covariance matrices αI and βI, where I is an identity matrix. We then, simulate n=2 samples for each group and estimate the covariance matrix and population mean ð P ; μÞ for both groups from previous samples. If real data set is available, simulation part may not be required. After which, we suggest conducting a test for homogeneity between groups H 0 : P 1 ¼ P 2 as well as their separation H 0 : μ 1 ¼ μ 2 to ascertain meaningful classification and separation required for using this method. Finally, we update θ at each iterative step τ by using Equation (6) to find the minimum misclassification rate (MCCR) that corresponds to the optimum θ.

Validation of the proposed MACC method by monitoring the MCCR
The efficiency of our method is validated by comparing its misclassification rate, MCCR against that from four different classification methods, including; the standard linear discriminant analysis SLDA, bootstrapping sampling method BSM, leave-one-out-cross-validation LOOCV and the k-fold cross-validation KFCV. We compare them by assessing their performance based on their resulting misclassification rates MCR. Here is a brief description for each method: (1) The SLDA is the Fisher's approach for classification (Puntanen, 2013;Shin, 2008).
(2) In the BSM, some samples are selected randomly from the dataset with specific sample sizes, fitting the linear model that leads to predict the group's memberships for the remaining unselected samples and consequently compute the MCR. This process was repeated many times and finally the average MCR is calculated (Shao, 1993).
(3) The LOOCV splits the data set into two parts; the "train", a frame that contains all samples except one data subject and the "test" frame. Train set is used to fit the linear classifier which takes the one left subject to predict its corresponding group membership. This process was continued until all subjects in the dataframe were completed giving the final result for the MCR (Xu & Goodacre, 2018).
(4) Under the KFCV, the data is divided the data into k-parts, which should contain relatively equal subjects. It is an extension for the LOOCV, but the test set contains more than one subject. At each time, one fold was treated as test frame and the others used to fit the model. At the end, we averaged the k MCRs (Xu & Goodacre, 2018).
(5) MCCR is the misclassification rate calculated from the new proposed method that uses the cost function based on the MM-principle.

Validation of the MACC method using simulation study
Datasets of different sizes N with N=2 data values in each each group and seven predictors p ¼ 7 were generated from two multivariate normal distributions with known parameters μ 1 ; μ 2 ; � 1 and� 2 . In each iteration, the covariance matrices were tested using the Box-M test for homogeneity, so as to check validity of the linear discrimination. Moreover, the hypothesis of population mean vectors H 0 : μ 1 ¼ μ 2 was also tested in order to check for existence of sufficient separation between groups, as required to perform meaningful classification. This process of simulation was conducted with different set seeds for each dataset. Table 1 presents the results of these calculations. Note that each calculated MCR is based on an average of 100 iterations for each dataset.
It can be seen from Table 1 that in most cases there are small differences in the misclassification between the standard classical LDA and our proposed method, the MACC. More specifically, when the separation between the groups becomes more difficult as it is indicated by the increase of p-value, the MACC method performs more efficiently than the standard linear discriminant analysis, LDA method.

Validation of the MACC method using data from real life studies
In this section, we present validation analysis of the MACC method based on five real life datasets. These may not necessarily meet the assumptions of the new method (MACC), but are good for exploratory purposes for the performance of the new method. The first one includes 12 predictors and sample size n of 872 students. We performed logistic regression to select the 5 most significant predictors that were used in the discrimination at the next stage. The group membership of the discriminant function were on time or late graduation. Before conducting the standard linear classification, the equality of two vector means were tested using the Hotelling's T 2 test. A significant difference between them resulted implying that there was a possibility to separate the two groups from each other by classification methods. In addition, we tested for the equality of the covariance matrices for the two groups H 0 : � 1 ¼ � 2 using the Box-M test statistic, small p-values ðp < 0:05Þ indicated that linear classification was not the appropriate method for classification. Nevertheless, the previous indication, SLDA was conducted and resulted in MCR ¼ 16:5%. However, after performing 100 iteration using Equation 6, the MACC method's minimal cost classification rate MCCR was 16:5% which is the same misclassification rate as that obtained by the standard linear classifier.
The second database analysed was (studentsperformance). It contained sample size of 604 students. The purpose of this dataset was to predict their group membership success ðn 1 ¼ 226Þ or fail ðn 2 ¼ 378Þ using five predictors. Implementing the Box-M test gave a high p-value (p ¼ 0:818) which reflected the homogeneity of data values in the two groups was acceptable. On the other hand, p-value of testing equality of the mean vectors was small (p ¼ 0:0024) indicating significant difference between μ 1 and μ 2 that reflected possible separation between these two groups. The SLDA and MACC methods were conducted and ended up with very close misclassification rates, MCR ð38:2%Þ and MCCR ð38:7%Þ.
The third database was collected from the Department of Psychology at the Sultan Qaboos University Hospital SQUH. It contained information about eighty patients including five features. These were as follows: age of patients (x 1 ), gender (x 2 ), primary weight of patient (x 3 ), age group (x 4 ) and drug group (x 5 ). The response for this model had two levels; over weight if the weight increased by more than or equal to eight kilograms after taking the drug otherwise there was no significant difference. Equality of covariance matrices was tested ðp ¼ 0:121Þ allowing the use of linear classification. Further, the possibility of separation between group was tested ðp ¼ 0:3045Þ reflecting difficulty of separation of these data values since the centres of group approximately in the same location. Standard linear discriminant method was implemented yielding MCR of 41:2%. By contrast, the MACC method gave on the average of 100 iterations a minimal cost classification rate MCCR of 38:6%, which reflects great improvement in using the proposed MACC method, particularly for this data set. Table 2 illustrates the results of analysis of these three data sets plus two other datasets; Bullying and Purchased, which were collected by using questionnaires. They were conducted as mini-projects among students of the College of Nursing and College of Economics and Political Science, respectively. Because of the marginal significant difference between centroids, and extreme significance between covariances of the dichotomous groups, the performance of the proposed MACC method was poor in the Bullying dataset.
Furthermore, using the train-test approach, we validated the efficiency of our proposed MACC method (Xu & Goodacre, 2018). Findings from these analyses continue to confirm the superiority of the proposed MACC method over the standard LDA or QDA particularly when data points are crowded at boundaries with no significant covariances and separation between groups. We provide detailed discussion in the next section.

Discussion
The aim of our study was to propose a new method so as to improve the classification performance of data often clustered around the linear separable hyperplane. Referring to Table 2, it is clear that the performance of the proposed method varies from one data set to another. It depends on the degree of overlap between groups as well as the significant difference on their homogeneity. (Calabrese, 2014;Naranjo et al., 2019) For instance, in the first validation dataset grad À students, the classification performance for both methods ðSLDA&MACCÞ were approximately the same. Because of low overlap between groups and significant difference in the homogeneity between groups, the results were reasonable. Moreover, for the second dataset stud À perform, it was found that the performance of SLDA was relatively better than MACC's since having marginal significant mean vectors still indicates that not too much overlap existed between the groups, consequently, no crowded data were expected along boundaries. Thus, we expect little to no contribution of the cost function to minimize the misclassification rate. On the other hand, applying the proposed MACC method on the drug data gave better improvements for the MCCR (38:6%) as compared to the MCR ð41:2%Þ. That was due to the fact that there were overlaps between groups and also the existence of equivalent group covariances. This signalled more importance for the quadratic cost function to influence the data points contributing in estimating the hyperplane θ. Further, the poor performance of MCCR using the fourth data set Bullying can be explained by the same reason of existence of marginal significant separation and differences in homogeneity.
It has been shown in others studies that splitting datasets into two parts that is, the train and test data could improve the performance of classification methods (Shao, 1993;Xu & Goodacre, 2018). In this section we discuss its effect on the MCR. The splitting methods considered included; Bootstrap splitting method BSM, k-Fold Splitting method KFSM(k = 5,10) and Leave-one -out-cross-validation LOOCV. Moreover, we tested some of them by using the Chi-square goodness of fit test-statistic as well as compared their performance by taking the mean of MCR for a number of iterations. Ultimately, comparing their performance was important to help draw some important conclusions. We utilized real life data sets as demonstrated in the Table 2. We developed R programming functions to fit five linear discriminant functions using three splitting methods for each proposed real dataset. This process was repeated 100 times until the final p-value (average of 100 p-value's) as presented in Table 3 correspond to each fitted model. Although, some of these models gave good classification performance, most of them did not fit the data well, meaning that the hypothesized goodness of fit was rejected. Further, we used the proposed MACC method for the five real data sets using train-test approach with 100 repetitions and calculated the average of the MCCR, which resulted in more classification efficiency improvement than the classical LDA. Thus, we concluded that using different splitting methods does not improve the MCR nor its goodness of fit. Besides, the train-test splitting method (train ¼ 60%, test ¼ 40%) was relatively the most appropriate choice for the MACC to solve the overfitting issue.
Furthermore, we compared the effect of increasing crowdedness of data points around the boundaries on misclassification using both methods. To verify that, we simulated 20 distinct data sets with two classes from multivariate normal distribution with equal covariance matrices and increase the centroids separation in each data set. The resultant relationship is presented in Figure 2. It has been noticed that as the separation between groups decrease (increasing the p-value), the MCCR decreases. On the other hand, the flow of blue dots shows that decreasing the separation (much overlap and large p-value) yields poorer misclassification rates (increase) using the standard method LDA.
The main challenge for any classification problem is existence of overlaps between groups, especially where there is no clear separation, often resulting in poor classifier performance. (Naranjo et al., 2019;Pires et al., 2020) This phenomena happens when the centroids of the two groups are too close to each other, identified by very large p-value of the Hotelling's test. For this reason, we suggest to test the separation of groups and their homogeneity before using the proposed MACC method.

Conclusion
Our study sought to develop a method based on the quadratic cost function through majorization minimization principle to improve the classification of data that are more concentrated at the boundaries and infused into another group. The proposed method, MACC has been validated against the standard methods through simulations and real life data. The findings show that the proposed method gives minimal misclassification rate compared to the standard classification methods. The method outperforms the linear discriminant analysis for more homogenous groups, when data are crowded at the boundaries.
In order to solve overfitting, we illustrated numerically that using distinct splitting methods such as bootstrapping and k-fold algorithms the performance of SLDA did not improve the classification. However, reduced misclassification rates were realised from the proposed method. Therefore, we recommend using the proposed MACC method to perform classification under the threat of group homogeneity.

Pubic interest statement
There are many life applications that are difficult to classify due to the presence of similarities between the prior classes. Failure to correctly classify could be dangerous and cost is prohibitive. The misclassification cost could be as financial loss, death of a misdiagnosed patient or just sending a student abroad to study a major that is incompatible with his abilities. As an application is to correctly classify a patient with either influenza or COVID-19 based on their signs and symptoms. To overcome this problem, our study introduces a suitable classification method that provides a minimal cost compared to the current classifiers.

Disclosure statement
No potential conflict of interest was reported by the author(s).

Funding
The authors received no direct funding for this research.

Notes on contributors
Mubarak Al-Shukeili holds an MSc in Statistics and is currently a final year PhD student. His area of research is to investigate the methods that can result into minimization of classification rates. Also, he is interested in doing research in medical science, mathematical modelling and computational statistics.
Ronald Wesonga holds PhD in Statistics; he is a professional statistician with vast knowledge, skills and experience gained over years through collaborative networks with other professionals across the world. As a university professor, he has has published widely in highimpact journals, inspired many students, groomed junior staff and is currently enthusiastic about estimation error minimization as well as creating deeper understanding and new knowledge in data, computing & statistics.