An Edgeworth-type expansion for the distribution of a likelihood-based discriminant function

ABSTRACT The exact distribution of a classification function is often complicated to allow for easy numerical calculations of misclassification errors. The use of expansions is one way of dealing with this difficulty. In this paper, approximate probabilities of misclassification of the maximum likelihood-based discriminant function are established via an Edgeworth-type expansion based on the standard normal distribution for discriminating between two multivariate normal populations.


Introduction
The concept of classification has interested several researchers already long time ago, such as [1][2][3][4][5].It consists of developing a classification rule by examining the characteristics of individuals that best distinguish between the populations.There exists several approaches of deriving such a rule.For example a plug-in approach [6], a likelihood approach [7,8], as well as others.
Let y i ∼ N p (μ 1 , ), i ∈ {1, . . ., n 1 }, be a sample from population π 1 collected in Y = (y 1 , . . . ,y n 1 ) and z j ∼ N p (μ 2 , ), j ∈ {1, . . ., n 2 }, be a sample from population π 2 collected in Z = (z 1 , . . ., z n 2 ).Let y = 1 n 1 n 1 i=1 y i , and z = 1 n 2 n 2 j=1 z j .Assume that an observation x is to be classified.Anderson's [6] classification rule that he labelled as the W-rule is based on the statistic which is a linear function in x with the rule that the new observation x is classified to π 1 if W ≥ 0 and to π 2 if W < 0. The maximum likelihood-based classification rule; the Z-rule proposed by Kudo [7,9], as an alternative to the W-rule CONTACT Emelyne Umunoza Gasana emelyne.umunoza.gasana@liu.sewhich is a quadratic function in x.The rule is to classify x as coming from π 1 if Z is positive and π 2 if Z is negative.Alternatively, Gasana et al. [10] proposed two likelihood-based discriminant functions.If is known If D ≥ 0, x is classified into π 1 and if D < 0 then x is classified into π 2 .If is unknown where Whenever classifying an object, there is a chance that it might be wrongly classified.Therefore, in statistics, we are interested in the probability of such an error.Fujikoshi et al. [11] defines the probability of misclassification as a measure of the goodness of a classification rule.Generally, misclassification errors involve unknown parameters, thus they need to be estimated.Anderson [12], Sitgreaves [13,14] and Wald [15] mentioned that the distribution of the classification function is too complicated to be used numerically.Several authors, such as Anderson [6,12] and Okamoto [16] have investigated the distribution of the linear discriminant function.One way of handling this problem is to approximate the discriminant distribution function using expansions.Okamoto [16] and Memon and Okamoto [17] investigated asymptotic expansions for the discriminant function distribution up to terms of order n −2 , n = n 1 + n 2 − 2, [12,18,19] extended the expansions to terms of O(n −3 ) the asymptotic expansions of misclassification errors for the distribution of the discriminant functions when n 1 and n 2 grow to infinity and the ratio n 1 /n 2 approaches a positive limit.
At early nineteenth century there was a general idea of approximating a frequency function using a series containing an uncomplicated density function, for instance Laplace [20] used normal density, Hermite polynomials and their expectations.On the basis of Laplace's results, several other researchers came up with ideas of density approximations in terms of other quantities such as moments [21] and cumulants [22][23][24][25][26].For details, see [27][28][29].Popular density approximations are the Gram-Charlier expansion introduced but Thiele, Edgeworth expansion, the saddle point approximation and the Cornish-Fisher expansion.The Gram-Charlier and Edgeworth expansions are comparably similar but differ on how terms are organized.In addition, both expansions are disadvantaged by the fact that the approximations may not be densities.Gupta and Panchapakesan [30], reviewed Edgeworth expansions in statistics.This concept attracted several applications in many areas of statistics such as Ranga Rao [31] who obtained Edgeworth expansions and Davis [32] who introduced a multivariate Edgeworth expansion.In addition, Fujikoshi et al. [11] investigated the Edgeworth expansion and its validity.Furthermore, Kollo and von Rosen [33] described Edgeworth-type expansions where a complicated density function is described through a simpler one.In this paper, we will use an Edgeworth-type expansion to approximate the distribution of the likelihood-based discriminant function through a standard normal distribution and derive approximate probabilities of misclassification.

Proof preparation
Several researchers such as Anderson [6,12] and Okamoto [16], to mention a few, have investigated the distribution of the linear discriminant function.The exact distributions of the discriminant functions D and D, given in (3) and ( 4), respectively, are very difficult to use for numerical calculations, as mentioned earlier.Therefore, it is practical to provide reasonable approximations of the distribution of these discriminant functions to get around the problem.With the help of the expansions misclassification probabilities can be obtained.However, it is important to note that an approximate density based on an Edgeworth expansion does not have to be a density.

Definition 2.1:
The jth-degree Hermite polynomial, H j (x, m, σ 2 ), for the mean m and the variance σ 2 > 0 is given by where f x (x) is the density function of the normal distribution N (m, σ 2 ), i.e.
Theorem 2.2: Consider the case of a standard normal distribution, N (0, 1).The j th univariate Hermite polynomials H j (x, 1), j ∈ {0, 1, 2, 3}, equal The Egdeworth-type expansion of a density presented by Kollo and von Rosen [33] is shown in the following theorem.Theorem 2.3: Let d be a random variable with finite first three cumulants, then its density f d (x) can be approximated through the density f N (x) of the standard normal distribution, N (0, 1), by the Edgeworth-type expansion Theorem 2.4: Suppose the third cumulant is omitted and let d be a random variable with finite first three cumulants, then its density f d (x) can be approximated through the density f N (x) of the standard normal distribution, N (0, 1), by the Edgeworth-type expansion where Proof: Inserting the Hermite polynomials H i (x, 1), i ∈ {1, 2, 3}, and the density of N (0, 1) by their values, (8) can be rewritten as Therefore, f d (x) is approximated with a polynomial function in x and the coefficients are in terms of the first and second cumulants of d.Gasana et al. [10] found expressions of the first two cumulants of the discriminant function.An Edgeworth-type expansion, based on Theorem 2.4, of the discriminant function can be expressed up to three terms when the term including c 3 [•] is excluded.Theorem 2.5: Let D be defined in (3).Then if x ∈ π 1 , the first two cumulants of D are given by Theorem 2.6: Consider the discriminant function D in (4).If x ∈ π 1 , the expected value and variance of the discriminant function are given by 2 , 2 is the Mahalanobis squared distance given in (13); Proofs of Theorems 2.5 and 2.6 are given by Gasana et al. [10].
Since the discriminant function is one-dimensional, the Edgeworth-type expansions of the densities of D and D through the density of N (0, 1) in the next sections will take the form of ( 9), where a i , i ∈ {0, 1, 2, 3}, will be in terms of the expectation and variance of the corresponding discriminant function, as in (10).
Theorem 2.7: Let D be defined in (3), and 2 in (13).For large n 1 and n 2 , D is normally distributed: Therefore, as Then, for x ∈ π 1 , Similarly, for x ∈ π 2 , with the same variance as when x ∈ π 1 .
Theorem 2.8: Let D be given by (4) and D by (3), with y the mean of a sample of size n 1 from N p (μ 1 , ), z the mean of a sample of size n 2 from N p (μ 2 , ).The limiting distribution of both D and D as n Proof: The limiting distribution of D is already proven in Theorem 2.7.Moreover, as n i → ∞, i ∈ {1, 2}, mS −1 → −1 .Thus the proof for D follows from the proof of Theorem 2.7.
Hence it follows that the limiting distribution of both classification functions D and D follow the distribution of a linear discriminant function.This result will be used for the standardization of the two classification rules in the next section.

Edgeworth-type expansion of D
The basic properties of the discriminant function D, defined in (3), such as its expectation and variance are given in Theorem 2.5.In the theorem below, an Edgeworth-type expansion of D is presented in terms of its mean and variance via the standard normal distribution density.Theorem 3.1: Let D be the discriminant function defined in (3) and 2 be as in (13).Let q be given by (12).Then the Edgeworth- where, is the same as when x ∈ π 1 , except that n 1 and n 2 are interchanged and we have opposite signs on coefficients of x and x 3 .
According to the results from Theorem 2.7, the background is that the limiting dis- In the following theorem we evaluate the Edgeworth-type expansion of the standardized discriminant function with respect to its limiting distribution.This means that we are going to consider either y = .

Corollary 3.2:
Let D be the discriminant function defined in (3) and 2 is as in (13).
and q be given by (12), then the Edgeworth-type expansion, where When x ∈ π 2 , we have the same values as when x ∈ π 1 , except that n 1 and n 2 are interchanged and we have opposite signs on coefficients of x and x 3 .

Edgeworth-type expansion of D
The classification function D given in (4) follows a complicated distribution since it is a difference between two non-independent non-central F-distributions.In this section, an Edgeworth-type expansion of the density of D is provided in terms of its expected value and variance via a standard normal distribution density.Let f D (x) be such an expansion.Then since D is one-dimensional, f D will be a polynomial function of the form (9). In the following theorem the expansion is presented.Theorem 3.3: Let D be the discriminant function defined in (4) and 2 is as in (13).Then, for m = n 1 + n 2 − p − 3, q in (12) and c 0 = m+p m(m−2)(m+1) , the Edgeworth-type expansion, f D (x) when the observation x is classified in population π 1 equals where Moreover, when x ∈ π 2 , the coefficients v i , i ∈ {0, 1, 2, 3} in f D (x) are similar to the above except that we have opposite signs on v 1 and v 3 and that n 1 and n 2 are interchanged.
consider the form (9) of the Edgeworth-type expansion, where D is defined as (4) with expectation and variance given by Theorem 2.6.Substituting the cumulants into a i , i ∈ {0, 1, 2, 3}, given by (10), we get the results.
Moreover, from Theorem 2.8, the limiting distribution of The following theorem depicts the standardized Edgeworth-type expansion of D. Theorem 3.4: Let D be the discriminant function defined in (4) and 2 is as in (13) and let z = . Then, for m = n 1 + n 2 − p − 3, q in (12) and c 0 = m+p m(m−2)(m+1) , the Edgeworth-type expansion, f z (x) when the observation x is classified in population π 1 equals where Proof: Consider E[ D] and Var [ D] given by Theorem 2.6.If x ∈ π 1 then, and Now consider the Edgeworth-type expansion of the form (9). Hence, if x ∈ π 1 , replacing the expected value and variance of z in a i , i ∈ {0, 1, 2, 3} given by ( 10) by their values yields Similar calculations if x ∈ π 2 concludes the proof.

Misclassification errors
It is not enough to make a decision based on a classification rule.In addition to a classification rule, it is important to study probabilities of misclassifications in order to know if a classification approach is reliable.The probability of misclassification is a measure of the goodness of the proposed classification rule as stated by Fujikoshi et al. [11].The classification of the new observation x depends on whether the discriminant function is greater than zero or not.Denote by P(x → π i ) the probability that an observation x is classified into population π i .Let i, j ∈ {1, 2}, then P(x → π i |x ∈ π i ) is the probability of correctly classifying the observation x into population π i and P(x → π j |x ∈ π i ) is the probability that the observation x comes from population π i but is classified as coming from π j .In this section, we investigate misclassification errors of the quadratic classifiers D in (3) and D given in (4).The probabilities of misclassifications for the classification functions D and D can be estimated by integrating the Edgeworth-type expansions of the discriminant functions through the density of the standard normal distribution.As we observed before, the exact distributions of D and D are complicated to calculate.Therefore, one would like to approximate the misclassification errors to know how good these classification rules are.According to Theorem 2.8 we assume the asymptotic distribution of D and D to be Lemma 4.1: Let a i , i ∈ {0, 1, 2, 3} be given by (10) and in (13).Let (•) and φ(•) be the cumulative distribution function and density of N (0, 1), respectively.For the discriminant function D, and P{ is the same but with opposite signs on coefficients of a 1 and a 3 , where D can either be D defined in (3) or D defined in (4). .Then, using integration by parts and the proof follows since The classification rule using D is to assign x to π 1 if D ≥ 0 and to π 2 if D < 0. The probabilities of misclassification are given by Theorem 4.2 below.Theorem 4.2: Consider the classification function, D, given by (3).Let 2 be given by (13) and let (•) and φ(•) be the cumulative distribution function and density of N (0, 1), respectively.Then, with q given by (12), The difference between the two expressions in ( 27) and ( 28) is that n 1 and n 2 are interchanged.
Proof: Consider the misclassification probability in Lemma 4.1 with the constants a i , i ∈ {1, 2, 3}, in (20) and the cumulants of D from Theorem 2.5.If x ∈ π 1 Hence the expression in (27) follows.Furthermore, similar calculations when x ∈ π 2 conclude the proof.
The classification rule using D is to classify x to π 1 if D ≥ 0 and to π 2 if D < 0. The theorem below expresses the misclassification errors for this rule.Theorem 4.3: Consider the classification function, D, given by (4) and 2 given by (13).Then, for m = n 1 + n 2 − p − 3, q in (12) and c 0 = m+p m(m−2)(m+1) , let (•) and φ(•) be the cumulative distribution function and density of N (0, 1), respectively.Then, Proof: Consider a i , i ∈ {1, 2, 3}, given by (24).If x ∈ π 2 , then from Lemma 4.1 we get 2 Therefore, substituting the above expression into (26) the proof follows.To prove a similar result when x ∈ π 2 we can use the same approach as when x ∈ π 1 and the result is established.

Conclusion
We have observed that it is difficult to evaluate the exact distributions of D and D. Therefore, we handled this problem by performing Edgeworth-type expansions in order to further obtain misclassification probabilities.Note that an Edgeworth-type expansion does not have to be a density.In our case, the integration of both Edgeworth-type expansions of the discriminant functions over the entire space equals one, that is, However, we have not shown nonnegativity of the expansions.The Edgeworth-type expansion was performed in terms of cumulants of the discriminant functions D and D through a standard normal distribution up to three terms.The expansion is a polynomial function in x.
In addition, in order to estimate misclassification probabilities for the classification rules whether D > 0 or D < 0 and whether D > 0 or D < 0, we used the Edgeworthtype expansions f D (x) and f D (x) of standardized variables with their limiting distribution given in Theorem 2.8.Misclassification errors from Theorem 4.2 and Theorem 4.3 are functions of the Mahalanobis distance.Figure 1 shows the change in misclassification probabilities using D and D with respect to the Mahalanobis distance ( ), where P(x → π 2 |x ∈ π 1 ) is the probability that an observation x is wrongly classified into population π 2 .The curves for the discrimination rules with D and D are more or less similar.
Now, let us discuss our results in comparison to well-known results.Consider the discriminant functions W given by (1) and Z in (2).Like the discriminant function D, they both involve a known covariance matrix.However, the W-rule is a linear function whereas the Z-rule is a quadratic function, similarly to D and D discriminant functions.Note that when n 1 = n 2 ≡ n, D = − mn n+1 W = mZ, where m = 2n−p−3.The following two theorems by Anderson [12] illustrate misclassification errors with respect to W and Z classification functions, respectively.) ≤ u|x ∈ π 2 } is similar to (30) (only n 1 and n 2 have to be interchanged).) ≤ u|x ∈ π 2 } equals (31) with n 1 and n 2 interchanged.
Proofs of Theorem 5.1 and Theorem 5.2 are provided by Anderson [12].Figure 2(a,b) display the change in misclassification probabilities using the rules D, D, W and Z with respect to .We can see from both figures that the rules D and D are comparably similar to W and Z rules.However, Figure 1 also shows that the rule with D may have negative misclassification error.Moreover, for small n, (n < 10), see Figure 2(b), misclassification errors using D can be negative.This is due to the fact that the Egdeworth-type expansion is not always a density.Hence, misclassification errors are not really probabilities and might yield irrelevant values.However, a negative value can be interpreted that the misclassification error is small.
We consider classification only if the misclassification error is less than 50%.Therefore, it is not meaningful to classify with D when < 0.4.On the other hand, the misclassification error using D is always less than 50%.

Remark 4 . 1 :
Note that the values of a 1 , a 2 and a 3 depend on whether it is D or D that is approximated.Proof: Consider the Edgeworth-type expansion of the form (9) and put t =