A new exact p-value approach for testing variance homogeneity

To test variance homogeneity, various likelihood-ratio based tests such as the Bartlett's test have been proposed. The null distributions of these tests were generally derived asymptotically or approximately. We re-examine the restrictive maximum likelihood ratio (RELR) statistic, and suggest a Monte Carlo algorithm to compute its exact null distribution, and so its p-value. It is much easier to implement than most existing methods. Simulation studies indicate that the proposed procedure is also superior to its competitors in terms of type I error and powers. We analyse an environmental dataset for an illustration.


Introduction
Homogeneity of variances among populations or factor levels plays a fundamental role in analysis of variance (ANOVA) and many statistical analysis approaches. For example, ANOVA inferences are generally slightly affected by unequal variances if the model contains only fixed factors and has equal or almost equal sample sizes. On the other hand, the inference results based on the ANOVA models with random effects or unequal sample sizes can be substantially affected by the inequality of variances. Bartlett (1937) developed a modified likelihood-ratio test and derived the associated asymptotic distribution of the test, which can control type I error under the normality assumption. However, its performance in small sample sizes is not attractive as pointed out by Bishop and Nair (1939) and Hartley (1940). Since then various efforts have been made to improve Bartlett's test. Representative work includes (Boos & Brownie, 1989;Box, 1953;Brown & Forsythe, 1974;Cochran, 1951;Hartley, 1950;Levene, 1960;Pardo et al., 1997). Recently, there is recognition that variability itself can be a major issue. For instance, Teschendorff and Widschwendter (2012) argued that in cancer genomics, differential variability can be as important as differential means for predicting disease phenotypes, and indicates that understanding heterogeneity can be crucial.
Since the common critical values are given using chisquared distribution approximation, various variants from large-sample or numerical approximation-based aspects have been proposed. These tests generally work well in the large-sample sense, but they are not exact tests in the sense of frequency.
In order to obtain an exact (or nearly exact) test for checking homogeneity of variances under normal distribution, additional efforts have further been made in several ways. For example, Wu and Wong (2003) provided a critical value approximation approach through the saddle point approximation. Bhandary and Dai (2009) proposed a test (BDT) based on Benforroni type adjustment procedure on the ordered p-value. Liu and Xu (2010) proposed a generalized p-value test (GPT) by employing the generalized inference (Tian, 2005(Tian, , 2007Weerahandi, 2004). Ma et al. (2015) suggested an adjusted Bartlett's test (ABT) on the basis of the equal mean principle. Gokpinar and Gokpinar (2017) re-examined the computational approach test (CAT), that was originally introduced by Pal et al. (2007). Each of these methods has their own merits under certain favourable circumstances. Gokpinar and Gokpinar (2017) compared the four tests, BAR, BDT, GPT and CAT, in terms of the type I error rate and the power, and concluded that CAT appears to be more powerful than other three tests when the group size is small or moderate, and further confirmed that BAR could not maintain type I error rates as well as could be conservative in small sample sizes.
In this paper, we develop a practically useful procedure to calculate the null distribution; i.e., the p-value, of the restrictive maximum likelihood-ratio (RELR) statistic. The procedure has nice statistical properties as aforementioned Bartlett type of tests in large sample sizes. Its small-sample performance is attractive and superior to its competitors in most situations. Most importantly, it is very easily implemented and computationally expedient from practical perspectives.
The paper is organized as follows. Section 2 briefly describes the framework and introduces Bartlett test. In Section 3, we re-examine the RELR statistic and suggest a Monte Carlo algorithm for computing its p-value. Section 4 presents simulation results to evaluate the small-sample performance of the proposed test and to compare with some existing methods. We analyse a real dataset to compare the six tests for illustrating the utility of the proposed test in Section 5, and remark the paper with a discussion in Section 6.

Framework and Bartlett test
The test of variance homogeneity can be formulated as be the sample mean and variance of the ith population, i = 1, . . . , k, and N = k i=1 n i be the total sample size. It is well-known that the restrictive maximum likelihood-ratio (RELR) test statistic for the hypothesis (1) is and the p-value of the test is given by with s 2 i being the observed S 2 i based on the data. Since it is generally impossible to derive the exact distribution of T n , Bartlett (1937) modified T n to and showed that T n,B is asymptotically chi-squared with degrees of freedom k−1 as min i n i → ∞, though this approximation is not necessary when k = 2 because the corresponding RELR statistic is a monotonic function of the F− ratio. Consequently, the null hypothesis is suggested to be rejected if T n,B > χ 2 k−1,1−α given the significance level α.
this expression motivates us to introduce a new quantity as T n,NEW is not a statistic any more because it con- which is independent of all unknown σ 2 i 's. Therefore, we may derive the distribution of T n,NEW , equivalently the distribution of T n , under H 0 . Consequently, we can calculate the p-value of the test (1) as p = P H 0 {T n,NEW ≥ t}. Hence, the power function of the test could be given by It may not be easy to derive the distribution of T n,NEW in practice, we therefore alternatively calculate the pvalue by Monte Carlo simulation. Specifically, we calculate the power via the following algorithm.  When M is large enough, the numerical estimation has sufficient accuracy.

Simulation studies
In this section, we report simulation results to evaluate the performance of the proposed testing procedure. For the comparison purpose, we examine the following tests: Bartlett test (BAR, Bartlett, 1937), the adjusted Bartlett's test (ABT, Ma et al., 2015), the generalized pvalue test (GPT, Liu & Xu, 2010), the Bhandary and Dai's test (BDT, Bhandary & Dai, 2009), the computational approach test (CAT, Pal et al., 2007), and the RELR test. The criterion for analysing the performance of the methods is to compare the type I error and power properties of tests.
To examine the performance of these tests, the parameter setting of the simulation studies are as follows: (1) The number of samples equals 2, 5, 10, 15, 30, 50; (2) Different combinations of group size k and sample sizes n are given in the first two columns of Table 1; (3) We set σ 2 i ≡ 1 for i = 1, . . . , k for calculating the type I errors, and consider various degrees of variance heterogeneity listed in the following box for the power comparison.

Real data example
In this section, we analyse the dataset for the detrended particulate matter (PM 10 ) of Maryland in 1990 by using the six tests to investigate the seasonal effect on pm 10 variability. After removing missing observations, we have 88, 88, 97, and 74 observations within Spring, Summer, Fall, and Winter. Let σ 2 i be their variances for i = 1, 2, 3, 4, respectively. This concern can then be formulated as the null hypothesis: H 0 : σ 2 1 = σ 2 2 = σ 2 3 = σ 2 4 . We compute the p-value using the six tests with M = 10000. The corresponding p-values for BAR, ABT, GPT, CAT, RELR are p BAR = 0.0505, p ABT = 0.0505, p GPT = 0.0552, p CAT = 0.0567 and p RELR = 0.0495, and BDT indicates that we fail to reject the null hypothesis for given 5% significant level. So all tests except the proposed RELR suggest that we could not reject the null hypothesis, while only RELR suggest a rejection, though these p-values are slightly different. Recalling our simulation results, we prefer the result based on RELR, and conclude that the variances among the four seasons are not homogeneous.

Concluding remarks
In this paper, we have proposed a procedure for calculating the p-value of the restrictive likelihood ratio test for variance homogeneity. The procedure is very easy to implement and performs promising. Given the optimality of the likelihood ratio principle, we conjecture that the test could be most efficient, which warrants a further investigation. This paper provides a means to calculate the p-value when it is difficult, if not impossible, to derive (asymptotic) distribution of the proposed test statistic. However, there is no a general guideline to reformulate T n in (2). So deriving a quantity similar to T n,NEW may be case by case. Whether the proposed procedure can be applied to high-dimensional (in the sense of diverging with the sample size) situations is unclear and also warrants further research.

Disclosure statement
No potential conflict of interest was reported by the author(s).