High-dimensional proportionality test of two covariance matrices and its application to gene expression data

With the development of modern science and technology, more and more high-dimensional data appear in the application fields. Since the high dimension can potentially increase the complexity of the covariance structure, comparing the covariance matrices among populations is strongly motivated in high-dimensional data analysis. In this article, we consider the proportionality test of two high-dimensional covariance matrices, where the data dimension is potentially much larger than the sample sizes, or even larger than the squares of the sample sizes. We devise a novel high-dimensional spatial rank test that has much-improved power than many existing popular tests, especially for the data generated from some heavy-tailed distributions. The asymptotic normality of the proposed test statistics is established under the family of elliptically symmetric distributions, which is a more general distribution family than the normal distribution family, including numerous commonly used heavy-tailed distributions. Extensive numerical experiments demonstrate the superiority of the proposed test in terms of both empirical size and power. Then, a real data analysis demonstrates the practicability of the proposed test for high-dimensional gene expression data.


Introduction
High-dimensional data are nowadays more and more common in bioinformatics, material science, astronomy and other application fields, as data collection technology rapidly evolves (Bühlmann & van de Geer, 2011). However, due to limited resources available to replicate observations, the sample sizes are usually much smaller than the dimension, which makes most traditional statistical approaches no longer appropriate. Under such an embarrassing background, scientists in many application fields urgently need powerful approaches to gather the greatest scientific insight from data. Testing equality of the distributions of two populations is a crucial problem in high-dimensional statistics, which is extremely complex and far more challenging than that for fixed-dimensional data. Due to this extreme complexity, it is usually replaced by a simpler problem, i.e. testing equality of some numerical characteristics, such as means and covariances, of the two populations, which is very useful but much easier to implement.
There is already a large number of literature on detecting the difference between the means of two highdimensional populations, such as Bai and Saranadasa (1996), Chen and Qinm (2010), , to name just a few. In contrast, there are much fewer studies on high-dimensional covariance matrix test of two high-dimensional populations. Hence, in this article, we focus on comparing the covariance matrices among two populations, which is strongly motivated for high-dimensional data, as high data dimensions can potentially increase the complexity of the covariance structure (Li & Chen, 2012). In particular, we consider the testing problem of the proportionality of two high-dimensional covariance matrices, which investigates the simplest heteroscedasticity of the population covariance matrices . It is often a preparation procedure before the case-control analysis of genomic data. Let X and Y be two p-dimensional populations with the mean vectors μ 1 , μ 2 and the covariance matrices 1 , 2 , respectively. The proportionality test of two population covariance matrices is formulated as follows: H 0 : 1 = c 2 versus H 1 : 1 = c 2 , (1) where c is an unknown scalar.
To alleviate such difficulties, Xu et al. (2014) proposed to use a pseudo-likelihood ratio test by extending the traditional likelihood ratio test with the statistic p log(p −1 tr(ˆ 1ˆ −1 2 )) − log |ˆ 1ˆ −1 2 |, which allows the dimension to increase proportionally with each sample size; furthermore, Liu et al. (2014) proposed an improved method, which allows the dimension to be larger than one of the sample sizes. In addition, for the special case of c ≡ 1 in (1), Li and Chen (2012) proposed a test statistic As mentioned in Li and Chen (2012), T LC is an unbiased estimation of tr{( 1 − 2 ) 2 }. Despite some progress, there are also drawbacks: first, these methods may have extremely poor performance for heavy-tailed distributions; second, the sample covariance matrices, which need to be inverted in the construction of the test statistic, are singular when the dimension is larger than both of the sample sizes.
To overcome these two drawbacks, more attention has been paid to nonparametric testing methods based on the multivariate sign or rank. Just recently, for testing the proportionality of two high-dimensional covariance matrices, Cheng et al. (2018) proposed to use a test procedure based on the multivariate sign and demonstrated its good performance in high-dimensional data analysis, especially for the heavy-tailed distributions. Recall that for fixed-dimensional data, the multivariate sign and rank are widely used to construct robust tests (Oja, 2010). However, most of these tests cannot be effective for high-dimensional data. Therefore, many researches extend the traditional multivariate sign-or rank-based testing methods to the highdimension data, such as Feng and Sun (2016) and Wang et al. (2015) for one-sample problems;  for two-sample problems; Feng and Liu (2017) and Zou et al. (2014) for sphericity testing problems. These researches clearly demonstrate the advantages of the high-dimensional multivariate sign-or rank-based methods in high-dimensional and heavy-tailed cases.
Unfortunately, due to the bias caused by estimating the location parameters, the test procedure based on the multivariate sign can only allow the dimension to be the squares of the sample sizes at most (Cheng et al., 2018), which makes the test procedure too restrictive for various practical applications, hence greatly affects the validity of the test procedure. For example, in genomic data analysis, genomic data typically carry thousands of dimensions for measurements on the genome, where the dimension can be much larger than the squares of the sample sizes. Therefore, it is very urgent to develop a new method to deal with the proportionality testing problem in (1) for the high-dimensional data, where the dimension is much higher than the squares of the sample sizes. This is the motivation and intention of this article.
The rest of the article is organized as follows. In Section 2, we introduce the proposed high-dimensional spatial rank test and establish its asymptotic normality under the elliptically symmetric populations. Then, we demonstrate the numerical performance of the proposed test in Sections 3, followed by a real data analysis in Section 4. Finally, we conclude this article in Section 5 and relegate the technical proofs to Appendix.

The proposed test
A p-dimensional random vector Z is said to follow an elliptically symmetric distribution, denoted by E p (μ, , F ξ ), if it has the following stochastic representation: where μ is the p-dimensional mean vector, ξ is a nonnegative random variable, F ξ is the cumulative distribution function of ξ , U is independent of ξ and is uniformly distributed on the unit sphere R p and A is a deterministic p × p-dimensional matrix satisfying AA T = with tr( ) = 1. It is known that the covariance matrix and shape matrix of the elliptical symmetric population Z will satisfy the equation = p −1 E(ξ 2 ) .
Let X 1 , . . . , X n 1 and Y 1 , . . . , Y n 2 denote the samples of two p-dimensional random vectors X and Y, which are generated from the two independent elliptically symmetric populations E p (μ 1 , 1 , F ξ 1 ) and E p (μ 2 , 2 , F ξ 2 ), respectively. From Section 3.1 in Magyar and Tyler (2014), it is known that i and § i have the same eigenvectors for each i ∈ {1, 2} under the assumption of elliptically symmetric distribution. Also, from Equation 3.9 in Magyar and Tyler (2014), it is known that when the eigenvalues of the covariance matrices 1 and 2 are proportional, the spatial sign covariance matrices § 1 and § 2 have the same eigenvalues. Theorem 1 in Cheng et al. (2018) showed that when § 1 and § 2 have the same eigenvalues, the eigenvalues of 1 and 2 are proportional. Hence, the hypotheses in (1) are equivalent to the following hypotheses: where are the spatial sign covariance matrices of X, Y, respectively, and U(z) = z z I(z = 0) for each z ∈ R p is the spatial sign function with · denoting the L 2 -norm and I(·) denoting the indicator function. On this ground, Cheng et al. (2018) suggested to use a test statistics based on the square Frobenius norm of § 1 − § 2 , i.e. tr{( § 1 − § 2 ) 2 }.
In deriving the asymptotic properties of T HT , we impose the following two conditions used in Cheng et al. (2018): Note that: (1) Condition (C1) is a commonly used condition in high-dimensional two sample testing problems; (2) Condition (C2) is similar to Condition (A2) in Li and Chen (2012); (3) If all the eigenvalues of 1 and 2 are bounded, Condition (C2) holds.
Remark 2.1: Note that the above Conditions (C1) and (C2) do not contain any restriction on p and n 1 , n 2 , since such restriction is not needed to control the following terms: which have been removed from T HT . That is to say, we remove all the items that include at least one pair of identical vectors, such as and so on. Such type of strategy was previously used in Chen and Qinm (2010). By removing the terms i X T i X i and k Y T k Y k from the test statistic proposed by Chen and Qinm (2010), no restriction on p, n 1 and n 2 is needed.
Under the above two conditions, the limiting null distribution of T HT is given in the following theorem.

Relationship with the test proposed in Cheng et al. (2018)
The proposed spatial rank test seems to be more complex than the existing ones, such as the spatial sign test proposed by Cheng et al. (2018). This is a price that we have to pay for making the proposed method powerful in testing the high-dimensional data, where the data dimension is potentially much larger than the squares of the sample sizes, especially for the data generated from heavy-tailed distributions. Below we will explain the motivation of the proposed method in detail. First, we recall Lemma B.1 in Han and Liu (2018).
By Lemma 2.3, we have that (Oja, 2010). Similarly, suggests that for each of the two populations, the population multivariate Kendall's tau matrix is the same as the spatial sign covariance matrix. As a result, testing equality of the two spatial sign covariance matrices is identical to testing equality of the two population multivariate Kendall's tau matrices. Moreover, it can be seen that the three components of the Frobenius norm of the difference between § 1 and the following equivalent representations: for each i, j, k, l ∈ {1, . . . , n 1 }, where i, j, k, l are not equal to each other; for each i, j ∈ {1, . . . , n 1 } with i = j and each k, l ∈ {1, . . . , n 2 } with k = l. These representations finally enlighten us to construct T HT as that in the above subsection, which is actually a consistent estimator of Unlike the spatial sign covariance matrix, to estimate the multivariate Kendall's tau matrix, it is not necessary to estimate the spatial medians, whose estimators may bring a bias hence strengthens the condition imposed on the dimension p. That is the reason why we propose to use a new test procedure based on the multivariate Kendall's tau matrix rather than the spatial sign covariance matrix. Therefore, the condition imposed on the dimension p can be released to some extent, which makes the proposed test procedure powerful in highdimensional data, even with the dimension much larger than the sample sizes.
In fact, in the spatial sign test proposed by Cheng et al. (2018), to test the equality of the two spatial sign covariance matrices § 1 and § 2 , the test statistic is Here,μ 1 andμ 2 are the spatial median estimators of X and Y, respectively, obtained by using the estimation method proposed in Mottonen and Oja (1995).
δ n 1 ,n 2 = 0, due to the spatial median estimatorsμ 1 andμ 2 (see Lemma 2 in Cheng et al., 2018). To obtain a consistent estimator of the bias δ n 1 ,n 2 , the condition p = O{(n 1 + n 2 ) 2 } was imposed in Cheng et al. (2018), which limits the application of T SS for the high-dimensional data where the dimension is much larger than the squares of sample sizes.

Simulation study
In this section, we will present some numerical results to demonstrate the performance of the proposed test (abbreviated as HT) in high-dimensional cases, in comparison with two existing popular tests, the test proposed by Li and Chen (2012) (abbreviated as LZ) and the spatial sign test proposed by Cheng et al. (2018) (abbreviated as SS). The following three scenarios are considered.
First, to observe the influence of the dimension p to the potential bias of the methods involved, we summarize the results of the mean-standard deviationratio E(T)/ √ var(T) and the variance estimator ratio var(T)/var(T) under the null hypothesis in Table 1 for each T ∈ {T HT , T SS , T LZ } with n 1 = n 2 = 15 and p = 100, 200, 400, 800, 1200, where T LZ is the test statistic proposed in Li and Chen (2012). Since the exact value of E(T) and var(T) are difficult to calculate, we replace them with their Monte-Carlo estimators respectively, using 1000 repeated samplings. Table 1 indicates that SS has worse mean-standard deviation-ratio results than the other two methods in high-dimensional situations, particularly when p > (n 1 + n 2 ) 2 . This is most likely due to the fact that in T SS the bias correction process is limited by the condition that p = O{(n 1 + n 2 ) 2 }. On the other hand, suggested by the variance estimator ratio results of Table 1, the estimated variances of LZ are eventually larger than the real ones, particularly in non-normal situations. In contrast, HT has better performance in these two aspects.
Then, we will compare the performance of the three methods in empirical size and empirical power. Let n 1 = n 2 = 15, 20, 30 and p = 100, 200, 400, 800, 1200. Tables 2-4 summarize the empirical size and power results of the three methods. First, the empirical size results in Tables 2-4, corresponding to the setting of ρ = 0.3, suggest that LZ fails to control the empirical size in the non-normal cases. Moreover, when comparing HT with SS, we find that their performance is very similar, except in the cases where the dimension is comparable to or larger than the squares of the sample sizes, i.e. 1200 > (15 + 15) 2 . In such cases, SS may lose control of the empirical size, which is consistent Table 1. Comparison of the mean-standard deviation-ratio and the variance estimator ratio at the 5% level with n 1 = n 2 = 15 and p = 100, 200, 400, 800, 1200.    with the conclusion made by analysing Table 1. In the above results about the empirical size, in a few cases, the empirical size is slightly larger than 5%, but still within a reasonable range. To comprehensively compare the empirical size and power of the three tests, in Figure 1, we present the receiver operating characteristic curves (ROCs) for the three tests with (n 1 , n 2 , p) = (15, 15, 800). Suggested by Figure 1, these tests have similar performance under the multivariate normal distributions, while under the remaining heavy-tailed distributions, the area under ROC (AUC) of the proposed HT test is larger than the AUCs of its competitors. This further demonstrates the advantages of the proposed test. Next, we consider an alternative structure of the covariance matrices, i.e. i = (a ikl ) for each i ∈ {1, 2}, . . , p} and the remaining entries of i are all zeros. Note that i is the corresponding covariance matrix of x i following the MA(2) model: where z it 's are i.i.d. random variables with mean zero and variance 1 1+2ρ 2 i . Under the null hypothesis, we set ρ 1 = ρ 2 = 0.7, while under the alternative hypothesis, we set ρ 1 = 0.7 and ρ 2 = 0.1 for instance. The other settings are all the same as the above. Tables 5 and 6 report the empirical sizes and power of these three Table 5. Empirical size comparison at the 5% level with the MA(2) covariance matrices with n 1 = n 2 = 15, 20, 30 and p = 100, 200, 400, 800, 1200. n 1 = n 2 = 15 n 1 = n 2 = 20 n 1 = n 2 = 30  100  42  40  40  65  62  56  91  92  67  200  42  40  42  64  63  50  91  92  67  400  43  40  46  64  66  51  92  92  65  800  44  45  46  64  64  51  91  94  65  1200  45  45  46  62  61  54  92  94  68  Multivariate mixture normal distribution  100  43  45  42  64  61  57  90  92  68  200  44  44  44  66  66  56  91  93  73  400  45  48  43  66  68  55  93  93  69  800  43  55  43  65  71  53  92  95  70  1200  43  61  46  62  74  53  92  94  66 methods, respectively. Although Table 6 suggests that the performance of empirical power of the three methods is similar, Table 5 suggests that the abilities of LZ and SS to control the empirical size are weakening much more quickly than HT with the increase of p for fixed n 1 and n 2 , especially when the dimension is comparable to or larger than the squares of the sample sizes. Overall, the comprehensive numerical results suggest that the proposed HT test has obvious advantages in terms of controlling empirical size over the existing two methods. Such gain is especially clear when the original distribution deviates from normality, and when the dimension is larger than the squares of sample sizes.

Application
In this section, we apply the proposed testing method to a gene dataset, which contains the expression of the 2000 genes with the highest minimal intensity across the 62 tissues. Each entry in the dataset is a gene intensity derived using the filtering process proposed in Alon et al. (1999). The dataset was previously studied by Alon et al. (1999), and now can be freely downloaded at the following website: http://genomicspubs.princeton.edu/oncology/affydata/index.html.
Among the 62 tissues, there are 22 normal tissues and 40 tumour colon tissues. We aim to test the hypothesis that the tissues in the tumour group and those in the normal group have the proportional covariance matrices in terms of the expression levels of the 2000 genes, where the dimension 2000 is larger than the squares of the sample sizes, 484 and 1600.
First, the normal distribution was tested for the expression data of each gene, using the Shapiro-Wilk test. The top two panels of Figure 1 present the histograms of the p-values of the normality tests for the tumour group and the normal group, respectively, which indicate that for a large number of genes the expression data are non-normal. In fact, under the significance level of 0.05, the overall rejection rates of all the normality tests are 93.55% and 37.75% for the tumour group and the normal group, respectively. This motivates us to use a nonparametric approach for testing the above hypothesis, which can deal with the high-dimensional data from non-normal distributions.
The bottom two panels of Figure 2 indicate that there exist some genes with very high values of sample mean in terms of expression. We see that the sample means vary largely for each of the two groups and recall that the dimension is larger than the squares of the sample sizes, which raises a concern that using a spatial signbased approach may lead to an uncontrollable bias. Hence, in theory, a spatial rank-based approach is more appropriate for this dataset.
Based on the above reasons, we apply the proposed HT test to this dataset. The test statistic and p-value of the HT test are 4.823 and 0.000, respectively, hence the null hypothesis is rejected, which suggests that the covariance matrix of the gene expression levels of the tumour group is significantly not proportional to that of the normal group. This result can also be intuitively verified by comparing the sample correlation matrices of the two groups. As a convenience and for demonstration purposes, in Figure 3, we only plot the heatmaps of the sample correlation matrices of the two groups as well as the difference of the two matrices using the first 100 genes in the original data. The heatmaps demonstrate that there are some intuitive differences between the two sample correlation matrices, which tends to support our result of rejecting the null hypothesis.

Conclusion
We have proposed the HT test, a new high-dimensional spatial rank test, for the proportionality testing problem of two high-dimensional covariance matrices, which is a high-dimensional extension of Kendall's tau test. It inherits the robustness advantage of the traditional spatial rank-based methods, and also has strong potential in dealing with the high-dimensional data, where the dimension can be potentially much larger than the squares of the sample sizes. We establish the asymptotic distributions of the proposed method rigorously. In comparison with some existing test procedures, the gain in empirical power and empirical size of HT is especially clear in high-dimensional and heavy-tailed data, shown by many numerical evidence. The real data analysis shows the applicability and pertinence of the proposed method to high-dimensional gene expression data.

Funding
(2) for any p × p symmetric matrix W, In Lemma A.2, the first statement has been proved in Section 3.1 of Fang et al. (1990) and the second statement has been proved in Zou et al. (2014). Now, we are ready to present the proof of Theorem 2.2. Then, the proof of Theorem 2.1 can be directly obtained.
Proof of Theorem 2.2: Define Hence we have that E(V T X,i V X,j ) = 0 and E(V T X,i W X,ij ) = 0. According to Lemma 1 in Feng and Liu (2017), we have that E(W T X,ij W X,ij ) → 0 as p goes to infinity and B 1 = 0.5 § 1 {1 + o(1)}. The same goes for W Y,ij and B 2 . On this ground, by Lemma A.1, we have that . As a result, the first part of T HT has the following decomposition: According to Lemma A.2 and the fact that E(W T X,ij W X,ij ) → 0 as p goes to infinity, we similarly have that E(J 2 2 ) = o{p 2 n −3 tr( § 2 1 )} = o(σ 2 n ) and E(J 2 3 ) = o(p 2 n −4 ) = o(σ 2 n ). Using the similar techniques, we can decompose the rest two parts of T HT , hence conclude that = pA n 1 + pB n 2 − 2pC n 1 ,n 2 + o p (σ n ).