Discussion on the paper ‘A review of distributed statistical inference’

Distributed statistical inferences have attracted more and more attention in recent years with the emergence of massive data. We are grateful to the authors for the excellent review of the literature in this active area. Besides the progress mentioned by the authors, we would like to discuss some additional development in this interesting area. Specifically, we focus on the balance of communication cost and the statistical efficiency of divide-and-conquer (DC) type estimators in linear discriminant analysis and hypothesis testing. It is seen that the DC approach has different behaviours in these problems, which is different from that in estimation problems. Furthermore, we discuss some issues on the statistical inferences under restricted communication budgets.


Linear discriminant analysis
Linear discriminant analysis (LDA) is a classical classification method (Anderson, 2003). For simplicity, we consider the two-sample problem, assuming that where μ i ∈ R p , i = 1, 2 are the mean vectors with μ 1 = μ 2 and ∈ R p×p is the covariance matrix. Furthermore, assume that observations come either from X with probability π 1 or from Y with probability π 2 such that π 1 + π 2 = 1. For a new observation Z, Fisher's linear discriminant rule is defined as follows: where μ a = (μ 1 + μ 2 )/2, μ d = μ 1 − μ 2 , and = −1 represents the precision matrix, and 1{·} is the indicator function. Suppose that {X i , i = 1, . . . , N 1 } and {Y i , i = 1, . . . , N 2 } are the independently and identically distributed copies of X and Y, respectively. Let N = N 1 + N 2 be the total sample size and suppose that N > p. For i = 1, 2, denoteμ i as the sample means, and i as the sample covariance using observations X i 's and Y i 's, respectively. Then the estimators of μ a , μ d and can be defined respectively as followŝ whereˆ pool = (N 1 /N)ˆ 1 + (N 2 /N)ˆ 2 denotes the pooled sample covariance matrix. Then the empirical version of ψ(Z), denoted asψ(Z), can be derived by plugging in the above estimators into (1).
In a distributed setting, one has a central machine (or hub) and many local machines. Suppose that data are split randomly and evenly, and are stored at K local machines. Denote by . . , N 2 /K} the samples from two classes on the k-th local machine k = 1, . . . , K. Tian and Gu (2017) considered sparse LDA in the high dimensional regime in the case of π 1 = π 2 = 1/2, under the assumption that β = μ d is a sparse vector. They proposed a one-shot estimator, which is communication efficient and attains the same convergence rate as the global estimator if K = O( N/ log p/ max{s, s }), where s and s stand for the sparsity of some parameters. Li and Zhao (2021) considered the distributed LDA without sparsity assumption under the settings where p/N → 0 and Kp/N → r ∈ [0, 1). Note that to com-puteˆ −1 , one needs to transfer p by p matrices to the central machine, of which the communication costs can be expensive. Li and Zhao (2021) proposed a tworound estimator and a one-shot estimator, defined as follows.
Denote byμ (k) i the estimator of μ i with data at the kth machine, for i = 1, 2, and k = 1, . . . , K. The oneshot estimator considers the following decision rule, poo1 ) −1 is the pooled sample covariance matrix using the data at the kth machine,μ 2 . Note that (k) and μ (k) i can be computed with the data only at the kth machine and that it is sufficient to transmit the vectorsˆ (k)μ (k) d ∈ R p and the scalars ( for all k to the hub. The two-round estimator is an improved version of ψ one (Z), just replacing the local estimatorsμ (2) by the global onesμ a ,μ d with an additional round of communication. In fact, by transferringμ (k) i 's to the central hub, we can obtain Li and Zhao (2021) compared the classification accuracy of the global estimator with those of distributed ones. They showed that when K = o(N/p), both the two-round estimator and the one-shot estimator can be as good as the global one under mild conditions. Moreover, they found if Kp/N → r ∈ [0, 1) and π 1 = π 2 , the two-round estimator can be as good as the global one, but the one-shot estimator is inferior to the global one. This is an interesting result, since when Kp/N → r > 0,ˆ (k) pool is not a consistent estimator of by the random matrix theory. Therefore, at the price of more communication cost, the two-round estimator achieves better statistical efficiency.

Hypothesis testing of the mean vectors
In this section, we discuss the DC approach in the onesample testing problem in the distributed system. We observe that DC type test statistics always lead to the loss of power, which is different from that of point estimation where the DC type estimator can be as good as the global one.
Suppose that X ∈ R p is a random vector with E(X) = μ. For a given vector μ 0 , consider the hypothesis testing problem Suppose that X follows the normal distribution N(μ, ) with unknown covariance matrix . Let {X i , i = 1, . . . , n} are independent and identically distributed copies of X. In the setting of p < n, the classical test statistic is Hotelling T 2 (Anderson, 2003), defined as follows, whereX denotes the sample mean andˆ = (ˆ ) −1 witĥ being the sample covariance matrix. In high dimensional cases with p > n, the sample covariance matrix is singular and the Hotelling T 2 test statistic is not well defined. Many works are developed to extend the Hotelling T 2 to large or high dimensional regimes (Bai & Saranadasa, 1996;Srivastava & Du, 2008;Wang et al., 2015, etc.). Du and Zhao (2021) considered the distributed version of these test statistics. Specifically, based on the DC approach, they extended the Hotelling T 2 statistics under the setting Kp/n → r ∈ [0, 1) and the nonparametric test statistics of Wang et al. (2015) for high dimensional settings. The ratio of the communication cost of deriving the global test statistics over that of the distributed test statistics is of order O(p 2 ) in the case of Kp/n → r ∈ [0, 1), and O(p) in high dimensional regimes.
They compared the power of distributed statistics with those of global ones, showing that the distributed test statistics are less efficient than those of the global ones whenever K > 1. Denote by β d (n) and β g (n) the powers of the distributed and global test statistics as the function of sample size n, respectively, and define n g /n d such that β d (n d ) = β g (n g ) as the relative efficiency. The asymptotic relative efficiencies of distributed test statistics have the order 1/ √ K. Hence, the story of the DC approach in the hypothesis problem above is quite different from that of the point estimation, where the mean square error (MSE) of the DC estimators can be as good as that of global ones (Lee et al., 2017;Volgushev et al., 2019;Zhang et al., 2013, etc.). On the other hand, Shi et al. (2018) and Banerjee et al. (2019) showed that, in some non-However, how to handle the statistical problems with the restricted budget in other settings is an interesting problem for future work. For example, for the hypothesis testing problem discussed in Section 2, how to design test statistics that can achieve good statistical efficiency under a given communication budget needs further investigation.

Disclosure statement
No potential conflict of interest was reported by the author(s).