A review of distributed statistical inference

The rapid emergence of massive datasets in various fields poses a serious challenge to traditional statistical methods. Meanwhile, it provides opportunities for researchers to develop novel algorithms. Inspired by the idea of divide-and-conquer, various distributed frameworks for statistical estimation and inference have been proposed. They were developed to deal with large-scale statistical optimization problems. This paper aims to provide a comprehensive review for related literature. It includes parametric models, nonparametric models, and other frequently used models. Their key ideas and theoretical properties are summarized. The trade-off between communication cost and estimate precision together with other concerns are discussed.


Introduction
With the rapid development of information technology, datasets of massive sizes become increasingly available. E-commerce companies like Amazon have to analyze billions of transaction data for personalized recommendation. Bioinformatics scientists need to locate relevant genes corresponding to some specific phenotype or disease from massive SNPs data. For Internet related companies, large amounts of text, image, voice, and even video data are in urgent need of effective analysis. Due to the accelerated growth of data sizes, the computing power and memory of one single computer are no longer sufficient. Constraint on network bandwidth and other privacy or security considerations also make it difficult to process the whole data on one central machine. Accordingly, distributed computing systems become increasingly popular.
Similar to parallel computing executed on a single computer, distributed computing is closely related to the idea of divide-and-conquer. Simply speaking, for some statistical problems, we can divide a complicated large task into many small pieces so that they can be tackled simultaneously on multiple CPUs or machines. Their outcomes are then aggregated to obtain the final result. It is conceivable that this procedure can save the computing time substantially if the algorithm can be executed in a parallel way. The main difference between a traditional parallel computing system and a distributed computing system is the way they access memory. For parallel computing, different processors can share the same memory. Consequently, they can exchange information with each other in a super-efficient way. While for distributed computing, distinct machines are physically separated. They are often connected by a network.
Accordingly, each machine can only access its own memory directly. Therefore, the inter-machine communication cost in terms of time spending could be significant and thus should be prudently considered.
The rest of this article is organized as follows. Section 2 studies parametric models.
Section 3 focuses on nonparametric methods. Section 4 expresses some other related methods. The article is concluded with a short discussion in Section 5.

Parametric models
Assume a total of N observations denoted as Z i = (X i , Y i ) ∈ R p+1 with 1 ≤ i ≤ N .
Here X i ∈ R p is the covariate vector and Y i ∈ R is the corresponding scalar response.
Define {P θ : θ ∈ Θ} to be a family of statistical models parameterized by θ ∈ Θ ⊂ R p .
We further assume that Z i 's are independent and identically distributed with the distribution P θ * , where θ * = (θ * 1 , . . . , θ * p ) is the true parameter. Consider a distributed setting, where N sample units are allocated randomly and evenly to K local machines M k , 1 ≤ k ≤ K, such that each machine has n observations. Obviously, we should have N = nK. Write S = {1, . . . , N } as the index set of whole sample. Then, let S k denote the index set of local sample on M k with S k 1 ∩ S k 2 = ∅ for any k 1 = k 2 . Other than the local machines, there also exists a central machine represented by M center . A standard architecture should have M center to be connected with every M k .
Let L : Θ × R p+1 → R be the loss function. Assume that the true parameter θ * minimizes the population risk L * (θ) = E[L(θ; Z)], where E stands for expectation with respect to P θ * . Define the local loss on the kth machine as L k (θ) = n −1 i∈S k L(θ; Z i ). Correspondingly, define the global loss function based on the whole sample as L(θ) = N −1 i∈S L(θ; Z i ) = K −1 K k=1 L k (θ), whose minimizer isθ = arg min θ∈Θ L(θ). In most cases, the whole sample estimatorθ should be √ N -consistent and asymptotically normal (Lehmann and Casella, 2006). If N is small enough so that the whole sample S can be read into the memory of one single computer, thenθ can be easily computed.
The entire computation can be executed in the memory of this computer. On the other hand, if N is too large so that the whole sample S cannot be placed on any single computer, then a distributed system must be used. In this case, the whole sample estimatorθ is no longer computable (or at least very difficult to compute) in practice. Then, how to develop novel statistical methods for distributed systems becomes a problem of great importance.

One-shot approach
To solve the problems, various methods have been proposed. They can be roughly divided into two classes. The first class contains so-called one-shot methods. They are to be reviewed in this subsection. The other class contains various iterative methods.
They are to be reviewed in the next subsection.
The basic idea of the one-shot approach is to calculate some relevant statistics on each local machine. Subsequently, they are sent to a central machine, where these statistics are assembled into the final estimator. The most popular and direct way of aggregation is simple average. Specifically, for each 1 ≤ k ≤ K, machine M k uses local sample S k to compute the local empirical minimizerθ k = arg min θ∈Θ L k (θ).
These local estimates (i.e.,θ k 's) are then transferred to the center machine M center , where they are averaged asθ = K −1 K k=1θ k . This leads to the final simple averaging estimatorθ (see Figure 1(a)).
Obviously, the one-shot style of distributed framework is highly communicationefficient. Because it requires only one single round of communication between each M k and M center . Hence, the communication cost is of the order O(Kp), where p is the dimension of each estimateθ. Theoretical properties of simple averaging estimator were also studied in the literature. For example, it was shown in Zhang et al. (2013, Corollary 2) that, under appropriate regularity conditions, where C 1 , C 2 are some positive constants. If n is sufficiently large such that n −2 = o(N −1 ), then the dominant term in (2.1) becomes C 1 /N , and is of the order O(N −1 ).
It is the same as that of the whole sample estimator. This also implies that, in order to obtain the global convergence rate, we should not divide the whole sample into too many parts. A further improved theoretical results were obtained by Rosenblatt and Nadler (2016). They showed that the one-shot estimator is first order equivalent to the whole sample estimator. However, the second-order error terms ofθ can be nonnegligible for nonlinear models. Similar observation was also obtained by Huang and Huo (2015). The work of Duchi et al. (2014) revealed that the minimal communication budget to attain the global estimation error for linear regression is O(Kp) bits up to a logarithmic factor under some conditions. This result matches the simple averaging procedure and confirms the sharpness of the bound in (2.1). To further reduce the bias, a novel subsampling method was developed by Zhang et al. (2013). By this technique, the error bound is improved to be O(N −1 + n −3 ), which relaxes the restriction on the number of machines.
Instead of the linear combination of local maximum likelihood estimates (MLEs) as simple average, Liu and Ihler (2014) proposed a KL-divergence based combination where p(x|θ) is the probability density of P θ with respect to some proper measure µ and KL-divergence is defined by KL p(x) q(x) = X p(x) log{p(x)/q(x)}dµ(x). It was shown thatθ KL is exactly the global MLEθ if {P θ : θ ∈ Θ} is a full exponential family (defined in their paper). This sheds light on the inference about generalized linear models (GLMs) based on exponential likelihood.
In many cases, some local machines might suffer from data of poor quality. This could lead to abnormal local estimates, which further degrade the statistical efficiency of the final estimator. To fix the problem, Minsker et al. (2019) devised a robust assembling method. It leads to an estimator asθ robust = arg min θ∈Θ K k=1 ρ(|θ −θ k |), where ρ(·) is a robust loss function satisfying some conditions. For example, when ρ(u) = u and p = 1 (univariate case),θ robust is the median ofθ k 's. It should be more robust against outliers as compared with the simple average. Under some regularity conditions, they showed thatθ robust achieves the same convergence rate as the whole (b) Illustration of the iterative approach.

Iterative approach
Although one-shot approach involves little communication cost, it suffers from several disadvantages. First, the local machines need to have sufficient amount of data (e.g., n √ N ). Otherwise the aggregated estimator cannot reach the convergence rate as the global estimator. This prevents us from utilizing many machines to speed up the computation process (Wang et al., 2017;Jordan et al., 2019). Second, the simple averaging estimator is often poor in performance for nonlinear models (Rosenblatt and Nadler, 2016;Huang and Huo, 2015;Jordan et al., 2019). Last, when p is diverging with N , the situation could be even worse (Rosenblatt and Nadler, 2016;Lee et al., 2017). This suggests that carefully designed algorithms allowing a reasonable number of iterations should be useful for a distributed system.
Inspired by the one-step method in the M -estimator theory, Huang and Huo (2015) proposed an one-step refinement of the simple averaging estimator. Let us recall thatθ is the one-shot averaging estimator. To further improve its statistical efficiency, it should be broadcast to each local machine. Next, local gradient ∇L k (θ) and local Hessian ∇ 2 L k (θ) can be computed on each M k . Then, they are reported to M center to form the central gradient ∇L(θ) = K −1 K k=1 ∇L k (θ) and Hessian ∇ 2 L(θ) = K −1 K k=1 ∇ 2 L k (θ). Thus an one-step updated estimator can be constructed on M center asθ Compared with one-shot estimator,θ (1) involves one more round of communication cost. Nevertheless, the statistical efficiency of the resulting estimator could be well improved. In fact, Huang and Huo (2015) showed that where C 1 > 0 is some constant. Obviously, this is a lower upper bound of mean squared error than that in (2.1). To attain the global convergence rate, the local sample size should satisfy n −4 = o(N −1 ), which is a much milder condition. Furthermore, they showed thatθ (1) also has the same asymptotic efficiency as the whole sample estimator θ under some regularity conditions.
A natural idea to further extend the one-step estimator is to allow the iteration (2.2) to be executed many times. Specifically, letθ (t) be the estimator of the tth iteration. Then, we can use (2.2) by replacingθ withθ (t) to generate the next step estimatorθ (t+1) (see Figure 1(b)). However, this requires a large number of Hessian matrices to be computed and transferred. If the parameter dimension p is relatively high, this will lead to a significant communication cost of the order O(Kp 2 ).
To fix the problem, Shamir et al. (2014) proposed an approximate Newton method, which conducts Newton-type iteration distributedly without transferring the Hessian matrices. Following this strategy, Jordan et al. (2019) developed an approximate likelihood approach. Their key idea is to update Hessian matrix on one single machine (e.g., M center ) only. Then, (2.2) can be revised to bê where ∇ 2 L center is the Hessian matrix computed on the central machine. By doing so, the communication cost due to transmission of Hessian matrices can be saved. Under some conditions, they showed that holds with high probability, where C 1 > 0 is some constant. By the linear convergence of the estimates (2.3), we can see that it requires [log K/ log n] iterations to achieve the √ N -consistency as the whole sample estimatorθ, providedθ (0) is √ n-consistent.
Note that if n = K = √ N , one iteration suffices to attain the optimal convergence rate. However, the satisfactory performance of this method relies on a good choice of the machine, on which the Hessian needs to be updated (Fan et al., 2019a). To fix the problems, Fan et al. (2019a) added an extra regularized term to the approximate likelihood used in Jordan et al. (2019). With this modification, the performance of the resulting estimator can be well improved. Theoretically, they showed a similar linear convergence rate under some more general conditions, which require no strict homogeneity of the local loss functions.

Shrinkage methods
We study various shrinkage methods for sparse estimation in this subsection. For a high-dimensional problem, especially when the dimension of θ * is larger than the sample size N , it is difficult to estimate θ * without any additional assumptions (Hastie et al., 2015). A popular constraint for tackling these problems is sparsity, which assumes only a subset of the entries in θ * is non-zero. The index of non-zero entries is called the support of θ * , that is To induce a sparse solution, an additional regularization term of θ is often introduced in the loss function. Specifically, we need to solve the shrinkage regression problem as is a penalty function with a regularization parameter λ > 0. Popular choices are LASSO (Tibshirani, 1996), SCAD (Fan and Li, 2001) and others discussed in Zhang et al. (2012). For simplicity, we consider the LASSO estimator in the framework of the linear model. Specifically, the whole sample estimator is computed aŝ is the design matrix, and θ 1 = p j=1 |θ j | denotes the l 1 -norm of θ. It is known that the LASSO procedure would produce biased estimators for the large coefficients. This is undesirable for the simple average procedures, since average cannot eliminate the systematic bias. To reduce bias, Javanmard and Montanari (2014) proposed a debiasing technique for the lasso estimator, that iŝ where M ∈ R p×p is an approximation to the inverse ofΣ = X X/N . It appears that whenΣ is invertible (e.g., when N p), setting M =Σ −1 givesθ k,λ . Unfortunately, the sparsity level can be seriously degraded by averaging. For this reason, a hard threshold step often comes as a remedy. It was noticed that the debiasing step is computationally expensive. Hence an improved algorithm was also proposed to alleviate the computational cost of this step. Under certain conditions, they showed that the resulting estimator has the same convergence rate as the whole sample estimator. Battey et al. (2018) investigated the same problem with additional study on hypothesis testing. Furthermore, a refitted estimation procedure was used to preserve the global oracle property of the distributed estimator.
An extension to high dimensional GLMs can also be found in Lee et al. (2017) and Battey et al. (2018). For this model, Chen and Xie (2014) implemented a majority voting method to combine the regularized local estimates without a debiasing step.
For the low dimensional sparse problem with smooth loss function (e.g., GLMs, Cox  Wang et al. (2017), the one-step estimatorθ (1) suffices to achieve the global convergence rate if n ≥ O(Ks 2 log p) (the condition used in Lee et al., 2017). Furthermore, if multi-round communication is allowed,θ (t+1) (i.e., estimator of the (t + 1)-th iteration) can match the estimator based on the whole sample as long as n ≥ O(s 2 log p) and t ≥ O(log K), under some certain conditions.

Non-smooth loss based models
The methods we described above typically require the loss function L to be sufficiently smooth, although a non-smooth regularization term is permitted (see e.g., Zhang et al., 2013;Wang et al., 2017;Jordan et al., 2019;Zhu et al., 2019). However, there are also some useful methods involving non-smooth loss functions, such as quantile regression and support vector machine. It is then of great interest to develop distributed methods for these methods.
We first focus on the quantile regression (QR) model. The QR model has a widespread use in statistics and econometrics, and performs more robust against the outliers than the ordinary quadratic loss (Koenker, 2005). Specifically, a QR model assumes Y i = X i θ * + ε i , i ∈ S, where X i ∈ R p is the covariate vector, Y i is the corresponding response, θ * τ ∈ R p is the true regression coefficient, and Here loss function, where 1{·} is the indicator function. When data size N is moderate, we can estimate θ * τ byθ τ = min θ∈Θ N −1 i∈S ρ τ (Y i − X i θ) on one single machine. However, when N is very large, a distributed system has to be used. Accordingly, distributed estimators have to be developed.  We next consider the support vector machine (SVM), which is one of the most successful statistical learning methods (Vapnik, 2013). The classical SVM is aimed at the binary classification problem, i.e., the response variable Y i ∈ {−1, 1}. Formally, a standard linear SVM solves the problemθ λ = arg min θ∈Θ N −1 i∈S 1−Y i X i θ + +λ θ 2 2 , where (u) + = u1(u > 0) is the hinge loss, and λ > 0 is the regularization parameter. Theoretically, they showed the asymptotic normality of the estimator, which can be used to construct confidence interval. For the ultra-high dimensional SVM problem, Lian and Fan (2018) studied the one-shot averaging method with debiasing procedure similar to (2.4).

Nonparametric models
Different from parametric models, a nonparametric model typically involves infinitedimensional parameters. In this section, we focus mainly on the nonparametric regression problems. Specifically, consider here a general regression model as is an unknown but sufficiently smooth function and ε i is the random noise with zero mean. The aim of nonparametric regression is to estimate function f * ∈ F, where F is a given nonparametric class of functions.

Local smoothing
One way to estimate f * (·) is to fit a locally constant model by kernel smoothing (Fan and Gijbels, 1996). More concretely, the whole sample estimator is given bŷ where the W h,X i (x) ≥ 0 is the local weight at X = x satisfying i∈S W h,X i (x) = 1. Specifically, for a Nadaraya-Watson kernel estimator, we should have W h,X i (x) = is a kernel function and h > 0 is the bandwidth. In the univariate case (p = 1), classical theory stated that the mean squared error off h (x) is of the order O(h 4 + (N h) −1 ) (Fan and Gijbels, 1996). Thus, the optimal rate O(N −4/5 ) can be achieved by choosing bandwidth h = O(N −1/5 ).
For a distributed kernel smoothing, an one-shot estimator can also be constructed.
Letf k,h (x) be the local estimator computed on M k . Then an averaging estimator can be obtained asf h (x) = K −1 K k=1f k,h (x). Chang et al. (2017a) studied the theoretical properties off h (x) in a specific function space F. They established the same minimax convergence rate off h (x) as that of the whole sample estimator. However, they found that a strict restriction on the number of machines K is needed to achieve this optimal rate. To fix the problem, two solutions were provided. They are, respectively, a datedependent bandwidth selection algorithm and an algorithm with a qualification step.
Nearest neighbors method can be regarded as another local smoothing method. where the optimal number of neighbors to achieve the optimal rate of convergence was derived. Li et al. (2013) discussed the problem of density estimation for scattered datasets. Kaplan (2019) focused on the choice of bandwidth for nonparametric smoothing techniques. All the works above in this subsection indicate that the bandwidth (or local smoothing parameter) used in the distributed setting should be adjusted according to the whole sample size N , other than the local sample size n.

RKHS methods
We next discuss another popular nonparametric regression method. This is reproducing kernel Hilbert space (RKHS) method. An RKHS H can be induced by a continuous, symmetric and positive semi-definite kernel function K(·, ·) : R p × R p → R.
Two typical examples are: polynomial kernel K(x 1 , x 2 ) = (x 1 x 2 + 1) d with an integer d ≥ 1, and radical kernel K(x 1 , x 2 ) = exp(−γ x 1 − x 2 2 2 ) with γ > 0. Refer to, for example, Wahba (1990); Berlinet and Thomas-Agnan (2011) for more details about RKHS. Then, our target is to find anf ∈ H so that the following penalized empirical loss can be minimized. That leads to the whole sample estimator aŝ where · H is the norm associated with the RKHS H and λ > 0 is the regularization parameter. This problem is also known as kernel ridge regression (KRR). By the representer theorem for the RKHS (Wahba, 1990), any solution to the problem (3.1) must have the linear form asf λ (x) = i∈S α i K(X i , x), where α i ∈ R for each i ∈ S. By this property, we can treat the KRR as a parametric problem with unknown parameter α = (α 1 , . . . , α N ) ∈ R N . The error bounds of the whole sample estimatorf λ has been well established in the existing literature (e.g., Zhang, 2005;Steinwart et al., 2009).
However, a standard implementation of the KRR involves inverting a kernel matrix in R N ×N (Saunders et al., 1998). Therefore, when N is extremely large, it is time consuming or even computationally infeasible to process the whole sample on a single machine. Thus, we should consider a distributed system.
In this regard, Zhang et al. (2015) studied the distributed KRR by taking the oneshot averaging approach. Specifically, each machine M k computes local KRR estimatê  (2019) and many others. It was noted that these one-shot approaches require the number of machines diverges in a relative slow speed to meet the global convergence rate. To fix the problem, Chang et al. (2017b) proposed a semi-supervised learning framework by utilizing the additional unlabeled data (i.e., observations without response Y i ).
Latest work of Lin et al. (2020) allowed communication between machines for the distributed KRR problem to improve the performance. In order to choose an optimal tuning parameter λ in (3.1), Xu et al. (2018) proposed a distributed generalized crossvalidation method.
For semiparametric models, Zhao et al. (2016) considered a partially linear model with heterogeneous data in a distributed setting. Specifically, they assumed the following model where W i ∈ R is an additional covariate, f * (·) is the unknown function, and θ * (k) ∈ R p is the true linear coefficient associated with the data on M k for 1 ≤ k ≤ K. In other words, the local data on different machines are assumed to share the same nonparametric part, but are allowed to have different linear coefficients. To estimate the unknown function and coefficients, they extended the classical RKHS theory to cope with the partially linear function space. Under some regularity conditions, the resulting estimator of the nonparametric part is shown to be as efficient as the whole sample estimator, provided the number of machines does not grow too fast. The case of high dimensional linear part was also investigated. For example, under the homogeneity assumption (i.e., the linear coefficients θ * (k) 's are assumed to be identical to θ * across different machines), Lv and Lian (2017) adopted the one-shot averaging approach with debiasing technique analogous to (2.4) to estimate the linear coefficient. Lian et al. (2019) considered the same heterogeneous model as in (3.2), but the linear part is assumed in a high dimensional setting (i.e., p > N ). For this model, they 1. For each k = 1, · · · , K, machine M k computes d leading eigenvectors of the local sample covariance matrixΣ k = n −1 i∈S k X i X i , denoted byv 1,k , · · · ,v d,k ∈ R p . Next, they are arranged by columns inV k = (v 1,k , · · · ,v d,k ) ∈ R p×d , which is then sent to the central machine M center .
2. The central machine M center averages K local projection matrices to obtaiñ Σ = K −1 K k=1V kV k . Then it computes d leading eigenvectors ofΣ, denoted byṽ 1 , · · · ,ṽ d ∈ R p . Theṽ 1 , · · · ,ṽ d are the estimators of the first d principal component directions that we need.
It is noticeable that the communication cost of above one-shot algorithm is of the order O(Kdp). This can be considered to be communication-efficient since d is usually much smaller than p in practice. Fan et al. (2019b) showed that, under some appropriate conditions, the distributed estimator achieves the same convergence rate as the global estimator. The cases of heterogeneous local data were also investigated in their work.
To further remove the restriction on the number of machines, Chen et al. (2021a) proposed a communication-efficient multi-round algorithm based on the approximate Newton method (Shamir et al., 2014).

Feature screening
Massive datasets often involve ultrahigh dimensional data, for which feature screening is critically important (Fan and Lv, 2008). To fix the idea, consider a standard linear regression model as Y i = X i θ * + i , i ∈ S, where X i ∈ R p is the covariate vector, Y i is the corresponding response, θ * ∈ R p is the true parameter, and ε i is the random noise.
To screen for the most promising features, the seminal method of sure independence screening (SIS) has been proposed by Fan and Lv (2008). Specifically, let A * = {1 ≤ j ≤ p : θ * j = 0} be the true sparse model. Let ω j be the Pearson correlation between jth feature and response Y . Then, SIS screens features by a hard threshold procedure asÂ γ = {1 ≤ j ≤ p : |ω j | > γ}, where γ is a prespecified threshold andω j is the whole sample estimator of ω j . Under some specific conditions, Fan and Lv (2008) showed the sure screening property for SIS, that is, However, the estimatorω j is usually biased for many correlation measures. This indicates that a direct one-shot averaging approach is unlikely to be the best practice for the distributed system. To fix the problem, Li et al. (2020) proposed a novel debiasing technique. They found that many correlation measures can be expressed as ω j = g(ν 1 , . . . , ν s ), including Pearson correlation used above, Kendall τ rank correlation, SIRS correlation (Zhu et al., 2011), etc. Therefore, they used U -statistics to estimate the components ν q , 1 ≤ q ≤ s on local machines. Then, these unbiased estimators of ν q 's given by local machines are averaged on the central machine M center .
Consequently, M center can construct distributed estimatorω j by the averaging estimators of the components in the known function g. Finally, they showed the sure screening property ofÂ γ based on the distributed estimators under some regularity conditions. When the feature dimension is much larger than the sample size (i.e.,

Bootstrap
Bootstrap and related resampling techniques provide a general and easily implemented procedure for automatically statistical inference. However, these methods are usually computationally expensive. When sample size N is very large, it would be even practically infeasible to conduct. To mitigate this computing issue, various alternative methods have been proposed, such as subsamping approach (Politis et al., 1999) and "m-out-of-n" bootstrap (Bickel et al., 2012). Their key idea is to reduce the resample size. However, due to the difference between the size of whole sample and resample, an additional correction step is generally required to rescale the result. This makes these methods less automatic.
To solve this problem, Kleiner et al. (2014) proposed the bag of little bootstraps (BLB) method. It integrates the idea of subsampling and can be computed distributedly without a correction step. Suppose that N sample units have been randomly and evenly partitioned to K machines. Consider that we want to assess the accuracy of the point estimator for some parameter θ. Then we summarize their algorithm as follows.
1. For each 1 ≤ k ≤ K, machine M k draws r samples of size N (instead of n) from S k with replacement. Then it computes r estimates of θ separately based on the r resamples drawn above. After that, each M k computes some accuracy measure, denoted byξ k (e.g., confidence region), by the r estimates above. Finally, all of the local machines sendξ k 's to the central machine M center .
2. The central machine M center aggregates theseξ k 's byξ = K −1 K k=1ξ k . Theξ is the final accuracy measure that we need.
It is remarkable that one does not need to process datasets of size N on local machines actually, although the nominal size of resample is N . This is because each machine contains at most n sample units. In fact, randomly generating some certain weight vectors of length n suffices to approximate the resampling process.

Future study
To conclude the article, we would like to discuss here a number of interesting topics for future study. First, for datasets with massive sizes, a distributed system is definitely needed. Obviously, there could be no place to store the data. On the other hand, for datasets with sufficiently small sizes, traditional memory based statistical methods can be immediately used. Then, there leaves a big gap between the big and small datasets.
Those middle-sized data are often of sizes much larger than the computer memory but smaller than the hard drive. Consequently, they can be comfortably placed on a personal computer, but can hardly be processed by memory as a whole. For those datasets, their sizes are not large enough to justify an expensive distributed system. They are also not small enough to be handled by traditional statistical methods. How to analyze datasets of this size seems to be a topic worth studying. Second, when the whole data are allocated to local machines randomly and evenly, the data on different machines are independent and identically distributed and balanced. Then, all of the methods discussed above can be safely implemented. However, when the data on different machines are collected from (for example) different regions, the homogeneity of the local data would normally be hard to satisfy. The situation could be even worse if the sample sizes allocated to different local machine are very different. How to cope with these heterogeneous and unbalanced local data is a problem of great importance . The idea of meta analysis may be applicable to these situations (Liu et al., 2015;Zhou and Song, 2017;Xu and Shao, 2020). Finally, in the era of big data, personal privacy is under unprecedented threat. How to protect users' private information during the learning process deserves urgent attention. In this regard, differential privacy (DP) provides a theoretical approach for privacy-preserving data analysis (Dwork, 2008). Some related works associated with distributed learning are Thus, it is of great interest to study the privacy-preserving distributed statistical learning problem practically and theoretically.