Communication-efficient distributed statistical inference on zero-inflated Poisson models

Zero-inflated count outcomes are common in many studies, such as counting claim frequency in the insurance industry in which identifying and understanding excessive zeros are of interest. Moreover, with the progress of data collecting and storage techniques, the amount of data is too massive to be stored or processed by a single node or branch. Hence, to develop distributed data analysis is blossoming. In this paper, several communication-efficient distributed zero-inflated Poisson regression algorithms are developed to analyse such kind of large-scale zero-inflated data. Both asymptotic properties of the proposed estimators and algorithm complexities are well studied and conducted. Various simulation studies demonstrate that our proposed method and algorithm work well and efficiently. Finally, in the case study, we apply our proposed algorithms to car insurance data from Kaggle.


Introduction
For the analysis of count data, many zero-inflated regression models have been developed.These models are designed to deal with situations where there is an 'excessive' number of individuals with a count of 0. For example, in a car insurance study where the dependent variable is 'number in an insurance period a policyholder has claimed', the vast majority of policyholders may have a value of 0 ('zero' means certain policyholder has no insurance claims).The reason for zero-inflation is twofold.First, a large proportion of insurance policies are subject to deductible excess, which means only if the loss exceeds some given amount the insurance company will pay the claims.Second, in the car insurance, there exists 'no claim discount (NCD)', that is, if the applicant does not claim in the current insurance period, he will enjoy a certain discount in the next insurance period, which will make some policyholders give up the claim in order to enjoy the discount in next period.Based on this situation, it is of great practical significance to identify and understand those inflated zeros.Commonly used counted distributions are problematic and require appropriate model assumptions to characterize such zero-inflation structure.Zeroinflated Poisson (ZIP) distribution is a popular method to deal with such problems.This distribution is a special case of finite mixture distributions.Specifically, it takes probability p at 0 and takes probability 1 − p at Poisson distribution with parameter λ (denoted as Poisson (λ)).When p is greater than 0, the data will contain more zeros than the ordinary Poisson distribution, showing zero-inflation.
ZIP distribution was first presented by Cohen (1963) and Johnson and Kotz (1970).Then, a ZIP regression model was developed by Lambert (1992) to study manufacturing defect problem.EM algorithm was used to estimate the parameters, avoiding the computational problems caused by simultaneous estimating parameters related to p and λ.In addition, the algorithm and theoretical properties were discussed according to whether p and λ were dependent or not, respectively.Hall (2000) proposed ZIP and Zero-inflated Binomial (ZIB) regressions with random effects and derived corresponding estimation algorithms.In order to deal with the zero-inflated outcomes with more complex correlations, Lee et al. (2006) presented a multi-level ZIP model and studied the parameter estimating method.Tang et al. (2014) discussed the application of ZIP regression to risk factor selection in the context of insurance industries.Adaptive lasso-based EM algorithm was developed to process parameter estimation.Hall (2000), Lee et al. (2006) and Tang et al. (2014) gave the applications of ZIP regression under different scenarios and corresponding algorithms, but above algorithms are based on data stored in a single institution.With the growth of data volume and emergence of distributed data, it is getting more and more urgent to propose ZIP regression algorithm based on distributed data, but the relevant research is still deficient.
With the progress of techniques, distributed structure is common in the storage of data.There are two main reasons for this structure.First, the amount of data exceeds the storage limit of a single institution, so it has to be stored distributedly.Second, because the process of data collection is completed by different countries and regions, the data naturally forms a distributed structure.In the first case, it is difficult to consolidate data from different institutions.In the second case, sometimes the data of different institutions cannot be directly exchanged or combined for the purpose of privacy protection, etc.In these cases, distributed algorithms are needed to process the data, so that the purpose of data analysis can be achieved without merging data.To evaluate a distributed algorithm, communication cost is an important element.Because of this, how to reduce the communication cost of distributed algorithm is also a hot topic.
In recent years, there have been some studies on communication-efficient distributed algorithms and their theoretical properties.Shamir et al. (2014) presented a distributed Newton-type optimization method, which had linear converging properties when the objective function had a quadratic structure.Mota et al. (2013) presented a distributed ADMM algorithm for separable optimization in node network structures.The algorithm converged when the network is bipartite or the loss function is strongly convex.The theoretical properties of these two algorithms require the strong convexity of the objective function, which is too harsh in some cases.Jordan et al. (2018) presented a distributed algorithm with efficient communication.The algorithm was used in a wide range of scenarios and had relatively weak convergence conditions.In low-dimensional M-estimation scenarios, the objective function only needed to be locally convex, which is our motivation of the proposed Algorithm 2. Zhu et al. (2021) presented a distributed least squares approximation method (DLSA) which could deal with a kind of regression problem.By approximating the local objective function using a local quadratic form, they were able to obtain a combined estimator by taking a weighted average of local estimators.The estimator had the same statistical efficiency as the global estimator.In this paper, this idea is consulted as an extended idea to construct the distributed algorithm with high communication efficiency Algorithm 3.
In this paper, in order to reduce the difficulty of calculation in solving the maximized likelihood function, latent variables are introduced to divide the parameters into two parts.EM algorithm is an effective algorithm to optimize objective function with latent variables.Dempster et al. (1977) applied EM algorithm to calculate MLE with incomplete data.They gave the general form of E step, M step, the theoretical properties of EM algorithm and scenarios in which it could be applied.Wu (1983) pointed out the problem in the proof of Dempster et al. (1977) and studied two more general convergence forms of EM algorithm.In addition to the theoretical properties, there were also plenty of research on EM algorithm applications.Redner and Walker (1984) investigated the implementation of the EM algorithm in the context of the problem of parameter estimation in mixture density, and its theoretical properties, especially when the mixed components come from exponential distribution family.For distributed EM algorithm, sensing network is often used as the research background, for example, Nowak (2003) and Gu (2008).This kind of distributed EM algorithm usually has some special properties because of sensing network, such as decentralization and connectivity of only adjacent nodes, which is not consistent with the research scene in this paper.There are relatively few researches on EM algorithm applied to traditional distributed scenarios.
According to the derivation results in this paper, E step of EM algorithm can be completed only with the data of the local institution, so there is no need to propose a distributed algorithm.However, M step of EM algorithm requires the data of all institutions, so a distributed algorithm needs to be proposed.In view of this, we first present the distributed EM Algorithm (Algorithm 1) for M step and analyse the communication cost of this algorithm.Further, motivated by two ideas, the communication cost of Algorithm 1 is decreased, and distributed algorithms with high communication efficiency are given.Specifically, the improvement of Algorithm 2 is to reduce the communication cost of each internal iteration of M step, and the improvement of Algorithm 3 is to reduce the number of internal iterations of M step to one-shot.Besides these, we also present the asymptotic analysis of Algorithm 2. In general, our contributions are twofold.First, we present three distributed algorithms for ZIP regression.And two of them are communication efficient.Second, we give the theoretical properties of Algorithm 2 and compare the time computation complexity and transmission cost of the three algorithms.Third, we fully compare three algorithms in simulations, and then summarize the applicable scenarios of each algorithm, which is of practical significance.
The rest of the paper is as follows.Section 2 gives the main algorithm of this paper: the EM algorithm of ZIP regression, the distributed EM algorithm (Algorithm 1) and the communication-efficient distributed EM algorithm motivated by Jordan et al. (2018) (Algorithm 2).Section 3 mainly introduces the theoretical results of Algorithm 2. Section 4 presents the second communication-efficient distributed algorithm (Algorithm 3) motivated by Zhu et al. (2021).In Section 5, various simulations are provided to verify the proposed Algorithms 1-3.The algorithms are then applied to the car insurance data in Section 6.The article is concluded with a short discussion in Section 7.

Model and method
This section introduces the ZIP regression model, EM algorithm to solve it and distributed version of the algorithm.Combining with the characteristics of distributed computing, we then propose an improved algorithm for more efficient communication.

Model
Assume that independent responses {Y i : i = 1, . . ., n} are from the ZIP distribution as follows: with probability p i , Poisson (λ i ) , with probability 1 − p i . (1) Simple calculation yields the distribution function of Y i as We see that P(Y i = 0) > e −λ i when p i > 0, which indicates zero-inflation.
Based on some early work (Lambert, 1992), the parameters p i and λ i can be modelled by a logistic regression model and log-linear model as follows: where {z i }, {x i }, i = 1, . . ., n are two vectors of covariates with respect to observation i, with dimensions p and q, respectively.These two vectors can either be the same or different.γ = (γ 1 , . . ., γ p ) and β = (β 1 , . . ., β q ) are corresponding coefficient vectors.
According to the assumptions above, we can give the likelihood function as Optimizing this function directly will meet great trouble, especially as the first part of the likelihood includes both γ and β.First, the responses are from a mixture distribution, which includes the parameters p i and λ i .Second, regression models are designed for both parameters.Therefore, the computation is quite challenging.We consider using EM algorithm to optimize it and the next section will give the specific implementation.

EM algorithm
The EM algorithm for the ZIP regression model was firstly introduced in early literature by Lambert (1992).The EM algorithm is based on a latent variable U = I (Y from zero state), which indicates the response is either from zero state or Poisson state.The distribution of U is U i = 1, with probability p i , 0, with probability 1 − p i . (5) According to conditional expection, The log-likelihood function based on (Y, U) is The parts of the above formula concerning the parameters γ and β are defined as Now, we can optimize the above two functions with respect to γ and β, respectively.Based on the objective function (7), the (k + 1) step of the EM algorithm is as follows.
E step: Based on γ (k) and β (k) , estimate U (k) i using its posterior mean.
The realization of Equation ( 10) does not need to cross different institutions, meaning that it can be completed directly in the local institution.However, the realization of Equations ( 13) and ( 14) is cross-institution, and a distributed algorithm needs to be proposed.The specific algorithm is as follows.
In the algorithm, δ is the parameter about the stopping criterion of the algorithm, which can be adjusted according to the accuracy requirement of the result.
To discuss the computation time complexity of the algorithm, we consider each EM algorithm iteration.For a local institution, the matrix multiplication operation is the main ingredient.The E step time complexity is O(n j (p + q)), and the M step time complexity is O(Tn j (p 2 + q 2 )).For the central institution, it mainly completes the matrix inverse and multiplication operation, and the time complexity is O(T(Jp 2 + Jq 2 + p 3 + q 3 )).
As for the communication cost of the algorithm, every local institution passing Hessian matrix in each internal iteration of M step is the main resource.And the communication cost of each EM algorithm iteration is O(TJ(q 2 + p 2 )).The central institution gives the updated parameters to the local institutions, and the local institutions pass the calculated statistics based on the updated parameters to the centre instead of the original data (x ji , z ji ), so the algorithm has the property of privacy protection.However, when the dimension of the covariable is high (i.e., p and q are large), the communication cost of Algorithm 1 will also increase sharply.For distributed algorithms, communication cost is one of the important criteria to measure the algorithm, so the algorithm with higher communication efficiency is proposed below.

Communication-efficient distributed EM algorithm
Based on Equations ( 13) and ( 14), its equivalent expression can be obtained: (k,t)   1 (k,t)   1 and the inverse term of Hessian matrix for all data in Equations ( 15) and ( 16) is replaced by an operation based on the data of 'Institution 1' (just choose one institution), i.e., At this point, parameter update only needs one institution to pass its Hessian matrix at most.If the institution that calculates Hessian matrix can be selected as the central institution, then the whole algorithm does not need to pass Hessian matrix.After the iteration of Equations ( 17) and ( 18) reaches the requirements (for example, the iteration reaches T times), the final iteration result is recorded as the updated parameter ji = I(y ji = 0), and every institution calculates Equations ( 11) and ( 12) by its own data.Take average to get γ (0) and β(0) on central institution and transmit the results to institution j = 1, . . ., J. 3: E step: For institution j = 1, . . ., J, compute U (k)  ji by Equation ( 10), i = 1, . . ., n j . 4: For institution j = 1, . . ., J, t = 1, . . ., T compute 6: β(k,t) • x 1i x 1i , 9: − y ji x ji . 10: Transmit T (k,t) 11 , . . ., T (k,t)  4j to central institution. 11: For central institution, compute 12: . 14: Transmit them to institution j = 1, . . ., J until t = T. 15: Update ) and transmit them to institution j = 1, . . ., J.
Motivated by Jordan et al. (2018), we can explain Equations ( 17) and ( 18) in terms of likelihood.For convenience, rewrite Equations ( 8) and ( 9) into the following distributed form, and add the factor 1/n (this operation is equivalent from the perspective of optimal solution, and the reason for adding this item will be explained later): Based on E step, take Taylor's expansion in Equations ( 19) and (20) at γ (k) or β(k) .The following discussion is for L c,1 (γ ) as an example, and L c,2 (β) is similar.
Replace the high-order derivative part (second order and above) of all the data with the data from a local institution (if the factor 1/n is not added in Equations ( 19) and ( 20), the approximate substitution here will become unreasonable).
Substitute Equation ( 23) into Equation ( 22) and then ignore the constant term to get Equation ( 24) is solved by Newton's algorithm, and t is the index of Newton iteration.The iteration expression is When substituted into the specific expression, this expression is the same as Equations ( 17) and ( 18).

Main theoretic results
This section introduces the theoretical results of Algorithm 2. We focus on low-dimensional situation.In order to fully reduce the communication cost, T = 1 is taken.This part of the theory is obtained under this situation.The conclusions include the asymptotic normality of parameter estimator obtained by Algorithm 2 and the consistent estimator of asymptotic variance.
(2) θ(0) := θ ∈ : L(θ) ≥ L θ (0) is compact for any θ satisfying L θ (0) > − ∞ and θ(0) is in the interior of , where θ (0) ∈ is the initial value of Algorithm 2. (3) and The above assumptions are related to the Jordan approximation of the M step of the EM algorithm and the convergence of the EM algorithm.Specifically, Assumption (1) is the condition of the parameter space.Assumption 2 is related to the boundedness of the EM sequences.Assumption 3 is about the objective function of M step.
(3)(a)-(3)(c) are used in the proof of Proposition A.4, A.5 in the appendix.Specifically, (3)(a) is related to the local convexity, which is used to deduce the upper bound error of M step objective function after the approximation.
(3)(b) is global identifiability condition, which is a basic condition to ensure the consistency of estimators.(3)(c) is related to the smoothness of the objective function of M step, and is used to deduce the error reduction order of the estimator after one iteration of M step.Assumption (4) is a requirement for initial values of the algorithm.Then, we have the following two theorems.Theorem 3.1: Denote the final result of Algorithm 2 as β, γ , and the result of Equation ( 4) maximization is marked as β, γ , which is the MLE of the original problem.Under the above conditions, if n → ∞, we have where 11 , 22 are the upper left p × p matrix and lower right q × q matrix of , respectively.:= I(θ * ) −1 = E(∇ 2 l(θ * , y, z, x)) −1 , where θ * = [β * , γ * ] is the true parameter, which is the maximizer of the expection of Equation ( 4).

One shot communication-efficient distributed EM algorithm
In order to reduce the communication cost of distributed algorithm, in addition to reducing the communication cost of each transmission (Algorithm 2), the same purpose can be achieved by reducing the transmission times in the M step.In order to make a full comparison with Algorithm 2, this section proposes another algorithm, in which only one transmission is carried out between different institutions.The main idea is motivated by Zhu et al. (2021).
The distributed data structure is consistent with that described in Section 2.3, and the difference with the previous algorithm is mainly in the M step of the EM algorithm.Specifically, the illustration starts with the M step on γ .According to the result of Equations ( 19) and ( 20), there is (γ ).Take Taylor's expansion in n j i=1 L c,1,ji (γ ) at γ j , where γ j = argmin γ (−L c,1,j (γ )) is the optimization result for one local institution.Ignore the higher order terms of Taylor expansion (so this algorithm requires a sufficient sample size of every local institution) and the constant term independent of γ , we can get (with ∇ The above equation is regarded as the optimization objective, and according to the weighted least squares algorithm, it can be obtained Similar to the derivation of γ , the M step for β is similar, resulting in where βj = argmin β (−L c,2,j (β))is the optimization results of one local institution.We have Algorithm 3.
If we solve the γ (k) j , β(k) j using Newton's method, C is the number of the iterations.For the time complexity of the algorithm, we consider each EM algorithm iteration.For a local institution, matrix multiplication and inverse operation are mainly completed.The E step time complexity is O(N j (p + q)), and the M step time complexity is O(C(N j (p 2 + q 2 ) + p 3 + q 3 )).The central institution mainly completes matrix inversion operation, and the computation time complexity is O(Jp 2 + Jq 2 + q 3 + q 3 ).
In order to reduce the computing cost of a single local institution, the optimization problem of γ (k) j , β(k) j obtained in Algorithm 3 can be replaced by a one-step approximation: After the above approximation, the effect on the computational time complexity is reflected in the M step of a single local institution, which is now O(n j (p 2 + q 2 ) + p 3 + q 3 )).
Note: 'Single' stands for computing complexity of a single local institution; 'central' stands for the calculation time complexity of central institution; 'communication' stands for communication cost.
Table 2. Computing complexity and communication cost of each EM iteration for Algorithms 1-3 with Note: 'Single' stands for computing complexity of a single local institution; 'central' stands for the calculation time complexity of central institution; 'communication' stands for communication cost.
The communication cost of the EM algorithm for one iteration is O(J(p 2 + q 2 )).The central institution passes the updated parameters to the single machine, and the local institution transmits calculated statistics based on updated parameters to the centre instead of the original data (x ji , z ji ).Therefore, this algorithm has the property of privacy protection.Compared with the previous two algorithms, the iteration of the optimization algorithm within the M step (that is, solving γ (k) , β(k) ) is completed within one local institution, and no cross-institution communication occurs.The EM algorithm only carries out one transmission at one iteration.It is not related to the number of iterations of internal optimization of M step.The iteration of M step of Algorithms 1 and 2 introduced above to solve the optimization problem is completed across different institutions and requires cross-institution communication, so the communication cost increases with times of iterations of M step.Therefore, the number of communication times in each iteration of the EM algorithm is related to the number of internal iterations of M step.Table 1 compares the computing complexity and communication cost of each EM iteration for Algorithms 1-3.Compared with Algorithm 1, the computing complexity and communication cost of Algorithm 2 are reduced.For Algorithm 3, the computing complexity of single local institution is related to C, and the complexity and communication cost of central institution decrease.The computing complexity and communication cost of Algorithms 2 and 3 are related to T and C.
To fully reduce communication or computing costs, set T = C = 1, as shown in Table 2.The computing complexity of Algorithm 2 is lower than that of Algorithm 1.For Algorithm 3, the time complexity of single local institution is higher than that of Algorithm 1 because a single local institution has to deal with the optimization problem, and their computing complexities of central institution are the same.In terms of communication cost, the communication cost of Algorithm 2 is the lowest.Since C is only related to the computing cost of a single local institution, the communication cost of Algorithm 3 is not improved.

Simulation study
We present the performance of the proposed algorithms, under a variety of settings and for varying sample sizes.All simulation results are calculated based on 500 replications.Section 5.1 studies the performances under the setting of homogeneous data which means that all the observations of different institutions are from the same distribution.Section 5.2 studies the heterogeneous data, which means different institutions have different data distributions.
To measure the estimation efficiency, we calculate the empirical mean square error (mse) as 500 s=1 x * − xs 2 /500, the empirical bias (bias) as In Table 3, the performances of our proposed Algorithms 1-3 are all better than that of Sub, and getting better with the increase of the number of institutions (i.e., the total sample size).Algorithms 1-2's results are close to that of Global, and the little difference between T = 1 and T = 3 means that we can get similar estimators under lower communication cost (T = 1) by these two algorithms.Algorithm 1 is equivalent to the Global method based on Newton optimization except for the selection of initial value and the number of internal iterations of M step, so the results of Algorithm 1 are similar to that of the Global method.The stability of Algorithm 2 is relatively poor.There are some nonconvergence times in repeated experiments, which is related to the sample size used to calculate Hessian matrix in M step, so this phenomenon does not improve with the increase of the number of institutions.There is a gap between the results of Algorithm 3 and the Global one.Increasing the number of institutions does not improve the results of Algorithm 3.However, when C increases, the results of Algorithm 3 improve significantly.The reason is that the M step results of Algorithm 3 are the weighted average of the results of local institutions.When C and the sample size of a single institution are fixed, increasing the number of institutions can reduce the variance of the results, but it does not reduce the bias.When other conditions are equal but C becomes larger, the final result will be improved because of the improvement of the results from each local institution of M step.
In summary, under the scenario of distributed data with small local sample size, Algorithm 1 is recommended if sufficient communication resources are equipped.Algorithm 2 with T = 1 is recommended for high restriction on communication and calculation.Algorithm 3 with large C is suggested to be selected if each institution has strong computing power and relatively low limitations on the communication cost.
Figure 1 shows the empirical coverage probabilities (CPs) of 95% confidence interval for the first parameter of γ , i.e., γ 1 .The way to construct the confidence interval is γ comes from different algorithms.11 comes from Theorem 3.2.The empirical CPs for other parameters are similar and the details are omitted here.The CPs of Algorithms 1 and 2 and the Global one approach 0.95 (the nominal level) as the total sample size increases.Since the sample size of the single institution does not change, the results of Sub method present a horizontal line.Based on the calculation principle of Algorithm 3, the increase of the number of institutions does not improve the bias but reduces the variance, and so makes the CPs worse.
Case 2: Fix the total sample size n = 100000.Data generation is the same as Case 1.We fix the total sample size n but change the number of institutions.
In Case 2, we only change the number of institutions, i.e., the degree of data aggregation.All empirical mse, bias and variance are listed in Table 4.The results of Algorithms 1-2 change little, because the total sample size n does not change.However, the results of Algorithm 3 are improved when the sample size of a single institution increases.Algorithms 1-2's performance is still similar to the Global one, and there is little difference between T = 1 and T = 3.With the increase of the sample size of a single institution, the number of nonconvergence of Algorithm 2 decreases significantly, which is consistent with the analysis in Case 1. Due to the increasing of the data volume of a single institution, the results of each institution used in Algorithm 3 have been improved, which makes the corresponding weighted average result improve.Besides, the addition of C can also improve Algorithm 3's performance.In particular, the simulation results show that the performances of these three algorithms are similar when the sample size of single institution is large enough.Therefore, for a distributed data structure, when the sample size of a single institution is large, Algorithm 3 with T = 1 is more recommended to save calculation and communication costs.Otherwise, recommendations from Case 1 are useful.
Similar to Case 1, Figure 2 shows the empirical coverage probabilities (CPs) of 95% confidence interval for γ 1 .In a summary, CPs of all different algorithms approach 0.95 with the increasing of sample size of a single institution.For Algorithm 3, with the decrease of the number of institutions, the bias of parameter estimation decreases significantly, and the variance slightly increases.At the same time, the sample size used to calculate the variance increases, making the CPs improved.

Heterogeneous data
In order to fully compare the algorithms in different scenarios, we set up simulation scenarios that are inconsistent with the theoretical conditions.This section shows the performance of the proposed algorithms when the data's distribution in each institution is slightly different.Specifically, {x ji , z ji } of different institution (i.e., j = 1, . . ., J) comes from different distributions: x ij ∼ N q (10 −3 × j, I q + diag(10 −3 × j)), z ij ∼ N p (10 −3 × j, I p + diag(10 −3 × j)), where diag(A) denotes as a square matrix with A as diagonal element and 0 for the rest.
Case 3: Fix the total sample size n = 100000 with heterogeneous data.Except the generation of {x ji , z ji }, all other settings are the same as in Case 2. All empirical mse, bias and variance are listed in Table 5.We find that Algorithms 1 -2 can also achieve similar results to the Global one when the data in different institutions comes from different distributions.Unlike Case 2, Algorithm 2 is less stable.When a = 100, most of the 500 repetitions do not converge.The reason is that when the sample size of a single institution is small and the data in each institution is distributed differently, the approximation of Hessian matrix adopted by M step of Algorithm 2 will become very poor.However, with the increasing of sample size in single institution, the stability of Algorithm 2 is acceptable, indicating that it will perform well if a is large enough, even if its theoretical assumptions are not satisfied.
When a = 5000, nonconvergence times for Algorithms 1-2 are the same, while Algorithm 3 has zero nonconvergence times.It shows that Algorithm 3 is more stable under the same condition in the case of heterogeneous data.Therefore, if the sample size of a single institution is small, it is suggested to choose Algorithm 1.Under the condition of limited communication and computing costs, if the sample size of a single institution is not too small, Algorithm 3 with large C will be a better choice.If the local sample size is large enough, Algorithm 2 can be an alternative.
Figure 3 shows the 95% confidence interval coverage probability for the first parameter in γ , i.e., γ 1 .The trend of different algorithms in this graph is similar to that in Figure 2.

Real case study
In this section, we apply proposed algorithms to car insurance data.The data is collected from Kaggle1 with total sample size n = 10, 300, and the data were randomly divided into K groups, K = 20, 10, 5.The conclusion is similar, only the results of K = 10 are presented here, and other results are presented in the appendix.CLM_FREQ is the response variable we are interested in, representing the number of past policy claims from the insured person, which is zero-inflated with 61% claims being zero.All possible covariates are shown in Table 6.
We set X = Z, and Tables 7-8 show the parameters and standard deviation estimates of the different algorithms with T/C = 3, 1, K = 10.vehicles in rural areas, and policyholders are more likely to have good driving habits, which would result in a lower likelihood of major accidents.Furthermore, insurance companies often offer attractive NCD policy for private cars, which would also result in a larger p i for this type of policyholder.
The mean of the Poisson distribution, λ i , which represents the average number of claims for a policyholder, is determined by the parameter β.According to the estimating results, various factors, such as having a small number of children at home, being unmarried, having a clean driving record, having a high number of motor vehicle record points, residing in a city, and using the vehicle for commercial purposes, can result in a higher λ i , indicating a greater average number of claims.Living in an urban area, using the vehicle for commercial purposes, and not having a private car can all increase the likelihood of being involved in road accidents.Additionally, commercial policyholders may not be attracted to the insurance company's NCD policy, further contributing to a higher number of claims.
As can be seen from Tables 7-8, the results of Sub differ furthest from Global, and the results of Algorithms 1-3 are relatively close to Global, where results in Algorithms 1 and 2 are closer than Algorithm 3. When T or C is 1 or 3, the estimation results of Algorithms 1 and 2 are basically the same, and the results of Algorithm 3 are slightly changed.These results are in good agreement with the theoretical and simulation results.

Conclusion and discussion
The study of data structures with zero-inflation is crucial in practical applications.In particular, when the data has a distributed structure, it becomes imperative to develop efficient algorithms with effective communication.This paper presents three distributed algorithms, with two of them being improved to enhance communication efficiency through various techniques.Subsequently, several simulation scenarios were established to compare the algorithms and identify their respective strengths and limitations.The performance of the algorithms in real-life cases is also reported.
The results of the study indicate that while Algorithm 2 has the lowest communication cost, it is less stable when the sample size of a single institution is small, especially for heterogeneous data.Algorithms 1 and 3 are comparatively stable but consume more resources.The development of a more robust algorithm remains a challenge to be addressed.Additionally, this paper only focuses on the ZIP model applied to zero-inflation data and does not cover ZIB model or models with random effect terms.These limitations provide opportunities for further research and improvement.

A.1 Convergence of EM algorithm
Assumptions: (1) ⊂ R r : is a subset in the r-dimensional Euclidean space.( 2) is in the interior of for θ (0) ∈ which is the initial value of EM algorithm.
(3) L is continuous in and differentiable in the interior of .
We can finish proving the convergence of EM Algorithm by the following two proofs.
Definition A.1: If an iterative algorithm defined by M satisfies it is called the GEM algorithm (generalized EM algorithm, Dempster et al. (1977)).
Based on this definition, the proof is as follows:  (k) ).According to the definition of EM algorithm, it belongs to GEM algorithm.Therefore the monotony of L(θ (k) ) is proved.
Proof: From θ (0) ⊂ , we have θ (0) ⊂ R r .θ (0) is a compact set, and then θ (0) is a bounded closed set.Since L is continuous in , and then L is upper bounded in θ (0) .Since {θ (k) } ⊂ θ (0) , and then L is upper bounded in {θ (k) }, ∀θ (0) .Based on above Proofs (1)-( 2), the convergence of the algorithm can be proved.Now we need to prove that the convergence point is the maximum point of L(θ ).Since L(θ) is unimodal on and has a unique maximum value, it is only necessary to prove that the convergence point is the stagnation point of L(θ ).

A.2 The convergence point of {θ (k) } is the stagnation point of L(θ)
Let's start with the Global Convergence Theorem (GCT), and the details of the proof refer to page 91 of Zangwill (1969).Zhang et al. (2013).This conclusion means that after one iteration in M step, that is, T = 1, the error of the estimator with respect to θ will drop to 1/ √ a times.For completeness of the elaboration and to avoid ambiguity, the proof of Propositions A.4-5 is omitted, the details of which can be derived based on Lemmas 6-8 in Zhang et al. (2013).

A.4 Consistency and asymptotic normality of the estimator from Algorithm 2
The proof in A.3 ensures that the Jordan approximation of the M step of each EM iteration converges to the result before approximation.According to the results from A.1 and A.2, the Jordan approximation of the EM Algorithm converges to the MLE of the original objective function, that is, the result of Algorithm 2 converges to the MLE of the original objective function.Hence, the asymptotic properties of the proposed estimation are obtained.
For now, the proofs of Theorems 3.1-3.2are finished.
Appendix 2. Results of real case study with K = 20, 5 This section shows the result of Real Case Study with K = 20, 5, and the conclusion is similar to that of K = 10.According to the results of K = 20, 10, 5, under the same T/C, the results of Algorithm 1 are slightly different due to the change of initial value.Algorithms 2 and 3 are improved by increasing the sample size of a single institution due to the improvement of the Hessian matrix approximation and the optimization results of a single institution, respectively.Global method remains unchanged because it uses the same data.Sub is also improving due to the increase in the amount of data used.
500 s=1 xs /500 − x * and the empirical variance (var) as 500 s=1 xs − 500 s=1 xs /500 2 /500, where x * stands for γ * or β * and xs stands for estimtors of γ or β in the sth replicate.T/C stands for number of internal iterations of M step.'Sub' stands for applying EM algorithm on one institution's data.'Global' stands for applying EM algorithm on the whole data.

Figure 1 .
Figure 1.CPs for 95% confidence interval coverage probability for γ 1 in the setting of Case 1.

Figure 2 .
Figure 2. CPs for 95% confidence interval coverage probability for γ 1 in the setting of Case 2.

Table 1 .
Computing complexity and communication cost of each EM iteration for Algorithms 1-3.

Table 3 .
Empirical performances of proposed Algorithms under different J with fixed total sample size.

Table 4 .
Empirical performances of proposed Algorithms under different a with fixed total sample size.

Table 5 .
Empirical performances of proposed Algorithms for heterogeneous data with fixed total sample size.

Table 6 .
Covariates details in real case study.