Bayesian inference for generalized linear mixed models: A comparison of different statistical software procedures

ABSTRACT Bayesian inference for generalized linear mixed models (GLMM) is appealing, but its widespread use has been hampered by the lack of a fast implementation tool and the difficulty in specifying prior distributions. In this paper, we conduct an extensive simulation study to evaluate the performance of INLA for estimation of the hierarchical Poisson regression models with overdispersion in comparison with JAGS and Stan while assuming a variety of prior specifications for variance components. Further, we analysed the influence of different factors such as small number of observations per cluster, different values of the cluster variance and estimation from a misspecified model. A simulation study has shown that the approximation strategy employed by INLA is accurate in general and that all software leads to similar results for most of the cases considered. Estimation of the variance components, however, is difficult when their true value is small for all estimation methods and prior specifications. The estimates obtained for all software tend to be biased downward or upward depending on the assumed priors.


Introduction
The Integrated Nested Laplace Approximation (INLA) by Rue et al. (2009) is a Bayesian estimation method which is computationally faster than its predecessors. Since its introduction, its performance compared to other software for Bayesian analysis has been widely reported in the literature. Taylor and Diggle (2013) compared INLA and Markov chain Monte Carlo (MCMC) in the context of spatial log-Gaussian Cox processes. Their simulation study confirms the advantage of INLA in terms of computational time, but shows that INLA has a lower predictive accuracy in certain scenarios. The use of INLA for Bayesian inference for generalized linear mixed models (GLMMs) was investigated by Fong et al. (2010) making a comparison with the maximum likelihood approach via Penalized Quasi Likelihood (Breslow & Clayton, 1993). Fong et al. (2010) showed that the approximations were inaccurate for binary data with few or no replications. Further, the performance of Bayesian inference using INLA for a random intercept logit model was investigated by Grilli et al. (2015), who presented a comparison with Bayesian MCMC Gibbs sampling and maximum likelihood with Adaptive Gaussian Quadrature (AGQ) approximation. In contrast to the previous two studies, Grilli et al. (2015) showed that the specification of the prior distribution is more relevant than the choice of the estimation method. Following Fong et al. (2010), INLA's developers addressed the inaccuracy of INLA for binary data with few or no replications by introducing a new correction term for INLA (Ferkingstad & Rue et al., 2015) and claim their correction has significantly improved the accuracy.
In this paper, we extend the investigation of Grilli et al. (2015) to longitudinal count data with overdispersion and evaluate the performance of INLA in comparison with JAGS (Plummer, 2003) and Stan (Stan Development Team, 2015) while assuming a variety of prior specifications for variance components. In contrast with the comparison presented in Ferkingstad and Rue et al. (2015), based on the difference between the results obtained by INLA after the correction to the results obtained by MCMC simulation, we base our comparison on the difference between the parameter estimates obtained for all methods and the true parameter values. Further, we investigate the influence of small sample sizes and true values on the cluster variance and overdispersion.
We proceed as follows. In Section 2, the two case studies used for illustration are presented. In section 3, we review the generalized linear mixed effects model, Bayesian estimation approaches and prior specifications. Section 4 presents results obtained for the two studies where we apply the three estimation methods to the two longitudinal count data sets. The simulation design and the main results of the simulation study are commented in Section 5. Finally, Section 6 gives a discussion and concluding remarks.

Case studies
Two data sets with longitudinal count outcomes were used to illustrate the methodological aspects discussed in this paper.

Anopheles mosquitoes count data
A longitudinal entomological study was conducted between June 2013 and November 2013 in Jimma town, south-western Ethiopia, to investigate whether the ecological transformation due to resettlement has an influence on the abundance and species composition of Anopheles mosquitoes. This was carried out by comparing villages who are at the centre of the town with those newly emerged villages located at the suburb of the town. The study design and rationale are presented in Degefa et al. (2015).
The study consists of a longitudinal count data where adult Anopheles mosquitoes resting inside human habitations were collected monthly (June 2013-November 2013) from 40 selected houses using pyrethrum spray catches (PSCs). Half of the households belong to resettled villages and are considered to be at risk of Malaria infection. The second half of the household belongs to villages located in low-risk areas (control). Figure 1 (left panel) presents the monthly female Anopheles mosquito count while the right panel presents the mean evolution over time.

The epilepsy data
The epilepsy data set contains information about 59 epileptics' patients that are randomized into two treatment arms, a placebo or a new drug in a randomized clinical trial of anticonvulsant therapy (Thall & Vail, 1990). The response variable, the number of epilepsy seizures, was measured at four visits over time. The data was presented and analysed by Breslow and Clayton (1993) and used as an illustrating example in Fong et al. (2010) who evaluated the performance of INLA in comparison with penalized quasi-likelihood (PQL). The epilepsy data set is available in R (R Core Team, 2016) as part of the R package INLA (Rue et al., 2009). Figure 2 shows the individual and average profiles of the epilepsy data for both treatment groups.

Hierarchical poisson-normal models
Generalized linear mixed models (GLMMs) extend the generalized linear model with a subject-specific random effect, usually of Gaussian type, added to the linear predictor to give a rich family of models that have been used in a wide variety of applications (McCulloch & Neuhaus, 2001;Molenberghs & Verbeke, 2005). The analysis presented in this paper is focused on count variables (number of female Anopheles mosquitoes and number of epilepsy seizures) which were repeatedly measured over time. Hence, we formulated a hierarchical Poisson model for both case studies. Let Y ij represent the response variable of the ith subject measured at time j, i ¼ 1; 2; . . . ; I and j ¼ 1; 2; . . . ; J. A Poisson-Normal hierarchical (HPN) model can be formulated as where x ij is a p-dimensional design matrix of the fixed effect parameters β and b 0i is a random intercept typically used to model repeated count responses. To complete the specification, we used non-informative prior distributions, namely, a normal distribution with mean zero and variance 10,000 for β's and considered various non-informative prior distributions for σ 2 b 0 (see Section 3.2).
The HPN model specified in (1) assumes that all sources of variability in the data can be captured by the random effects and the Poisson variability. A commonly encountered problem related to count data is overdispersion (or underdispersion), which may cause serious flaws in precision estimation and inference (Breslow, 1990) if not appropriately accounted for. A number of extensions of the HPN models have been proposed to account for extra-dispersion in the setting of longitudinal count data (2015; Aregay et al., 2013;Molenberghs et al., 2007). Aregay et al. (2015) extend the HPN model in (1) by considering two separate random effects added to the linear predictor. The mean structure for their model (HPNOD) is given by where b 0i is the random intercept used to account for possible clustering effect as before and the random effect u ij included to accommodate the overdispersion not captured by the normal random effect b 0i . The assumed priors for σ 2 u are discussed in section 3.2. We choose the additive overdispersion model above rather than the multiplicative overdispersion model (Aregay et al., 2013;Molenberghs et al., 2007) to keep parametrization of the model to be the same across softwares. Further, a comparison of the additive model and the multiplicative model has been described by Aregay et al. (2015) and they reported the performance of the two approaches to be the same.

Model formulation for Anopheles mosquitoes count data
Let Y ij represent the number of female Anopheles mosquito counts for household i during month j of the follow-up period, let t ij be the time point (months) at which Y ij has been measured, t ij ¼ 1; . . . ; 6 for all households, and let x i be an indicator variable, which denotes the village type of the ith household, which takes the value of one if the household is located in a resettled (risk) village and zero if the household is located in a control village. We assumed Y ij jβ; b 0i ,Poissonðλ ij Þ; i ¼ 1; . . . ; 40; j ¼ 1; . . . ; 6, and that the pattern of Anopheles mosquito abundance over time is log-linear and possibly different between the control and at-risk villages. Thus, the HPN model has a linear predictor of the form and the mean structure for the HPNOD is given by Breslow and Clayton (1993) analyzed the epilepsy data set presented in Section 2.2 using likelihood-based inference via penalized quasi-likelihood (PQL). Fong et al. (2010) used this data set to illustrate how Bayesian inference may be performed using INLA. We concentrate on the two random-effects models fit by Breslow and Clayton (1993) and Fong et al. (2010). Let Y ij be the number of seizures for patient i at visit j, i ¼ 1; 2; . . . ; 59 and j ¼ 1; 2; . . . ; 4, assumed to be conditionally independent Poisson variables with mean expðλ ij Þ, where the linear predictor for the HPN model is given by

Model formulation for epilepsy data
and the linear predictor for the HPNOD model is given by Here, Baseline i =4 denote the one-fourth of the baseline seizure count for the ith patient, Trt i is an indicator variable for the treatment arm of the ith patient, which takes the value of one if the patient is assigned to the new drug and zero if the patient is assigned to the placebo group, and V 4i is an indicator variable for the fourth visit. To aid convergence when fitting the HPN and HPNOD models (5) and (6), respectively, the covariates logðBaseline i =4Þ, logðAge i Þ, and Trt i � logðBaseline i =4Þ were centred about their mean.

Priors for the variance components of the hierarchical Poisson models
A Bayesian approach is attractive for modelling complex longitudinal count data, but requires the specification of prior distributions for all the random elements of the model. For the hierarchical models in Section 3.1, this involves choosing priors for the regression coefficients and the hyperparameters σ 2 b 0 and σ 2 u of subject and observation-specific random effects, respectively. Two classes of prior distributions, informative and noninformative priors, are used in Bayesian modelling. One can use informative priors when substantial prior information is available, for instance, from previous studies relevant to the current data set. Noninformative prior distributions are intended to allow Bayesian inference for parameters about which little is known beyond the data included in the analysis at hand (Gelman, 2006). In this paper, we focus on the choose of non-informative prior.
The specification of a non-informative prior to express the absence of prior information about the cluster variance and overdispersion is often difficult (Gelman, 2006). Various non-informative prior distributions have been suggested in the Bayesian literature, including an improper uniform density on the scale of standard deviation (Gelman et al., 2003), proper distributions such as the inverse À gammað2; 2Þ with small positive 2 for the variance (Lunn et al., 2012), and conditionally conjugate folded-non-central-t family of prior distributions for the standard deviation (Gelman, 2006). In this paper, we concentrate on the latter two approaches. We considered three specifications based on inverse À gammað2; 2Þ for the variance and half-Cauchy prior (a special case of the conditionally conjugate folded-non-central-t family of prior distributions) for the standard deviation (see Figure 1 in the supplementary material): 1. pðσ À 2 Þ,Γð1; 0:0005Þ-the default choice of the INLA software (Rue et al., 2009); 2. pðσ À 2 Þ,Γð0:001; 0:001Þ-the default choice of the BUGS software (Lunn et al., 2012) and the most popular choice in Bayesian analysis; 3. pðσ À 2 Þ,Γð0:5; 0:0164Þ-a specification proposed by Fong et al. (2010); 4. Half-Cauchy prior with scale 25 on σ-a specification proposed by (Gelman, 2006).

Analysis of the Anopheles mosquitoes count data
In this subsection, we present the analysis of the Anopheles mosquitoes count data introduced in Section 2.1. We considered three estimation routines (namely: Stan, JAGS, and INLA) where all of them are accessed through R software version 3.3.2 (R Core Team, 2016) to fit the HPN and HPNOD models presented in section 3.1.1 and 3.1.2. For the Bayesian inference using runjags (Denwood, 2016) we run 3 chains with 30,000 MCMC iterations per chain from which the first 10,000 iterations are considered burn-in period, while for RStan (Stan Development Team, 2016) the results are based on 4 chains with 4000 MCMC iterations per chain (with the first 2000 the burn-in period). Note that the models can be fitted in Stan using brms (Bürkner, 2017) as well. An elaborate discussion about two packages, RStan and brms, is given in Section 6 of the supplementary appendix. Model selection was made using the Deviance Information Criteria (DIC, Spiegelhalter et al., 2002) for INLA and JAGS and the widely applicable information criteria (WAIC, Vehtari et al., 2017) for Stan.
The posterior means and standard deviations for the parameters of the HPN and HPNOD models estimated using INLA, JAGS, and Stan using the four prior specifications are presented in Table 1. For all priors considered, the DIC and WAIC values of the HPNOD model are slightly higher than those of the HPN model, suggesting the latter to be a preferred model for this dataset.
All prior distributions and all methods of estimations we examined yielded similar results for the regression coefficients. However, the results in Table 1 revealed the influence of the software used to fit the models and the prior specification on the parameter estimates corresponding to the random effect terms. The three estimation methods give similar point estimates for the variance of the random intercept under the Γð0:001; 0:001Þ and Γð0:5; 0:0164Þ priors in both the HPN and HPNOD models. On the other hand, the difference among estimation methods is noticeable under the Γð1; 0:0005Þ and half Cauchy prior, where the posterior mean for σ 2 b 0 obtained from Stan is about 6% larger than INLA under Γð1; 0:0005Þ prior and about 13% larger under the half Cauchy prior. Figure 3 shows the posterior density of the precision of the random intercept (σ À 2 b 0 ) obtained from JAGS, Stan and INLA using the four prior distributions. All estimation methods lead to a similar result under Γð0:001; 0:001Þ and Γð0:5; 0:0164Þ priors, but the posterior density of INLA for σ À 2 b 0 deviates from that of Stan under the Γð1; 0:0005Þ and half Cauchy prior. For INLA, we observed differences in the point estimates of σ 2 b 0 and σ 2 u across priors, where a profound variation across priors is observed in the posterior density of σ 2 u (Figure 4).

Analysis of the epilepsy data
The HPN and HPNOD models formulated in equations (5) and (6) were fitted using the three software and four priors discussed in the previous section. Table 2 presents the posterior means obtained for all fitted models. The DIC (WAIC) values of the HPNOD model obtained for all estimation methods are smaller than those of the HPN model for all prior settings, indicating that the first model is preferred. The three estimation methods lead to virtually identical results for the HPNOD model under priors 2 and 3. However, under the half Cauchy prior (prior 4), the posterior means for the random intercept (σ 2 b 0 ) and overdispersion (σ 2 u ) obtained for Stan and JAGS are slightly higher than the posterior mean obtained for INLA. Figures 5 and6 show the posterior density of the precision of the random intercept (σ À 2 b 0 ) and the precision of the overdispersion parameter (σ À 2 u ) obtained from JAGS, Stan and INLA using the four prior distributions. We notice a small difference between JAGS and INLA as well as Stan and INLA for the two prior specifications (priors 1 and 2). Differences between estimation methods are clearly seen when the half Cauchy prior is used.

Simulation study
A simulation study was conducted to compare the performance of the three Bayesian software and the four prior specifications for σ b 0 and σ u presented in Section 3.2.

Simulation setting
The simulation represents a longitudinal study where the data are Poisson distributed. The steps for the  simulation study are as follows: For i ¼ 1; . . . ; N clusters (subjects), j ¼ 1; . . . ; J observations per cluster observed at equally spaced sampling times with b 0i ,Nð0; σ 2 b 0 Þ; u ij ,Nð0; σ 2 u Þ, and x i is an indicator variable such that x i ¼ 0 for i � N=2 and x i ¼ 1 otherwise. The model formulated in (7) is the HPNOD model discussed in (3.1.1). Low, moderate, and high levels of overdispersion were considered corresponding to σ u ¼ 0:2; 1 and 2, respectively (Aregay et al., 2015). A model without overdispersion (HPN) was formulated by omitting u ij from the mean structure in (7). The true values of the fixed effect parameters for both the HPN and HPNOD models are β ¼ ðβ 00 ; β 01 ; β 10 ; β 11 Þ 0 ¼ ð2; À 2; 0:05; 0:2Þ 0 . To study how the performance of the estimation methods and the role of the prior distribution are affected by the value of the cluster variance, we used four different values for the standard deviation parameter of the random intercept term, i.e σ b 0 ¼ 0:1; 0:5; 1; 1:5. Further, in order to study the effect of small number of observations per cluster, we considered four balanced designs with J ¼ 2; 5 observations per cluster and the number of clusters equal to N ¼ 20 and 40. In total 4 � 4 � 2 � 2 ¼ 64 simulation setting was considered. For each setting 500 datasets were generated. For each dataset, the HPN and HPNOD models were fitted with INLA, JAGS, and Stan using a flat prior distribution Nð0; 1000Þ for the fixed effect parameters and using the four prior distributions listed in Section 3.2 for σ 2 b 0 and σ 2 u . For JAGS and Stan, we ran 3 chains with 20,000 iterations per chain from which the first 10,000 iterations were considered as burn-in period. For each parameter of interest θ, the relative bias was calculated by and the mean squared error (MSE) by where θ i is the parameter estimate for θ obtained for the ith simulation. A simulation study for unbalanced longitudinal data was conducted as well. The simulation setting and result are discussed in detail in Section 5 of the supplementary appendix.  Figure 6. Epilepsy data: posterior density for the precision of the overdispersion parameter (σ À 2 u ) obtained for the HPNOD model. Figure 7 (upper left panel) presents the simulation results obtained for the datasets generated without overdispersion. For all prior specifications, and for all software, differences were observed for σ b 0 ¼ 0:1. For INLA and JAGS, prior 1 led to an underestimation of σ b 0 while prior 2 À 4 led to an over estimation. For Stan, prior 4 led to over estimation, while priors 1-3 led to an underestimation of σ b 0 . Note that this pattern was observed for both HPN and HPNOD, which in this case, misspecified the mean structure (by including the over dispersion parameter). The upper right and the lower panels in Figure 7 present the simulation results obtained for the datasets generated with varying levels of overdispersion (σ u ¼ 0:2, σ u ¼ 1, and σ u ¼ 2). In this case, the HPN model mis-specified the mean structure of the underlying model used to generate the data (by omitting the overdispersion parameter). For datasets generated with low levels of overdispersion (σ u ¼ 0:2), we observe a similar pattern to that of no overdispersion. When the data are generated with a moderate to high level of overdispersion, for the HPN model, all prior specifications and software have a tendency to overestimate the value of σ b 0 with the magnitude of overestimation decreasing as the value of σ b 0 increases. For the HPNOD model (which specified the mean structure correctly), for σ b0 ¼ 0:1, prior 1 consistently led to an under estimation of σ b 0 across all software but with the smallest magnitude compared to the overestimation observed for the other priors.

Estimation of σ b 0
Similar patterns were observed for the second simulation study of unbalanced longitudinal data (see Section 5 of the supplementary material). When the true values for σ b are small, substantial variation is observed among software and prior specifications and this variation vanishes when the true values are relatively large. Figure 8 shows the results for the estimation of σ u across the levels of σ b 0 for the HPNOD model. Differences between the priors can be observed for σ u ¼ 0:2. For INLA and Stan, all priors led to an underestimation of σ u (prior 1 with the highest magnitude). For JAGS, prior 4 led to an overestimation of σ u while the other priors to an underestimation.

Estimation of β's
The relative biases for the regression coefficients for given values of σ b 0 and σ u are shown in Figure 9. For INLA and Stan, all priors lead to similar relative bias for all regression coefficients under all settings. For JAGS, we observed a different pattern for β 10 when the data are generated with a high level of overdispersion (σ u ¼ 2) and the HPNOD model used to fit the data. The relative bias for this parameter increases in a linear fashion with the value of the standard deviation of the random intercept.

Effect of cluster size
A simulation study also investigated the effect of a small number of observations per cluster on the estimation. Two and five observations per cluster were considered. Figure 10, presents the relative bias for all the parameters of the HPNOD model. The results obtained for N ¼ 20, σ b 0 ¼ 1, and σ u ¼ 1 indicate that, as expected, the relative bias decreases as the sample size increases. Note that this pattern was observed for all software and priors considered. The results obtained for the HPN model are similar and presented in Section 3 of the supplementary appendix.

Random intercept and slope model
We extend the simulation setting in Section 5.1 to a random intercept and slope model. For i ¼ 1; . . . ; N clusters (subjects), j ¼ 1; . . . ; J observations per cluster observed at equally spaced sampling times with b 0i ,Nð0; σ 2 b 0 Þ, b 1i ,Nð0; σ 2 b 1 Þ, u ij ,Nð0; σ 2 u Þ, and x i is an indicator variable such that x i ¼ 0 for i � N=2 and low, moderate, and high levels of overdispersion were considered corresponding to σ u ¼ 0:2; 1 and 2, respectively. A model without overdispersion (HPN) was formulated by omitting u ij from the mean structure in (8).
The true values of the fixed effect parameters are β ¼ ðβ 00 ; β 01 ; β 10 ; β 11 Þ 0 ¼ ð2; À 2; 0:05; 0:2Þ 0 . For each setting, we made 100 simulated data sets consisting of 20 subjects with 5 measurements per subject and fit the HPN and HPNOD models with INLA, JAGS, and Stan using the four prior distributions listed in Section 3.2 for σ 2 b 0 , σ 2 b 1 and σ 2 u . For JAGS and Stan, we run 3 chains with 20,000 iterations per chain from which the first 10,000 iterations were considered as burn-in period. Then, we computed the relative bias and mean squared for each parameter (see Section 5.1).

Estimation of σ b 0 and σ b 1
The relative bias for σ b 0 obtained for the datasets generated with varying level of overdispersion (σ 2 u ¼ 0; 0:2; 1; and 2) and different values of σ b 1 (σ b 1 ¼ 0:1 and σ b 1 ¼ 1:5) are presented in Figure 11. We observe substantial variation among priors and software when σ b 0 ¼ 0:1. When σ b 0 ¼ 0:1 and the data generated without overdispersion, Prior 1 (Γð1; 0:0005Þ) leads to underestimation while prior 2-4 leads to overestimation for all software. The relative bias decreases as the true value for σ b 0 increases. We also observed the influence of the level of variation in the random slope (σ b 1 ) on the estimate of σ b 0 especially for JAGS. Figure 12 presents the relative bias for σ b 1 obtained for the datasets generated with varying levels of overdispersion (σ 2 u ¼ 0; 0:2; 1; and 2) and different values of σ b 0 (σ b 0 ¼ 0:1 and σ b 0 ¼ 1:5). As before, variation among priors and software is observed when the true value for σ b 1 is small (σ b 1 ¼ 0:1) and the true value for σ b 0 is large (σ b 1 ¼ 1:5). Overall, INLA performs better than JAGS and Stan under all scenarios. Figure 13 presents the relative bias for σ u obtained for the datasets generated with varying levels of overdispersion. The variation among priors decreases as the true value of overdispersion (σ u ) increases. However, a substantial difference is observed among software. INLA performs better than JAGS and Stan under all scenarios. JAGS and Stan consistently underestimate and Stan overestimate σ u when σ u ¼ 0:2 and underestimate σ u when σ u ¼ 1 or σ u ¼ 2.

Concluding remarks
In this paper, we performed a Monte Carlo simulation study in order to simultaneously evaluate the performance of the Bayesian estimation methods and prior specifications for variance components in the context of longitudinal count data. We compared the results obtained with INLA to the results obtained with JAGS which uses Gibbs sampling and Stan which uses Hamiltonian Monte Carlo while assuming a variety of prior specifications for variance components. We analysed the influence of different factors such as small number of observations per cluster, different values of the random effect variance and estimation from a misspecified Red lines for prior 1 (Γð1; 0:0005Þ), green for prior 2 (Γð0:001; 0:001Þ), light blue for prior 3 (Γð0:5; 0:0164Þ) and purple for prior 4 (half-Cauchyð0; 25Þ). N ¼ 20 and J ¼ 5.
model on the bias and mean squared errors of the parameter estimates.
A simulation study has shown that the approximation strategy employed by INLA is accurate in general and all software leads to similar results for most of the cases considered. Estimation of the variance components, however, is difficult when their true value is small for all estimation methods and prior specifications. The estimates obtained for all software tend to be biased downward or upward depending on the prior. The results of the simulation study also show that there is an effect of cluster size. For all software and prior specifications, the relative bias for all parameters decrease as cluster size increases. For the random intercept and slope model, INLA performs better than JAGS and Stan under all scenarios. In our simulation study of the random intercept and slope model, we only considered independent prior for the random intercept and slope. Future research should focus on the comparison of different software assuming general priors for variance-covariance.

Disclosure statement
No potential conflict of interest was reported by the authors.

Funding
The authors received no direct funding for this research.