Bayesian Estimation of Single-Test Reliability Coefficients

Abstract Popular measures of reliability for a single-test administration include coefficient α, coefficient λ 2, the greatest lower bound (glb), and coefficient ω. First, we show how these measures can be easily estimated within a Bayesian framework. Specifically, the posterior distribution for these measures can be obtained through Gibbs sampling – for coefficients α, λ 2, and the glb one can sample the covariance matrix from an inverse Wishart distribution; for coefficient ω one samples the conditional posterior distributions from a single-factor CFA-model. Simulations show that – under relatively uninformative priors – the 95% Bayesian credible intervals are highly similar to the 95% frequentist bootstrap confidence intervals. In addition, the posterior distribution can be used to address practically relevant questions, such as “what is the probability that the reliability of this test is between .70 and .90?”, or, “how likely is it that the reliability of this test is higher than .80?” In general, the use of a posterior distribution highlights the inherent uncertainty with respect to the estimation of reliability measures.

Reliability analysis aims to disentangle the amount of variance of a test score that is due to systematic influences (i.e., true-score variance) from the variance that is due to random influences (i.e., error-score variance; Lord & Novick, 1968). The most straightforward way to quantify the proportion of true-score variance is by correlating two administrations of the same test to the same group of people under exactly the same conditions. By definition, this correlation equals the test-score reliability. Thus, reliability provides an idea as to what happens if a test is readministered. That is, what results would be obtained if one could replicate the measurement procedure in the same group as if nothing had changed compared to the first measurement, except for changes in random error.
Reliability is a general concept that is relevant across a range of different designs, such as test-retest, inter-rater, and single-test designs. In practice, researchers often have data from only a single test administration rather than multiple administrations, and thus have to estimate reliability from the data collected by means of a single test version.
A reliability analysis serves three purposes. First, researchers wish to confirm that their measures are reliable. To do so, practitioners usually choose an estimator, commonly coefficient a (Cronbach, 1951) and obtain a point estimate, which is compared to a cutoff value to determine bad, sufficient, or good reliability (Oosterwijk et al., 2019). Second, reliability, by itself, does not inform us about the precision with which an individual is measured. For that, one needs the standard error of the unobservable individual's distribution of test-score replications. In practice, one uses the standard error of measurement (sem) for this purpose. The sem indicates the standard deviation of the measurement error in the group of interest and is often interpreted as an estimate of measurement precision (Mellenbergh, 1996). The sem is calculated as sem ¼ SDðXÞ ffiffiffiffiffiffiffiffiffiffiffi 1 À q p , where SD(X) is the standard deviation of the test score X and q is the reliability of the test score. Third, when assessing the correlation between two variables that are measured with error, researchers may estimate the correlation between the true scores by correcting for the unreliability of the test scores. This so-called correction for attenuation (Spearman, 1904) is common in validity research and is often used in meta-analyses (Schmidt & Hunter, 1999).
When one estimates a parameter, the point estimate can be accompanied by an uncertainty interval. Both a Bayesian credible interval (or credible interval for short) and a frequentist confidence interval (or confidence interval for short) are representatives of an uncertainty interval, but stem from competing statistical frameworks (see Appendix A for a comprehensive description of the interval types used in this study). Unfortunately, substantive researchers almost always ignore these intervals and present only point estimates in the context of reliability analysis (e.g., Oosterwijk et al., 2019). This common practice disregards sampling error and the associated estimation uncertainty and should be seen as highly problematic. In this study, we show how the Bayesian credible interval can provide researchers with a flexible and straightforward method to quantify the uncertainty of point estimates in a reliability analysis.
We wish to demonstrate the benefits of conducting a reliability analysis in the Bayesian statistical framework. We show that the Bayesian reliability coefficients (a) perform equally well as their frequentist counterparts, and (b) provide an intuitive interpretation of uncertainty, allowing researchers to address questions that are beyond the scope of a frequentist analysis. We employ Bayesian estimation procedures for some of the most popular single-test reliability coefficients: coefficient a (Cronbach, 1951), coefficient k 2 (Guttman, 1945), the greatest lower bound (glb) (Woodhouse & Jackson, 1977), and coefficient x (McDonald, 2013). Coefficient a provides a lower bound to the reliability (Cronbach, 1951) and is by far the most frequently used reliability coefficient (Barry et al., 2014;Flake et al., 2017;Hogan et al., 2000). Coefficient k 2 is another lower bound and is at least as high as a (Guttman, 1945); the glb is at least as high as coefficient k 2 and comes closest to reliability among the lower bounds (Sijtsma, 2009). Finally, we consider coefficient x as the most prominent representative of the factor analytic approach to reliability (Revelle & Zinbarg, 2009). To obtain a parsimonious and clear study design, we leave out other reliability estimators such as Revelle's beta (Revelle, 1979) or Guttman's other k-coefficients (Guttman, 1945). 1 For a Bayesian solution of coefficients a, k 2 , and the glb we use and extend the methodology Padilla and Zhang (2011) described for the Bayesian estimation of coefficient a; we develop the first Bayesian version of coefficient x. To promote their use, we implement the coefficients in an easy-to-use R-package. 2 The outline of this paper is as follows. First, we discuss the purpose of Bayesian reliability estimation and how the present approach relates to previous work. Second, we provide an overview of the current state of uncertainty estimation in reliability analysis. Third, we discuss reliability estimation in general and more specifically with respect to the different measurement models relevant to the different reliability coefficients. This results in the implementation of the Bayesian single-test reliability coefficients. Fourth, we apply the Bayesian coefficients both in a simulation study and to an example data set. Finally, we discuss our findings and their implications for future work on the topic of Bayesian reliability estimation.

Bayesian theory
In Bayesian inference, estimation uncertainty is quantified by a posterior distribution that represents the relative plausibility of different parameter values after the data have been observed. In order to obtain the posterior distribution we must first choose a prior distribution. The prior distribution represents the relative plausibility of different parameter values before the data have been observed. The prior distribution is then updated by means of the likelihood to yield the posterior distribution. Based on the posterior distribution we can report credible intervals that indicate the precision with which the parameter has been estimated, we can make statements about the probability that the parameter has a value larger or smaller than some threshold, and we can obtain a starting point for the analysis of additional data.

Previous work
Previous efforts to develop Bayesian reliability estimates focused on coefficient a. Li and Woodruff (2002) detailed a procedure to obtain the posterior distribution for a directly by using the distribution of the maximum likelihood estimator of the coefficient (Van Zyl et al., 2000). A similar approach was taken by Najafabadi and Najafabadi (2016), who employed an exact distribution function of coefficient a (Kistner & Muller, 2004) to estimate the posterior distribution of the coefficient.
In contrast, Padilla and Zhang (2011) used the posterior distribution of the covariance matrix of multivariate observations to calculate a Bayesian coefficient a, including credible intervals. Padilla and Zhang noticed that coefficient a is fully determined by the 1 Inclusion of coefficient k 4 , the maximum split half reliability, was considered but dismissed due to a strong positive bias and its uneconomic computational effort when the number of items exceeds ten.

2
The R-package Bayesrel can be installed from CRAN or the latest version can be downloaded from https://github.com/juliuspf/Bayesrel. covariance matrix, and that the posterior distribution of the covariance matrix takes a convenient form when one assumes the data to be multivariate normal. They validated the method in a small simulation study by measuring the bias of the Bayesian point estimate and the coverage of the credible intervals.

Novel work
In this study we use and extend the approach from Padilla and Zhang (2011). Their procedure of sampling the posterior covariance matrix can be extended to estimate a series of well-known coefficients in addition to coefficient a. Specifically, we use Padilla and Zhangs' approach to obtain Bayesian estimates of coefficient k 2 and the glb.
Moreover, to the best of our knowledge, our development of a Bayesian coefficient x is novel. The setup of x is more complex than sampling from the multivariate normal, since one first needs to fit a factor model and then calculate x from the model parameters. Fortunately, Lee (2007) detailed a sampling scheme to estimate a Bayesian single-factor model. Our work is the first to employ Lee's procedure and obtain a Bayesian estimate of coefficient x, which is currently a popular reliability measure.
With this work we want to achieve two things. First, we wish to demonstrate the adequacy of Bayesian single-test reliability estimates by comparing them to their frequentist counterparts in a simulation study, and by explaining their benefits with an exemplary real-data set. Second, a long-standing problem in methodological research is the gap between theory and practice (Sharpe, 2013). We attempt to bridge this gap for Bayesian reliability analysis by introducing an R-package that contains the proposed methodology. This gives researchers all they need to conduct a Bayesian reliability analysis with multiple estimators and thus make more informed and intuitive inferences about reliability.

Current state of uncertainty estimation in reliability analysis
Researchers usually report a point estimate as an indicator for the quality of a research instrument but rarely report uncertainty intervals (Flake et al., 2017;Moshagen et al., 2019;Oosterwijk et al., 2019). 3 The almost complete absence of uncertainty interval reporting is surprising, especially in the light of multiple calls to improve the quality of psychological research by making increased use of confidence intervals (e.g., American Psychological Association, 2010; Association for Psychological Science, 2018;Cumming, 2014;Task Force Research on Reporting of Research Methods in AERA Publications, 2006;Thompson, 2002;Wilkinson, 1999).
The practice of reporting only reliability point estimates can hardly result from a lack of available methods. For example, bootstrapping is a mathematically simple re-sampling procedure that can be used to estimate a confidence interval when the distributional properties of an estimator are mostly unknown (e.g., DiCiccio & Efron, 1996;Efron, 1979). The so-called bootstrap confidence interval offers a well-known approach to the uncertainty estimation in reliability analysis. Several studies examined bootstrap confidence intervals for coefficient a and coefficient x. They found that the intervals performed satisfactory under general conditions (Kelley & Pornprasertmanit, 2016;Padilla et al., 2012;Padilla & Divers, 2016;Raykov & Shrout, 2002). In addition to bootstrapping, analytic approaches to construct confidence intervals for the coefficients a and x have been developed, and are accessible in standard software packages (see for a: Bonett & Wright, 2015;Feldt et al., 1987;Revelle, 2019; for x: Kelley, 2018;Kelley & Cheng, 2012;Padilla & Divers, 2016;Raykov, 2002). Indeed, an analytic confidence interval for coefficient a is available in SPSS (v25), although disguised as a confidence interval for an intraclass correlation coefficient that equals coefficient a under certain conditions. The common use of cutoff values might be one explanation why methods for estimating confidence intervals are not commonly employed in reliability analysis. Cutoff values such as .70 or .80 are frequently applied to determine "sufficient" or "good" reliability (Nunnally & Bernstein, 1994;Schmitt, 1996). Adding a confidence interval clashes with the idea of a simple cutoff value. For example, suppose the reliability point estimate of a single test is .73, the cutoff value for comparison .70, and the confidence interval around the estimate is [.67, .79]; by evaluating the point estimate the cutoff is exceeded and a clear conclusion is reached. By evaluating the confidence interval, however, one cannot say with "complete" certainty that the reliability is greater than .70.
Moreover, the interpretation of confidence intervals is subject to a common and troubling misconception. Many practitioners assume that a specific confidence interval contains the true parameter value with X% probability, or that one can be X% certain that the 3 Personal communications with the authors in the first two references.
interval contains the true parameter value. However, a confidence interval is an interval generated by a procedure that in repeated sampling has at least an X% probability of containing the true value, for all possible values of that parameter (e.g., Morey et al., 2016;Neyman, 1937). Accordingly, it is incorrect to infer that the population value of a reliability parameter from a single test administration lies within the confidence interval limits with a probability of X%. Thus, confidence intervals do not answer practically relevant questions about the confidence one can have in a reliability estimate. Credible intervals offer a solution to this predicament. An X% credible interval encloses the parameter of interest with X% probability.
In sum, researchers almost never report interval estimates for their reliability coefficients. In addition, the questions they may have (e.g., "what is the probability that a > :80?", "what is the probability that a 2 ½:70, :90?", "what is the relative support for different values of a provided by the data?") fall outside the purview of frequentist inference. To remedy these issues, we adopted and developed Bayesian estimation procedures for the most common single-test reliability coefficients.

Reliability estimation
According to classical test theory (CTT) a test score consists of a true score (systematic influences) and an error score (random influences), which are assumed to be uncorrelated (Spearman, 1904). In CTT, an individual's true score is defined as the expected value of a distribution of test scores for that person, known as the propensity distribution. The test scores are obtained in a hypothetical experiment, in which the same test is administered to the person under the same test administration conditions an infinite number of times. The crucial assumption is that the person remains the same from administration to administration. Consequently, the only source of variation in a person's test performance is random error (e.g., . Whereas the true score is defined for an individual, the true-score variance as well as the reliability are group characteristics. Reliability is traditionally defined as the product-moment-correlation between two parallel measures (Lord & Novick, 1968). Two test scores X and X 0 are parallel if: (1) the true scores T and T 0 are equal for the i-th individual, that is T i ¼ T 0 i ; and (2) r 2 X ¼ r 2 X 0 , in the group of interest. Two test administrations are assumed parallel when the same test instrument is administered to the same sample of participants at two times while, hypothetically, the participants have no recollection of the previous administration. Then, the product-moment correlation between the two parallel test administrations equals the reliability, since both the true scores for individuals and thus the true-score variances as well as the test-score variances for the group are equal across administrations. It can be easily shown that for either of the parallel test administrations the proportion of test-score variance that is true-score variance is the same as the correlation of the test scores, which is the reliability (e.g., Lord & Novick, 1968): with r 2 E denoting the error-score variance. Since strictly parallel tests and parallel test administrations are unavailable in practice, and repeated test administrations often require substantial resources, a data matrix from a single test administration is commonly used to approximate the true-score variance (or complementary the error-score variance). Thus, reliability is calculated not as the product-moment correlation between scores across parallel test repetitions, but as a function of item covariances based on a single test administration. This makes it virtually impossible to accurately retrieve the true reliability, except when the items satisfy unrealistic conditions. Consequently, the following single-test reliability coefficients are only approximations to the true reliability.

CTT-coefficients
Coefficient a, coefficient k 2 , and the glb are based on CTT and are lower bounds to reliability (e.g., Guttman, 1945;Ten Berge & Zegers, 1978). To determine the error-score variance of a test, the coefficients estimate an upper bound for the error variances of the items. The estimators differ in the way they estimate this upper bound. The basis for the estimation is the covariance matrix R of multivariate observations. The CTT-coefficients extract information about errorscore variance (or complementary, true-score variance) fromR, which denotes the sample estimate of R, and then weigh the estimated error-score variance against the total variance and subtract the result from 1 (or complementary, weigh the estimated true-score variance against the total variance). The CTT-coefficients estimate error-score variance from the variances of the items and true-score variance from the covariances of the items. The following paragraphs will elaborate on this.

Coefficient a
LetR be the covariance matrix estimated from the data set consisting of k items, then where trðÞ is the sum of diagonal elements and S is obtained by summing all elements inR: Cronbach (1951) left his mark on coefficient a by discussing the measure in his famous article and deriving some well known and often repeated properties for it. Coefficient a equals reliability when the items are essentially tau-equivalent (Novick & Lewis, 1967). 4 When essential tau-equivalence does not hold, coefficient a is smaller than the true reliability, hence the lower bound property (Sijtsma, 2009;Zinbarg et al., 2005). Generally, coefficient a is closer to the reliability when a test is closer to unidimensionality (Dunn et al., 2014;Sijtsma, 2009).

Coefficient k 2
Guttman proposed six lower bounds to reliability, one of which -k 3equals coefficient a. These six coefficients take different approaches to the approximation of the error-score variance and are characterized by the quality that the reliability is never smaller than the largest of the six bounds (Guttman, 1945). The second lower bound is calculated as follows: We define c ¼ 1ðRÀdiagðRÞÞ1 0 as the sum of squares of the off-diagonal elements inR and S as before, then Coefficient k 2 is always at least as large as coefficient a (Guttman, 1945;Sijtsma, 2009) and performed better than Guttman's other lower bounds .

Greatest lower bound
As its name implies, the glb is always at least as large as the other lower bounds. Following directly from the definition of the glb, the reliability lies in the interval [glb, 1]. The glb is calculated as follows: Let R ¼C T þ C E be the split of the sample covariance matrix into a matrix C T that contains the true-score variances and a diagonal matrix C E that contains the error-score variances. The proper estimates of C T and C E are found by maximizing the sum of the trace of C E with the only condition being that C T and C E are positive semidefinite (e.g., Jackson & Agunwamba, 1977). Again, let S be the sum of all elements inR, then Finding the matrix C E with maximum trace is not trivial. Various iterative matrix decompostion algorithms attempt to find a solution to the so-called "educational testing problem", all of which require a non-negligible amount of computational power (Bentler & Woodward, 1980;Ten Berge et al., 1981;Woodhouse & Jackson, 1977). Whereas among the lower bounds the glb comes closest to the true reliability, its sample estimate is prone to a capitalization on chance Ten Berge & So can, 2004). This leads to a considerable positive bias of the estimate even up to sample sizes of n ¼ 1,000 with the bias being smaller for fewer than ten items Ten Berge & So can, 2004).

Factor model coefficient x
Whereas coefficients a, k 2 , and the glb are rooted in CTT, coefficient x is based on the single-factor model (McDonald, 2013). Specifically, the single-factor model assumes that one factor explains the covariances between the items (Spearman, 1904). The single-factor model corresponds to the congeneric measurement model, where all items load on one factor with varying loadings opposed to the tau-equivalence model where all loadings are assumed to be equal (e.g., J€ oreskog, 1971).
The single-factor model can be considered a special case of a CTT-model, when one assumes that the common factor is a valid replacement for the true score. This way, the common factor variance replaces the true-score variance and the residual variances replace the error-score variance. Note that without this strong assumption coefficient x could not be considered a reliability estimate. However, if the assumption holds, coefficient x is a reliability measure for item sets. Please note that coefficient x is not a measure of the fit of the single-factor model, it merely expresses reliability assuming unidimensionality. Let where X ij is the ith examinee's score on item j, k j are the loadings of the items on the common factor, f i are the examinees' factor scores, and E ij is the ith examinee's error of item j that cannot be explained by the common factor. Further let W be the diagonal variance matrix of E with w j as the diagonal elements representing the residual variances of the items, then This equation coincides with Equation (6.20b) from McDonald (2013). Coefficient x equals coefficient a (and the other lower bound coefficients) when the condition of tau-equivalence holds (Zinbarg et al., 2005). Coefficient x represents a well studied alternative to coefficient a that shows good statistical properties (Kelley & Cheng, 2012;Zinbarg et al., 2005). To obtain the factor loadings and residual variances of the single-factor model one can apply a variety of application methods from the family of structural equation modeling (SEM). The most common are exploratory factor analysis (EFA), principal component analysis (PCA), principal factor analysis (PFA), and confirmatory factor analysis (CFA) (Revelle & Zinbarg, 2009;Zinbarg et al., 2006).
In the R-package, we implement both CFA and PFA to obtain coefficient x. CFA models attempt to minimize the discrepancy between the model-implied covariance matrix and the sample covariance matrix using, for example, maximum likelihood or generalized least squares (Bollen, 1989). In a PFA, which is highly similar to the principal factor method or the principal axis method, the factor loadings are calculated from an altered sample covariance matrix by means of an eigendecomposition. The altered covariance matrix equals the sample covariance matrix with respect to the offdiagonal elements, but is different with respect to the diagonal, which contains the item communalities. The communalities are found in an iterative procedure that starts with the squared multiple correlations of the items (Rencher, 2002, p. 421 ff.).
Note that coefficient x is sometimes split into x h (h for hierarchical) and x t (t for total) (Revelle & Zinbarg, 2009). In the calculation of these two coefficients a hierarchical multi-factor model replaces the single-factor model and additional factors account for group-specific variance. Subsequently, coefficient x, as implemented in this work, can only be interpreted as a measure of reliability when the test is unidimensional and the single-factor model fully explains the true-score variance. The value of coefficient x cannot address the question of dimensionality. To determine dimensionality one needs to apply more sophisticated factor analytic methods.
To summarize, the covariance matrix is sufficient for the CTT-coefficients, and the single-factor model parameters determine the value of coefficient x. Thus, the obstacle in Bayesian single-test reliability analysis becomes the estimation of a posterior distribution for the covariance matrix and the estimation of the posterior distributions of the factor model parameters. Note that the formulas for the calculation of the coefficients remain unchanged in the Bayesian paradigm.

Bayesian single-test reliability estimation
Bayesian CTT-coefficients We employ an approach to Bayesian reliability estimation that was described by Padilla and Zhang (2011), who provided a Bayesian version of coefficient a. Their approach is similar to the approach followed in this article and can be generalized to a series of different estimators.
Coefficients a, k 2 , and the glb are calculated on the basis of the sample covariance matrix. Thus, a straightforward way to obtain a posterior distribution of a CTT-coefficient is to estimate the posterior distribution of the covariance matrix and use it to calculate the estimate. In this procedure, we follow the methods also detailed in Murphy (2007). The author presents a simple and straightforward way to obtain a posterior distribution of the covariance matrix by choosing a conjugate prior distribution.
To facilitate the Bayesian estimation of reliability estimates, we assume data to be normally distributed with means l and covariance matrix R for items j ¼ 1, :::, k: Thus, the normal inverse Wishart distribution (NW À1 ) is an obvious choice as a conjugate prior distribution: Let where l 0 denotes the prior means, j 0 is an inverse scaling parameter for the covariance matrix R, W ¼ ð 1 j 0 W 0 Þ À1 is a positive definite inverse scaling matrix, and 0 are its degrees of freedom. The number of observations are n and the number of items are k. The prior hyperparameters are chosen as l 0 ¼ 0, j 0 ¼ 10 À10 , W 0 ¼ I k , and 0 ¼ k: This yields a relatively uninformative prior. The posterior distribution of the multivariate normal distribution is available in closed form: ðl, RÞ $ NW À1 ðl n , j n , W n , n Þ: Since we are only interested in the covariance matrix, which is the basis for the CTT-coefficients, we sample the posterior covariance matrices from an inverse Wishart distribution, which equals: T is the sum of squares matrix, x i are the item scores of the ith examinee, and x are the item means. The posterior distribution is obtained by drawing a sufficient number of random samples from the distribution denoted in Equation (9). By calculating any of the CTT-coefficients for each of the covariance matrices in the posterior sample, one obtains posterior samples for the reliability coefficients.

Bayesian factor model coefficient x
Since coefficient x is based on a factor model, its setup is more challenging. To obtain the posterior distributions of the parameters in the single-factor model, we need to sample from their conditional distributions. A comprehensive account to do so is given by Lee (2007), who describes how to obtain the posterior distributions of the quantities in Equation (5). We designed a Gibbs sampling algorithm to sample from the conditional conjugate distributions of the parameters in accordance with Lee (2007).
Recall the single-factor model from Equation (5) with w j as the diagonal elements of W, the diagonal variance matrix of E. We assume the following relatively uninformative conjugate prior distributions similar to Lee (2007, p. 71 ff.): Let i ¼ 1, :::, n observations and j ¼ 1, :::, k items, then f i $ Nð0, /Þ; / $ W À1 ðR 0 , p 0 Þ with scale matrix R 0 (in fact a scale value, as we only assume one factor) and degrees of freedom p 0 ; w À1 j $ Cða 0 , b 0 Þ with shape and rate parameter, also w j ¼ 1=w À1 j , and k j $ Nð0, w j h 0 Þ: We fix the following prior hyperparameters: In SEM, the latent factor needs to be assigned a metric, which is commonly done by fixing either the loading of one indicator to 1 or the factor variance to 1. Since we are interested in estimating the factor loadings we employ the latter procedure by dividing the factor scores drawn from Equation (10) by their standard deviation in every iteration. Subsequently, we multiply the means of the posterior loadings by the square root of the unstandardized factor variance (see Equation (13)) (Bollen, 1989). Eventually, the iterative Gibbs-Sampling algorithm has the following steps: 1. Draw a sample value from one of the conditional distributions (henceforth the "first conditional distribution") in Equations (10)

Simulation study
We conducted a simulation study to evaluate the Bayesian single-test reliability coefficients across a variety of conditions and data sets. In this study, we compared the Bayesian coefficients with their frequentist counterparts and their population value.

Method
We constructed data generating covariance matrices for which we varied the average correlation between items ( q ¼ 0; :3; :7) and the number of items (k ¼ 5; 20). The covariance matrices were based on the parameters of the single-factor model. Specifically, loadings and residual variances were combined to create the matrices by means of the equation R ¼ kUk 0 þ W, with loadings k, factor variance U ¼ 1, and diagonal residual variance matrix W: During the process the loadings and residual variances were sampled randomly until the correlation between items reached the desired average value. Consequently, the correlations between items varied in the medium and high-correlation conditions. In the high-correlation condition, the standard deviations of the average correlations ranged from .04 to .07; in the medium-correlation condition, they ranged from .17 to .21. In the zero-correlation condition the correlations varied slightly but were all very close to zero (ranging from .0002 to .0005). The resulting covariance matrices were used to generate multivariate normal data with means of zero in different sample sizes (n ¼ 50; 100; 500). We chose the zero-correlation condition to represent a state of very noisy data. The condition can be viewed as a baseline that is unlikely to appear in empirical research. However, by pushing the coefficients to their limits, we could examine how estimates behave for a wide variety of data sets. Also, in an extreme scenario like the zero-correlation condition, differences between the Bayesian and frequentist coefficients could emerge that we might otherwise not be able to detect.
The setup resulted in a total of 18 conditions with each condition being replicated 1,000 times. The population values of the CTT-coefficients were calculated from the data-generating covariance matrices. The population value of coefficient x was computed from the factor loadings and residual variances that were used to construct the data-generating matrices. For the CTT-coefficients the data generation from a single-factor model can be considered proper, although one does not need to assume the data to be unidimensional to compute the coefficients.
For all analyses we used the R-package Bayesrel that we developed to facilitate the calculations and to make the Bayesian estimators accessible to other researchers. A detailed description of the package can be found in Appendix B. Although analytic solutions of confidence intervals for coefficients a and x are available (e.g., Bonett & Wright, 2015;Kelley & Cheng, 2012), we used the non-parametric bootstrap (DiCiccio & Efron, 1996;Padilla et al., 2012;Padilla & Divers, 2016) to estimate percentile-type confidence intervals for reasons of consistency and comparability (see Table A1). To compute the confidence intervals, the number of bootstrap samples was set to 1,000. The number of iterations in the Bayesian solution was 1,000 with a burn-in of 50, because the sampling reached convergence very quickly. The credible intervals were highest posterior density (HPD) intervals (Box & Tiao, 1973) (see Table A1). When discussing the results from the simulation study, we refer to "percentile-type non-parametric bootstrap confidence intervals" as "confidence intervals" and we refer to "HPD credible intervals" as "credible intervals".
The frequentist calculation of coefficient x was based on loadings and residual variances from a PFA, instead of a CFA, becausein several simulation runsthe CFA model-fitting did not converge or resulted in negative variances. Note that CFA and PFA yield virtually identical parameter estimates and thus nearly identical coefficient x values when data are unidimensional.
We combined the posterior distributions of simulation runs for each condition by calculating the quantiles of those distributions and averaging each quantile. The analysis of the simulation results included a visual assessment of the consistency and the deviance of the estimates by means of summary plots, the coverage of the uncertainty intervals, the root mean-square error (RMSE) of the estimates, and the probability that a coefficient overestimated its associated population value (risk).
Coverage was computed as the percentage of simulation runs where the interval contained the population value. When calculating 95% intervals, we expected a well calibrated method to cover the population value in 95% of the cases. The RMSE quantifies the deviation of the posterior and bootstrapped samples of the estimators from their associated population valuesaveraged over simulation runs. 5 The RMSE can be interpreted on the same scale as the reliability and indicates how much an estimate spreads around its population value. Smaller RMSE values are associated with a less biased estimator. The "risk" was calculated by determining the quantile of the population coefficient value in the posterior distribution of the estimator. Let F X ðxÞ be the cumulative distribution function of the posterior distribution of X evaluated at the point x, then r is the risk of overestimation: r ¼ 1 À F X ðxÞ: F X can be substituted by the posterior distribution of any one of the Bayesian reliability coefficients; x then becomes the associated population value of that coefficient. One would consider a coefficient to perform satisfactory if the risk of overestimation is close to .50, meaning the risk of underestimation is close to .50 as well. A more conservative approach would also render values below .50 acceptable, since underestimating the reliability is more acceptable than overestimating it.
All results relate the estimated coefficients to each other or their corresponding population values. Thus, when we use the term bias, it reflects differences between the estimates and their population values.

Comparison of frameworks
The results of the Bayesian coefficients were highly similar to the frequentist coefficients in almost all conditions. A summary of the results for the mediumcorrelation condition can be found in Figure 1. The figures summarizing the zero and high-correlation conditions can be found in Appendix C. Figure 1 shows the results separated for the number of items (the columns) and the four coefficients (the rows). Each coefficient-block is divided into the different sample sizes (rows) and statistics frameworks (colors). The bars in the figure show the average 95% uncertainty interval limits, the vertical line segments in the bars indicate the average 25% and 75% quartiles, the dashed lines display the population values of the coefficients, and the average point estimates are denoted as rhombuses. We may notice, for example, that the average Bayesian point estimate for coefficient a with five items and 50 observations is slightly smaller than the frequentist one, andfor the same conditionthe average confidence interval is shifted slightly more to the left and wider than the credible interval.
Except for k 2 in the medium and high-correlation condition, the Bayesian point estimates over-or underestimated the associated population coefficient values slightly more than the frequentist ones when the sample size was small to medium in almost all conditions. This indicates an influence of the relatively uninformative prior distribution in small sample settings. The pointestimated Bayesian glb showed larger positive bias than the point-estimated frequentist glb across all conditions. In general, the Bayesian estimators converged to the associated population values with increasing sample size.
Both methodological approaches displayed similar interval coverage performance (see Table 1). The 95% intervals of both frameworks performed well in the medium and high-correlation conditions. The credible intervals for k 2 performed slightly worse than the confidence intervals in the zero-correlation condition. The difference vanished with increasing associations between items. Furthermore, the credible intervals of coefficient x performed satisfactory. In the zero-correlation condition the credible intervals for x displayed better coverage than the confidence intervals with a small number of items. In the zero-correlation condition, the RMSEs of the Bayesian x were slightly smaller than the frequentist x (see Tables C1 and C2). Other than that, we regard the differences in RMSEs between the Bayesian and frequentist coefficients as negligible.

Evaluation of coefficients
As expectedbased on CTT-resultsan increase in the number of items lead to better point estimation and narrower covering uncertainty intervals in the majority of the conditions. An exception was the glb, which displayed a large positive bias in both frameworks when the number of items was largea finding consistent with previous research (e.g., Oosterwijk et al., 2017).
The coverage information underscored the poor performance of the glb. The uncertainty intervals of the glbboth confidence and credible intervalsdid not cover the population value in any simulation run with 20 items. The bias of the estimator was so large that in the frequentist approach none of the 95% confidence interval limits enclosed the point estimate when the sample size was small or medium. This indicates that the bootstrap re-sampling of the data set lead to even more biased estimates for the glb. This happened to different degrees in all three correlation conditions and illustrates the capitalization on chance mechanism that affects the measure. In addition to the glb, coefficient k 2 and frequentist x displayed unsatisfactory coverage in the zero-correlation condition. The other estimators showed satisfactory coverage performance. Except for the glb, the other coefficients reached a coverage level close to 95% with increasing correlations. With the exception of coefficient a, all other estimators displayed worse coverage results in the zero-correlation condition compared to the other conditions. The RMSEs of coefficient x were smallest among the estimators in the zero-correlation condition. In the other correlation conditions coefficient a, k 2 and x reached similar RMSEs. Similar to the coverage results, the glb had the poorest RMSE values among the coefficients across all conditions. We did not compute risk results for the zerocorrelation condition because the population values of the coefficients were very close to zero (see Figure C1), and thus a reliability coefficient was in itself inclined to overestimate its true value, as negative values were deemed invalid. Thus, the risk results were only calculated for the medium and high-correlation conditions. Table 2 contains the exact values. Again, the glb was at high risk of overestimating the population coefficient value. Coefficients a, k 2 , and x reached satisfactory results with x showing a more conservative risk.

Example: eight-item questionnaire data
To illustrate the advantages of the Bayesian single-test reliability coefficients, we used a data set by Cavalini (1992) measuring coping style in response to malodorous environments. Both Sijtsma (2009) and Revelle and Zinbarg (2009) used the data to discuss several reliability coefficients. The data set consists of eight-item questionnaire data filled out by 828 participants. The questionnaire is Likert-scaled with scores from 0 to 3, but is usually treated as quasi-continuous. The data covariance matrix can be found in Appendix D. The data set is included in the R-package under the name "cavalini". A table containing the exact values of the point estimates and the uncertainty interval limits for both the frequentist and Bayesian framework together with the R-code to reproduce the results for the example data set can be found in Appendix D.
A simple research question for the reliability analysis of this data set might be: "Does the reliability of the test exceed a cutoff value of .80, hence is the reliability good?". In the current state of substantive research, the common way to answer this question would be to use the point estimate of coefficient a,â freq ¼ :778: Based on this result, we would conclude that the scale is associated with a sufficient but not a good reliability, depending on the specific cutoff criteria applied. Suppose we are in doubt about using the proper coefficient, because we are aware of some critiques about a and know that k 2 is generally a better coefficient. Also, we heed recommendations about good research practices and decide to provide a confidence interval. Thus, in a more sophisticated analysis we calculate:k 2 freq ¼ :785, 95% CI ½:758, :809: The correct interpretation of the 95% confidence interval is: If we would repeatedly draw samples from the same population and calculate the 95% confidence intervals for k 2 in the same way each time, the true value would fall into the interval in 95% of the cases. Thus, we are unable to make any inference about the reliability lying in the particular confidence interval we just obtained.
In contrast, from the posterior distribution we can conclude that the specific credible interval contains 95% of the posterior mass. Sincek 2 Bayes ¼ :784, 95% HDI ½:761, :806, we are 95% certain that k 2 lies between .761 and .806. Yet, how certain are we that the reliability is larger than .80? Using the posterior distribution of coefficient k 2 , we can calculate the probability that it exceeds the cutoff of .80: pðk 2 > :80 j dataÞ ¼ :075: We can also determine the posterior probability of coefficient k 2 being larger than .70, which is another common cutoff value, and smaller than .80, which, in this example, is pð:70 < k 2 < :80 j dataÞ ¼ pðk 2 > :70 j dataÞ À pðk 2 > :80 j dataÞ % 1 À :075 ¼ :925: In addition to making probabilistic statements about the values of the reliability coefficients, the Bayesian approach allows for prior knowledge to be incorporated into the analysis. For example, a data covariance matrix of a previous study, which worked with the same test instrument, can be used to update the prior distribution and enhance the precision of the subsequent reliability analysis. Figure 2 displays the Bayesian prior and posterior distributions of the coefficients for the Cavalini-data set with the original sample size of 828 and a reduced sample size of 100 randomly drawn observations. The plots differ for sample sizes and show that the point estimates generally concur, but the posterior distributions and the width of intervals are quite different for the smaller sample size. While the Bayesian posterior distribution can be used to calculate probabilities of interest, its simple graphical display encourages the users' appreciation for the uncertainty of reliability estimates.
Finally, the R-package Bayesrel allows the display of the graphical posterior predictive check (PPC) for the single-factor model. The idea is to use the posterior predictive distributions of the model parameters to check model fit. In the PPC, which is similar to a scree plot, we compare the posterior model-implied covariance matrices with the covariance matrix of the data to see if the parameters we sampled under the single-factor model appear similar to the parameters observed in the data (Gelman et al., 2004, chapter 6.3). Since the model-implied covariance matrix is fully determined by the loadings and the residual variances, we can use the 95% limits of the loadings and residual variances to construct a lower and upper Note. The probability is calculated by computing the quantile of the population coefficient value in the posterior distribution of its associated estimator. The values are averaged over simulation runs. A desirable result is a value close to .50 or below.
bound for the model-implied covariance matrix. We calculate the eigenvalues of the matrices and plot them (gray bars in Figure 3) together with the eigenvalues of the observed covariance matrix (black dots in Figure 3). If all gray bars enclose the black dots, we assume that the values sampled under the single-factor model are similar to the values of the observed data, hence the single-factor model fits the data. Inspecting the PPC for the full Cavalini-data set, we assess model fit to be mediocre at best, since not all of the observed eigenvalues are enclosed in the bars of the model predicted eigenvalues (see Figure 3).
In this instance, we are unable to interpret coefficient x as a measure of reliability. This demonstrates that an analysis of dimensionality should always precede a reliability analysis and the PPC should only be used as a post-hoc check of model fit. If one had analyzed the dimensionality of the Cavalini data before conducting the reliability analysis, the results might have pointed toward a multidimensional scale. Thus, one could have divided the test into subscales and computed the reliability for each subscale separately. This way, the coefficients a, k 2 , and x would have been better approximations of the reliability of the total scores based on each of the item subsets.

Discussion
When estimating reliability coefficients, researchers (a) predominantly report coefficient a; (b) almost always report only point estimates; and (c) rarely if ever report Bayesian inference. Here, we addressed these issues by implementing Bayesian inference procedures for some popular single-test reliability coefficients. Using the posterior distribution, researchers obtain a direct and intuitive appreciation for the uncertainty in their inference. Moreover, the posterior distribution allows them to obtain answers to questions that fall outside of the standard frequentist framework. Specifically, researchers can obtain the probability that a reliability coefficient falls in any particular interval of interest and update this probability continuously, as more data become available.  Cavalini (1992) with eight items and sample size of n ¼ 828, and n ¼ 100 randomly chosen observations. Posterior distributions of estimators with dotted prior density and 95% credible interval bars. We tackled the issue of lacking uncertainty estimation and reporting in psychological research by offering an easy-to-use statistical method readily available for empirical researchers. The credible intervals implemented in this study offer a simple and intuitive solution to account for the sampling error in reliability estimation. We presented Bayesian procedures to estimate coefficient a, coefficient k 2 , the glb, and coefficient x, and implemented them into an openly accessible R-package. A simulation study and the analysis of an exemplary data set demonstrated the adequacy and benefits of the Bayesian estimators. In addition, the study validated bootstrap confidence intervals for coefficient k 2 and the glb.
In particular, the simulation study showed that the Bayesian coefficients converge to their population values and concur with the frequentist estimates. As expected, performance of the Bayesian coefficients improved with increasing sample size. Whereas the frequentist and Bayesian methods differed slightly with respect to point estimation, they generally coincided with respect to interval estimation in terms of coverage performance.
In contrast to the frequentist coefficients, the Bayesian reliability estimates allow for more intuitive and simple statements about reliability, which makes them superior in almost every aspect. The covariance sampling procedure introduced here can be easily extended to other estimators as long as they are calculated from the covariance matrix. Measures that are based on the repeated administration or the repeated rating of the same test, can be subject to future work on Bayesian test-retest or inter-rater reliability.
Except for k 2 , the performance of the Bayesian coefficients appears unsatisfactory for small samples. However, we do not consider this problematic. In Bayesian statistics the initial confidence in different values of a parameter is expressed by means of a prior distribution. The observed data then cause this confidence to be reallocated following Bayes' rule: Confidence is gained for parameter values that accord well with the data, and confidence is lost for values that accord poorly. The more informative the data, the larger the shift in confidence. Small samples generally result in modest changes of the prior distribution. Consequently, in practical applications with small sample data, a visual inspection of the posterior distribution would reveal that it is relatively wide, and the primary concern will focus on the large posterior uncertainty, not on how close the posterior mean may be to the true value. In addition, with small sample sizes the relatively uninformative priors for coefficient a, k 2 , and x (e.g., see Figure 2) yield underestimates of the population values. In other words, the bias is conservative and researchers are safeguarded from overestimating reliability when the evidence in the data is small. When sample size is small, we urge researchers to try and collect more observations. With more incoming evidence the influence of the prior distribution will diminish and subsequently the Bayesian sample estimates will be closer to their respective population values.
More generally, the results for Bayesian parameter estimation are relatively robust to changes in the prior distribution, a regularity captured by the maxim "the data overwhelm the prior" (e.g., Wrinch & Jeffreys, 1919). Exceptions to this rule occur when the prior distribution is highly informative or when the sample size is very small. In the former case one may wonder why, with such firm knowledge already in hand, one would seek to collect additional data at all; in the latter case, it is prudent to interpret any strong conclusions with considerable caution. If researchers decide to employ informed prior distributions, these should be well motivated, for instance by an analysis of previous empirical findings or by a systematic elicitation effort (e.g., Stefan, Evans, & Wagenmakers, in press, and references therein). Even in these cases we see value in reporting the results for default priors as well, because these provide a reference against which to assess the conclusions from informed priors. Specifically, it is always possible to conduct a sensitivity analysis, where one examines the extent to which the conclusions vary as a result of reasonable changes in the prior distribution.
Some researchers may want to capitalize on the benefits of the Bayesian framework and incorporate informed prior knowledge into their analysis. This is not a trivial task and we caution readers against using more informative priors until more research into the topic is conducted. For academic purposes, however, we briefly outline a potential process to do so. One could use the posterior distribution of the covariance matrix from a previous similar data analysis as the prior distribution for the current analysis. For example, one wants to conduct a reliability analysis for a translation of the Cavalini-questionnaire administered in Germany. The same questionnaire in Dutch was administered in a previous study, and the authors of the study provided the sample covariance matrix. To obtain more precise CTT-estimators for the administration of the German questionnaire one could alter the parameters of the prior inverse Wishart distribution so that its mean equals the sample covariance matrix of the Dutch questionnaire. To obtain coefficient x for the German questionnaire one could adjust the hyperparameters of the prior distributions of the loadings and residual variances to better resemble the loadings and residual variances from a single-factor model fit of the sample covariance matrix of the Dutch questionnaire. With multiple similar questionnaires at hand, a more sophisticated approach is to apply a hierarchical model (e.g., Shiffrin et al., 2008) such that the estimate of reliability for a single small-sample questionnaire is informed by knowledge of the reliability for the other questionnaires, moderated by the extent to which the questionnaires are similar.

The Bayesian reliability coefficients
In view of common misconceptions about coefficient a (Cho, 2016;Cho & Kim, 2015;Hoekstra et al., 2019) 6 ; we appeal to researchers to interpret a only as a lower bound to reliability. The Bayesian version of k 2 performed equally well in the simulation study as coefficient a. We suggest that in the future researchers should take a closer look at coefficient k 2 , a greater lower bound to reliability than coefficient a (Guttman, 1945;Oosterwijk et al., , 2017. Although the glb is the coefficient that comes closest to the true reliability on a population level (Ten Berge & Zegers, 1978), we recommend against using the glb in practice based on the performance in this study and others Ten Berge & So can, 2004).
This study chose the same prior for the covariance matrices for all coefficients. Being sampled from an inverse Wishart distribution with the identity as a scaling matrix, the prior covariance matrices can be assumed relatively uninformative. This yielded acceptable flat priors for coefficients a and k 2 with most values between 0 and 1 being approximately equally likely a priori. However, the prior of the glb was heavily skewed to the left. Specifically, the choice of a relatively uninformative prior for the covariance matrix lead to a prior distribution of the glb that put an unwanted amount of weight on values of .9 and above (see, e.g., Figure 2). This became more severe as test length increased. To establish the glb as a more popular measure of reliability, future work could investigate different prior distributions and their effect on the estimator.
The simulation results suggest that the Bayesian coefficient x proved its usefulness as a reliability coefficient for unidimensional tests. Multidimensional tests, however, require a more sophisticated model, for example, a bi-factor model. For this, one needs to apply a more complex procedure of posterior sampling than the single-factor model (see e.g., Lee, 2007;Muth en & Asparouhov, 2012). Estimating coefficient x h (Zinbarg et al., 2005) from a Bayesian perspective can be a promising approach to indicate the measurement of a common factor in a multidimensional scale.
To summarize, the Bayesian reliability estimates displayed similar statistical properties as the frequentist estimators. Furthermore, the shape of the prior distribution may influence the outcome of a Bayesian reliability analysis when the sample size is small.

Simulation
Our simulation study covered a limited range of conditions. First, we assumed data to be continuous and normally distributed. A Bayesian solution for data that are multivariate non-normal or even categorical is offered by Gaussian copula graphical models (Mohammadi, Abegaz, Heuvel, & Wit, 2017). Kelley and Pornprasertmanit (2016) discussed the implications of categorical data on different types of x. Second, we only simulated data that were unidimensional. An account of the behavior of frequentist reliability coefficients with multidimensional data is given by Cho (2016) and in the case of coefficient x by Kelley and Pornprasertmanit (2016). Third, we did not evaluate the coefficients with respect to the population reliability but with respect to their population values. The coefficients are approximations to population reliability, thus their population values are unequal to population reliability under realistic conditions. If one assumes that the factor model can substitute CTT for reliability analysis and thus the factor score of the single-factor model represents the true score, the population value of coefficient x equals the population reliability. Since the purpose of our study was to compare the reliability coefficients of one statistical framework to another we refrain from comparing coefficients to population reliability.

Bayesian estimation
The Bayesian estimation procedure starts with the choice of adequate priors for the parameters. Given the sparsity of research on Bayesian reliability estimation, we conservatively decided on relatively uninformative priors on the covariance matrices and relatively uninformative priors on the factor model parameters. For future research an alternate approach 6 The two most common errors among researchers about coefficient a are the failure to interpret a as a lower bound and the false assumption of a as an indicator for unidimensional data (Hoekstra et al., 2019). would be to specify priors such that the resulting prior distribution of the reliability coefficients is uninformative, instead of the prior covariance matrix or the factor model parameters.

Conclusion
The posterior distribution answers practically relevant questions about the confidence one can have in reliability estimation. As such, the Bayesian estimation adds an essential measure of uncertainty to simple point-estimated coefficients. Adequate credible intervals for single-test reliability estimates can be easily obtained applying the procedure described in this article, and as implemented in the R-package Bayesrel. Whereas the R-package addresses substantive researchers who have some experience in programming, we admit that it will probably not reach scientists whose software experiences are limited to graphical user interface programs such as SPSS. For this reason we are currently implementing the Bayesian reliability coefficients in the open-source statistical software JASP (JASP Team, 2020).
Whereas we cannot stress the importance of reporting uncertainty enough, the question of the appropriateness of certain reliability measures cannot be answered by the Bayesian approach. No single reliability estimate can be generally recommended over all others. Nonetheless, practitioners are faced with the decision which reliability estimates to compute and report. Based on a single test administration the procedure should involve an assessment of dimensionality. Ideally, practitioners report multiple reliability coefficients with an accompanying measure of uncertainty, that is based on the posterior distribution. Table A1. Uncertainty intervals and their interpretation Frequentist confidence interval Bayesian credible interval

Definition
An X% confidence interval for a parameter h is an interval with limits (L, U) generated by a procedure that in repeated sampling has at least an X% probability of containing the true value of h, for all possible values of h (Morey et al., 2016;Neyman, 1937).

Procedure
One constructs a confidence interval around a sample estimateĥ by adding an estimate of uncertainty toĥ: The estimate of uncertainty is obtained by means of the sampling distribution of the parameter h. This can be done in several ways. We distinguish between two cases. One where the sampling distribution is known (analytic confidence intervals) and one where it is unknown (bootstrap confidence intervals).

Analytic confidence interval
For example, assume a participant gets 9 out of 12 true/false questions correct. One calculates an analytic confidence interval for the parameter h (the probability of answering any one question correctly) on the basis of the known sampling distribution of h, i.e., the binomial distribution. Commonly, one approximates the binomial distribution of h with a normal distribution on the basis of the central limit theorem. Hence, the 95% analytic confidence interval is: ½ð9=12Þ61:96 Ã ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ffi :75 Ã :25=12 p ¼ ½:505, :995:

Bootstrap confidence interval
For some models, the sampling distribution may not be available or applicable. In these situations, bootstrapping can be used to obtain an empirical sampling distribution of the parameter h by resampling the data with replacement and computing h for each data sample (Efron, 1979). The "bootstrap" confidence interval can be constructed from this empirical distribution in a number of ways.

Percentile-type confidence interval
To obtain a percentile-type interval we discard 100ða=2Þ% of the mass in the tails of the empirical sampling distribution. Thus, the percentile-type interval is given by the quantiles of the empirical sampling distribution: ½h Ã a=2 , h Ã 1Àa=2 , with, i.e., h Ã being the bootstrapped parameter estimates. For the binomial example the 95% percentile-type bootstrap confidence interval for h equals [.5, 1]. Note that the percentile-type interval can be constructed for any sampling distribution (also a posterior distribution) and is not exclusively relevant to bootstrapped sampling distributions.

Definition
An X% credible interval for parameter h is an interval with limits (L, U) that encloses 95% of the posterior mass.

Procedure
One constructs a credible interval for a parameter h by summarizing the posterior distribution of h by means of an interval; we distinguish two types.

Central credible interval
Consider again the example about answering true/false questions, where a participant gets 9 out of 12 questions correct. One can obtain the posterior distribution of h (the probability of answering any one question correctly) in closed form if one uses a beta (1, 1) distribution as the prior for h. Then the posterior distribution of h is a beta (10, 4) distribution. One can compute a central credible interval of the beta (10, 4) distribution by using the a=2 and 1 À a=2 quantiles of the posterior distribution. This results in a 95% central credible interval of [.462, .909].

HPD interval
Another popular way to construct a credible interval is to use the highest posterior density (HPD) region. The HPD interval is the smallest possible interval that contains 100ð1 À aÞ% of the posterior mass (Gelman et al., 2004, chapter 2.3). For example, to obtain the interval limits of an HPD interval for the posterior probability of the binomial rate h one could sample a sufficient number of values (e.g. 10,000) from the posterior distribution, i.e., the beta (10, 4), and compute the HPD interval with standard software packages. The 95% HPD interval then equals [.486, .926].
Below we calculated the root mean square errors (RMSE) of the values in the sampled distributions, posterior and bootstrap, and their associated population valuesaveraged over simulation runs. Thus, the RMSEs quantify the bias of the sampled coefficients in both frameworks. The tables are separated by number of items. Figure C2. Simulation results for the high-correlation condition. The endpoints of the bars are the 95% uncertainty interval limits. The 25%-and 75%-quartiles are indicated with vertical line segments.