On the Null Distribution of Bayes Factors in Linear Regression

ABSTRACT We show that under the null, the is asymptotically distributed as a weighted sum of chi-squared random variables with a shifted mean. This claim holds for Bayesian multi-linear regression with a family of conjugate priors, namely, the normal-inverse-gamma prior, the g-prior, and the normal prior. Our results have three immediate impacts. First, we can compute analytically a p-value associated with a Bayes factor without the need of permutation. We provide a software package that can evaluate the p-value associated with Bayes factor efficiently and accurately. Second, the null distribution is illuminating to some intrinsic properties of Bayes factor, namely, how Bayes factor quantitatively depends on prior and the genesis of Bartlett’s paradox. Third, enlightened by the null distribution of Bayes factor, we formulate a novel scaled Bayes factor that depends less on the prior and is immune to Bartlett’s paradox. When two tests have an identical p-value, the test with a larger power tends to have a larger scaled Bayes factor, a desirable property that is missing for the (unscaled) Bayes factor. Supplementary materials for this article are available online.


Introduction
Bayesian methods have been sidelined by most practitioners in genetic association studies. The main reason is that p-value, although often misinterpreted, is entrenched among practitioners (Sellke, Bayarri, and Berger 2001;Nuzzo 2014). A Bayesian method that performs genetic association tests, such as that of Guan and Stephens (2008), often faces an inconvenient demand to produce a p-value associated with an extraordinarily large Bayes factor. Because the null distribution of Bayes factor is unknown, a previous solution has been to obtain the p-value through permutation (Servin and Stephens 2007). In genomewide association studies, however, the significance threshold for p-values is exceedingly small, owing to the burden of multiple testing; thus, the number of permutations required is often prohibitively large and hence impractical. This motivates us to quantify the null distribution of Bayes factors, from which we can compute a p-value associated with Bayes factor analytically, without the need of permutation.
Our second motivation is to understand Bayes factor itself, first and foremost to understand, quantitatively, the priordependence nature of Bayes factors. Such prior-dependency often turns away naive practitioners. The second is to investigate the root of inconsistency of Bayes factor, namely, Bartlett's paradox (Bartlett 1957). Bartlett discovered that a diffusive prior tends to favor, unintentionally, the null model. In other words, if one were uncertain about the prior effects, one would automatically favor the null. (On the other hand, if one were too certain about the prior effects, one would risk prior-misspecification, which also unintentionally favors the null.) We identified the dominant term in Bayes factor that is affected by the prior, which motivates us to systematically scale Bayes factor. The scaled Bayes factor depends weakly on the prior and no longer suffers from the Barlett's paradox.
Our third motivation is to emphasize the necessity of taking into account power to interpret p-values. The probability that a positive report is false depends on both the p-value and the power of a test (Wacholder et al. 2004). A plainer reiteration of this insightful observation is that a small p-value alone cannot provide a strong evidence for a true association, and it has to be interpreted in light of the statistical power (Burton et al. 2007). A large Bayes factor, however, by itself provides a strong evidence for a true association (Stephens and Balding 2009). And Sawcer (2010) related Bayes factor to the ratio between the power and the p-value. It is therefore beneficial for a study to report both Bayes factors and their associated p-values. The idea of computing a p-value associated with Bayes factor dates back to Good (1957), as a symbol of "Bayes/non-Bayes compromise" (Good 1992). The p-values will satisfy the practical mandate imposed by the research community, and the companion Bayes factors will remind one to account for power when interpreting p-values. For example, tests may be ranked differently by their p-values than by their Bayes factors, and two identical p-values may associate with different Bayes factors. Both examples highlight the necessity of taking into account the statistical power to interpret p-values. To this end, our scaled Bayes factor becomes more appealing. When two tests produce identical p-values, the scaled Bayes factor tends to assign a larger value to the test with a larger power, while the (unscaled) Bayes factor does not.
Our main result states that, under the null, the logarithm of Bayes factor has the same distribution as a weighted sum of chi-squared random variables with a shifted mean. The results hold asymptotically for Bayesian multi-linear regression. For simple linear regression, we have 2 log(Bayes factor) = λ χ 2 1 + log(1 − λ), where λ is a quantity related to the prior and the data, and χ 2 1 denotes a chi-squared random variable of one degree of freedom (denoted by d.f.). The undesirable properties of Bayes factor, namely, its over-dependence on the prior and Bartlett's paradox, find their roots in the shift term log(1 − λ). Our scaled Bayes factor effectively substitutes this term with −λ to achieve 2 log(scaled Bayes factor) = λ (χ 2 1 − 1). For simple linear regression, the p-value associated with a Bayes factor is the same as the p-value of the likelihood ratio test. For multi-linear regression, computing the p-value associated with a Bayes factor requires evaluation of the distribution function of a weighted sum of chi-squared random variables. Based on a recently published polynomial algorithm (Bausch 2013), we developed a software package to evaluate the p-values analytically, which can efficiently achieve an arbitrary precision.
The article is structured as follows. In Section 2, we formulate the model and the priors, and provide our main result on the null distribution of Bayes factors. In Section 3, we demonstrate how to compute a p-value associated with a Bayes factor. In Section 4, we introduce the scaled Bayes factor and demonstrate its benefits. In Section 5, we analyze a real dataset to compute and compare Bayes factor, the scaled Bayes factor, and the p-values associated with Bayes factors. In the last section, we summarize our findings and discuss relevant (future) topics. All proofs are in the supplementary online.

The Null Distribution of Bayes Factor
We consider the standard hypothesis testing problem in linear regression with independent normal errors.
where MVN stands for the multivariate normal distribution, I n is an n × n identity matrix, W is a full-rank n × q matrix representing the nuisance covariates, including a column of 1, a is a q-vector, L is an n × p matrix representing the covariates of interest, b is a p-vector, and τ −1 is the error variance. The two models H 0 and H 1 are nested and the null model H 0 represents no effect of L.
The Bayesian linear regression comes with three forms of conjugate priors in the literature. The first is the normal-inversegamma (NIG) prior (O'Hagan and Forster 2004, chap. 9), detailed below: where V a and V b are some positive definite matrices, and the gamma distribution is in the shape-rate parameterization. Following the standard treatment (see Servin and Stephens 2007) to let V −1 a → 0 and κ 1 , κ 2 → 0, we can compute Bayes factor in the closed form where is the residuals of L after regressing out W , and | · | denotes the determinant. Since W is assumed to contain a column of 1, each column in X is therefore centered. Bayes factor in (3) uses the null as the base model and is thus called the null-based Bayes factor (see Liang et al. 2008), which has been widely used in genetic association studies (Balding 2006;Marchini et al. 2007;Guan and Stephens 2008;Xu and Guan 2014).
The use of the improper prior V −1 a → 0, κ 1 , κ 2 → 0 has two merits. First, the limiting prior distributions for a and τ is equivalent to Jeffreys' prior (Ibrahim and Laud 1991;O'Hagan and Forster 2004), p(a, τ ) ∝ τ (q−2)/2 , which is the standard choice of the noninformative prior for the nuisance parameters in the literature. Moreover, the posterior distributions for a and τ are proper. Second, Bayes factor in (3), which can be written as the limit of a sequence of Bayes factors with proper priors (see proof in supplementary), is invariant to the shifting and scaling of y (or independent of a and τ ). To see this, replace y with the standardized random variable and one can check (3) still holds. We assume a priori that the expectation of b is zero so that the direction of the effect has no influence on Bayes factor. This prior for b is commonly adopted in practice (see Jeffreys 1961, chap. 5). For the NIG prior, we further assume independence between the effects and the covariates. Henceforth when we refer to the NIG prior, we mean unless otherwise stated. The NIG prior and the corresponding Bayes factor will be the primary focus of this article. The second conjugate prior is Zellner's g-prior (Zellner 1986;Liang et al. 2008): This can be seen as a special case of the NIG prior with V b = g(X t X ) −1 and thus Bayes factor under the g-prior can be derived straightforwardly from (3). The third conjugate prior, the normal prior, can also be viewed as a special case of the NIG prior, when the error variance τ −1 is assumed known: Under this prior, and letting V −1 a → 0, Bayes factor can also be computed analytically where X is defined in (4) and z is defined in (5). Because the error variance in most applications is unknown, the normal prior is more of theoretical interest. But, as we will see shortly, Bayes factor with the normal prior is approximately equal to that with the NIG prior. This is not too surprising because the data are very informative on the error variance.
Theorem 1. For multi-linear regression (1) with the NIG prior (2), the g-prior (7), and the normal prior (8), under the null Bayes factors (BF) given in (3) and (9) follow where (4). For the NIG prior and the g-prior, = o P (1) vanishes in probability when the sample size n → ∞. For the normal prior = 0.
Theorem 1 states that under the null 2 log BF is distributed as a weighted sum of chi-squared random variables with a shifted mean, and the weights and the mean-shift can be computed. By evaluating the distribution function, we can obtain a p-value associated with Bayes factor. For simple linear regression, Q 1 is asymptotically equal to the test statistic of the likelihood ratio test. Thus, the p-value associated with Bayes factor is the same as the p-value of the likelihood ratio test.
It is easy to see that λ i ∈ [0, 1). When the leading eigenvalue approaches 1, p i=1 log(1 − λ i ) goes to negative infinity, and so does the 2 log BF. Under two scenarios the leading eigenvalue does approach 1: the sample size goes to infinity or the prior diffuses indefinitely. Thus, the prior-dependence nature of Bayes factor and the Barlett's paradox both find their roots in the term Moreover, when the sample size gets extraordinarily large, every λ i approaches 1 and p i=1 λ i Q i behaves like the likelihood ratio test statistic, which is distributed as a chi-squared random variable with p degrees of freedom, a special case of Wilks's 1938 theorem.

The p-Value Associated with Bayes Factor
Using Theorem 2.1, we can compute a p-value associated with Bayes factor given in (3), which henceforth is denoted by p B . Since Bayes factor is a test statistic, p B is naturally a Frequentist p-value. We point out p B is also a Bayesian p-value. The p-values, or tail probabilities, are frequently computed in Bayesian context to check whether the model provides a good fit to the data. Bayesian p-values can be computed by comparing the observed test statistic to a predictive distribution obtained by integrating out the nuisance parameters over a reference distribution. Different Bayesian p-values can be computed using different reference distributions (Robins, van der Vaart, and Ventura 2000, Table 1). Two well-known examples are the prior predictive p-values (Box 1980) and the posterior predictive p-values (Rubin 1984;Meng 1994), which use the prior and the posterior as the reference distributions, respectively. In our case, Bayes factors in (3) and (9) are independent of the nuisance parameters; thus, p B can be viewed as a posterior predictive p-value. This convenience can be viewed as a bonus from the improper prior we used.
Corollary 1. Denote by p F the p-value from the likelihood ratio test, then for simple linear regression, we have asymptotically When p = 1, the right-hand side of (10) contains a single chi-squared random variable Q 1 , which is asymptotically equal to the likelihood ratio test statistic, and therefore the two pvalues are equal. In addition, for simple linear regression (10) explains the linear relationship between log BF and the likelihood ratio test statistic observed in Wakefield (2008) and Guan and Stephens (2008).

Weighted Sum of χ 2 1
For multi-linear regression, the right-hand side of (10) contains a weighted sum of chi-squared variables. The weights λ 1 , . . . , λ p are functions of the prior effect size σ b and the eigenvalues of the matrix X defined in (4). In contrast, the likelihood ratio test statistic is asymptotically equal to p i=1 Q i and distributed as χ 2 p , which does not take into account the eigenvalues of X. Consequently, p B no longer equals to p F in general. One exception, however, is Bayes factor under the g-prior, where we have λ i = g/(g + 1) for every i.
To compute p B for multi-linear regression requires evaluating the distribution function of a weighted sum of chi-squared random variables, a challenging problem. Fortunately, a recent polynomial method by Bausch (2013) provides an efficient solution. Our contribution here is its implementation. We have implemented Bausch's method in C++, which allows one to compute p-values (tail probabilities) to an arbitrary precision efficiently.
First, we briefly summarize Bausch's method and then provide more details of our implementation. Bausch pairs the chisquared variables to take advantage of the identity where Q 1 and Q 2 are independent χ 2 1 random variables and I 0 is the modified Bessel function of the first kind. I 0 can be approximated, to an arbitrary precision, by its Taylor expansion, and the series obtained can be integrated algebraically in the subsequent convolutions. The error of this algorithm only depends on the remainder terms of the Taylor expansions and thus can be quantified. Bausch showed that the complexity of this algorithm is polynomial in p.
In our implementation, the weights (λ 1 , . . . , λ p ) are sorted in a descending order and the chi-squared variables are then paired consecutively. If p is odd, the term with the smallest weight is retained for a numerical convolution in the last step. This pairing strategy aims to minimize the number of terms needed in Taylor expansions for a prespecified precision. After Taylor expansion, we are faced with convolving gamma density functions (up to a normalizing constant). The order of the convolutions is determined by a single-linkage hierarchical clustering (Murtagh and Contreras 2012) on the rate parameters of the gamma densities. Convolving two gamma densities of similar rates is computationally more efficient.
Our program has four features outstanding. First, we adopted GNU Multi-Precision Library so that our program can produce an arbitrarily small p-value without suffering underflow or overflow. Second, for an even number of chi-squared variables, p B can be computed at an arbitrary precision; for an odd number of chi-squared variables, the error introduced at the last step of numerical convolution can be made arbitrarily small. Third, the terms of Taylor expansion are determined by a prespecified precision and a strict error bound is provided. Last, since the program is written in C++, it is fast and suitable for studies that evaluate millions of tests, such as genetic association studies. Figure 1 demonstrates the efficiency of our program, for example, when p = 10 our program can evaluate ≈ 150 p-values per second. The speed appears to be quadratic in p. The weighted sum of chi-squared variables occurs frequently in statistical applications, such as the ridge regression, the variance component model, and recently the association testing for rare variants (Wu et al. 2011;Epstein et al. 2015). We believe our program has a wide range of applications. The source code and executables of our program BACH (Bausch's Algorithm for CHisquare weighted sum) are freely available at http://haplotype.org.

Accuracy and Calibration of p B for Finite Sample Sizes
Using Theorem 1, we can evaluate extremely small p-values, an important advantage in applications such as genome-wide association studies (GWAS) compared to the permutation method described in Servin and Stephens (2007). Since p B is quantified asymptotically, we are compelled to evaluate the accuracy and calibration of p B for small sample sizes. We also computed the likelihood ratio test p-value p F as a comparison because of its intrinsic connection to Bayes factor (and hence p B ) shown in Theorem 1.
We used a GWAS dataset to perform our simulation studies. The details of the dataset are provided in Section 5. For given n and p, we randomly sampled a subset of genotypes of n individuals and p SNPs. Then we simulated y under the null, that is, y ∼ MVN(0, I n ). For each pair of sampled genotypes and simulated phenotypes, we computed p B , using σ b = 0.2, and p F . We chose n = 100, 300 and p = 1, 5, 10, 20. For every combination of n and p, we repeated the simulations 10 7 times. For p = 1, we can compute the exact p-value associated with Bayes factor using the F test (see proof in supplementary materials). The comparison between exact values of p B and p B obtained from asymptotic approximation is shown in the top row of Figure 2. For p > 1, true values of p B cannot be obtained analytically; we thus compared our asymptotic results against the theoretical uniform distribution. The two bottom rows show that for n = 100, the asymptotic results are conservative for small p-values; but for n = 300, the asymptotic results appear to be well-aligned with the theoretical prediction. Overall, Figure 2 demonstrates that p B is well calibrated, and the calibration is better than the p F at the tail. We thus conclude that our asymptotic method for obtaining p B is accurate and well-calibrated when the sample size is more than a few hundred.

Scaled Bayes Factors
Bayes factors are sensitive to the specification of priors. Let us consider the NIG prior with V b = σ 2 b I p and denote the singular values of X by δ i for i = 1, . . . , p. Then λ i in (10) becomes λ i = δ 2 i /(δ 2 i + 1/σ 2 b ), and thus Here, we assume the sample size is sufficiently large such that the o P (1) error term can be safely omitted. The term λ i Q i is bounded by Q i (because λ i < 1), but the term log (1 + δ 2 i σ 2 b ) is monotonically increasing with respect to both δ i and σ b . When the sample size goes to infinity, δ i goes to infinity; when the prior becomes more diffusive, σ b goes to infinity. These properties give rise to the prior-dependence nature of Bayes factor and Bartlett's paradox.
By (10) where E 0 is the expectation evaluated under the null. Centering 2 log BF to obtain We call the quantity log BF − E 0 [log BF] the logarithm of the scaled Bayes factor (sBF). By definition, evaluating sBF requires computing E 0 [2 log BF]. In addition to direct computation, E 0 [2 log BF] can also be evaluated by simulating y under the null. A valid and convenient approach to simulating under the Figure . Accuracy and calibration of p B . The top row is for simple linear regression. The "true values"for p B are obtained from F-tests. The y-axis is the asymptotic p B . Line y = x is marked in gray. Two bottom rows are for multi-linear regression. Red dots represent p F (from likelihood ratio tests) and blue p B . The gray region represents a 95% prediction intervals for uniformly distributed p-values. null was proposed by Kennedy (1995). The approach is to permute y W , the residuals of y after regressing out covariates W . Since 2 log BF is a weighted sum of chi-squared random variables, a modest number of permutations of y W provide an accurate estimation of its mean under the null. The permutation might be advantageous over the analytical computation when the model is misspecified. Comparing (11) and (12), the scaling removes from sBF the over-dependence on prior and Bartlett's paradox observed in BF (Figure 3). The scaling is a function of λ, which takes value in [0, 1) p . If there is a gap between λ i and 1, then the ith component contributes modestly to the scaling. For example, when p = 1 the scaling is 1.5 when λ 1 = 0.8 and 2.0 when λ 1 = 0.9. When all λ i → 0, the scaling approaches 1 and meanwhile sBF → 1, as expected; when all λ i → 1, although the scaling factor blows up (sBF/BF → ∞), 2 log sBF is stable and 2 log sBF → χ 2 p − p. Consider a multiple testing problem that tests H 1 , H 2 , . . . against H 0 . Each alternative model is concerned with testing a single covariate in association with the response variable, and each covariate has the same λ 1 . Then sBF and BF produce the same ranking for all tests, because the scaling coefficient is determined solely by λ 1 . In light of the connection between BF and p F (Wakefield 2008;Guan and Stephens 2008), we have that, asymptotically, sBF and p F produce the same ranking for all tests that have the same λ 1 . However, when λ 1 differs, the three statistics BF, sBF, and p F (or p B ) produce different rankings.

sBF Disregards Informativeness of Covariates Under the Null
Let us focus on simple linear regression. The treatment of multilinear regression can be found in supplementary materials. Let V b = σ 2 b and X t X = δ 2 1 . Then we have λ 1 = δ 2 1 /(δ 2 1 + 1/σ 2 b ). So λ 1 can be taken as a measurement of the informativeness of a covariate. In genetic association studies, an SNP's λ 1 is determined by the minor allele frequency and the prior effect size, and for a fixed prior effect size, the larger the minor allele frequency, the larger the λ 1 .
Proposition 2. Suppose two models H 1 and H 2 are each concerned with a single but different covariate, and H 1 associates with a larger λ 1 . Denote BF j and sBF j of BF and sBF for model H j ( j = 1, 2). We have Although it is rudimentary to prove Proposition 2, the result is illuminating with respect to the difference between BF and sBF. Under the null, BF has the propensity to assign a larger value to a less informative covariate. In other words, BF penalizes more heavily to a more informative covariate. On the other hand, sBF disregards the informativeness of a covariate under the null. This indifference to the informativeness of sBF is advantageous under the alternative model (next section), because, loosely speaking, the over-penalty of BF on more informative covariates applies not just under the null, but also under the alternative.

BF and sBF Under the Local Alternatives
The local alternatives are a sequence of alternatives that scale down the effect size when sample size increases so that the test statistic converges for large samples (see Ferguson 1996, chap. 22). The following theorem quantifies BF (and hence sBF) under the local alternatives.
Theorem 2. For multi-linear regression (1) with the NIG prior (2), the g-prior (7), and the normal prior (8), under the local alternatives b = β/ √ nτ and assuming either L t L/n converges or L is entry-wise bounded, Bayes factors given in (3) and (9) follow where ∼ o P (1),Q i is a noncentral chi-squared random variable with d.f. = 1 and the noncentrality parameter For the normal prior = 0. Note in the above theorem, 2 log BF has the same form as in (10). The only difference is that under the local alternatives Q 1 , . . . , Q p become noncentral chi-squared random variables instead of central chi-squared under the null. The assumptions on L guarantee the stochastic boundedness of ρ i , permitting a Taylor expansion that leads to the linear approximation. These two assumptions are usually satisfied in practice, particularly in genetic association studies, where each entry of L is the allele counts of an individual at a marker and thus bounded by two.
Let us assume that the sample size is large enough so that the error term in Theorem 4.1 can be safely ignored. For simple linear regression, we have 2 log BF = λ 1 Q 1 + log(1 − λ 1 ) and 2 log sBF = λ 1 (Q 1 − 1), where Q 1 ∼ χ 2 1 (ρ 1 ) is a noncentral chi-squared random variable. Because E[Q 1 ] = ρ 1 + 1, we have E[2 log sBF] = λ 1 ρ 1 , which is proportional to λ 1 for a fixed ρ 1 . In other words, under the local alternatives, sBF tends to assign larger values to more informative covariates. On the other hand, E[2 log BF] = λ 1 (ρ 1 + 1) + log(1 − λ 1 ) is not a monotonic function of λ 1 for a fixed ρ 1 . Thus, the (unscaled) Bayes factor does not respect the informativeness of covariates under the alternative model.
The above proposition says that under the local alternatives the distribution of Q 1 (and hence BF and sBF) is determined by λ 1 . In (a), the alternative is evaluated at a fixed point, while in (b) it is averaged over the prior distribution of b. Because the power of a test is determined by the alternative distribution of Q 1 (for a fixed null), Proposition 3 suggests that the statistical power is positively correlated with λ 1 . This result is simple yet profound. Suppose we are faced with two tests with equal p-values that suggest the null should be rejected. Without knowing the powers of the tests, we cannot decide which rejection is more reliable or carries more evidence (Stephens and Balding 2009). Suppose two tests have different λ 1 's but the same Q 1 's, then the two tests have the same p-value. From 2 log sBF = λ 1 (Q 1 − 1), the scaled Bayes factor has a propensity to assign a larger value to the test that has a larger power (or a larger λ 1 ), a desirable property. On the other hand, this desirable property is missing for the unscaled Bayes factor.

Applying to Ocular Hypertension GWAS Datasets
To illustrate how the scaled Bayes factor performs in real data analysis, we applied for access and downloaded two GWAS datasets from the database of Genotypes and Phenotypes (dbGaP). Both studies were funded by the National Eye Institute: one is the Ocular Hypertension Treatment Study (Kass et al. 2002) (henceforth OHTS, dbGaP accession number: phs000240.v1.p1), the other is National Eye Institute Human Genetics Collaboration Consortium Glaucoma Genome-Wide Association Study (Ulmer et al. 2012) (henceforth NEIGHBOR, dbGaP accession number: phs000238.v1.p1).
The phenotype of interest is the intraocular pressure (IOP). The OHTS dataset only contains individuals of high IOP (> 21). The NEIGHBOR dataset is a case-control design for glaucoma (Ulmer et al. 2012;Weinreb, Aung, and Medeiros 2014), in which many samples have IOP measurements because a high IOP is considered a major risk factor and a precursor phenotype for glaucoma. The NEIGHBOR dataset, however, contains case samples with small IOP and control samples with large IOP. Since IOP and glaucoma evidently have different genetic basis, though many are overlapping, we removed those samples. We also removed samples whose IOP measurements differ by more than 10 between the two eyes since such a large difference is likely to be caused by a different mechanism, for example, physical accidents. The average IOP of the two eyes was used as the raw phenotype. We then performed the routine quality control for the genotypes using the same procedure described by Xu and Guan (2014). OHTS and NEIGHBOR were genotyped on different SNP arrays, and there remained 301,143 autosome SNPs genotyped in both datasets that passed QC. We then performed principal component analysis to remove the outliers and extracted 3226 subjects (740 from OHTS and 2486 from NEIGHBOR) that clustered around the European samples in HapMap3 (The International HapMap Consortium 2010). We regressed out age, sex, and six leading principal components from the raw phenotypes, quantile normalized the residuals, and used them as the phenotypes for single SNP analysis. We computed BF, sBF, and p B using prior σ b = 0.2.
We first compared BF and sBF in different minor allele frequency (MAF) bins. Different MAF bins correspond to different bins of the informativeness (λ 1 ) of SNPs. Figure 4 shows that in each bin log 10 sBF ∼ log 10 BF is roughly parallel to the line y = x, and more importantly the larger the MAF, the further are the points away from y = x, or in other words, log 10 sBF − log 10 BF is larger, which agrees well with the definition of sBF (Figure 3), and fits the theoretical predictions (Propositions 2 and 3). Another noticeable feature in Figure 4 is that the minimum value of sBF is larger than that of BF, which is consistent with the Proposition 1(a).
Next we examined the ranking of SNPs by different test statistics. Table 1 contains the top 20 SNPs in the ranking by BF. Rows were then sorted according to SNP's chromosome and position. Incidentally, the top 2 hits (rs7518099 and rs4656461) are the same for all the three test statistics. The rankings are largely similar to one another among the three test statistics: BF, sBF, and p B , particularly so between BF and p B . There is, however, the noticeable exception of SNP rs7696626, with a ranking by sBF that is much worse than its rankings by BF and p B . Not surprisingly, this SNP has the smallest MAF (0.023) among the 20 SNPs included in Table 1. This example fits the theoretical observations made in Proposition 2. We permuted the phenotypes once, and recomputed the test statistics. The permutation is to simulate under the null to confirm that E 0 [log sBF] = 0. Apparently sBF is more invariant against permutation compared to BF in the sense that log(sBF(y)/sBF(ỹ)) ≈ log sBF(y).
Our choice of the σ b = 0.2 represents the prior belief of small but noticeable effect size in the context of GWAS (see Burton et al. 2007). To check how sensitive our results are with respect to the choice of σ b = 0.2, we redid the analysis using σ b = 0.5. As predicted by our theory (Figure 3), we observed that with σ b = 0.5 BF tends to be smaller and sBF tends to be larger, and p B remains unchanged. Moreover, the rankings of the SNPs remained mostly unchanged between the two choices of σ b (Table S1 in supplementary materials).
Lastly, although it was not our main objective, we examined the top hits in the association result. Our analysis reproduced three known genetic associations for IOP. Namely, the TMCO1 gene on chromosome 1 (163.9M-164.0M), which was reported by van Koolwijk et al. (2012); a single hit rs2025751 in the PKHD1 gene on chromosome 6 (Hysi et al. 2014); and a single hit rs12150284 in the GAS7 gene on chromosome 17 (Ozel et al. 2014). A noticeable potentially novel finding is the gene PEX14 on chromosome 1. Two SNPs, rs12120962 and rs12127400, have modest association signals from BF and p B , but their scaled Bayes factors are noteworthy. PEX14 encodes an essential component of the peroxisomal import machinery. The protein interacts with the cytosolic receptor for proteins containing a PTS1 peroxisomal targeting signal. Incidentally, PTS1 is known to elevate the intraocular pressure (Shepard et al. 2007). In addition, a mutation in PEX14 results in one form of Zellweger syndrome, and for children who suffer from Zellweger syndrome, congenital glaucoma is a typical neonatal-infantile presentation (Klouwer et al. 2015).

Discussion
In this article, we quantify the null distribution of Bayes factors in the context of multi-linear regression. We showed that under the null, 2 log BF is distributed as a weighted sum of chisquared random variables. The null distribution allows us to compute the p-value associated with Bayes factor analytically, and we have developed a software package to do so efficiently. The software package can be used in a wide range of applications such as ridge regression, variance component model, and genetic association studies. The null distribution of Bayes factors also allows us to study the properties of Bayes factors, and we identified the dominant term in Bayes factor that leads to its excessive prior-dependence and Bartlett's paradox. We proposed the scaled Bayes factor, which depends less on the prior and is immune to Bartlett's paradox. We then studied the properties of the sBF under the null and the local alternatives. Compared to BF, sBF respects more to the informativeness of data.
Very often the covariates L are inferred from a statistical model, for example, imputed allele dosages by Guan and Stephens (2008) and the haplotype loading matrix in Xu and Guan (2014). One would like to take into account the uncertainty of the inferred L. In imputation-based association studies, one may compute the posterior mean of L and then perform the test. But in haplotype association analysis (Xu and Guan 2014), using the posterior mean of L is impractical as one realization of L may be a column switching of the other, to say the least. A natural solution is to compute a Bayes factor for each realization of L and use the averaged Bayes factor as the test statistic. Then how to evaluate the associated p-value for the averaged Bayes factor? The same question also arises after obtaining an averaged Bayes factor from multiple choices of σ b 's.
and is the cumulative distribution function of χ 2 1 . In essence, we converted each p-value to its associated Bayes factor, averaged Bayes factors, and computed p B of the average Bayes factor. Since averaging over Bayes factors is always valid and Theorem 1 provides the necessary connection between p-values and Bayes factors, our approach to combining the correlated p-values appears to work well, at least for the aforementioned two examples, where the existing methods surely fail.
By definition, E 0 [BF] = 1 which is a nice property because it suggests that BF does not favor either the null or the alternative when the data are simulated under the null. A careful investigation into Proposition 2, however, revealed that this seemingly nice property effectively results in a greater penalty on more informative covariates. The scaled Bayes factor, on the other hand, satisfies E 0 [log sBF] = 0, trading the property E 0 [BF] = 1 of the (unscaled) Bayes factor. Immediately, this trade suggests that sBF favors the alternative over the null (by Jensen's inequality or simply Proposition 1(a)). We argue that this trade brings several benefits to sBF: it depends less on prior; it becomes immune to Bartlett's paradox; and, more importantly, sBF becomes well calibrated with respect to permutation. Suppose we permute the response variable y once to obtainỹ and compute the test statistic (either BF or sBF) withỹ, treating it as the test statistic under the null. Obviously, 2 log(sBF(y)/sBF(ỹ)) is expected to have the same mean as that of 2 log sBF(y) (Proposition 1(c)). On the other hand, 2 log(BF(y)/BF(ỹ)) is expected to have a different (shifted) mean from 2 log BF(y). We believe this better calibration of sBF with respect to permutation will make it a better test statistic for Bayesian variable selection regression (Guan and Stephens 2011).
In genetic association studies, one routinely performs millions of simple linear regression to test for the association between each genetic variant and the phenotype. In general, sBF and p B would produce different rankings for the variants, because their corresponding λ 1 's differ. When a special prior, σ b ∝ 1/δ 1 , is used, however, sBF (and BF) will produce the same ranking as p B (Wakefield 2008;Guan and Stephens 2008). This prior, which produces the same λ 1 for all covariates, somewhat defeats the purpose of specifying a prior, because it practically eliminates the effect of a variant's variance to its test statistic. In multi-linear regression, such a special prior is the g-prior, which sets every λ i to g/(g + 1). At some sense ranking variants using sBF is more "informative" than using p B , and we would like to control false discovery rate (FDR) for sBF. One approach is specifying the prior odds, multiplying the prior odds with BF or sBF to obtain the posterior odds, and then the posterior probability of association (PPA) for each variant. But specifying the prior odds is somewhat arbitrary, which unfortunately has a strong influence on PPA. An alternative approach is perhaps to develop a procedure that is similar to that of Benjamini and Hochberg (1995). The Benjamini-Hochberg procedure relies on the null distribution of the p-values (which is uniform) to control FDR, but it is noted that the p-value may not be the optimal statistic for controlling FDR (Sun and Cai 2009). Since now we know the null distribution of sBF (and BF), we can estimate the expected FDR for sBF (and BF). Such an FDR controlling procedure will provide an alternative solution to "calibrating" Bayes factors (either scaled or unscaled), and it will strengthen the "Bayes/non-Bayes compromise, " which is likely to attract more practitioners to apply Bayesian methods in their studies.

Supplementary Materials
The supplementary online includes proofs of theorems, R code for computing Bayes factors, and scaled Bayes factors, and a table to summarize top single SNP association results using σ b = 0.5 for the IOP dataset presented in Section 5. Our software package to compute p-values for a weighted sum of chisquared random variables is freely available at https://github. com/haplotype/BACH and http://www.haplotype.org.