Evidential Calibration of Confidence Intervals

We present a novel and easy-to-use method for calibrating error-rate based confidence intervals to evidence-based support intervals. Support intervals are obtained from inverting Bayes factors based on a parameter estimate and its standard error. A $k$ support interval can be interpreted as"the observed data are at least $k$ times more likely under the included parameter values than under a specified alternative". Support intervals depend on the specification of prior distributions for the parameter under the alternative, and we present several types that allow different forms of external knowledge to be encoded. We also show how prior specification can to some extent be avoided by considering a class of prior distributions and then computing so-called minimum support intervals which, for a given class of priors, have a one-to-one mapping with confidence intervals. We also illustrate how the sample size of a future study can be determined based on the concept of support. Finally, we show how the bound for the type I error rate of Bayes factors leads to a bound for the coverage of support intervals. An application to data from a clinical trial illustrates how support intervals can lead to inferences that are both intuitive and informative.


Introduction
A pervasive problem in data analysis is to draw inferences about unknown parameters of statistical models.For instance, data analysts are often interested in identifying a set of parameter values which are relatively compatible with the observed data.Here we focus on a particular method for doing so -the support set -that arguably represents a natural evidential answer to the problem both from a likelihoodist (Edwards, 1971;Royall, 1997;Blume, 2002) and a Bayesian (Wagenmakers et al., 2022) point of view.In either paradigm, statistical evidence may be defined via the Law of Likelihood (Hacking, 1965), that is, data constitute evidence for one parameter value over an alternative parameter value if the likelihood of the data under that parameter value is larger than under the alternative parameter value.The likelihood ratio (or Bayes factor) measures the strength of evidence, and it plays also a central role in the construction of support sets, as we will explain in the following.
Let f (x | θ) denote the likelihood of the observed data x.Let θ be an unknown parameter and denote by the Bayes factor quantifying the strength of evidence which the observed data x provide for the simple null hypothesis H 0 : θ = θ 0 relative to a (possibly composite) alternative hypothesis H 1 : θ = θ 0 , with f (x | H 1 ) the marginal likelihood of x obtained from integrating the likelihood f (x | θ) with respect to the prior density of the parameter f (θ | H 1 ) under the alternative H 1 (Jeffreys, 1961;Kass and Raftery, 1995).For constructing a support interval, one views the Bayes factor (1) as a function of the null value θ 0 for fixed data x.A k support set for θ is then given by the set of parameter values for which the data are k times more likely than under the alternative hypothesis H 1 (Wagenmakers et al., 2022), that is, The support set thus includes the parameter values for which the observed data provide statistical evidence of at least level k.
Figure 1 illustrates different support sets (in this case intervals) for a log hazard ratio parameter θ quantifying the effect of the drug dexamethasone on the mortality of hospitalized patients with Covid-19 enrolled in the RECOVERY trial (RECOVERY Collaborative Group, 2021).Shown is also the Bayes factor for testing H 0 : θ = θ 0 versus H 1 : θ = θ 0 viewed as a function of the null value θ 0 .A k support set is obtained from "cutting" this function at height k, and taking the parameter values with a Bayes factor value larger than k as part of the set.In practice, it is not clear which value of k should be chosen.One possibility is to select k based on conventional classifications of Bayes factors or likelihood ratios.Table 1 lists three of them.For instance, using the classification from Jeffreys (1961, Appendix B), the k = 10 support interval ranging from −0.27 to −0.1 can be interpreted to contain log hazard ratios that are strongly supported by the data, whereas the k = 1/10 support interval ranging from −0.37 to 0 can be interpreted to contain log hazard ratios that are at least not strongly contradicted by the data.The construction of support sets thus parallels the construction of frequentist confidence sets: A (1 − α)100% confidence set corresponds to the set of parameter values which are not rejected by a null hypothesis significance test at level α.It can equally be displayed and obtained from a so-called p-value function, which is the p-value of the data viewed as a function of the null value (Fraser, 2019;Rafi and Greenland, 2020).Despite these similarities, the interpretation of support and confidence sets is rather different; support sets contain parameter values for which there is at least a certain amount of statistical evidence, whereas confidence sets are defined through the long-run frequency of including the unknown parameter θ with probability equal to their confidence level.The parameter values in a confidence sets are typically interpreted as being "compatible" with a particular data set, but this is debatable as the confidence level is concerned with the confidence set as a procedure over multiple replications.
Although support sets are conceptually simple and intuitive, they have not been applied to many problems.It is also unclear how they relate to the more widely used confidence sets.In this article we thus shed light on the connection between support and confidence sets.Specifically, we provide methods for calibrating approximate confidence sets to approximate support sets and vice versa in the important case when the data consists of an estimate of a univariate parameter θ with approximate normal likelihood (Section 2).To do so, we derive novel and easyto-use formulas for computing support intervals that only require summary statistics typically reported in research articles, e. g., point estimates, standard errors, or confidence intervals.This scenario is highly relevant as a large part of commonly used estimators satisfy the approximate normality assumption, and also because one often does not have access to the raw data but only the summary statistics.Computing a support interval requires the specification of a prior distribution for θ under the alternative H 1 , and we compare several classes of distributions.We also show how bounding the evidence against the null hypothesis for a certain class of prior distributions leads to the novel concept of a minimum support set.Our minimum support sets are directly related to well-known bounds of Bayes factors (Berger and Sellke, 1987;Sellke et al., 2001;Held and Ott, 2018).In Section 3, we show how minimum support sets provide confidence sets an evidential interpretation with respect to certain classes of priors.We then illustrate how the sample size of a future study can be determined based on support, which provides a novel alternative to the conventional approaches based on either power or precision of an interval estimator (Section 5).Finally, we show how the universal bound for the type I error rate of Bayes factors can be used for bounding the coverage of support sets, even under sequential analyses with optional stopping (Section 6).As a running example, we use data from the RECOVERY trial (RECOVERY Collaborative Group, 2021), as already introduced in Figure 1.

Support intervals under normality
Denote by θ an asymptotically normal estimator of an unknown univariate parameter θ, possibly the maximum likelihood estimator (MLE).Suppose its squared standard error σ 2 is an estimate of the asymptotic variance of θ, so that an approximate normal likelihood θ | θ ∼ N(θ, σ 2 ) is justifiable.For example, θ could be an estimated regression coefficient from a generalized linear model and σ its standard error.In many simple settings, the standard error is of the form σ = λ/ √ n, where λ 2 is the variance corresponding to one effective unit and n is the effective sample size, for example, the number of measurements or the number of events (Spiegelhalter et al., 2004, Section 2.4), see also Berger et al. (2013) for a generalization of effective sample size to more complex settings with dependent data.An approximate (1 − α)100% confidence interval for θ is given by with Φ −1 (•) the quantile function of the standard normal distribution.The confidence level (1 − α)100% represents the long run frequency with which the true parameter is included in the confidence interval (assuming that the sampling model is correct).Note that the interval (3) also corresponds to the (1 − α)100% posterior credible interval based on an (improper) uniform prior for θ, corresponding to Jeffreys's transformation invariant prior (Jeffreys, 1961;Ly et al., 2017) and thus also representing the default interval estimate for θ from a Bayesian estimation perspective.We will now contrast the confidence interval (3) to several types of support intervals.

Normal prior under the alternative
To construct a support interval for θ using the data summary θ with θ | θ ∼ N(θ, σ 2 ), specification of a prior for θ under the alternative H 1 is required.Specifying a normal prior Similar to the confidence interval (3), the support interval ( 5) is centered around the parameter estimate θ.However, while the width of the confidence interval is only determined through the confidence level (1 − α)100% and standard error σ, the width of the support interval also depends on the specified prior for θ under H 1 .Moreover, for k > 1 it may happen that the support interval is empty, as the term below the square root in (5) may become negative for too large k > 1.This means that in order to find the desired level of support k > 1, the data have to be sufficiently informative (relative to the prior), i. e., the squared standard error σ 2 has to be sufficiently small relative to the prior variance σ 2 θ .In the following, we will discuss how different prior means µ θ and variances σ 2 θ affect the resulting support intervals.When the prior variance decreases (σ 2 θ ↓ 0), the prior approaches a point mass at µ θ .The width of the support interval is then fully determined by the difference between the parameter estimate θ and the prior mean µ θ divided by the standard error σ.A smaller difference between θ and µ θ leads to a tighter support interval.In contrast, for priors that become increasingly diffuse (σ 2 θ → ∞), the k ≥ 1 support interval (5) extends to the entire real line, indicating that all values θ ∈ R receive more support from the data than the diffuse alternative, regardless of the data, i.e., the observed estimate θ, standard error σ, and the location of the prior mean µ θ .This particular behavior provides another perspective on the well-known Jeffreys-Lindley paradox (Wagenmakers and Ly, 2023); the confidence interval from (3) only spans a finite range around the parameter estimate θ, so that the corresponding null hypothesis significance tests would reject the parameter values outside, whereas for the same values the Bayes factor would indicate evidence for the null hypothesis.Finally, centering the prior around the parameter estimate (µ θ = θ) and setting the prior variance equal to the variance of one effective observation (σ 2 θ = n × σ 2 with n the effective sample size), produces the support interval for Jeffreys's approximate Bayes factor (Wagenmakers, 2022) which is equal to the well-known approximation of the Bayes factor based on the Bayesian information criterion (Raftery, 1999).In this case, the standard error multiplier has a particularly simple form showing that at least n ≥ k 2 − 1 effective observations are required for the respective support interval with k ≥ 1 to be non-empty.

Local normal prior under the alternative
The support interval based on the normal prior (5) depends on the specification of a prior mean and prior variance.A different approach is to use a so-called local prior, that is, a unimodal and is not a special case of the support interval (5).This is because the prior for θ under H 1 is different for each null value θ 0 , whereas it is always the same under the two-parameter normal prior approach.To fully specify the support interval ( 7), the prior variance σ 2 θ needs to be chosen.One standard choice is to set it equal to the variance of a single observation (σ 2 θ = n×σ 2 ), known as unit-information prior (Kass and Wasserman, 1995).This approach leads to the k support interval For this type of support interval, the standard error multiplier M = √ is wider than for the Jeffreys's approximate Bayes factor by a factor of √ (1 + 1/n) but the condition n ≥ k 2 − 1 for the k ≥ 1 support interval to be non-empty is the same.

Nonlocal normal moment prior under the alternative
Another attractive class of priors for θ under the alternative is given by so-called nonlocal priors.These priors are characterized by having zero density at the null value θ 0 , thereby leading to a faster accumulation of evidence than local priors when the null hypothesis is actually true (Johnson and Rossell, 2010).One popular type of nonlocal priors is given by normal moment priors θ ∼ NM(θ 0 , σ θ ), with symmetry point θ 0 and spread σ θ which have den- denotes the density function of a normal distribution with mean θ 0 and variance σ 2 θ .The Bayes factor employing a prior from which the corresponding k support interval can be derived to be with W 0 (•) denoting the principal branch of the Lambert W function.The Lambert W function is the (complex) multivalued function W(•) satisfying W(x) exp{W(x)} = x.For real x, it is defined for x ∈ [−1/e, ∞).For x ≥ 0 the function has a unique value, whereas in the interval x ∈ (−1/e, 0), the function has two branches: W 0 (x) > −1 for all x ∈ (−1/e, 0) termed the principal branch, and W −1 (x) < −1 for all x ∈ (−1/e, 0), see Corless et al. (1996) for more details.It is possible that the support interval ( 9) is empty, as for the other two types of support intervals.
This happens when the Lambert W term is smaller than one half so that the square root is undefined.Since W 0 (0.82) ≈ 1/2, this situations occurs when √ e, meaning that the standard error σ has to be sufficiently small relative to the prior spread parameter σ θ and the support level k, so that the interval is non-empty.
assumed in all cases.The prior scale/spread parameter is set to σ θ = 2.The normal prior (correct mean) has a mean equal to the parameter estimate θ, while the normal prior (wrong mean) has a mean one standard deviation λ = 2 away from θ.

Comparison of priors
To better understand the advantages and disadvantages of the previously discussed priors, the resulting support intervals can be compared in terms of their width as a function of the sample size n (Figure 2 top).For small sample sizes, the normal prior with mean equal to the observed Evidential Calibration of Confidence Intervals S. Pawel, A. Ly, E.-J.Wagenmakers parameter estimate produces the narrowest k = 1 support intervals, followed by the local normal prior, the normal prior with mean one standard deviation away from the observed estimate, and lastly the nonlocal normal moment prior.Thus, a well-chosen normal prior can increase the precision of support inference, whereas a poorly chosen normal prior can decrease precision.
However, the differences in width between the priors mostly disappear with increasing sample size.In the realistic range between ten and a few hundred samples, the local normal prior seems to be a reasonable default choice, as it leads to support intervals almost as narrow as the normal (correct mean) prior, without the need to specify a mean.
Another aspect in which the priors can be compared is the highest support level k for which the resulting support intervals are non-empty (Figure 2 bottom).We see that for the same sample size n, the highest support levels from the normal and local normal priors are similar and show the same growth rates.In contrast, the highest support level from the nonlocal moment prior is higher and grows much faster.This is expected because nonlocal priors are designed to produce Bayes factors with faster accumulation of evidence for the null hypothesis.Thus, although nonlocal moment priors result in wider support intervals than the other priors, for small sample sizes they may be the only type of prior that can produce a support interval at, say, Jeffreys's strong evidence level k = 10.

Support intervals based on Bayes factor bounds
In some situations it is clear which prior for θ should be chosen under the alternative H 1 , e. g., when a parameter estimate from a previous data set is available.In other situations it is less clear and different priors may produce drastically different results.To provide a more objective assessment of evidence in the latter situation, several authors have proposed to instead specify only a class of prior distributions and then select the one prior among them that leads to the Bayes factor providing the strongest possible evidence against the null hypothesis H 0 (Edwards et al., 1963;Berger and Sellke, 1987;Sellke et al., 2001;Held and Ott, 2018).Here we refer to these Bayes factor bounds as minimum Bayes factors for the null H 0 over the alternative H 1 , as we are interested in the support for null values θ 0 .
We will now show how minimum Bayes factors can be used for obtaining so-called minimum support sets.Specifically, a k minimum support set is given by where minBF 01 (x; θ 0 ) is the smallest possible Bayes factor for testing H 0 : θ = θ 0 versus H 1 : θ = θ 0 that can be obtained from a class of prior distributions for θ under the alternative H 1 .That is, given the data, for each θ 0 the prior for θ under H 1 is cherry-picked from a class of priors to obtain the lowest evidence for H 0 : θ = θ 0 possible.Minimum support intervals thus provide a Bayes/non-Bayes compromise (Good, 1992) as they do not require specification of a specific prior distribution but still allow for an evidential interpretation of the resulting interval.
One property of minimum Bayes factors is that they can only be used to asses the maximum evidence against the null hypothesis but not for it.Minimum support sets inherit this property, meaning that they can only be obtained for support levels k ≤ 1.For instance a k = 1/3 minimum support set includes the parameter values under which the observed data are at most 3 times less likely compared to under all priors from the specified class of alternative.Being unable to obtain support intervals with k > 1 is the price that needs to be paid for having to only specify a class of prior distributions but not a specific prior itself.We will now discuss minimum support intervals from several important classes of distributions.

Class of all distributions under the alternative
Among the class of all possible priors under H 1 , the prior which is most favorable towards the alternative is a point mass at the observed effect estimate H 1 : θ = θ (Edwards et al., 1963).The resulting minimum Bayes factor is given by for which twice the negative log equals the standard likelihood ratio test statistic when θ is the MLE.Inverting (11) for θ 0 leads to the k minimum support interval Interestingly, defining a support interval relative to the likelihood of the data under the MLE has already been suggested by Fisher (1956).Table 1 shows Fisher's classification of evidence for this type of interval.Also Royall made use of the minimum support interval (12), usually with support levels k = 1/8 and k = 1/32.He noted: "The 1/8 and 1/32 likelihood intervals are not confidence intervals, in general, but they truly represent what confidence intervals are often mistaken to represent, namely parameter values that the sample does not represent evidence against, that is, values that are 'consistent with the observations'.We can speak in this way, asserting that there is not strong evidence against a point inside the interval, without reference to an alternative value, because the statement is true for all alternatives.Every point inside the 1/8 interval is consistent with the observations in the strong sense that there is no other possible value of the parameter that is better supported by a factor as large as 8" (Royall, 1997, p. 101).
While we agree that the support interval ( 12) is a useful bound, it is important to note that from a Bayesian perspective it represents the most blatantly biased assessment of support in the sense that assigning a point prior at the observed parameter estimate hardly reflects prior knowledge about θ but can rather be considered cheating (Berger and Sellke, 1987).This is reflected by the fact that for a given estimate (i.e., data set) and fixed support level k, the interval represents the narrowest support interval among all possible support intervals.When minimizing over the class of all two-parameter normal priors, i.e., the Bayes factor (4), we also obtain the same minimum Bayes factor (11) and consequently the same minimum support interval (12).

Class of local normal alternatives
When the class of priors for θ under the alternative H 1 is given by normal distributions centered around the null value θ 0 , choosing its variance to be σ 2 θ = max{( θ − θ 0 ) 2 − σ 2 , 0} maximizes the marginal likelihood of the data under H 1 .Plugging this variance in the Bayes factor (6) leads to the minimum Bayes factor over the class of local normal priors as first shown by Edwards et al. (1963).Equating (13) to k and solving for θ 0 leads then to the with W −1 (•) the branch of the Lambert W function that satisfies W(y) < −1 for y ∈ (−e −1 , 0).
For k = 1, the standard error multiplier becomes M = −W −1 (−1/e) = 1.Hence, the data provide support for all parameter values within one standard error around the observed parameter estimate θ when the class of priors for the parameter is given by local normal alternatives.

Class of p-based alternatives
For k = 1, the standard error multiplier is given by M )] ≈ 0.90, so the k = 1 minimum support interval is just slightly tighter than the one based on local normal alternatives.

Mapping between confidence and minimum support levels
For all types of minimum support intervals discussed so far, there is a one-to-one mapping between their minimum support level k and the confidence level (1 − α)100% of the approximate confidence interval (3), see Figure 3.The conventional default level of 95% corresponds to a k = 1/6.8support level for the class of all priors under the alternative, a k = 1/2.5 support level for the −ep log p, and a k = 1/2.1 support level for the local normal prior calibration.Conversely, the k = 1/10 minimum support interval corresponds to the 96.81% confidence interval for the class of all priors, the 99.25% confidence interval for −ep log p, and the 99.43% confidence intervals for the local normal prior calibration.Similar to the mappings between Bayes factor bounds and p-values (Held and Ott, 2018), the mappings displayed in Figure 3 provide confidence intervals an evidential interpretation.Specifically, they enhance their long-term frequency interpretation with an interpretation that directly relates to the minimum support that the observed data provide for the parameter values in the interval.

Example RECOVERY trial
We now compute the above (minimum) support intervals for the data from the RECOVERY trial (RECOVERY Collaborative Group, 2021).With the standard error σ known, the minimum support intervals are fully specified and can be readily computed.For the normal, local normal, and the nonlocal normal moment prior we choose their parameters as follows.The trial steering committee determined the sample size of the trial based on an assumed clinically relevant log hazard ratio of log 0.8 = −0.22.This effect size can be used to inform the normal prior under the alternative H 1 , i. e., we specify the mean µ θ = −0.22along with the unit-information variance σ 2 θ = 4 for a log hazard ratio (Spiegelhalter et al., 2004, Section 2.4.2).Likewise, we use the unit-information variance σ 2 θ = 4 as the variance of the local normal prior.The spread parameter of the nonlocal moment prior σ θ is elicited with a similar approach as in Pramanik and Johnson (2022); The value σ θ = 0.28 is selected so that 90% probability mass is assigned to log hazard ratios between θ 0 − log 2 and θ 0 + log 2, representing effect sizes that at most half or double the mortality hazards relative to the null value θ 0 .−ep log p, yet all of them are narrower than the ordinary support intervals.This illustrates that minimum support intervals provide an overly pessimistic assessment of support for parameter values, in the same way that Bayes factor bounds provide an overly pessimistic quantification of evidence for the null hypothesis.

Design of new studies based on support
The sample size of a future study is typically derived to achieve (i) a targeted power of a hypothesis test, or (ii) a targeted precision of a future confidence/credible interval.Here, we provide an alternative where the sample size of a future study is determined to achieve a desired level of support.
Assume we wish to conduct a study and analyze the resulting parameter estimate θ using the support interval based on a normal prior (5).Further assume that we either specify a reasonable prior from existing knowledge or use the prior for Jeffreys's approximate Bayes factor.The goal is now to determine the sample size n such that we can identify the parameter values which are strongly supported by the future data, for instance, with a support level k = 10 representing "strong" support in the classification from Jeffreys (1961).In order for the k > 1 support interval ( 5) to be non-empty, the standard error σ of the parameter estimate θ needs to be sufficiently small so that the term in the square root becomes non-negative, i. e., it must hold that The sample size n can now be determined such that the standard error σ is small enough for (17) to hold.The resulting sample size then guarantees that parameter values with the desired level of support will be identified.In general, this needs to be done numerically, but for the Jeffreys's approximate Bayes factor prior (µ θ = θ and σ 2 θ = nσ 2 ), the simple expression n ≥ k 2 − 1 mentioned earlier exists.For instance, if we want a k = 10 support interval to be non-empty, we must take at least 10 2 − 1 = 99 samples.
While the previously described approach guarantees that a k > 1 support interval is nonempty and includes at least one parameter value θ, one may want to guarantee that the resulting k support interval will span a desired length with M k the standard error multiplier of a k support interval.In general, numerical methods are required for computing the n such that ( 18) is satisfied, yet again for the support interval based on Jeffrey's approximate Bayes factor there are explicit solutions available with λ 2 the variance of one (effective) observation and assuming log(1+n)/ log(n) ≈ 1.From ( 19) two things are apparent: (i) the argument to W(•) has to be larger than −1/e for the function value to be defined, meaning that the possible width is limited by ≤ (4λ 2 )/k 2 , (ii) since the argument to W(•) is negative, there are always two solutions given by the two real branches of the Lambert W function, if any exist at all.For instance, for a standard error of σ = λ/ √ n with λ = 2, a support level k = 10, and a desired width = 0.2, equation ( 19) leads to the sample sizes n 1 = 143 and n 2 = 862 (when rounded to the next larger integer).Both lead to the k = 10 support interval spanning the desired width = 0.2, yet for the study employing the larger sample size n 2 other support intervals with higher support levels k can be computed compared to a study employing the smaller sample size n 1 .

Error control via the universal bound
The universal bound (Royall, 1997, Section 1.4) ensures that for k < 1 and when the null hypothesis H 0 : θ = θ 0 is true, the probability for finding evidence at most of level k for H 0 cannot be larger than k, that is for any prior of θ under the alternative H 1 .Remarkably, the universal bound is also valid under sequential analyses with optional stopping as soon as a Bayes factor smaller than k is obtained (Robbins (1970); Pace and Salvan (2020)).In contrast, frequentist tests and confidence sets typically have to be adjusted for sequential analyses to guarantee appropriate error rates, and the theory and applicability can become quite involved.Lindon and Malek (2020) proved that k support sets with k < 1 are also valid (1 − k)100% confidence sets.Their proof and the related "safe and anytime valid inference" theory (see e. g., Grünwald et al., 2019) is based on relatively technical results from martingale theory.We now briefly show how the universal bound can also be used to derive error rate guarantees for support intervals.Assume there is a true parameter θ = θ * .For any (data-independent) prior for θ under the alternative hypothesis H 1 , the coverage of the corresponding k support set SI k with k < 1 is bounded by where the first equality follows from the definition of a k support set (2), whereas the inequality follows from the universal bound (20).This shows that a k support set with k < 1 is also To transform an interval from type A to type B, first subtract θ from the boundaries of the interval, multiply by the ratio of the standard error multipliers M B /M A , and add again θ to the boundaries of the interval.The standard error multipliers M depend on either the confidence level (1 − α) or the support level k.For the support intervals, the standard error multipliers M additionally depend on the parameters of the prior for θ under the alternative hypothesis: mean µ θ and variance σ 2 θ for the normal prior, variance σ 2 θ for the local normal prior, and spread σ θ for the nonlocal normal moment prior.The quantile function of the standard normal distribution is denoted by Φ −1 (•), W 0 (•) denotes the principal branch of the Lambert W function, and W −1 (•) denotes the branch that satisfies W(y) < 1 for y ∈ (−1/e, 0).(Minimum) support intervals are only non-empty for support levels k for which the standard error multiplier is real-valued, i. e., the term in the square root must be non-negative and/or the argument for W −1 (•) must be in [−1/e, 0).All interval types can be computed with the R package ciCalibrate (Appendix A).

Interval type
Standard error multiplier M Which type of support interval should data analysts use in practice?We believe that the support interval based on a normal prior distribution is the most intuitive for encoding external knowledge.This type should therefore be preferably used whenever external knowledge is available.At the same time, the support interval based on a local normal prior with unit-information variance (Kass and Wasserman, 1995) seems to be a reasonable "default" choice in cases where no external knowledge is available.Finally, we believe that minimum support intervals are mostly useful for giving confidence intervals an evidential interpretation due to the one-to-one mapping between the two.
It is also not clear which support level k should be used for computing support intervals.If space permits, we recommend to visualize the Bayes factor as a function of the null value as in Figure 1.A similar approach has also been proposed by Grünwald (2023) under the name of E-posterior.The Bayes factor visualization provides readers with a more gradual assessment of support, and any desired k support interval can be read off from it.If there are space constraints, a compromise is to report support intervals for different levels (e. g., k ∈ {1/10, 1, 10}) or to present a forest plot with "telescope" style support intervals with ascending support levels stacked on top of each other, as in Figure 4. We are hesitant to recommend a "default" support level because any classification of support is arbitrary, just like the 95% confidence level convention.
We believe that k = 1 is perhaps the least arbitrary default level, as it represents the tipping point at which the included parameter values begin to receive support from the data (although not necessarily strong support).
Other approaches for reinterpreting confidence intervals have been proposed.For instance, Rafi and Greenland (2020) propose to rename confidence intervals to "compatibility" intervals and give their confidence level an information theoretic interpretation.For example, a 95% confidence interval contains parameter values with at most 4.3 bits refutational "surprisal".This notion of compatibility is logically weaker than the notion of support considered in this paper as a failure to refute a parameter value cannot establish that this parameter value is supported without reference to alternatives (Greenland, 2023).Compatibility intervals are in this sense similar to minimum support intervals; without a specified prior under the alternative hypothesis only the maximum surprisal/evidence against the included parameter values can be quantified.
We also showed how the coverage of k support intervals with k < 1 is bounded by (1−k)100%, which holds even under sequential analyses with optional stopping.For instance, a k = 1/20 support interval has valid 95% coverage.Of course, such error rate guarantees rest on the assumption that the data model has been correctly specified, which in most real world applications will be violated to some extent.We do not see this as a problem for the evidential interpretation of support intervals, which is usually of more concern to data analysts.Evidential inference does not rely on a statistical model being "true" in some abstract sense.Bayes factors and support intervals simply quantify the relative predictive performance that the combination of data model and parameter distribution yield on out-of-sample data (Kass and Raftery, 1995;O'Hagan and Forster, 2004;Gneiting and Raftery, 2007;Fong and Holmes, 2020).Such "descriptive inferential statistics" are especially important for the analysis of convenience data samples which typically violate assumptions of the underlying statistical model (Amrhein et al., 2019;Shafer, 2021).
In fact, even one of the best known proponents of p-values -R.A. Fisher -noted "For all purposes, and more particularly for the communication of the relevant evidence supplied by a body of data, the values of the Mathematical Likelihood are better fitted to analyse, summarize, and communicate statistical evidence of types too weak to supply true probability statements" (Fisher, 1956, p. 70) clearly recognizing the importance of inferential tools based on relative likelihood for making sense out of data.

Software and data
The point estimate and 95% confidence interval of the adjusted log hazard ratio were extracted from the abstract of RECOVERY Collaborative Group (2021).All analyses were conducted in the R programming language version 4.

Figure 1 :
Figure 1: The RECOVERY trial (RECOVERY Collaborative Group, 2021) found that dexamethasone treatment reduced mortality compared to usual care in hospitalized Covid-19 patients (estimated log hazard ratio θ = −0.19 with standard error σ = 0.05 and 95% confidence interval from −0.29 to −0.07).Assuming a normal likelihood θ | θ ∼ N(θ, σ 2 ), the Bayes factor for contrasting H 0 : θ = θ 0 to H 1 : θ = θ 0 is shown as a function of the null value θ 0 .A unit-information normal distribution θ | H 1 ∼ N(µ θ = −0.22,σ 2 θ = 4) centered around the clinically relevant log hazard ratio is used as prior for θ under H 1 .Support intervals for different support levels k indicate the range of log hazard ratios supported by the data.

Figure 2 :
Figure 2: Comparison of prior distributions for the parameter θ under the alternative H 1 in terms of the resulting support interval width and the highest level for which it is non-empty.A data model θ | θ ∼ N(θ, λ 2 /n = 4/n) is assumed in all cases.The prior scale/spread parameter is set to σ θ = 2.The

Vovk ( 1993 )
and Sellke et al. (2001) proposed a minimum Bayes factor where the data are summarized through a p-value.The idea is that under the null hypothesis H 0 : θ = θ 0 , a p-value should be uniformly distributed, whereas under the alternative it should have a monotonically decreasing density characterized by the class of Beta(ξ, 1) distributions (with ξ ≤ 1).Choosing ξ such that the marginal likelihood of the data under H 1 is maximized, leads to well-known "−ep log p" minimum Bayes factor minBF 01 (p; θ 0 ) = 2{1 − Φ(| θ − θ 0 |/σ)}.Equating (15) to k and solving for θ 0 , leads to the k minimum support interval θ

Figure 3 :
Figure 3: Mapping between confidence level (1 − α)100% and minimum support level k for different types of minimum support intervals.

kFigure 4 :
Figure 4: Different support intervals for the data from the RECOVERY trial.The normal prior is centered around µ θ = −0.22 and has unit variance σ 2 θ = 4.The local normal prior also has unit variance σ 2 θ = 4.The spread parameter of the nonlocal normal moment prior is σ θ = 0.28.

Figure 4
Figure 4 shows the corresponding k support intervals for different values of k.The support intervals based on normal (second row) and local normal prior (third row) mostly coincide for all considered support levels k.The k = 10 support intervals (blue) from both types indicate that log hazard ratios between −0.27 and −0.1 receive strong support from the data compared to alternative parameter values.In contrast, the k = 10 support interval (blue) based on the nonlocal normal moment prior (fourth row) is slightly wider, indicating that values between −0.28 and −0.09 are strongly supported by the data.For smaller support levels (k < 10) this trend reverses and the normal and local normal prior support intervals are wider than the one based on the nonlocal normal prior.Finally, each parameter value not included in a k support interval corresponds to a point-null hypothesis for which the respective Bayes factor is smaller than k, similar to the relationship between confidence intervals and p-values.For instance, one can immediately see that the Bayes factor based on nonlocal moment priors indicates strong evidence (BF 01 < 1/10) against H 0 : θ = 0 as the value is not included in the interval, whereas this is not the case for the Bayes factors based on normal and local normal priors.The three bottom rows in Figure 4 show different types of k minimum support intervals computed for the data from the RECOVERY trial.Since minimum support intervals are only non-empty for k ≤ 1, only such support levels are shown.The (yellow) k = 1 minimum support interval for the class of all priors (fifth row) is just a point at the observed effect estimate θ = −0.19.In contrast, the (yellow) k = 1 minimum support intervals based on local normal 3.0 (R CoreTeam, 2023).Code and data for reproducing the results in this manuscript are available at https://github.com/SamCH93/ECoCI.A snapshot of the GitHub repository at the time of writing this article is archived at https://doi.org/10.5281/zenodo.6723249.An R package for calibration of confidence intervals to (minimum) support intervals is available at https://CRAN.R-project.org/package=ciCalibrate,see Appendix A for an illustration.

Table 2 :
Summary of confidence intervals (CI), support intervals (SI), and minimum support intervals (minSI) for an unknown parameter θ based on a parameter estimate θ with standard error σ.All intervals are of the form θ ± σ × M.