A Review of Bayesian Perspectives on Sample Size Derivation for Confirmatory Trials

Sample size derivation is a crucial element of planning any confirmatory trial. The required sample size is typically derived based on constraints on the maximal acceptable Type I error rate and minimal desired power. Power depends on the unknown true effect and tends to be calculated either for the smallest relevant effect or a likely point alternative. The former might be problematic if the minimal relevant effect is close to the null, thus requiring an excessively large sample size, while the latter is dubious since it does not account for the a priori uncertainty about the likely alternative effect. A Bayesian perspective on sample size derivation for a frequentist trial can reconcile arguments about the relative a priori plausibility of alternative effects with ideas based on the relevance of effect sizes. Many suggestions as to how such “hybrid” approaches could be implemented in practice have been put forward. However, key quantities are often defined in subtly different ways in the literature. Starting from the traditional entirely frequentist approach to sample size derivation, we derive consistent definitions for the most commonly used hybrid quantities and highlight connections, before discussing and demonstrating their use in sample size derivation for clinical trials.

A Sensitivity of Probability of Success with respect to the definition of "success" The degree to which PoS(n) and PoS (n) differ numerically is visualized in Figure 1. It depicts the proportion of the individual components of PoS (n) for varying prior standard deviation and prior means. The sample size is fixed at n = 150, θ 0 = 0, the maximal type I error rate is α = 0.025, and the minimal clinically important difference is θ MCID = 0.1. A truncated normal prior on [−1, 1] with varying mean and standard deviation was used. The contribution of type I errors (component "A" in Figure 1) to PoS (n) is mostly negligible unless the prior is sharply peaked at an effect size slightly smaller than the null. The a priori probability of a relevant effect size is close to zero in these cases and so is PoS (n).  Figure 1: Components of PoS (n) for n = 150, θ 0 = 0, α = 0.025, θ MCID = 0.1 and varying prior mean and standard deviation; numbers correspond to overall PoS (n); proportions in individual pie charts correspond to: A = probability to reject and null effect (type I error), B = probability to reject and irrelevant but non-null effect, C = probability to reject and relevant effect (PoS).

B Literature review of terminology
A structured overview of the literature on "hybrid" Bayesian sample size derivation in the context of clinical trials is given in Table 1. The table relates publications in the field to the terms defined in Figure 2 of the main text. Publications with a similar take on the matter are grouped. In the following, we highlight a few particularly interesting contributions and how they relate to the definitions used in this manuscript.
The majority of the manuscripts only consider the marginal probability to reject H 0 (PoS (n)). Many publications refer to O'Hagan and Stevens (2001) or O'Hagan et al. (2005), where this quantity was introduced as "assurance". The range of names for what we call the "marginal probability to reject H 0 " is, however, quite diverse: "assurance", "probability of success", "predictive probability of success", "average probability of success", "probability of statistical success", "probability of study success", "predictive power", "predictive frequentist power", "expected power", "average power", "strength", "extended Bayesian expected power 1", and "hybrid Neyman-Pearson-Bayesian probability".
However, only a handful of authors elaborate on the intricacies of defining what exactly constitutes a "success" and whether to consider an unconditional measure of success or to condition on the presence of a relevant effect for sample size derivation (Spiegelhalter and Freedman, 1986;Brown et al., 1987;Shao et al., 2008;Liu, 2010;Ciarleglio et al., 2015). Most publications fail to define explicitly what exactly constitutes a "success". However, the use of PoS (n) implies that rejection of the null hypothesis, irrespective of its truth, must be considered a success. Our analysis confirms the statement in Spiegelhalter et al. (2004) that PoS (n) can be used as a practical approximation to PoS(n) in many situations.
The exact definition of "probability of success" becomes more interesting when allowing for θ MCID > θ 0 , a potential extension rarely considered in the literature (see, e.g., Brown et al., 1987, for the binary case).
The exact choice of wording should not be given too much weight. However, we feel that any notion of power in the "hybrid" Bayesian/frequentist setting should be conditional on a relevant effect (or at least a non-null effect) to preserve the conditional nature of the purely frequentist power. Using the term "power" to refer to a joint probability like the 'expected power' of Brown et al. (1987) and Ciarleglio et al. (2015) (our PoS(n)) or the "average/expected power" of Spiegelhalter et al. (2004) (our PoS (n)) is potentially misleading. Others suggest "conditional expected power" for EP(n) to distinguish it from "expected power" (our PoS (n)) (Brown et al., 1987;Ciarleglio et al., 2015). This wording, however, may lead to confusion when also considering interim analyses where "conditional power" is a well-established term for the probability of rejecting the null hypothesis given θ alt and partially observed data (Bauer et al., 2016).
A particularly interesting publication is Liu (2010). They extend hybrid sample size derivation in the normal case to also incorporate uncertainty about the variance and clearly distinguish between PoS (n) = "extended Bayesian expected power 1", PoS(n) = "extended Bayesian expected power 2", and EP(n) = "extended Bayesian expected power 3". Apart from nomenclature, our definitions of these three quantities only differ in that they assume the standard deviation to be fixed and the fact that we accommodate the optional notion of a relevant effect via θ MCID . The former makes explicit formulas more manageable, the latter is important to keep sample sizes small in situations with vague or conservative prior information but substantial relevance thresholds. Liu (2010) and Rufibach et al. (2016) are also the only publications we found that study the distribution of the quantities that are averaged over. In Ciarleglio et al. (2015), the distinction between all three quantities is also made explicit ("expected power" is our PoS (n), "prior-adjusted power" is our PoS(n), and "conditional expected power" is our EP(n)).

O'Hagan and
Stevens (2001) Termed 'assurance' or 'expected power'; different from our notion of expected power which is conditional on a relevant effect, see also (O'Hagan et al., 2005).

References Notes
Chuang-Stein

References Notes
Carroll ( Termed 'average probability of success'; discussed in context of historical data integration.
Walley et al.
Termed 'assurance' or 'probability of success'; extension to multi-parameter situations.
Termed 'assurance' or 'probability of success'; in-depth discussion of the distribution of the probability to reject the null hypothesis.

References Notes
Brown et al.
Termed 'expected power'; also discusses 'conditional expected power' which corresponds to our definition of EP(n).
Shao et al.
Termed 'adjusted power'; application of the ideas of Spiegelhalter et al. (2004) to binary setting, define probability of success but approximate it with the marginal probability to reject H 0 .
Termed 'prior-adjusted power'; also considers EP(n) and PoS (n), very similar settings considered in Ciarleglio et al.