Eight predictive powers with historical and interim data for futility and efficacy analysis

ABSTRACT When the historical data of the early phase trial and the interim data of the Phase III trial are available, we should use them to give a more accurate prediction in both futility and efficacy analysis. The predictive power is an important measure of the practical utility of a proposed trial, and it is better than the classical statistical power in giving a good indication of the probability that the trial will demonstrate a positive or statistically significant outcome. In addition to the four predictive powers with historical and interim data available in the literature and summarized in Table 1, we discover and calculate another four predictive powers also summarized in Table 1, for one-sided hypotheses. Moreover, we calculate eight predictive powers summarized in Table 2, for the reversed hypotheses. The combination of the two tables gives us a complete picture of the predictive powers with historical and interim data for futility and efficacy analysis. Furthermore, the eight predictive powers with historical and interim data are utilized to guide the futility analysis in the tamoxifen example. Finally, extensive simulations have been conducted to investigate the sensitivity analysis of priors, sample sizes, interim result and interim time on different predictive powers.


Introduction
The predictive power, which is the prior expectation of the power and averaged over the prior distribution for the unknown true treatment effect, is an important measure of the practical utility of a proposed trial, and it is better than the power in giving a good indication of the probability that the trial will demonstrate a positive or statistically significant outcome. As we know, the power may have very different values at different treatment effects (for instance, a treatment effect under the alternative hypothesis or an observed treatment effect in the interim analysis), and that may cause difficulty for interpretation. The predictive power has been investigated intensively in the literature (Choi et al., 1985;Schmidli et al., 2007;Spiegelhalter et al., 1986;Zhang & Ting, 2018). Moreover, the predictive power is also known as assurance (Kirby et al., 2012;O'Hagan et al., 2005;Wang et al., 2006), Probability Of Success (POS) (Ibrahim et al., 2015;Jiang, 2011;Trzaskoma & Sashegyi, 2007), Average Success Probability (ASP) (Chuang-Stein, 2006;Zhang & Ting, 2020) or Contemplated Average Success Probability (CASP) (Zhang et al., 2020a).
The 'predictive power' is the central matter of our methodological development. Therefore, we present a general formal expression of it. The predictive power is CONTACT Ying-Ying Zhang robertzhangyying@qq.com; robertzhang@cqu.edu.cn https://zhangyingying319.wordpress.com Supplemental data for this article can be accessed here. https://doi.org/10. 1080/24754269.2021.1991557 an average power with respect to some prior, that is, where δ is the true treatment effect of the early phase and Phase III trials. There are eight predictive powers with historical and interim data, because we have four choices for power(δ), that is, the classical power that does not use any data, the classical conditional power that uses the interim data once, the Bayesian power that uses the historical data once, and the Bayesian conditional power that uses the historical data once and the interim data once; and we have two choices for prior(δ), that is, π(δ|d 0 ) that uses the historical data once, and π(δ|d 0 , d 1 ) that uses the historical data once and the interim data once, where d 0 is the historical data, and d 1 is the interim data. Spiegelhalter et al. (2004) have calculated the rejection region, the power or the conditional power, and the predictive power or the conditional predictive power of the hypotheses H 0 : δ ≤ 0 versus H 1 : δ > 0 for five different scenarios, which are non-sequential trials with classical power and Bayesian power, and sequential trials with hybrid predictions, Bayesian predictions, and classical predictions in Sections 6.5 and 6.6. They also gave the adjusting formulae, which include nonzero threshold and reversal of hypotheses, for different hypotheses in Section 6.5.4. In their book, they did not explicitly mention that the predictive powers of the five different scenarios use different combination of historical and interim data. In this article, we explicitly mention that different predictive powers will use different combination of historical and interim data. Moreover, we expand the four predictive powers (the predictive power corresponding to the sequential trials with classical predictions is excluded) in Spiegelhalter et al. (2004) to eight predictive powers for the hypotheses H 0 : δ ≤ δ 0 versus H 1 : δ > δ 0 and the reversed hypotheses H 0 : δ ≥ δ 0 versus H 1 : δ < δ 0 , which can be seen in Tables 1 and 2, where δ 0 is a threshold value for δ. In other words, we have discovered four predictive powers with historical and interim data for the hypotheses and the reversed hypotheses. Finally, the eight predictive powers are utilized to guide the futility analysis in the tamoxifen example, in which a long-term tamoxifen therapy is used for the prevention of recurrence of breast cancer. The tamoxifen example is a Phase III trial and the predictive powers suggest us to stop the trial for futility.
The rest of the paper is organized as follows. In Section 2, we provide two tables. The eight predictive powers with historical and interim data, their analytical expressions, the predictive distributions, the data used, Table 1. The eight predictive powers with historical and interim data, their analytical expressions, the predictive distributions, the data used, and the references for the hypotheses H 0 : δ ≤ δ 0 versus H 1 : δ > δ 0 . No.

Eight predictive powers with historical and interim data
Similar to Dmitrienko and Wang (2006) and Jiang (2011), a go/no-go decision rule can be defined at the end of the early phase trial or at the interim of the Phase III trial. In our notation, where PP is the predictive power, while γ f , γ g and γ e are pre-specified thresholds for futility, go and efficacy, respectively. The thresholds should satisfy the following constraints: 0 < γ f < γ g < γ e < 1. Jiang (2011) suggests γ f ≥ 0.5, with γ f = 0.5 meaning that a stop for futility decision is taken if that is, the risk of failure is greater than or equal to the chance of success. The threshold γ e can be set at a relatively high value such as 0.9, so that when the PP exceeds this threshold, a stop for efficacy decision can be made. Finally, the threshold γ g can be set at a value such as 0.8, so that if γ g ≤ PP < γ e , a go decision can be made, where 'Go' means moving on without the need of adjustment to the sample size of the future data m 2 ; if γ f < PP < γ g , a conditional-go decision can be made, where 'Conditional-Go' means moving on with the condition that m 2 is either increased to improve the PP (so it is equal or close to γ g ) or staying unchanged while acknowledging a reduced PP or increased risk of failure. Note that there are two no-go decisions in our decision criteria (1), that is, stop for futility and stop for efficacy. The data structures of the historical data, interim data and future data are described in Figure 1. In the figure, H means historical data, I means interim data and F means future data. The historical data could be the Phase II data, or the previous Phase III data, as long as the outcome variable and patient populations are the same between the historical data and the upcoming Phase III data. Moreover, the historical data could also be a fictitious data corresponding to a sceptical or optimistic prior, and in this case d 0 and m 0 of the historical data are determined to satisfy the requirements of the sceptical or optimistic prior. Note that d 0 , d 1 and d 2 are the observed treatment differences in the treatment group and the control (or placebo) group of the historical data, interim data and future data respectively, and m 0 , m 1 and m 2 are the per group number of patients of the historical data, interim data and future data respectively. In the upper plot, only historical data are available. Furthermore, the upper plot also depicts the data structure for criterion (7). Note that in the upper plot, the sample size of the future data m 2 is the whole sample size of the Phase III trial. Note that the present time of the program (termed now) in the upper plot is at the end of Early Phase and before the start of Phase III. At that time, only two predictive powers can be calculated to facilitate the go/no-go decision according to the decision criteria (1), that is, the first and fifth predictive powers in Tables 1 and 2. If the PP results in a 'Go' or 'Conditional-Go' decision according to the decision criteria (1), then the Phase III trial is launched. However, if the PP results in a no-go decision (either stop for futility or stop for efficacy), then the Phase III trial will not be launched. Furthermore, if the Phase III trial is launched and the interim data of the Phase III trial are available, the data structure of the program can be described in the lower plot of Figure 1. Note that the present time of the program (termed now) in the lower plot is at the interim of the Phase III trial. At the interim, there are six predictive powers which can be calculated to facilitate the go/no-go decision according to the decision criteria (1), that is, the second, third, fourth, sixth, seventh and eighth predictive powers in Tables 1 and 2. In the lower plot, both historical data and interim data are available. Moreover, the lower plot also depicts the data structure for criterion (4) and (5).
Note that F in the graph could be meaning data after interim in the lower plot, and full Phase III data in the upper plot. The justifications of the meaning of F are given as follows. First, the future data are the data after the present time (termed now in the upper and lower plots). Second, in the lower plot, when the information time increases, the interim data become more and more, and the future data become less and less. Conversely, when the information time decreases, the future data become more and more, and the interim data become less and less. When the information time is 0, the future data is the full Phase III data.
Suppose that the interim analysis of a randomized controlled Phase III trial is to be conducted with patients randomized to one of two treatments, with m 1 patients allocated to treatment i (i = 1, 2), where treatment 2 is the test drug and treatment 1 is the control (or placebo). Moreover, suppose that the j-th patient receiving treatment i for the interim data will yield a continuous response x ij1 that we can assume is normally distributed with an unknown mean μ i1 and a common known variance σ 2 . The third subscript '1' in x ij1 means that the responses are for the interim data. Moreover, assume that the data from the two treatments are independent. Thus the model of the interim data of the Phase III trial is that It is easy to derive the sampling distributions of the sufficient statistics More specifically, x 21 |μ 21 ∼ N μ 21 , σ 2 m 1 and x 11 |μ 11 ∼ N μ 11 , σ 2 m 1 . Therefore, where d 1 =x 21 −x 11 is the sample mean difference based on the interim data of the Phase III trial, and δ = μ 21 − μ 11 is the true treatment effect based on the interim data of the Phase III trial.
Similarly, suppose that the future data of a randomized controlled Phase III trial is to be collected with patients randomized to one of two treatments, with m 2 patients allocated to each treatment. After some similar derivations for the interim analysis of the Phase III trial, we have where d 2 =x 22 −x 12 is the sample mean difference based on the future data of the Phase III trial, δ = μ 22 − μ 12 is the true treatment effect based on the future data of the Phase III trial,x i2 = 1 m 2 m 2 j=1 x ij2 (i = 1, 2) is the sample mean of x ij2 which is the continuous response of the j-th patient receiving treatment i for the future data, and μ i2 (i = 1, 2) is the unknown mean of x ij2 . The third subscript '2' in x ij2 means that the responses are for the future data. Note that we have assumed the true treatment effects based on the interim data and future data of the Phase III trial are the same. This assumption has also been used in the literature. See for instance (Spiegelhalter et al., 2004). Note also that the assumption can be easily violated in the clinical trials, such as the enrichment design which will change the population. Therefore, our discussions are not suitable for the enrichment design.
Suppose that we have some prior knowledge about δ through the historical data corresponding to m 0 patients per group in two treatments, and the prior mean of δ is estimated to be d 0 . We remark that the historical data with m 0 patients refer to Phase II patients specifically, and thus the treatment effect δ in Phase II could be different than Phase III. However, in many disease areas where main clinical outcomes can be observed in relatively short duration -such as acute pain, allergy, asthma, depression, hypertension, and so on -Phase II and Phase III trials often have the same trial design including a same outcome variable and same patient population. In these disease areas, the treatment effect δ on Phase II and Phase III trials can be assumed the same. For simplicity, we assume a normal model for the prior. That is, Note that this prior incorporating the historical data can be obtained as follows. For the historical data d 0 , assume that Suppose that we have no prior knowledge about δ before the historical data d 0 , and thus we assume that δ has an improper uniform prior over (−∞, ∞), that is, π(δ) ∝ 1. Then the posterior distribution of δ given d 0 is easily found to be given by (2).
Therefore, when the interim data d 1 is available, the model and the prior are given by Let the model and prior be given by (3). Given the likelihood d 1 |δ and the prior δ|d 0 , standard Bayesian calculus yields the posterior distribution of δ given d 0 , d 1 and the conditional distribution of d 1 given d 0 , that is, Then using the posterior distribution π(δ|d 0 , d 1 ) as a new prior for our future data d 2 , standard Bayesian calculus yields the posterior distribution of δ given d 0 , d 1 , d 2 and the conditional distribution of d 2 given d 0 , d 1 , that is, The data structure of (4) and (5) is depicted in the lower plot of Figure 1. Note that the posterior distribution π(δ|d 0 , d 1 , d 2 ) is used in the calculations of the Bayesian rejection regions with d 0 , d 1 , d 2 , The conditional distribution π(d 2 |d 0 , d 1 ) is the predictive distribution used in the calculations of the evennumbered predictive powers in Table 1.
Similarly, when the interim data d 1 is not available, the model and the prior are given by Let the model and prior be given by (6). Given the likelihood d 2 |δ and the prior δ|d 0 , standard Bayesian calculus yields the posterior distribution of δ given d 0 , d 2 and the conditional distribution of d 2 given d 0 , that is, The data structure of (7) is depicted in the upper plot of Figure 1. Note that the posterior distribution π(δ|d 0 , d 2 ) is used in the calculations of the Bayesian rejection regions with d 0 , d 2 , The conditional distribution π(d 2 |d 0 ) is the predictive distribution used in the calculations of the oddnumbered predictive powers in Table 1. For clarity, we define the Classical Power (CP), Classical Conditional Power (CCP), Bayesian Power (BP), and Bayesian Conditional Power (BCP). The CP is the probability of the classical rejection region with d 2 , S C,d 2 α,δ 0 , given a value for δ, P(S C,d 2 α,δ 0 |δ), where S is for 'Success' and the success region is the rejection region, C is for 'Classical', α is the significance level, and δ 0 is a threshold value for δ. The CCP is the probability of the classical rejection region with d 1 and d 2 , S C,d 1 ,d 2 α,δ 0 , given values of δ and interim result . Under normality assumptions for the priors and the likelihoods, it is easy to obtain the expressions of the rejection regions and the powers as The detailed derivations of the expressions of the rejection regions and the powers can be found in the supplement. Suppose that we are interested in testing the hypotheses H 0 : δ ≤ δ 0 versus H 1 : δ > δ 0 . This kind of hypotheses arise when we assume that a larger value in the population mean of the normal distribution means improvement in disease condition. Hence, a positive value of δ means better. The eight predictive powers with historical and interim data, their analytical expressions, the predictive distributions, the data used, and the references for the hypotheses H 0 : δ ≤ δ 0 versus H 1 : δ > δ 0 are given in Table 1. Note that the definitions of the eight predictive powers for the hypotheses are given in Table 1 We add a capital letter P (short for Predictive) to the nomenclatures to indicate that they are predictive powers. Moreover, we add a capital letter I (short for Interim) to the nomenclatures to indicate that the prior π(δ|d 0 , d 1 ) uses the interim data.
• The analytical expressions are given as follows: Note that in the table, for E 1 , E 3 , E 5 , and E 7 , the analytical expressions are in the form of where the Expression is A, B(d 1 ), C(d 0 ) and D(d 0 , d 1 ) given by (8), (9), (10) and (11), respectively. Similarly, for E 2 , E 4 , E 6 and E 8 , the analytical expressions are in the form of The tedious calculations of the analytical expressions of the eight predictive powers in Table 1 can be found in the supplement. It is worth noting that the calculations of the predictive powers by directly calculating the expectations need an important expectation identity (Zhang et al., 2014(Zhang et al., , 2020b. • Note that in the table, there are only two predictive distributions, that is, π(d 2 |d 0 ) and π(d 2 |d 0 , d 1 ). • For the data used column, H means that the historical data are used, and I means that the interim data are used. HI means that the historical data are used once and the interim data are also used once. HI 2 means that the historical data are used once and the interim data are used twice. H 2 means that the historical data are used twice. H 2 I means that the historical data are used twice and the interim data are used once. H 2 I 2 means that the historical data are used twice and the interim data are also used twice. Now we explain why the eight predictive powers use different combination of historical and interim data. Note that the predictive power is an average power with respect to some priors. Only two priors are exploited for the eight predictive powers, that is, π(δ|d 0 ) and π(δ|d 0 , d 1 ). The prior π(δ|d 0 ) uses the historical data (d 0 ) once. However, the prior π(δ|d 0 , d 1 ) uses the historical data (d 0 ) once and the interim data (d 1 ) once. Four powers are used in the eight predictive powers, that is, the classical power P(S C,d 2 α,δ 0 |δ) that does not use any data, the classical conditional power P(S C,d 1 ,d 2 α,δ 0 |δ, d 1 ) that uses the interim data once, the Bayesian power P(S B,d 0 ,d 2 α,δ 0 |δ, d 0 ) that uses the historical data once and the Bayesian conditional power P(S B,d 0 ,d 1 ,d 2 α,δ 0 |δ, d 0 , d 1 ) that uses the historical data once and the interim data once. Therefore, for the predictive power I 1 , it uses the historical data once, since it is an average classical power P(S C,d 2 α,δ 0 |δ) with respect to the prior π(δ|d 0 ). Moreover, for the predictive power I 8 , it uses the historical data twice and the interim data twice, since it is an average with respect to the prior π(δ|d 0 , d 1 ). The data used for other predictive powers can be explained in the same way.
• For I 1 , I 4 , I 5 and I 8 , we can find a similar formula in Spiegelhalter et al. (2004). Note that in Spiegelhalter et al. (2004), the variance is σ 2 which corresponds to one arm trial, while in our article, the variance is 2σ 2 which corresponds to two arm trials. The other four predictive powers (I 2 , I 3 , I 6 and I 7 ) are discovered by us. Consequently, Table 1 gives us a complete picture of the predictive powers with historical and interim data for futility and efficacy analysis for the hypotheses H 0 : δ ≤ δ 0 versus H 1 : δ > δ 0 . Moreover, Spiegelhalter et al. (2004) use z which is a lower quantile, that is, Now suppose that we are interested in testing the reversed hypotheses H 0 : δ ≥ δ 0 versus H 1 : δ < δ 0 . This kind of hypotheses arise when we assume that a smaller value in the population of the normal distribution means improvement in disease condition. Hence, a negative value of δ means better. We will use a '−' sign here to indicate that the respective quantities are calculated for the reversed hypotheses. The eight predictive powers with historical and interim data, their analytical expressions, the predictive distributions, and the data used for the reversed hypotheses H 0 : δ ≥ δ 0 versus H 1 : δ < δ 0 are given in Table 2. Note that the definitions of the eight predictive powers for the reversed hypotheses are given in Table 2 under the column name 'Predictive Power'. In the table: • For the predictive power column, the nomenclatures are the same as in Table 1 with a '−' sign here to indicate that the respective nomenclatures are for the reversed hypotheses. • The analytical expressions are given as follows: Note that in the table, for E − 1 , E − 3 , E − 5 and E − 7 , the analytical expressions are in the form of where the Expression is respectively. Similarly, for E − 2 , E − 4 , E − 6 and E − 8 , the analytical expressions are in the form of d 1 ), respectively. The tedious calculations of the analytical expressions of the eight predictive powers in Table 2 can be found in the supplement. • Note that in the table, there are only two predictive distributions, that is, π(d 2 |d 0 ) and π(d 2 |d 0 , d 1 ). • The data used column can be explained in the same way as in Table 1. • There are no references available to the best of our knowledge for the reversed hypotheses H 0 : δ ≥ δ 0 versus H 1 : δ < δ 0 .
Comparing Tables 1 and 2, we find that for each predictive power, the predictive distribution and the data used are the same. From the two tables we see that the analytical expressions of the hypotheses H 0 : δ ≥ δ 0 versus H 1 : δ < δ 0 are just the quantities of the hypotheses H 0 : δ ≤ δ 0 versus H 1 : δ > δ 0 with the terms involving d 0 − δ 0 and d 1 − δ 0 adding a negative sign, and vice versa.

A real data example
Long-term tamoxifen therapy is used for the prevention of recurrence of breast cancer (see Dignam et al., 1998;Example 6.7 in Spiegelhalter et al., 2004). The aim of the study is to estimate disease-free survival benefit from tamoxifen over placebo, in patients who already have had 5 years of taking tamoxifen without a recurrence. That means, patients were randomized to either continuation of tamoxifen therapy or continuation with placebo after having survived recurrence-free under tamoxifen for 5 years. To detect a 40% reduction in annual risk associated with tamoxifen (hazard ratio = 0.6), with 85% power and a one-sided tail area of 5%, 115 events were required. The statistical model is the proportional hazards regression model, with summary using the approximate hazard ratio analysis. If there are O T events on treatment, and O C events on control, then d 1 = 2(O T − O C )/m 1 is an approximate estimate of the log(hazard ratio) δ, with mean δ and variance 4/m 1 , as shown in Tsiatis (1981). Prior distributions: an optimistic prior was centred on a 40% hazard reduction and a 5% chance of a negative effect (i.e., HR > 1), equivalent on the log(HR) scale to a normal prior with mean μ o = log(0.6) = −0.51 and standard deviation 0.31 (σ = √ 2, m 0 ≈ 41.4). Note that in Spiegelhalter et al. (2004), the variance is σ 2 = 4, while in our article, the variance is 2σ 2 = 4, and thus σ = √ 2 in our article. Moreover, m 0 ≈ 41.4 is used to guarantee that 'an optimistic prior was centred on a 40% hazard reduction and a 5% chance of a negative effect'. Also a sceptical prior was adopted with the same standard deviation as the optimistic prior but centred on μ s = 0. The estimated log(HR) after the first interim analysis in 1993 is d 1 = 0.435. At that time m 1 = 46 events have been observed, and further m 2 = 115 − 46 = 69 events are to be observed.
In the tamoxifen example, let h 1 and h 2 be the hazard rates corresponding to tamoxifen (treatment) and placebo (control) respectively. Therefore, Consequently, for j = 1, . . . , 8, the j-th predictive power I j is for control superior, the j-th predictive power I − j is for tamoxifen superior, and 1 − I j − I − j is for equivocal.
The eight predictive powers with historical and interim data of eventual conclusions for the B-14 trial after the first interim analysis in 1993 are reported in Table 3. In the table, the conclusion is: 'Tamoxifen superior', defined as a 1 − α confidence interval or credible interval for δ = log(HR) lying wholly below 0; 'Equivocal', defined as a 1 − 2α confidence interval or credible interval for δ = log(HR) including 0; and 'Control superior', defined as a 1 − α confidence interval or credible interval for δ = log(HR) lying wholly above 0. The significance level α is chosen to be 0.025 in all cases. For the first and fifth predictive powers, the number of events of the future data m 2 is the whole number of events of the Phase III trial 115, not 69 (the further number of events to be observed). In Table 3, we observe the following facts.
• The sum of the three predictive powers in each row corresponding to the sceptical prior (or the optimistic prior) should be equal to 1. However, in some cases, the sum is equal to 0.999, due to the rounding error. • The fourth predictive powers in Table 3 are the same as those under the column 'When not using prior in analysis', which can be calculated by (6.15), in Table 6.7 of Spiegelhalter et al. (2004). Moreover, the eighth predictive powers in Table 3 are the same as those under the column 'When using prior in analysis', which can be calculated by (6.18), in Table 6.7 of Spiegelhalter et al. (2004). • All the predictive powers under the 'Tamoxifen superior' column are less than 0.85, the designed power. Note that these predictive powers are calculated when the significance level α is chosen to be 0.025, while the designed power 0.85 is calculated when α is chosen to be 0.05. When the significance level α is risen to 0.05 when calculating the predictive powers, the predictive powers also rise, as the predictive powers are increasing functions of α. However, they are still less than 0.85. This phenomenon has been observed in the literature. See for instance Chuang-Stein (2006); Chuang-Stein Kirby (2017); Spiegelhalter et al. (2004). • For the eight predictive powers, the optimistic prior has a greater tendency to draw a 'Tamoxifen superior' conclusion than the sceptical prior, and this is reflected in the predictive powers. In contrast, the sceptical prior has a greater tendency to draw a 'Control superior' conclusion than the optimistic prior, and this is also reflected in the predictive powers. • Now let us focus on the 'Tamoxifen superior' column. The first predictive power under the optimistic prior is 0.656, which is fairly high, due to the first predictive power only uses the historical data once and it does not use the interim data, and the historical data (a fictitious data corresponding to the optimistic prior) favours the tamoxifen treatment. The fifth predictive power under the optimistic prior is 0.771, which is even higher, due to the fifth predictive power uses the historical data twice and it does not use the interim data, and the historical data favours the tamoxifen treatment. Note that the time point of the first and fifth predictive powers is before the launch of the Phase III trial. Since the first and fifth predictive powers are between γ f = 0.5 and γ g = 0.8 in the decision criteria (1), a 'Conditional-Go' decision is made and the Phase III trial is launched. When the first interim data are available in 1993, we can calculate the other six predictive powers which use both the historical data and the interim data. Intuitively, when the interim data are available, they should be used to give a more accurate prediction. The interim data d 1 = 0.435 > 0 favours the control treatment. The combination of the historical data and the interim data produces the six predictive powers 0.077, 0.161, 0.003, 0.195, 0.321 and 0.017. The largest one of the six predictive powers is 0.321, corresponding to the seventh predictive power, which uses the historical data twice and the interim data once. At the same time, the seventh predictive power in favour of control and equivocal is as high as 0.679. The predictive powers in favour of tamoxifen under the sceptical prior are much lower than 0.321. Since the six predictive powers with interim data under the optimistic prior or the sceptical prior are all less than γ f = 0.5, according to the decision criteria (1), we should stop the trial for futility.

Numerical simulations
In this section, we will conduct extensive simulations to investigate the sensitivity analysis of priors (d 0 ), sample sizes (m 0 , m 1 , m 2 ), interim result (d 1 ), and interim time (t) on the eight predictive powers. We assume that is calculated to ensure that an optimistic prior was centred on a 40% hazard reduction and a 5% chance of a negative effect (i.e., HR > 1), equivalent on the log(HR) scale to a normal prior with mean μ o = log(0.6) ≈ −0.51 and standard deviation 0.31 (σ = √ 2, m r 0 ≈ 41.4). We add a superscript 'r' in m r 0 , m r 1 , m r 2 , d r 1 and t r to indicate that they are from the real data. Now let us explain the special reason for choosing σ = √ 2 in the simulations section. As described in Section 2.4.2 in Spiegelhalter et al. (2004), suppose that the first intervention corresponds to an active treatment T, and the second to a control C. Often the results of a survival analysis may be given in terms of an observed log-rank test statistic L m , which is defined as the excess of events under T, compared to that expected were there no treatment effect, where m is the total number of events observed. L m is often denoted as O−E (observed minus expected). Assuming proportional hazards, we have the following approximation in the particular case of equal allocation and follow-up. If there have been O T events on treatment, and O C events on control, then the expected number of events in the treatment group under the null hypothesis is approximately m/2, and hence the log-rank statistic is Tsiatis (1981) that, for large trials, is an approximate estimate of the log(hazard ratio) θ, and Hence we can set σ = 2 and adopt a normal likelihood. Note that in Spiegelhalter et al. (2004), the variance is σ 2 = 4, while in our article, the variance is 2σ 2 = 4, and thus σ = √ 2 in our article, as Let us introduce some notations used in this section.
, the superscript 's' is for the sceptical prior which corresponds to d 0 = μ s , the superscript 'o' is for the optimistic prior which corresponds to d 0 = μ o , the subscript 'i' is for the i-th predictive power, I − is for tamoxifen superior, I is for control superior, and E is equivocal.
The sensitivity analysis of d 0 on the eight predictive powers is displayed in Figure 2. In the figure, we note the following issues.
• The first and second predictive powers are related to the CP, the third and fourth predictive powers are related to the CCP, the fifth and sixth predictive powers are related to the BP, and the seventh and eighth predictive powers are related to the BCP.   Table 3 are labelled in the figure by the six markers°, , +, ×, and .
• In the first plot, a different d 0 value corresponds to a different prior, with d 0 = μ s corresponding to the sceptical prior, and d 0 = μ o corresponding to the optimistic prior. • From the first plot, we see that as d 0 moves from μ s = 0 to μ o = log(0.6) ≈ −0.51 and to below μ o , the d 0 values favour tamoxifen more and more, and the predictive powers for tamoxifen superior (I − 1 ) are becoming larger and larger, while the predictive powers for control superior (I 1 ) and equivocal (1 − I − 1 − I 1 ) are getting smaller and smaller. Conversely, as d 0 moves from μ s = 0 to above μ s , the d 0 values favour control more and more, and the predictive powers for control superior (I 1 ) are becoming larger and larger, while the predictive powers for tamoxifen superior (I − 1 ) and equivocal (1 − I − 1 − I 1 ) are getting smaller and smaller.
• The other seven plots can be explained similarly to the first plot. • It is interesting to note that for the first and fifth predictive powers, the predictive powers for equivocal are symmetric around d 0 = 0, and thus when d 0 moves from μ s = 0 to μ o = log(0.6) ≈ −0.51, the predictive powers for equivocal are getting smaller and smaller. While for the other six predictive powers, the predictive powers for equivocal are symmetric around a negative d 0 , and thus when d 0 moves from μ s = 0 to μ o = log(0.6) ≈ −0.51, the predictive powers for equivocal may get bigger and bigger (e.g., the second, fourth and eighth predictive powers), or may get bigger and then smaller (e.g., the third, sixth and seventh predictive powers).
The sensitivity analysis of m 0 on the eight predictive powers is displayed in Figure 3. In the figure, we note the following issues.
• For the i-th (i = 1, . . . , 8) predictive power, there are six markers labelled°, , +, ×, and , which correspond to (m r 0 , Table 3 are labelled in the figure by the six markers°, , +, ×, and . • Note that Var(d 0 |δ) = 2σ 2 /m 0 , and thus when m 0 is large, the variance of d 0 |δ will be small. The sensitivity analysis of d 1 on the eight predictive powers is displayed in Figure 4. In the figure, we note the following issues.
• Note that d 1 is the observed treatment difference in the treatment group and the control (or placebo) group means of the interim data. The first and fifth predictive powers do not use the interim data, and thus they are missing in the figure. 8. Additionally, the sceptical prior favours control, and thus I s i are consistently higher than I o i , for i = 2, 3, 4, 6, 7, 8.
• Note that m 1 is the per group number of patients of the interim data. The first and fifth predictive powers do not use the interim data, and thus they are missing in the figure.  3, 4, 6, 7, 8 in Table 3 are labelled in the figure by the six markers°, , +, ×, and .
• Note that Var(d 1 |δ) = 2σ 2 /m 1 , and thus when m 1 is large, the variance of d 1 |δ will be small. • When m 1 → s = 115, the predictive powers tend to 1 or 0. The sensitivity analysis of m 2 on the eight predictive powers is displayed in Figure 6. In the figure, we note the following issues.
• For the i-th (i = 1, . . . , 8) predictive power, there are six markers labelled°, , +, ×, and , which correspond to (m r 2 , Table 3 Table 6. From the table, we observe that as m 2 increases, I s− i increase and I o− i increase for all eight predictive powers. I s i are increasing functions of m 2 for the first and fifth predictive powers, they are increasing and then decreasing functions of m 2 for the second and sixth predictive powers, and they are decreasing functions of m 2 for the third, fourth, seventh and eighth predictive powers. 1 − I s− i − I s i are decreasing functions of m 2 for the first, second, fifth and sixth predictive powers, and they are increasing and then decreasing functions of m 2 for the third, fourth, seventh and eighth predictive powers. I o i are zero constants for the first, fifth, sixth, seventh and eighth predictive powers, it is an increasing and then decreasing function of m 2 for the second predictive power, and they are decreasing functions of m 2 for the third and fourth predictive powers. 1 − I o− i − I o i are decreasing functions of m 2 for the first, second, fifth, sixth, seventh and eighth predictive powers, and they are increasing and then decreasing functions of m 2 for the third and fourth predictive powers. Note that some predictive powers display the same increase-decrease characteristics, and they are the first and fifth predictive powers, the third and fourth predictive powers, and the seventh and eighth predictive powers. The sensitivity analysis of t on the eight predictive powers are displayed in Figure 7. In the figure, we note the following issues.
• Note that t is the information time of the interim data. The first and fifth predictive powers do not use the interim data, and thus they are missing in the figure.  3, 4, 6, 7, 8 in Table 3 are labelled in the figure by the six markers°, , +, ×, and .
• When t → 1, the predictive powers tend to 1 or 0.
• The increase-decrease characteristics of 3, 4, 6, 7, 8 observed from Figure 7 are the same as those observed from Figure 5, which are summarized in Table 5.

Conclusion and discussion
For the randomized controlled early phase and Phase III trials, suppose that the model and the prior are given by (3). We provide two tables in this article. The eight predictive powers with historical and interim data, their analytical expressions, the predictive distributions, the data used, and the references for the hypotheses H 0 : δ ≤ δ 0 versus H 1 : δ > δ 0 are given in Table 1. The eight predictive powers with historical and interim data, their analytical expressions, the predictive distributions and the data used for the reversed hypotheses H 0 : δ ≥ δ 0 versus H 1 : δ < δ 0 are given in Table 2. Moreover, the data structures of the historical data, interim data and future data are described in Figure 1. Furthermore, the eight predictive powers with historical and interim data for the hypotheses and the reversed hypotheses are utilized to guide the futility analysis in the tamoxifen example. Finally, extensive simulations are conducted to investigate the sensitivity analysis of priors (d 0 ), sample sizes (m 0 , m 1 , m 2 ), interim result (d 1 ) and interim time (t) on the eight predictive powers.
In addition to the four predictive powers (I 1 , I 4 , I 5 , I 8 ) summarized in Table 1, we discover and calculate another four predictive powers (I 2 , I 3 , I 6 , I 7 ) also summarized in Table 1, for the hypotheses H 0 : δ ≤ δ 0 versus H 1 : δ > δ 0 . Moreover, we calculate eight predictive powers (I − 1 to I − 8 ) summarized in Table 2, for the reversed hypotheses H 0 : δ ≥ δ 0 versus H 1 : δ < δ 0 . The combination of Tables 1 and 2 gives us a complete picture of the predictive powers with historical and interim data for futility and efficacy analysis, as illustrated in Table 3.
By comparing these eight predictive power calculations, one main difference among them is how many times the historical data and interim data are utilized. For example, the historical data and the interim data could be used once or twice in these calculations. It may be confusing to the reader why the historical data or interim data could be used twice. For example, if the predictive power is calculated at the time when the required interim data are collected, why the authors incorporate the interim data into the prior specification given the interim data have been contributed to the likelihood? These are the fourth and eighth predictive powers in Tables 1 and 2. Note that in Table 1, the fourth predictive power is (6.15) in Spiegelhalter et al. (2004), and it is the average classical conditional power with respect to the updated new prior π(δ|d 0 , d 1 ); the eighth predictive power is (6.18) in Spiegelhalter et al. (2004), and it is the average Bayesian conditional power with respect to the updated new prior π(δ|d 0 , d 1 ). If one is willing to use the historical data and interim data only once, then one could use the second and third predictive powers in the two tables, and the two predictive powers are discovered by us. Another possible solution to use the data twice is to use the external data.
Two sets of one-sided hypotheses are considered throughout the paper, and they are both needed. That is, both Tables 1 and 2 are needed. As discussed in the real data example, for j = 1, . . . , 8, the j-th predictive power I j (see Table 1) is for control superior, the j-th predictive power I − j (see Table 2) is for tamoxifen superior, and 1 − I j − I − j is for equivocal. We have assumed a known variance (σ 2 ), which is unrealistic. However, in the literature and real applications (see for instance Chuang-Stein, 2006;Kirby et al., 2012;Lan & Wittes, 2012;O'Hagan et al., 2005;Spiegelhalter et al., 2004;Wang et al., 2006), it is common practice to assume that the variance σ 2 is known to obtain analytical solutions, such as (·) for powers and average powers. When the variance is unknown, one might use the historical data to specify a sampling prior for σ 2 (Chen et al., 2011). Alternatively, one might utilize a t statistic. As stated in O' Hagan et al. (2005), the sampling distribution of t is a non-central t distribution (which only becomes an ordinary Student t distribution if δ = 0). Nevertheless, based on previous Phase II trials or publications, the estimate of σ 2 is good enough, such that it provides some assurance to the practitioners that probably there is no need to have a prior for σ 2 when designing the Phase III trial. Furthermore, in practice and in publications, it is not common to add a prior to σ 2 in the calculations in frequentist framework and mixed frequentist and Bayesian framework. However, it is very common to include prior on σ 2 in pure Bayesian framework.
We have assumed equal variances for the normally distributed responses of two treatments of the Phase III trial. The equal variances assumption can be reasonably met in reality by exploiting the randomized controlled Phase III trial. This statement needs to be further justified. Consider a well-designed (patient-masked and outcome observer-blinded) placebo controlled trial where patients in the control group will demonstrate (approximately) the same outcome before and after treatment exposure. If the study drug is effective in a certain portion of patients in the treatment arm, the outcome for these patients will be different (shifted by a certain magnitude) before and after treatment. Hence, the variance in the treatment arm is expected to be higher than that in the control arm, unless the study drug is similarly effective in every patient who received it. On the other hand, if the study drug leads to an elevation (or decrease) of the outcome to a certain boundary value, the variance in the treatment group may be even smaller than that in the control group. Therefore, for simplicity, we assume equal variances for the normally distributed responses of two treatments. However, it is not uncommon to assume unequal variances in pure Bayesian framework. The method demonstrated in Section 2 assumes the treatment arms have the same randomization ratio for illustration purpose, but the method can be easily adapted when the randomization ratios are not balanced. See the Conclusions and Discussion section in Deng et al. (2020) for details.
For simplicity, we assume that outcome measurements are available for all individuals in the study and that everyone in the treatment arm and the control arm is fully adherent to the treatment they are allocated to, i.e., no non-compliance or treatment arm cross-over. In other words, the meaning of the effect parameter we are going to identify from the observed data is the true average treatment effect.
For simplicity, we have assumed the true treatment effects based on the historical data of the early phase trial, the interim data, and the future data of the Phase III trial are the same. This assumption has also been used in the literature. For example, Chuang-Stein (2006) has assumed that the true treatment effects based on the Phase II trial and the Phase III trial are the same. Spiegelhalter et al. (2004) have assumed that the true treatment effects based on the interim data and the future data of the Phase III trial are the same.
The analytical derivations in Section 2 are based on normal likelihoods. As explained in Section 2.4 of Spiegelhalter et al. (2004), normal likelihoods can be used for binary data, survival data, count responses and continuous responses. In the real data example, we use a data example where survival data (disease-free survival time) is the primary outcome variable. Note that, in general, effect estimates such as log hazard ratios follow a normal distribution. It is important to stress that m 0 , m 1 and m 2 do represent number of events and are not sample sizes in this context.
Intuitively, when the historical and interim data are available, they should be used to give a more accurate prediction, as the predictive powers shown in Table 3. Therefore, we recommend reporting all eight predictive powers in practice to have a complete picture for futility and efficacy analysis.
If one is interested in evaluating whether the incorporation of the historical data or interim data can improve the estimation of treatment effects for futility analysis, a real data example is not enough. One may need to conduct simulation studies to evaluate estimation accuracy or correct stopping rates by using the historical data (or interim data) or not. Alternatively, one may use the Receiver Operating Characteristic (ROC) curve as a tool to evaluate and compare operating characteristics by using the historical data (or interim data) or not. In fact, we are currently working on the analytical ROC analyses of the eight predictive powers, and the elaborated version deserves another publication. Table 3 summarizes the predictive power values for the example data under three predefined scenarios (tamoxifen superior, equivocal, and control superior) considering sceptical and optimistic priors. Note that the three scenarios are based on the notion of 'statistical significance', i.e. if 0 is included in the 95% posterior interval for the target parameter δ or not. One could consider the specification of these scenarios as to consider clinically relevant equivalence margins for δ (say ±5% or ±10%). The statement 'equivocal' would then only hold, if both credible interval limits fall within these margins.
The way the results are presented right now suggests to stop the trial for futility but this may in fact be an imprecision issue due to small m 2 (or limited overall number of events). This claim is supported by the fact that even for very low optimistic predictive power values under scenario 'Tamoxifen superior', the sceptical predictive power values under scenario 'Control superior' remain relatively low. This means that the confidence intervals or credible intervals of δ often are too wide to exclude 0 for the target parameter δ. The lengths of the confidence intervals or credible intervals of δ and the lengths of the intervals of d 2 of equivocal are decreasing functions of m 2 . That is, when m 2 is small (imprecision), the lengths of the intervals of d 2 of equivocal are large. Hence, it is probably that the probabilities of equivocal for the powers and predictive powers will be large. It is worth noting that the imprecision issues due to small m 2 (or limited overall number of events) are related to all four powers (CP, CCP, BP and BCP) and all eight predictive powers. We are currently working on the imprecision issue, and the elaborated version deserves another publication.
Assuming a flat prior with infinite tales (π(δ) ∝ 1) seems overly conservative, the uniform prior interval would in practice rather be [a, b] with |b| > |a| and a ≤ 0 < b for the hypotheses H 0 : δ ≤ 0 versus H 1 : δ > 0, expressing the optimism of the drug-developer as the drug made it already beyond lab and animal testing. That is, it is useful to allow for the incorporation of a proper uniform prior for δ when estimating the posterior δ|d 0 , into formula (3) and following expressions. However, in this situation, one may not obtain analytical solutions. Then one should be able to derive the predictive powers numerically.