Bayesian adaptive bandit-based designs using the Gittins index for multi-armed trials with normally distributed endpoints

ABSTRACT Adaptive designs for multi-armed clinical trials have become increasingly popular recently because of their potential to shorten development times and to increase patient response. However, developing response-adaptive designs that offer patient-benefit while ensuring the resulting trial provides a statistically rigorous and unbiased comparison of the different treatments included is highly challenging. In this paper, the theory of Multi-Armed Bandit Problems is used to define near optimal adaptive designs in the context of a clinical trial with a normally distributed endpoint with known variance. We report the operating characteristics (type I error, power, bias) and patient-benefit of these approaches and alternative designs using simulation studies based on an ongoing trial. These results are then compared to those recently published in the context of Bernoulli endpoints. Many limitations and advantages are similar in both cases but there are also important differences, specially with respect to type I error control. This paper proposes a simulation-based testing procedure to correct for the observed type I error inflation that bandit-based and adaptive rules can induce.


Introduction
The medical and statistical communities have long held as a 'gold standard' for clinical trials the so-called randomised controlled trial (RCT), where patients are allocated to a treatment arm with a fixed probability which is equal across all arms and for all patients. This scheme ensures the trial is well balanced, eliminates possible sources of bias, and makes the results as sound as possible. However, this design makes no concession to the wellbeing of patients in the trial: in a K-arm RCT, on average (K − 1)/K of the patients will be assigned to a treatment other than the most effective one (if it exists).
This creates one of the foremost ethical concerns inherent in any clinical trial: the conflict between learning (ensuring the selection of the best treatment) and earning (treating most patients effectively). The scientific aim of a traditional RCT is to learn about new treatments and identify the most effective one. It is inevitable under this paradigm, however, that a fixed number of patients will be given an inferior treatment. The research on adaptive methods for trial designs, such as response-adaptive randomisation methods, has developed as a response to this ethical dilemma, seeking to improve the earning resulting from a trial while preserving its learning. The challenge is to find response-adaptive methods which improve patient welfare during the trial, but do not allow extreme imbalance or bias to hinder the statistical validity of the trial, and are conclusive enough truly to influence future medical practice.
The need to consider patients' wellbeing during the trial is particularly acute in the case of a treatment for a rare disease. In this situation the trial patients represent a high proportion of all those with the disease, and a trial aiming solely to identify the most effective treatment will benefit only the small number of patients remaining to be treated after the end of the trial. The ethical concerns with randomising patients onto an inferior treatment are most severe in the case of a serious or life-threatening disease. Thus, the motivation for an adaptive trial design is arguably strongest in the case of life-threatening rare diseases such as the new types of rare cancers identified by the advances of genetics. However, the challenges of maintaining statistical rigour are even more acute when recruitable patients are sparse and sample sizes are small.
The majority of the response-adaptive randomisation methods proposed in the literature use Bayesian learning and a binary endpoint, with information on the effectiveness of the treatments gained throughout the trial deployed immediately, to increase the chances of patients in the trial receiving a better performing treatment (see e.g. [22]). A limitation of these approaches is that they are myopic (they only make use of past information to alter treatment allocation probabilities) and hence they are not influenced at all by the number of patients that remain to be treated in the trial (nor by the expected number of patients outside the trial). An approach recently proposed and modified for addressing this limitation and developing 'forward looking algorithms' is to consider clinical trial design within the framework of the Multi-Armed Bandit Problem (MABP). The optimal solution to the classic MABP has been known since the 1970s [8], and those responsible for its solution saw clinical trials as the 'chief practical motivation' for their work [9, p. 561]; despite this, it has never been applied to a real life clinical trial. Villar et al. [23, pp. 2-3] In Villar et al. [23] some of the benefits and limitations of applying the MABP solution to clinical trials are explored, considering in particular the case where the trial's primary endpoint is dichotomous (i.e. the treatment arms are modelled as reward processes by Bernoulli random variables). The objective of the paper is two-fold. The first is to apply some of the considerations and techniques of Villar et al. [23] to define a response-adaptive Bayesian design for a clinical trial whose primary endpoint is normally distributed with known variance, a case that has been less commonly studied in the response-adaptive literature. Specifically, we investigate whether the same conclusions in terms of patient-benefit and operating characteristics hold as in the case of trials with binary endpoints and, since many trials do have normally distributed endpoints, in this way we hope further to bridge the gap between MABP theory and clinical trial practice. The second objective is to identify and address issues that may limit the use in practice of the MABP-based designs considered in this paper. Specifically, we consider in detail the level of bias and type I error rates observed under this setting and further suggest appropriate procedures to control them.
Results are illustrated by simulations in the context of a currently ongoing clinical trial: TAILoR trial, described in Wason et al. [25].
The structure of this paper is as follows: In Section 2 an overview of the general MABP with a continuous state variable and its solution for the special case of a normally distributed reward is provided together with an adaptive patient allocation rule based on it. Then, Section 3 presents some simulations of two-armed and multi-armed trials implementing bandit strategies for normally distributed endpoints and comparing them to alternative trial designs. Section 4 concludes with a discussion of our findings and lines of further research.

The classic Bayesian MABP with a continuous state variable and known variance
Let K ∈ N and consider a collection {X k,t : k = 0, 1, . . . , K, t = 0, 1, 2, . . .} of independent (real-valued) random variables, where for each fixed k the distributions of X k,0 , X k,1 , X k,2 , . . . are identical and parametrised by some unknown θ k ∈ R p . At each time t = 0, 1, 2, . . . , T − 1 we obtain a reward by choosing some distribution k ∈ {0, . . . , K} and sampling from X k,t . In the context of a clinical trial, this corresponds to choosing the treatment allocation of the tth patient, and X k,t corresponds to the endpoint observation for patient t on treatment k. In order to incorporate the adaptive learning element into the model, we take a Bayesian viewpoint and assume k is a random variable taking the value θ k . We assign k a prior distribution π (0) k , which is assumed to be a density function with respect to Lebesgue measure. By Bayes' Theorem, the posterior density of k , having observed values x k,i 1 , . . . x k,i n in n independent samples from X k,i 1 , . . . , X k,i n after having treated t patients, is where f k (·|θ) is the density of X k,t (with respect to Lebesgue measure) [14]. Note that we have used the subscript i j (for j = 1, . . . , n) to emphasise that the sample of n observations from distribution k is a subset of the total number of sampling observations possible at time t. Formally, the classic Bayesian MABP within this general setting is defined by formulating a Markov decision process as follows. Consider each distribution (or arm) k as a Markov process B k with a Borel state space (E k , E k ), by taking the state ξ k (t) ∈ E k of B k at time t to be the valuex of some chosen sufficient statisticX k for the posterior density of k , and updating the state every time we sample from this arm. At each time t = 0, . . . , T − 1 a decision variable a k,t ∈ {0, 1} is chosen for process B k for each 0 ≤ k ≤ K, such that exactly one arm receives action 1 (is sampled) and all others receive action 0 (their posterior density remains frozen). If a k,t = 0 then B k is frozen (and so is its associated value for the sufficient statistics), thus ξ k (t + 1) = ξ k (t) with probability 1. If a k,t = 1 then ξ k evolves according to a Markovian transition kernel P k , i.e. for any A ∈ E k andx 0 ,x 1 , . . . ,x t−1 ,x ∈ E k we have The transition kernel P k is a density p k (with respect to Lebesgue measure on R): wherex y denotes the updated value ofX k if y is the next value sampled. If the process B k is sampled at time t we earn the random reward In the classic MABP this function is given by R k (x,x y) = y, that is, the value of R k is the value taken by X k,t . We define r k : E k → R as the expected reward from the process in a given state [18], given by Let E = E 0 × · · · × E K be the joint state space of the MABP, and ξ(t) = (ξ 0 (t), . . . , ξ K (t)) ∈ E the joint state vector of the MABP at time t. Let be the set of all feasible sampling policies, that is, those in which the decision at time t depends only on past information and only sample one arm (or distribution) at a time. Writing a π k,t for the sequence of sampling decisions chosen by policy π, the value function for the classic MABP with a continuous state variable is Thus, the MABP is the problem of finding a policy π ∈ which maximises the value of the expected total discounted reward of the sampling process. Notice that d is a discount factor (i.e. 0 ≤ d < 1) introduced for reasons of tractability, so that the infinite horizon problem (T = ∞) can be considered. One approach to solve the MABP in (3) would be via the dynamic programming equation Standard theory on Markov processes ensures that there is an optimal solution to (3), and approximations to it may be obtained using value iteration on (4) [7], but such an approach is computationally expensive, exploding with the truncation horizon T even for a small number of arms K > 3 [23]. For the infinite horizon MABP Gittins and Jones [8] provided a theorem by which there exists a function ν = ν(B k ,x k ) such that at any time the optimal strategy is to sample the process which has the highest value of ν. There is a clear computational advantage to this approach: if we can compute a grid of values of ν for each bandit process, then the policy can be followed any number of times by looking up values of ν for each process at each decision time. Several proofs of the Index Theorem are given in Gittins et al. [7]. Gittins and Jones referred to ν as a dynamic allocation index, but this is now known widely as the Gittins index.
In the case where the MABP state space is discrete, as in the Bernoulli case, values of ν can be looked up from a matrix. With a continuous state space it is less clear that all the necessary calculations can be performed in advance. However, in most useful cases, including the normally distributed case, the function ν(B k ,x k ) is a linear function of B k andx k under some discrete boundary conditions; the discrete part can thus be calculated in advance as a matrix of values [7].
For the finite horizon problem, which is the relevant case in the clinical trial context, [23] suggested an index-based solution to the finite horizon MABP based on the Whittle index [26]. However, the Whittle index is omitted from the studies in this paper since in most trials its performance was near identical to that of the Gittins index (calibrated through the choice of the discount factor d) which further has a lower computational cost.

The Gittins index as a Bayesian adaptive patient allocation rule for the normally distributed endpoint (with known variance)
In this paper we consider clinical trials for which the endpoint of each treatment arm k is assumed to be normally distributed with unknown mean μ k and known variance σ 2 k . We therefore consider the rewards from arm k to be an independent identically distributed (iid) sequence X k,t iid ∼ N(μ k ; σ 2 k ), and μ k is given a prior distribution π (0) k , which we will take to be the improper uniform distribution on the whole real line. The uniform prior distribution assumption will allow us to isolate the effect on patient welfare and other relevant statistical properties of the MAB adaptive design alone, that is, without the use of prior (historical) data. Let f (·|μ; σ 2 ) denote the density of a N(μ; σ 2 ) distribution. If we have observed n independent samples x k,i 1 , . . . , x k,i n from X k,i 1 , . . . , X k,i n , then, writinḡ x k,n = (1/n) n j=1 x k,i j for the sample mean, by Bayes' Theorem the posterior density of μ k at time t is π (t) k (μ k |x k,i 1 , . . . , x k,i n ) ∼ N(x k,n ; σ 2 k /n). A sufficient statistic for the posterior distribution of μ k is (x k,n , n), thus the state vector of process B k in this case will be (x, n) where n is the number of observations so far sampled from arm k after having treated t patients, andx is the mean of these observations. As explained in Gittins et al. [7] for the MABP with normally distributed rewards with known variance, the indices ν(x, n; σ 2 k , d) can be written as follows: Therefore, to implement the Gittins index policy at very low computational cost it suffices to calculate in advance the values of ν(0, n; 1, d). This can be done to a good accuracy using value iteration on (4) in the case of the two-armed bandit calibration setup. Details are given in do Amaral [1, pp. 131-162] and Gittins et al. [7,Chapters 7 and 8]. Computational results for this case were first computed in Jones [12]. The values of the indices ν(0, n; 1, d) used in this paper have been interpolated from the tables printed in Gittins et al. [7, pp. 261-262]. Figure 1 shows the values of the indices ν(0, n; 1, d) for a range of discount factors d. In Gittins and Wang [10], the learning component of the index is defined as the difference between the index value and the expected immediate reward, which for this MABP corresponds to the reward from sampling an arm with posterior meanx from previous samples, that is, simplyx. Therefore, σ k ν(0, n; 1, d) can be interpreted as a measure of the learning reward associated with continuing sampling an arm which has already been sampled n times. Figure 1 illustrates clearly that ν(0, n; 1, d) increases with d, since a larger discount factor puts greater value on future rewards and increases the value of learning. However, for any choice of d, the value of learning drops very quickly as n increases; in the limit as n tends to infinity, the value of learning tends to 0 and the sample mean converges by the Law of Large Numbers, so the index tends to the true value of the parameter μ k .
The Gittins index solution for the case of both μ k and σ k being unknown exists and is similar to that in (5). The difference is that the model requires a joint prior distribution on both parameters and the known variances in (5) are replaced by sample variances.

Some considerations specific to the use of bandit strategies in a clinical trial context
In Villar et al. [23] simulation results comparing a number of alternative patient allocation rules to index-based solutions for trial scenarios with dichotomous endpoints were provided. The authors conclude that, alongside the clear advantages, there are a number of limitations to the use of the Gittins index as an allocation mechanism for clinical trials. Some of these disadvantages are still going to be an issue in the normally distributed case. The endpoint needs to be immediately observable so that index rules can be applied. This means that a patient in the trial cannot be treated until all previous outcomes have been observed. This is a strong limitation that affects all adaptive designs in general and not only MAB-led designs. In practice, this limits the speed at which new patients can be recruited to the trial; however, this may be less problematic in a rare disease context, where the rate of patient recruitment is likely to be slow already. Applying the adaptive algorithms in batches of patients rather than patient after patient is a way of acknowledging and partially addressing this issue [17,24].
Another limitation still present is that the allocation of treatments in a Gittins indexbased design is highly deterministic, which can lead to the introduction of different sources of bias. As explained in Atkinson and Biswas [2], randomisation prevents the trial results from being influenced by 'secular trends in the population's health and our ability to measure it, in the quality of recruits to the trial and in the virulence of a disease' . In trials where the clinician can influence which patient receives the next treatment, so-called selection bias (the ability of the experimenter to predict which treatment will be allocated next) can influence the results. These extrinsic bias effects are absent from the simulations next presented and from those in Villar et al. [23], but could have a significant impact when deterministic rules are used on trials with real populations. Recent work [24] addresses this particular limitation proposing a simple modification of the Gittins index rule for the Bernoulli case that is randomised. Notice that the lack of randomisation of the resulting patient allocations is a limitation shared with most bandit-based algorithms, even those that introduce random terms in their definitions as, for example, [3] or [11].
For other limitations and also for the patient-benefit advantages of index-based designs reported in Villar et al. [23], the magnitude or even their existence requires careful consideration. This is the case for the possibility of introducing intrinsic sources of bias. Response-adaptive trials in general can result in biased estimate of a treatment's outcomes. For example, in a two-arm trial scenario [23] found that the use of the Gittins index introduced a significant negative bias in the estimate of treatment outcomes; the magnitude of the bias is greatest for inferior treatments (since they are more likely to be dropped early in the trial) and the treatment effects are likely to be overestimated. Similar considerations apply with respect to the resulting rates of type I error (a false positive result, i.e. incorrectly rejecting the null hypothesis H 0 ) and of type II error (failing to detect that an experimental treatment is effective, i.e. incorrectly accepting H 0 ). Villar et al. [23] reported that the index-based designs achieved a level of statistical power that was far below the level of an RCT with the same number of patients T and also that control of the type I error rate required adjusting the statistical test to correct for its conservativeness (i.e. moderate deflation).
An important contribution of this paper is to assess the extent to which further considerations different from the ones mentioned above apply to the normally distributed endpoint. In particular, assessing how important the bias and statistical error levels are in the normally distributed case, and suggesting how to control for the type I error rate at a desired level, are two of the main contributions of this work.

Simulation studies
In this section we evaluate the performance of a range of patient allocation rules in a clinical trial context, including bandit-based solutions using the Gittins index. As a case study for simulations we shall use a generalisation of the currently ongoing TelmisArtan and InsuLin Resistance in HIV trial (TAILoR trial), which is described and also used as a case study in Wason et al. [25]. See also [15] for discussion of the design of the TAILoR trial.
The TAILoR trial is a one-sided test of K experimental treatments against a control treatment (i.e. testing for superiority). Treatment k is assumed to have endpoint outcomes X k,t iid ∼ N(μ k ; σ 2 ), for k = 0, 1, . . . , K (where k = 0 is the control treatment), and σ 2 is known and common to all treatments. Setting δ k = μ k − μ 0 , the global null hypothesis is H 0,G : δ 1 , . . . , δ K ≤ 0 and the alternative hypotheses are H 1,k : We focus on the following: statistical power (1 − β); type I error rate (α); expected proportion of patients in the trial assigned to the best treatment (Ep * ); the Expected Outcome (EO) defined as the mean patient outcome across the trial realisations; and, for the twoarm case, bias in the maximum likelihood estimate of treatment effect associated with each decision rule.
For testing these hypotheses we shall use the following test statistics: where n k is the number of sample observations taken from arm k andX k is the sample mean of arm k. Under the assumption that the n k 's are independent and identically distributed samples, these k test statistics will follow a normal distribution with mean δ k /(σ √ 1/n k + 1/n 0 ) and variance 1. In the case of a two-arm trial with one experimental treatment to be tested against a control, this simplifies to the case of a standard z-test using a univariate normal distribution. For the multi-armed case we will consider the joint distribution of Z 1 , . . . , Z k and use a critical value C α that controls the Family- For each scenario we set the size of the trial T to ensure that an RCT with equal randomisation achieves a specified power (1 − β) to detect a specified effective treatment difference δ (1) between each arm and the control, while controlling the FWER within α. Because we are interested in the marginal type II error rate in a single test, rather than a family-wise error rate, we consider the marginal distributions rather than a joint distribution to determine a required sample size per arm for an RCT. Following this rationale it can be computed that the total required size of the RCT trial (i.e. across all arms) is where z β is the 100(1 − β)th-percentile of a standard N(0, 1) distribution. See Appendix 1 for details of how C α is determined and, for example, [28] for a review of sample size calculation in RCTs. Following [25], we shall assume the variance in the outcomes is σ 2 = 1, and specify the treatment difference to be detected as δ (1) = 0.545 (chosen such that the probability of a patient given a treatment k with δ k = δ (1) having a better outcome than a patient on the control treatment is 0.65). We will consider the usual error rates of α = 0.05 and β = 0.10.
In every scenario we consider the following patient allocation procedures: • Fixed Randomised (FR): For each patient, treatments are allocated randomly with fixed probability 1/(K + 1) across all treatments; • Thompson Sampling (TS) [21]: For each patient, treatments are allocated randomly, where the probability π k,t of allocating treatment k to patient t is proportional to the posterior probability that treatment k is the best, i.e.
where c is a tuning parameter defined as t 2T introduced to stabilise the resulting allocation probabilities [20]. The probabilities in the fraction are estimated by simulation at each t; is allocated the treatment k with the highest value of the indexx k + σ 2lnt/n k , as proposed in [3,13].
is allocated the treatment k with the highest value of the indexx k + σ 2(lnt + 3(ln lnt))/n k . This variant of UCB was shown in [6] to have improved asymptotic regret bounds compared to UCB. • Current Belief (CB): The next patient is allocated the treatment with the highest posterior meanx k . • Gittins Index (GI): The next patient is allocated the treatment with the highest value of the Gittins Index ν(x k , n k ; σ 2 , d), where d is the value of the discount factor;. • Randomised Gittins Index (RGI): as first suggested in Glazebrook [11], the next patient is allocated the treatment with the highest value of the semi-randomised index where Y t is a random variable sampled from the exponential distribution with mean 1/(K + 1) (this choice of randomisation element is the same as that used by Villar et al. [23]). • Randomised Belief Index (RBI): As first suggested in Bather [4], the next patient is allocated the treatment with the highest value of the semi-randomised indexx k + ( where Y t is a random variable sampled from the exponential distribution with mean 1/(K + 1). For the multi-armed scenarios we have additionally considered the following rules: • Trippa et al. Procedure (TP): For each patient, treatments are allocated randomly, where the probability π k,t of allocating treatment k to patient t is defined by in these simulations we have considered γ t = 3(t/T) 1.75 and η t = 0.25(t/T), as in Trippa et al. [22]. Note that this procedure is only considered for multi-arm trials because by design its allocation to the control arm will closely follow that of the best experimental arm.
• Controlled Gittins (CG): Each patient is randomly assigned the control treatment with (fixed) probability 1/K. If the patient is not randomly assigned to the control group in this way, then she is assigned to the treatment with the greatest Gittins index. Although CG deviates from the optimality of GI, it was still found in Villar et al. [23] to offer a significant improvement in patient welfare over FR; moreover, it largely combatted the issue of reduced power. In fact, when there existed a clear superior treatment among the K arms it was found to achieve even higher power than FR. • Controlled UCB (CUC): A variant of UCB with the control allocation protected as in CG above.
In all scenarios we also include 'batched' versions of the Bayesian rules in which the allocation probabilities are updated after a block of b patients are treated instead of after every patient. This idea was implemented in [17,24] as a means of overcoming the practical limitations imposed by the assumption of immediate outcome observability of these algorithms. Their inclusion is intended to more closely replicate the constraints of a real life clinical trial without fully sequential design. Specifically, we consider: In every scenario considered and for every procedure we assumed that the prior for the μ k parameters is the improper uniform distribution on the whole real line. Notice that a fully Bayesian approach to the design could make use of historical data existing before the trial through appropriate choice of the prior distribution. In this paper we have chosen to use an uninformative prior to make results comparable to the case study in [25] and to isolate the effects of the adaptive designs in the different performance measures.
For the rules that are based on the Gittins index values there is an obvious ethical concern around the choice of a discounting factor when calculating the indices: clearly current and future patients' wellbeing should be valued equally. So for scenarios with large sample sizes we will take d close to 1, usually d = 0.995. In the case of a rare disease, if it is known that not more than N patients will ever be treated, the value of d could be chosen so that d N ≈ 0, to ensure that the possibility of treating patients beyond N has little impact on current decisions. By doing this the current estimation of the patient population could be used to indirectly affect the choice of trial design. In all trial designs and in all simulations, ties among index values are broken randomly.

Two-arm trial
We first simulate the TAILoR trial with one experimental arm to be compared with a control treatment, that is, K = 1. The trial is implemented under H 0 with δ 1 = μ 0 = μ 1 = 0 and under H 1 with μ 0 = 0, μ 1 = 0.545 (i.e. δ 1 = 0.545). In both scenarios the common variance is σ 2 = 1 and d = 0.995. The sample size considered is of T = 116 patients. This size ensures a 5% type I error rate (using C α = 1.645 as a critical value) and(1 − β) = 90% power to detect a difference of δ (1) = 0.545 through a FR design.

Type I error control for adaptive designs
For the adaptive allocation mechanisms, including the Gittins index-based, the use of a critical value of C α = 1.645 is found in simulations to generate a type I error rate inflated above 5%. This is in stark difference to the type I error deflation reported in [23,24]. We will next explain this phenomenon in detail in terms of the GI rule but a similar logic applies for other adaptive rules.
In each realisation of the trial, if one arm performs badly early on and is dropped (or allocated with a very low probability), then the sample mean from this arm will not have a chance to regress upwards to the mean (or do so more slowly), being therefore negatively biased. Figure 2 illustrates this in a typical GI trial run under H 0 , displaying the posterior meanx k,t of the outcomes for each treatment arm k after t patients have been allocated that arm. In this example, the control arm k = 0 performs badly early on in the trial, so is dropped with just n 0 = 19 patients, leaving the trial's estimate of this treatment's outcomes negatively biased. The experimental arm performs better early on, so is continued and regresses to its mean; thus the trial's final estimate of this treatment's effect is close to the true value of 0. The result is that the test statistic takes the value 1.81 > 1.645, so a hypothesis test using the normal cut-off value of 1.645 would generate a type I error, incorrectly concluding the superiority of the experimental arm.
In order to choose a more suitable critical value for the hypothesis tests when using adaptive designs, we estimate the distribution of the test statistic Z under each trial design by a Monte Carlo simulation with 10 4 repeats of the trial under H 0 . Figure 3 shows the observed empirical distributions of Z for the GI trials, implemented under H 0 and under H 1 . In each case, as well as a histogram of the observed empirical distribution, also displayed is a curve of the standard normal distribution which the test statistic is expected to follow in a FR trial, for comparison.
In Figure 3(a) we see that, in the GI trial under H 0 , the distribution of the test statistic is starkly different from a normal distribution. The sample standard deviation of 1.37 is  much greater than the standard deviation of 1.00 in the FR case, and the heavier tails than a normal distribution correspond to an inflated type I error rate when hypothesis testing is carried out with the normal critical value of 1.645. Notice that because both left and right tails are heavier than the normal tails, the inflation of type I error rate when testing hypotheses at the normal cut-off value would be even greater in a two-tailed test. The empirical cumulative distribution function evaluated at 1.645 isF GI,H 0 (1.645) = 0.89, indicating that we might expect a type I error rate of 11% if hypothesis testing was carried out with this critical value. Instead, the empirical 95th-percentile of the distribution is C 0.05 =F −1 GI,H 0 (0.95) = 1.951, marked on the histogram by a vertical dotted line. We will therefore use this as the critical value for hypothesis testing in the GI trials to control the type I error rate within 5%.
Notice that the two peaks in the frequency density arise from the two situations in which the estimate of one arm's outcomes is negatively biased, and the other is unbiased: the right-hand peak corresponds to an incorrect conclusion that the experimental treatment is superior to the control treatment (the situation illustrated in Figure 2), and the left-hand peak corresponds to an incorrect conclusion of the opposite. Figure 3(b) illustrates that if the GI trial is implemented under H 1 the bimodality of the distribution of the test statistic is greatly reduced, but still present to some extent. F GI,H 1 (1.951) = 0.77, i.e. 77% of the distribution still lies to the left of the empirical critical value of 1.951, marked by a vertical dotted line; thus we expect to observe greatly reduced power of around 23% in the GI trials. The (small) left-hand peak has a weight ofF GI,H 1 (−0.5) = 5%, indicating that in 5% of trials the superior arm is dropped early on due to poor initial performance, and the trial has ended up favouring the wrong arm.
Following the same procedure, 95th-percentiles of the test statistic distribution are estimated for the other adaptive trial designs. Histograms for the distributions of the test statistics in the other trial designs are displayed in Appendix II in Figure A1. Notably, TS, RGI, UCB and KLU are the only ones of the adaptive designs which appear unimodal in both scenarios. The unimodality of TS, RGI, UCB and KLU under H 1 (Figure A1(b)) indicates that, in almost all realisations of these trials under H 1 , the trial is correctly favouring the superior experimental arm by the end of the trial.

Results and discussion
We now present results of 10 4 repetitions of each trial design using the estimated values described before as a priori critical values. The results of the simulation are displayed in Table 1. Ep * under H 0 is computed as the proportion of patients receiving the control treatment, and under H 1 as the proportion of patients receiving the experimental treatment. The (s.d) values are the standard deviations associated with each measurement. The Upper Bound (UB) row displays a theoretical optimum for each measurement based on a design which assigns every patient to the best treatment (i.e. p * = 1) in every trial.
All the adaptive rules achieve better patient welfare than the FR design under H 1 . In this scenario RBI, RGI, UCB and GI all perform similarly well in patient welfare, with EO values between 0.47 and 0.48, the closest values to the theoretical UB of 0.545. Note that this contrasts with the findings in Villar et al. [23] for the Bernoulli case, where GI was found to achieve much better patient welfare than either of the semi-randomised designs. However, the results for these rules are in line with their poorer performance in terms of power when compared to the findings in Villar et al. [23]. The TS trial is outperformed by the other adaptive designs in terms of patient welfare; this is explained by the tuning parameter c in the TS mechanism which stabilises the randomisation probabilities.
The high standard deviations in p * for all the adaptive designs under H 0 indicate that p * has a broad distribution across the realisations of the trial, so the trials are not consistent and are frequently unbalanced. The standard deviation of 0.48 for CB is close to the limiting case where, in each trial, p * ∼ Bernoulli( 1 2 ), that is, all patients within a trial are assigned to the same treatment, which would give a standard deviation of 1 2 (1 − 1 2 ) = 0.50 in p * across the trials. This indicates that most trials under CB (and to some extent also GI Table 1. Comparison in 10 4 trial replicates of operating characteristics of different two-arm trial designs of size T = 116, under both hypotheses. and RBI) are highly unbalanced, with one arm being dropped early on and most patients receiving the same treatment). Under H 1 , the high standard deviation in p * under GI arises from the bimodality observed in Figure 3(b): in a small proportion of realisations of the trial, the control arm is incorrectly favoured and p * < 1 2 . The lower standard deviation in p * for RGI confirms that RGI trials are more consistent in correctly favouring the superior treatment arm. As expected, the GI trial has greatly reduced statistical power (just 24%) compared to the value of 90% achieved by the FR trial. Reduced power is also evident in the other trial designs (c.f. Figure A1); CB has the lowest power (17%).
Note that UCB outperforms KLU in patient welfare, but KLU offers significantly higher power (78%) than UCB (56%). Interestingly, despite the improved regret bounds for KLU proved in Cappé et al. [6] KLU only begins to dominate UCB under both power and patient-benefit when the number of patients is very large. Nevertheless, KLU seems to achieve the best compromise between patient welfare and statistical inference out of other modifications to the UCB algorithm designed to improve regret bounds reviewed for this paper, with power of 78% and EO of 0.45 (only slightly below the 0.48 achieved by GI). The low standard deviation of 0.10 in expected outcome indicates that the welfare benefit is more consistent than in the GI trial.
The results for the batched TS (TSB) illustrate the effects of a blocked implementation of the algorithm to deal with a moderate delay: a marginal increase in power and a considerable decrease of the patient welfare benefits. However, the patient-benefit advantages of TSB over FR are considerably large even if assuming a moderate delay in patient recruitment. Figure 4 shows the mean (across the trial realisations) of the bias (x (t) k − μ k ) in the estimated outcome of each treatment after a total of t patients have been treated across both arms in the trial, under each scenario. Figure 4(a,b) shows the GI design introducing a negative bias into estimates of both treatment's effects; within each trial realisation this bias will be restricted to one of the two arms, corresponding to the two modes of the test statistic distribution in Figure 3(a). In all scenarios, the deterministic designs GI and CB exhibit larger bias than the semi-randomised designs RBI and RGI.

Four-arm trial scenario
This scenario uses the TAILoR trial but now considers K = 3 experimental treatments to be compared with a control treatment. To achieve a type I error rate of 5%, the critical value is C α = 2.0621 for the FR trials. Once again we take σ 2 = 1, and we assume a trial size is of T = 302 patients since this is the total required trial size for FR to achieve (1 − β) = 90% power to detect a difference of δ (1) = 0.545 in treatment outcome. The trial is implemented under H 0 with μ 0 = μ 1 = μ 2 = μ 3 = 0 and under H 1 with μ 0 = 0, μ 1 = μ 2 = 0.178, μ 3 = 0.545. These values are chosen to give the Least Favourable Configuration (LFC) for the trial, with μ 1 = μ 2 = δ (0) and μ 3 = δ (1) , where, as [25] explain: 'δ (1) is a prespecified clinically relevant effect, and δ (0) is some threshold below which a treatment is considered uninteresting. The configuration is called least favourable as it minimises the probability of recommending a treatment with effect greater than or equal to δ (1) amongst all configurations where at least one treatment has a treatment effect of δ (1) or higher and no treatment effects lie in the interval (δ (0) , δ (1) ).' Following [25], δ (0) = 0.178 is chosen so that the probability of a patient on a treatment with this treatment effect achieving a better outcome than a patient on the control treatment is 0.55 and the corresponding probability for δ (1) = 0.545 is 0.65.
We will compare all the trial designs, including now the Controlled Gittins (CG) design, in which each patient is allocated the control treatment with probability 1/(K + 1) = 0.25, and otherwise allocated the drug with the highest value of the Gittins Index. We compare CG design against similar procedures: the Trippa Procedure (TP) and Controlled UCB (CUC) designs. We also include the Batched Trippa Procedure (TPB) to assess the effects of delays in outcome observability. The Gittins Indices used are again based on discount factor d = 0.995. To calculate critical values for the trial designs other than FR, Monte Carlo simulations were run as explained in Section 3.1.1. Critical values are found by calculating the empirical 95th-percentile of the distribution of Z max := max j=1,2,3 Z j , in order to control the FWER. Trial simulations are then run using the computed quantiles as critical values; for each design the trial is run 10 4 times. Results are displayed in Table 2.  As in the two-arm scenario, all the adaptive rules outperform the FR design under H 1 in terms of patient welfare, although TP only improves marginally over FR in this case. The greatest EO values are achieved by RGI, RBI, UCB and GI, but these designs and CB exhibit a greatly reduced power level compared with FR, rendering them less useful as trial designs from a frequentist point of view. In particular, CB, which is essentially the simplest myopic approach, exhibits the worst performance in terms of power and variability. As in the two-arm trial, KLU achieves considerably greater power than UCB and the welfare benefit is only slightly reduced, offering a very good compromise between the two conflictive objectives.
As in the two-armed case TSB (90%) achieves marginally higher power than TS (88%) in return for slightly lower patient welfare (EO of 0.33 compared to 0.34). Conversely, TPB results in a slightly reduced power than TP while the patient wlefare is practically identical. In both cases, the difference caused by the moderate 'batching' of patients' outcomes is small, indicating that these adaptive designs could offer patient-benefit advantages even if applied without a fully sequential design.
As expected, the family of 'protected control' designs: TP, CG and CUC, offer a compromise between learning (power) and earning (patient welfare). Whilst CG does not perform as well as GI in patient welfare, its EO value of 0.3392 is still a significant improvement on FR's 0.2252, and the 87% power attained by CG is greater than that of many of the other adaptive designs, and only marginally lower than FR's 90%. CUC compares similarly to UCB and dominates over TP by offering a significantly increased patient welfare with a slight increase in power over TP. Just as found in Villar et al. [23] for the Bernoulli case, fixing the control allocation in this way is a simple heuristic modification of adaptive allocation rules that results in good trial designs in terms of both patient welfare and frequentist operating characteristics. Figure 5 shows the bias in the estimates of treatment outcomes in this scenario, for the control treatment and the best experimental treatment (arm 3). For the designs which were included in the two-arm simulation, the results here are similar. CG significantly lowers the bias in the estimates of control treatment outcomes, but it does not improve the issue of negatively biased estimates of unselected experimental treatment outcomes, where it performs almost identically to the original GI.

Four-arm rare disease trial scenario
The final simulation scenario is the same as in Section 3.2 but with the trial size reduced to T = 64 to imitate a rare disease setting where the number of patients who can be recruited is limited. Notice that for T = 64 the FR trial will achieve a power of 30% while controlling the FWER within 5%. The same critical values are used for hypothesis testing as in the large trials in Section 3.2, based on the assumption that (especially in a trial where patients are recruited sequentially) the experimenter might not know at the start of the trial the total number of patients she will be able to recruit, so more appropriate critical values cannot be estimated a priori. Based on the same reasoning we continue to use the original choice of d = 0.995. Table 3 shows the full results of the simulations. Due to the greatly reduced sample sizes, all designs now achieve much lower power, a common situation in drug development for rare diseases. In a situation where N T, statistical power is important, and CUC and CG  offer the best compromise. Both perform similarly well, achieving higher power than FR, and offering a marked improvement in patient welfare compared with FR. However, if the trial subjects comprise most of the total population to be treated (T/N ≈ 1), then GI and RBI provide the best patient outcome throughout the trial. The results in the table for the batched approaches show that, as expected, as the delay in recruitment is more severe the advantages of TSB and TPB over FR are significantly reduced (though both designs still offer important patient welfare advantages). Noticeably, the effect on power and patient welfare of a severe delay in the controlled version (i.e. TPB) differs to that of the uncontrolled variant (TSB). The controlled version has its power levels reduced as the delay increases (while the opposite happens to TSB). TP improves power over FR by matching the allocation of the control arm to that of the best performing arm, therefore increasing the allocation to these two arms over the other arms. With a larger delay TBP will allocate larger number of patients to all arms which therefore reduces its marginal power levels compared to TP. For TSB the power improvement is explained because the design cannot skew allocation to the best arm as fast as with TS, thus allocating more patients to all arms when compared to TS.
One distinctive feature of the results is that the Type I error rate α in the UCB trial is lower than the expected 5%, at just 4.4%. As explained above, the same critical values for hypothesis testing have been used as in Section 3.2, since the experimenter might not known in advance the total number of patients to be recruited. Figure 6 shows how the appropriate critical value C 0.05 for hypothesis testing with a Type I error rate α = 5% varies according to the size T of the trial. For most trial designs, there is little variation in C 0.05 as T increases. However, for the UCB trial, C 0.05 increases significantly with T; as a result, the appropriate critical value to ensure a 5% Type I error rate is lower for the smaller 64 person trial, at C 0.05 = 2.10, compared to C 0.05 = 2.22 for the 302 person trial. Therefore, the 64 person trial conducted at the higher critical value of 2.22 generates a low Type I error rate, and the power is even lower than it could be if the test was relaxed by lowering the critical value to 2.10. As a result, the UCB mechanism may be unsuitable for trials where the total number of patients to be recruited is not known in advance. This effect is less pronounced in the KLU variant making it more suitable in that case.
Since the trial size was much smaller than expected, there is a motivation to consider if using a smaller value for d would affect results, as a smaller discounting factor corresponds to putting less value on learning for the future. Note that, when varying the discount factor, we might expect the distribution of the test statistic Z to change, and so critical values for the hypothesis tests would have to be recalculated for each discount factor for the Gittins index designs, via a Monte Carlo simulation as in Section 3.2. In simulations not included here we found that for this trial setting in all of GI, RGI and CG there is no significant variation in patient outcome between discount factors in {0.9, 0.95, 0.995, 0.99}.

Conclusions and discussion
The simulation results provided by this paper illustrate how the index-based responseadaptive design derived from the MABP can lead to significant improvements in patient welfare also with a normally distributed endpoint. In all situations, designs based on the Gittins index achieved the largest patient welfare gain over FR trials or myopic designs currently in use in drug development such as TP. However, there are a number of limitations to the effectiveness of the purely deterministic Gittins index design that still prevail. As in the binary case, the Gittins index rule exhibits considerably lower power than FR, and whilst the loss of power can be alleviated to some extent by the introduction of random perturbations to the indices (RGI), in the two-arm trial the power achieved is still not sufficient for most clinical trials unless the exploration term is correctly calibrated.
In a multi-armed case, the patient welfare advantages of adaptive designs, and GI-based particularly, over FR are the largest. Moreover, there are adaptive designs that can offer more power than FR together with a patient-benefit advantage, making them suitable for drug development for common conditions. In the four-arm case based on a real trial we studied, a small deviation from optimality by protecting the allocation of the control treatment (CG and CUC) offers a power close to (or even above) FR's while still providing considerable patient-benefit. In contexts where power is relatively less important (if there are very few disease sufferers outside the trial), GI, RGI or UCB offer even better patient welfare at the expense of a power reduction.
There are designs that increase power levels of the UCB algorithm by introducing modifications to improve its asymptotic regret bounds, as shown for KLU in Cappé et al. [6]. However, such power gains require a very large number of patients in the trial to be also accompanied by similar patient welfare advantages. For example, KLU dominates over UCB under both criteria only in scenarios where trials had more than thousands of patients. For smaller (and more realistic) trial sizes, as the ones considered in this paper, UCB had better patient welfare and less power than KLU. Nevertheless, rules like KLU offer a good trade-off between the two objectives and can be suitable designs for common diseases.
An important observation drawn from the simulations provided by this paper is that the type I error deflation of the GI observed for the Bernoulli case does not hold in the normally distributed case. Actually, if no correction is introduced using a standard test will result in an important type I error inflation. In this work we have outlined a simulation-based procedure that can be used to prevent this inflation.
As pointed out in Berry [5], trying to shoehorn trials employing an adaptive design from a Bayesian viewpoint into traditional frequentist hypothesis tests may not be the most appropriate method of inference. Alongside the statistical community's faith in randomisation is a trust in frequentist inference, so this is generally used even in Bayesian trials to make the results as persuasive as possible. But, the inferential power and the potential patient-benefit from adaptive trials could be improved by applying Bayesian inference methods combined with the use of prior data. Further research could seek an appropriate method of Bayesian inference based on index-based adaptive trials, for example, by considering which arm the adaptive design is favouring most at the end of the trial, or by incorporating information derived from historical data.
None of the Bayesian allocation mechanisms considered here manages to completely eliminate the statistical bias phenomenon; further research is needed to seek an alternative mechanism or a means of accounting for the bias introduced. Moreover, they all carry a level of selection bias which, while not studied in the simulations included in this paper, could lead to much greater bias in clinical trials on a real population. Further research is needed to investigate whether significant practical problems will arise from selection bias in real trials, and whether random perturbations to the indices are sufficient to eliminate these problems. Alternatively, to overcome this limitation further research could repeat the idea introduced in Villar et al. [24] to randomise group of patients based on probabilities determined by the Gittins Indices for trials with continuous endpoints. For the procedures that protect allocation to the control arm we recommend a randomised implementation (where a patient is randomised to control or experimental arms with probabilities 1/(K + 1), 1 − 1/(K + 1) respectively and then allocated to experimental arms according to the index rule). A systematic allocation to control arm (i.e., 1 in every K+1 patient is allocated to control) while in theory is equivalent to its randomised counterpart in practice is subject to a very high degree of selection bias.
Some adaptive trials are designed to take account of covariates in the trial population (e.g. age, weight, blood pressure) which might affect the treatment response, by ensuring the allocations are balanced across the covariate factors [2]. Other trials incorporating covariate information combined with response-adaptive procedures with the aim of identifying superior treatments more quickly, mainly treatments that work better within subgroups, is an essential requirement to make personalised medicine possible. Some work has been done on incorporating covariates into the one-armed bandit problem, yet further research is needed to extend the approach to multi-armed bandits used in this work to clinical trials with biomarkers. See [16,19,27].
None of the adaptive designs considered formally accounts for the estimated population size. The index-based approaches indirectly can consider that by appropriate selection of the discount factor. However, the results in this paper suggest that the choice between implementing a traditional FR design or an adaptive design should depend on the current belief of how large the population of patients outside the trial is.
Finally, the results presented in this paper have highlighted that further analogous research is needed to extend these results and address potential specific issues to trials with other endpoints, such as continuous endpoints that are not normally distributed. by the independence of Z j and Z k . Using the fact that Var(X 0 ) = σ 2 /n 0 , cov(Z j , Z k ) = 1 + n 0 n j 1 + n 0 n k −1/2 . Error rates are lowest when the variance of the sample means is minimised, which corresponds to the trial being well balanced: in a RCT trial with fixed equal randomisation all the sample sizes are (asymptotically) equal, so we will have a good approximation for n 0 ≈ n 1 ≈ · · · ≈ n K and cov(Z j , Z k ) ≈ 1 2 . Hence, under H 0,G , δ 1 = · · · = δ K = 0 and Z = (Z 1 , . . . , where K is the K × K matrix given by So we would expect a RCT trial to control the FWER at level α by using critical value C α satisfying where φ K is the probability density function of a multi-variate normal N K (0, K ) distribution, i.e. ensuring that P max