Group sequential crossover trial designs with strong control of the familywise error rate

Abstract Crossover designs are an extremely useful tool to investigators, and group sequential methods have proven highly proficient at improving the efficiency of parallel group trials. Yet, group sequential methods and crossover designs have rarely been paired together. One possible explanation for this could be the absence of a formal proof of how to strongly control the familywise error rate in the case when multiple comparisons will be made. Here, we provide this proof, valid for any number of initial experimental treatments and any number of stages, when results are analyzed using a linear mixed model. We then establish formulae for the expected sample size and expected number of observations of such a trial, given any choice of stopping boundaries. Finally, utilizing the four-treatment, four-period TOMADO trial as an example, we demonstrate that group sequential methods in this setting could have reduced the trials expected number of observations under the global null hypothesis by over 33%.


Introduction
The efficiency of crossover trials often makes them the best design for a clinical trial. Administering multiple treatments to patients reduces the standard error of the estimated treatment effects compared to a parallel trial design with an equal number of patients. Therefore, though restrictions to their use exist, such as a requirement for patients to begin each new treatment period in a state comparable to those completed, crossover trials are the design of choice in many settings (Jones and Kenward, 2014;Senn, 2002), resulting in them accounting for 22% of all published trials in December 2000, for example (Mills et al., 2009).
In a parallel design setting, group sequential methods are frequently utilized to improve a clinical trial's efficiency (Jennison and Turnbull, 2000). These designs incorporate interim analyses that allow for early rejection of null hypotheses; efficacy stopping, or early stopping for lack of benefit; futility stopping. This way, the expected sample size required can be reduced over the more classical single-stage approach. Moreover, multi-arm multi-stage designs, which allow multiple experimental treatments to share a control group, can increase efficiency even further (Parmar et al., 2014).
Group sequential methods are not frequently used in crossover trial settings however, in particular ones with multiple experimental treatments. Hauck et al. (1997) investigated the performance of group sequential trials for average bioequivalence employing an AB/ BA crossover design, and Jennison and Turnbull (2000) provided one possible analysis method for a group sequential AB/BA crossover with a normally distributed endpoint. To the best of our knowledge, no study has explored group sequential theory for crossover trials with more than one experimental treatment being compared to a shared control.
Thus, one possible explanation for the lack of group sequential crossover trials may be that there is not yet available a formal proof of how to strongly control the familywise error rate of such a trial with multiple experimental treatments, because such a proof is usually required for regulatory approval (Wason et al., 2014). In comparison to a proof for a parallel multi-arm multi-stage design (Magirr et al., 2012), proving strong control of the familywise error rate is complicated here due to difficulties associated with the covariance structure implied by mixed model analysis. As has been remarked, multiple testing corrections for mixed models are only presently available for certain specific circumstances (Bender and Lange, 2001). Extension to this setting is particularly significant, though, given the noted advantages of comparing multiple experimental treatments to a shared control, in terms of both trial management and sample size (Parmar et al., 2014).
Potential exists, given a proof, for the efficiency of crossover trial designs to be improved. In this work, we begin by providing such a proof for a linear mixed model using period and treatment as fixed effects and individuals as random effects. Following this, using the four-treatment, four-period TOMADO trial (Quinnell et al., 2014) as an example, we explore and discuss the efficiency gains that group sequential designs could bring in a crossover setting.

Notation, hypotheses, and analysis
The trial is assumed to have D ! 2 treatments initially, indexed d ¼ 0; :::; D À 1. Treatments d ¼ 1; :::; D À 1 are experimental, to be compared to the control d ¼ 0. A maximum of L stages are planned for the trial. At each stage, patients are allocated to each of a set of treatment sequences, which specify an order in which a patient receives treatments. The sequences used at each stage are determined by the number of treatments remaining in the trial at that stage. Without loss of generality, we will assume that if a treatment or treatments are dropped, treatment D À 1 is dropped first, then D À 2, and so on, because treatments can always be relabeled at each interim analysis. Then, we denote by S r ¼ fs ri : i ¼ 1; :::; jS r jg; r ¼ 2; :::; D, the set of sequences for patient treatment allocation when r treatments remain in the trial, with each S r written in the form assuming that it is exactly treatments d ¼ 0; :::; r À 1 that remain. We further constrain each S r to contain only complete block sequences that are balanced for period. Specifically, complete block allocation requires all sequences to contain each treatment remaining in the trial exactly once, and period balance requires an equal number of patients to receive each treatment remaining in the trial in each period. These constraints allow the use of the popular Latin and Williams squares (Jones and Kenward, 2014).
A fixed group size n is used for each stage of the trial and is chosen such that at every stage each sequence is used an equal number of times. Thus, n must be divisible by the lowest common multiple of jS 2 j; :::; jS D j. Designing the trial in this manner ensures that each treatment is considered equally.
Outcome data are assumed to be normally distributed, and a linear mixed model is used for analysis, given by where Y is the vector of responses, containing the values of the y ijkl ; the response for individual i, in period j, on sequence k, in stage l, b is the vector of fixed effects, of length 2D À 1, consisting of l 0 the mean response on treatment 0 in period 1, an intercept term, p j the fixed period effect for period j, with the identifiability constraint p 1 ¼ 0.
Note that the period is reset to 1 for each new stage of the trial. That is, the first period of stage 2 is treated as period 1 rather than period D þ 1, also used in later stages. Thus, we have exactly D À 1 non-zero period effects given our restriction to complete block sequences. s d½j;k;l is thes fixed direct treatment effect for an individual in period j, on sequence k, in stage l, with the identifiability constraint s 0 ¼ 0, X is the matrix linking the fixed effects to the vector of responses, b is the vector of random effects, consisting of the s ikl ; the random effect for individual i, on sequence k, in stage l, Z is the matrix linking the random effects to the vector of responses, e is the vector of residuals, consisting of the E ijkl ; the residual for individual i, in period j, on sequence k, in stage l.
Additionally, denoting by r 2 b > 0 and r 2 e > 0 the between-and within-subject variances, respectively, we take where d ij is the Kronecker delta function. Incorporation of fixed effects for period and treatment only and our chosen covariance structure above are the conventional choices for a crossover trial (Jones and Kenward, 2014). We test D À 1 hypotheses. Because we are interested in testing the efficacy of experimental treatments in comparison to a control, we consider the case of one-sided alternative hypotheses H 0d : s d 0; H 1d : s d >0; for d ¼ 1; :::; D À 1.
At each interim analysis, the above model is used to compute an estimate,b l ðl ¼ 1; :::; LÞ, for b through the standard maximum likelihood estimator of a linear mixed model where R ¼ Zcovðb; bÞZ T þ covðE; EÞ (Fitzmaurice et al., 2011). From this we acquirê s l ¼ ðŝ 1l ; :::;ŝ DÀ1l Þ T , which consists of the maximum likelihood estimates for each s d .
Then, eachŝ dl is standardized to give D À 1 test statistics Z dl ¼ŝ dl I 1=2 dl ; d ¼ 1; :::; D À 1, with I dl ¼ fvarðŝ dl Þg À1 the information level for treatment d at interim analysis l. Sincê s l is estimated via a normal linear model, we know that EðZ dl Þ ¼ s d I 1=2 dl (Jennison and Turnbull, 2000).
Given fixed futility boundaries, f dl , and efficacy bounds, e dl , the following stopping rules are used at each analysis l ¼ 1; :::; L, for each experimental treatment d ¼ 1; :::; D À 1 satisfying f dm Z dm <e dm for m ¼ 1; :::; l À 1 if Z dl < f dl treatment d is dropped without rejecting H 0d , if f dl Z dl <e dl the trial is continued with treatment d still present, and if e dl Z dl treatment d is dropped and H 0d rejected.
The control treatment, d ¼ 0, remains present at every undertaken stage, and we only proceed to an additional stage if there is at least one experimental treatment remaining in the trial. It is convenient to take f dl ¼ f l and e dl ¼ e l for all d and l, as well as f L ¼ e L in order to ensure that the trial conforms to the desired maximum number of stages and so that a conclusion is made for each H 0d . Note that rejection of one treatment's null hypothesis does not end the trial. Furthermore, with this formulation, once a treatment is dropped from the trial, its standardized treatment effect is not tested in any future analyses.
In what follows, we will make use of the vectors x R ¼ ðx R1 ; :::; x RDÀ1 Þ T and w R ¼ ðw R1 ; :::; w RDÀ1 Þ T . Here, x Rd 2 f1; :::; Lg is the analysis at which experimental treatment d was dropped from the trial. Moreover, w Rd 2 f0; 1g with w Rd ¼ 1 if experimental treatment d was dropped for efficacy and 0 otherwise. Prior to a trial's commencement, x R and w R are unknown random variables. However, the probability that the trial progresses according to some particular x ¼ ðx 1 ; :::; x DÀ1 Þ T and w ¼ ðw 1 ; :::; w DÀ1 Þ T , given a vector of true response rates s ¼ ðs 1 ; :::; s DÀ1 Þ T , can be computed using multivariate normal integration. More specifically, given this particular ðx; wÞ pair, the covariance between and the information level of the test statistics can be computed and the following integral evaluated (see Jennison and Turnbull [2000] or Wason [2015] for further details): n where x ¼ ðx 11 ; . . . ; x 1ðDÀ1Þ ; . . . ; x L1 ; . . . ; x LðDÀ1Þ Þ T , ϕfx; μ; Kg is the probability density function of a multivariate normal distribution with mean μ and covariance matrix K, evaluated at vector x, rðs; LÞ is the vector formed by repeating s L times, I ðx;wÞ ¼ I T 1;ðx;wÞ ; . . . ; I T L;ðx;wÞ T , where I l;ðx;wÞ ¼ ðI 1l ; . . . ; I ðDÀ1Þl Þ T ðx;wÞ is the vector of information levels for the estimated treatment effects at interim analysis l, according to (conditional on) the particular ðx; wÞ being considered, ᭺ denotes the Hadamard product of two vectors, the square root of the vector I ðx;wÞ is taken in an element-wise manner, l and u are functions that tell us the lower and upper integration limits for the test statistic Z dl given values for l, x d and w d . For example, lð1; 2; 1Þ ¼ f 1 and uð1; 2; 1Þ ¼ e 1 , while lð2; 2; 1Þ ¼ e 2 and uð2; 2; 1Þ ¼ 1, and then lðl; 2; 1Þ ¼ À1 and uðl; 2; 1Þ ¼ 1 for l > 2, K ðx;wÞ is the covariance matrix between the standardized test statistics at and across each interim analysis according to ðx; wÞ. Thus, using Z l ¼ ðZ 1l ; . . . ; Z DÀ1l Þ T , we have However, Z l ¼ŝ l ᭺I 1=2 l;ðx;wÞ , and by the properties of normal linear models covðŝ l 1 ;ŝ l 2 jx; wÞ ¼ covðŝ l 2 ;ŝ l 2 jx; wÞ ðl 1 ; l 2 ¼ 1; :::; L; l 1 l 2 Þ (Jennison and Turnbull, 2000), giving covðZ l1 ; Z l2 jx; wÞ ¼ diagðI 1=2 l 1 ;ðx;wÞ Þcovðŝ l2 ;ŝ l2 jx; wÞdiagðI 1=2 l 2 ;ðx;wÞ Þ; (2.1) for l 1 ; l 2 ¼ 1; :::; L; l 1 l 2 , and where diagðvÞ is the matrix formed by placing the elements of vector v along the leading diagonal. Note that equation (2.1) in conjunction with the expectations of our standardized test statistics and the observation that ðZ T 1 ; :::; Z T L Þ T is multivariate normal can be restated simply as that our test statistics follow the canonical joint distribution (Jennison and Turnbull, 2000).

Familywise error rate control
It is a common requirement of clinical trial designs that the probability of one or more false rejections within the family of null hypotheses is not greater than some a. This is known as strong control of the familywise error rate. In this section, we establish strong control for our considered trial design.
To evaluate the familywise error rate of a design, for any s, the above integral can be evaluated for all x and w that would imply that a type-I error is made and the results summed. In order to demonstrate how to strongly control, though, it is essential to know the forms of the I l;ðx;wÞ and K ðx;wÞ for each ðx; wÞ. However, by equation (2.1), the I l;ðx;wÞ and K ðx;wÞ can be determined if covðb l ;b l jx; wÞ is known for l ¼ 1; :::; L.
Thus, consider the matrix covðb l ;b l jx; wÞ for some l L and any ðx; wÞ. We compute values for L lr ðr ¼ 1; :::; DÞ, the number of stages of the trial, up to analysis l, in which r treatments were remaining. Because we do not continue the trial unless at least one experimental treatment remains, L l1 ¼ 0 always. It will be convenient, however, to still include this value. Moreover, it is clear that the L lr are uniquely determined given ðx; wÞ. Now, covðb l ;b l jx; wÞ can always be decomposed to be a sum over the determined L lr and the pre specified sequences S r (see Fitzmaurice et al. [2011] for details): covðb l ;b l jx; wÞ ¼ covðb l ;b l jL l1 ; :::; L lD Þ; ¼ Here X s ri is the uniquely defined r Â ð2D À 1Þ design matrix for a single patient allocated to sequence s ri , and R r is the easily computed r Â r covariance matrix of the responses for a single patient allocated r treatments in total. The factor n=jS r j arises from the number of patients allocated to each sequence s ri by our choice of period balance.
We now establish two key results about covðb l ;b l jL l1 ; :::; L lD Þ. Following this, we provide a proof detailing how to strongly control the familywise error rate.
Theorem 2.1. Let b ¼ ðl 0 ; p 2 ; :::; p D ; s 1 ; :::; s DÀ1 Þ T . Consider an analysis to be performed after some number of stages l. Then 1. We have covðb l ;b l jL l1 ; ::: 2D À 1 D r 2 e ; G pq ¼ Àr 2 e ðp ¼ 1; :::; D À 1; q ¼ 1Þ; H pq ¼ r 2 e ð1 þ d pq Þ ð p ¼ 1; :::; D À 1; q ¼ 1; :::; D À 1Þ: 2. If q ! 2 is the largest integer such that L lr ¼ 0 for r ¼ 1; :::; q À 1, then the covariance of the estimates of the fixed effectsp 2l ; :::;p ql ;ŝ 1l ; :::;ŝ qÀ1l is identical to what it would be for L l1 ¼ Á Á Á ¼ L lDÀ1 ¼ 0. Moreover, the covariance between the estimates ofp 2l ; :::;p ql ;ŝ 1l ; :::;ŝ qÀ1l and the estimates ofp qþ1l ; :::;p Dl ;ŝ ql ; :::;ŝ DÀ1l is also identical to what it would be for Note that part (1) of the above theorem implies for d 1 ; d 2 2 f1; :::; D À 1g. This is the familiar result for complete block sequences that there is no dependence upon the between-patient variance r 2 b (Jones and Kenward, 2014). w Theorem 2.2. A group sequential crossover trial of the type considered, with D ! 2, testing the D À 1 hypotheses H 0d : s d 0; H 1d : s d >0, attains a maximal value of its familywise error rate for Proof. Theorem 2.1 implies that elements of the covariance matrix covðŝ l ;ŝ l Þ that differ from the case where no treatments have been dropped are exactly those corresponding to unstandardized test statistics no longer of importance. Consequently, the values of I l;ðx;wÞ and K ðx;wÞ that differ from the case x ¼ ðL; :::; LÞ T are only ever those corresponding to limits of integration given by ðÀ1; 1Þ in our computation of prðx R ¼ x; w R ¼ wjsÞ. By the marginal distribution properties of the multivariate normal distribution, we therefore need only consider one matrix K ðx;wÞ and one set of vectors I l;ðx;wÞ ðl ¼ 1; :::; LÞ, exactly those given by the case x ¼ ðL; :::; LÞ T . Denote these by K and I l , and set I ¼ ðI T 1 ; :::; I T L Þ T . For more information on this, see Appendix A. We now have ::: Now, consider without loss of generality the probability that we reject H 01 , and denote by X and W the sets of all possible x and w, respectively. By integrating over all possible values of x 2 ; :::; x DÀ1 and w 2 ; :::; w DÀ1 , we have that the probability that we reject each H 01 does not depend on the values of s 2 ; :::; s DÀ1 ; that is, on the other treatments tested: ::: where I s 1 and K s 1 are the restrictions of I and K to rows and columns corresponding to experimental treatment d ¼ 1 respectively. This final form for pr Reject H 01 js À Á is identical to what it would be in the case D ¼ 2. Therefore, to ascertain the s giving the maximal familywise error rate of a trial with D ! 2, it suffices to consider which s Ã 0 maximizes the probability that H 01 is rejected in a trial with D ¼ 2 initial treatments. For then, s ¼ ðs Ã ; :::; s Ã Þ T using this s Ã will provide the maximum probability of rejecting at least one true H 0d for some d; that is, the maximum familywise error rate. To see this, consider the familywise error rate for s ¼ ðs Ã ; :::; s Ã Þ T . If one changes some individual element s d 1 of this vector, this does not effect the probability that H 0d 2 is rejected for d 2 6 ¼ d 1 , and it can only decrease the probability that H 0d 1 is incorrectly rejected. Thus, overall, straying from this s ¼ ðs Ã ; :::; s Ã Þ T can only decrease the familywise error rate.
Thus, now consider all possible realizations of the test statistics of a trial with D ¼ 2 and their associated values of ðx; wÞ ¼ ðx 1 ; w 1 Þ. We have Z ¼ ðZ 11 ; :::; Z 1L Þ T 2 R L , with Z 1L ¼ Á Á Á ¼ Z 1x 1 if the trial was stopped at stage x 1 . Now consider increasing the value of the test statistics by some g > 0. All instances before where H 01 was rejected will still exceed the efficacy bound of that stage, or earlier, and so H 01 will still be rejected. Therefore, the probability of rejecting H 01 is at least as large as before. Thus, increasing the value of s 1 0 causes a non-decreasing change in the value of the type-I error rate. Therefore, the probability of rejecting H 01 is maximized by s 1 ¼ 0, implying in turn that the maximal familywise error rate of a trial with D ! 2 is given by s ¼ ðs 1 ; :::; s DÀ1 Þ T ¼ ð0; :::; 0Þ T . w

Design characteristics
A trial will now be fully specified given values for D, L, r 2 e , and n, as well as choices for S r and the futility and efficacy boundaries f 1 ; :::; f L and e 1 ; :::; e L , respectively. Given these, K and I can be computed using the results above. Then, by Theorem 2.2 we can strongly control the familywise error rate to a for this design using the following sum of integrals: ::: Additionally, suppose that we wish to power this trial to reject a particular null hypothesis, without loss of generality H 01 , at some clinically relevant difference s 1 ¼ d.
The type-II error rate b for H 11 is then given by Moreover, denoting by N and O the total number of patients and observations required by the trial, respectively, we can compute the expected sample size, EðNjsÞ, or expected number of observations, EðOjsÞ, for any s, according to Here, Nðx; wÞ and Oðx; wÞ are functions that give the number of patients and observations, respectively, required by a trial that progresses according to ðx; wÞ. Specifically Nðx; wÞ ¼ n max fd¼1;:::;DÀ1g

Example: TOMADO
As an example of how to design a group sequential crossover trial with strong control of the familywise error rate, we will make use of the TOMADO crossover randomized controlled trial (Quinnell et al., 2014). This open-label trial compared three experimental treatments to a single control for the treatment of sleep apnea-hypopnea using a fourtreatment four-period crossover design. The normally distributed secondary endpoint, Epworth Sleepiness Scale, was used to observe negative test statistics. Therefore, we consider the decrease as the endpoint in order to retain the same hypothesis tests as before H 0d : s d 0; H 1d : s d > 0; d ¼ 1, 2, 3. The trial planned to recruit 90 patients, and utilizing restricted error maximum likelihood estimation, the final analysis calculated that r 2 e ¼ 6:51. Taking this variance as the truth, the trial had a familywise error rate a ¼ 0:05 for s ¼ ðs 1 ; s 2 ; s 3 Þ T ¼ ð0; 0; 0Þ T , and b ¼ 0:2 for H 11 at s 1 ¼ 1:11.
Many methods exist for determining boundaries for a one-sided group sequential trial with parallel treatment arms. Here, we consider analogues of the power family boundaries of Pampallona and Tsiatis (1994). For this, values for the desired type-I and type-II error rates, a clinically relevant difference d, the maximum number of stages L, the within-person variance r 2 e , and a shape parameter D must be specified. A twodimensional grid search is then used to find the exact required maximal sample size. From this, a suitable value of n is identified by rounding up to the nearest integer such that n is as required divisible by jS 2 j; :::; jS D j. Utilizing Williams squares for our designs, n was forced to be divisible by 12.
Taking a ¼ 0:05; b ¼ 0:2; d ¼ 1:11; r 2 e ¼ 6:51, L ¼ 3, and D ¼ À0:25, 0, 0.5, 0.5 as examples, group sequential crossover trial designs were determined and compared to the single-stage design used by TOMADO. All computations were done in R (R Core Team, 2016) using the package groupSeqCrossover, available from https://github.com/mjg211/ article_code. Matlab (The Mathworks Inc., 2016) code employing symbolic algebra is also available to return the matrices given by several of the equations in the text. Use of both the R and Matlab code is detailed in Appendix D.
A summary of the performance of the designs is provided in Table 1, and their computed boundaries are displayed in Figure 1. We can see that, as is the case for two-arm parallel trial designs, there is a trend that larger values of D result in larger maximum sample sizes and lower expected sample sizes due to their larger stopping regions. However, this is not the case for D ¼ 0:25 because of the requirement to round to a suitable integer value of n.
Plots of the probability of rejecting H 01 and rejecting H 0d for some d ¼ 1, 2, 3 are provided for a range of values of h when s ¼ ðh; h; hÞ T in Figure 2. The power curves are similar for all of the designs, with the only differences a result of rounding in the group sequential designs to achieve suitable values of n. As is to be expected for group sequential designs, the maximum sample size and maximum number of observations are larger than those for the single-stage design. However, the group sequential designs have lower expected sample sizes under the global null hypothesis ðs ¼ 0 ¼ ð0; 0; 0Þ T Þ, up to a maximum of 23% for D ¼ 0:5. This does, however, come at the expense of an increased expected sample size under the global alternative hypothesis ðs ¼ d ¼ ðd; d; dÞ T Þ.
From Figure 3, the expected sample sizes of the group sequential designs can be seen to be far lower than those of the single-stage design for more extreme values of h. A similar statement holds for the expected number of observations. However, in this instance for D ¼ 0, 0.5, the performance of the group sequential designs is better than the single-stage design across all values of h.

Discussion
There is a long history on group sequential clinical trials. Very few, however, utilize a crossover design. This may be at least in part due to no formal proof existing for how to strongly control the familywise error rate of such a trial. Here, we provided such a proof and then explored the performance of several sequential designs for the TOMADO trial.
The expected sample size of the sequential designs was observed to be far lower than that of the single-stage design for a large range of values of the true response rate on all between 0 and d, which may be more realistic observed treatment effects. However, for some considered designs, this region is very small and does not include values near 0, which is notable for ethical reasons. This issue could even be further alleviated by utilizing optimal stopping boundaries, as has been proposed for parallel arm designs (Wason and Jaki, 2012;. Importantly, several of the designs always performed better than the single-stage design in terms of the expected number of observations required, which could be a significant factor in the cost and length of a trial. Consequently, we can conclude that a group sequential approach to a crossover trial improves efficiency in some circumstances. Several possible extensions to our work present themselves. For example, we assumed that the period was reset in each trial stage. This could reflect a scenario where it is believed being enrolled in the trial will alter a patient's behavior. However, in some cases, such as to deal with seasonal effects, it would be preferential to have different period effects in each stage.
One simple extension would be to employ non-inferiority tests, from our present superiority testing framework. Non-inferiority tests, seeking to determine whether new treatments are not clinically worse than an established control, would have hypotheses shifted by some factor from the ones presented here. Theorem 2.2 could easily be altered to accommodate this and then popular methods for boundary determination in this setting could be applied.
Here, we have worked under an idealized scenario, assuming the within-patient variance to be known prior to trial commencement. Though this is a common assumption in group sequential theory, it does have limitations, because often a good estimate for the key variance parameter cannot be provided at the design stage. In this instance, group sequential t-tests would almost certainly be required. Furthermore, simulation is required to quantify error rates accurately in the case of small sample sizes. To explore this scenario, we analyzed the true familywise error rate under the global null hypothesis of a particular design motivated again by the TOMADO trial but with L ¼ 2 and n ¼ 12. We found that provided that restricted error maximum likelihood was utilized, there was very little inflation in the familywise error rate over the nominal level a. Details of this are provided in Appendix B.
Moreover, we have only explored designing group sequential crossover trials. It is well known that if a final analysis is performed on data acquired in a sequential trial, not taking into account the sequential nature, then biased treatment effects will be acquired. Extending established methodology for parameter estimation to our scenario will thus be important.
Finally, we have implicitly assumed that there will be no patient dropout and have not discussed the issue of patient recruitment rates. Though these are problems for all adaptive designs, it is important to give them note. Due to our need for one stages, data to be analyzed before the commencement of the following stage, it is likely that the length of a trial using our approach would be longer for certain recruitment rates. It could be that recruitment is paused at interim or that patients are continually recruited under the old scheme until results are available, which would lead to overrun and an increase in the expected number of observations and sample size. Thus, this would be an important factor to consider when choosing an appropriate design for a trial.
Nevertheless, for future crossover trials, consideration should be given to a group sequential approach. This may substantially assist in the efficient prioritization of efficacious treatments.
Moreover, using the above along with equation (2.1), in conjunction with part (2) of Theorem 2.1, we have that if f p Z d 1 p <e p for p ¼ 0; :::; l 1 À 1 (i.e., if treatment d 1 is present up to stage l 1 ) and f q Z d 2 q <e q for q ¼ 0; :::; l 2 À 1 (i.e., if treatment d 2 is present up to stage l 2 ), with l 1 l 2 ; l 1 ; l 2 2 f1; :::; Lg, and d 1 ; d 2 2 f1; :::; D À 1g (taking f 0 ¼ À1 and e 0 ¼ 1), covðZ d1a ; Z d2b jx; wÞ ¼ I 1=2 d 1 a covðŝ d1b ;ŝ d2b jL b1 ; ::: for a b; a ¼ 1; :::; l 1 , and b ¼ 1; :::; l 2 . For further clarity, as an example, consider the case D ¼ 3, L ¼ 2, and the associated value of pr x R ¼ x; w R ¼ wjs ð Þ when x ¼ ð2; 1Þ T and w ¼ ð1; 0Þ T . Using the above, we know the following elements of the matrix K ðx;wÞ and vector I ðx;wÞ ¼ ðI T 1;ðx;wÞ ; I T 2;ðx;wÞ Þ T : where we have used to signify an element that we do not know the value of. Now consider our computation of pr As we have seen, we do not know the values of the final row and column of matrix K ðx;wÞ or the final element of the vector I ðx;wÞ . But, the fact mentioned in Theorem 2.2 becomes clear: this does not matter because the limits of integration corresponding to this variable are ðÀ1; 1Þ. Indeed, by the marginal distribution properties of the multivariate normal distribution, we need only consider one matrix K ðx;wÞ and one set of vectors I l;ðx;wÞ , exactly those given by the case x ¼ ðL; :::; LÞ T . We denote these by K and I l and set I ¼ ðI T 1 ; :::; I T L Þ. Explicitly, we have covðZ d1l1 ; Z d2l2 Þ ¼ K d1þðDÀ1Þðl1À1Þ;d2þðDÀ1Þðl2À1Þ ; for any d; d 1 ; d 2 2 f1; :::; D À 1g and l 1 l 2 ; l; l 1 ; l 2 2 f1; :::; Lg.

Appendix B: Small sample size performance
For small sample sizes, simulation is required to accurately determine a design's performance. Because crossover trials are routinely conducted with small sample sizes, here we explore the impact this has upon the familywise error rate under the global null hypothesis. We determined a design corresponding to the TOMADO example that would require only 12 patients in each of two stages, the smallest allowable maximum sample size for a group sequential crossover trial with D ¼ 4 treatments initially, given our restrictions on n. Taking D ¼ 0 as an example, a trial with n ¼ 12 and L ¼ 2 with f 1 ¼ 0:768; f 2 ¼ 2:036, and e 1 ¼ 2:879 to three decimal places would, using our multivariate normal calculations, have a maximal familywise error rate of a ¼ 0:05 under the global null hypothesis, and b ¼ 0:2 for d ¼ 2:2 when r 2 e ¼ 6:51. Ten thousand of these trials were simulated in order to ascertain the true probability of rejecting H 0d for some d ¼ 1, 2, 3, when s ¼ ð0; 0; 0Þ T . For simplicity, p j was set to 0 for j ¼ 2; :::; D, and l 0 was set to 0. Incorporating non-zero period effects, however, would not be expected to greatly affect the results. Whitehead et al. (2009) proposed a quantile substitution procedure for adapting the boundaries of a sequential trial to be more suitable to the case of unknown variance. We additionally considered employing this procedure. Given that there is no consensus on how to determine the degrees of freedom when analyzing using linear mixed models, we took the degrees of freedom at any analysis to be the classical decomposition of degrees of freedom in balanced, multilevel analysis of variance designs (Pinheiro and Bates, 2009). Moreover, we assessed the performance of the sequential design when the linear mixed model was fitted through either maximum likelihood or restricted error maximum likelihood estimation. Therefore, in total, these simulations were performed for each of four possible analysis procedures: maximum likelihood or restricted error maximum likelihood estimation, with or without boundary adjustment through quantile substitution.
Thus, for each simulated study, patient response data for each stage l were randomly generated according to the distribution implied by their allocated treatment sequence (assigned according to the rules of the trial design), using the function rmvnorm (Genz et al., 2016) in R. The betweenperson variance was set to r 2 b ¼ 10:12, the value ascertained in the final analysis of the TOMADO trial data. Following this, our linear mixed model was fitted on all accumulated data (with either maximum likelihood or restricted error maximum likelihood estimation according to the particular analysis procedure being considered) and Z dl ¼ŝ dlÎ 1=2 dl determined for d ¼ 1, 2, 3, whereÎ 1=2 dl is the observed Fisher information forŝ dl . Then, each Z dl was compared to e l and f l and our stopping rules were applied (with e l and f l adjusted using quantile substitution if the analysis procedure under consideration so dictated). If for some d ¼ 1, 2, 3, f l Z dl <e l , the trial proceeded to the following stage and the process was repeated. In each instance, simulations in which H 0d was rejected for some d ¼ 1, 2, 3 were recorded in order to ascertain true rejection rates.
The performance of these procedures is displayed in Table B.1. We observe that when maximum likelihood estimation is utilized and the boundaries are not adjusted using the procedure of Whitehead et al. (2009), there is substantial inflation in the familywise error rate under the global null hypothesis to 0.077. However, when restricted error maximum likelihood estimation is used, there is only negligible inflation if adjustment of the boundaries is employed. A program to perform this analysis is available. Its use is detailed in Appendix D.
By the definition of being in period j, the vth element of P jr is given by Because complete block design sequences are used, the vth element of T js ri is given by where I js ri v ¼ 1 An individual on sequence s ri receives treatment j in period v; 0 otherwise:

&
We denote this non-zero element, if it exists, by t j . Now from the symmetry present in R À1 r , using Lemma C.1 we have X j;k for all j and k. Therefore, we can determine the form of each of the scalar elements in the matrix above as follows: As a final step, we must compute the sum across sequences; that is, over the index i. We can see instantly that we have confirmed the elements proposed to be 0 in X jS r j i¼1 nX T sri R À1 r X sri =jS r j are indeed so, and we therefore need only concentrate on the non-zero terms suggested: A, B, C, and E. However, other than P T jr R À1 r T ksri , all of the elements above have been identified as independent of sequence s ri . Therefore, computing the sum over the s ri 2 S r can be done easily and gives the forms proposed for A, B, and C in the statement of the lemma immediately, on multiplying through by n=jS r j. But, by our imposed constraint that sequences be balanced for the period, we can also sum over the P T jr R À1 r T ks ri ðp ¼ 1; :::; D À 1; q ¼ 1Þ; H pq ¼ r 2 e ð1 þ d pq Þ ð p ¼ 1; :::; D À 1; q ¼ 1; :::; D À 1Þ: 1. If q ! 2 is the largest integer such that L lr ¼ 0 for r ¼ 1; :::; q À 1, then the covariance of the estimates of the fixed effectsp 2l ; :::;p ql ;ŝ 1l ; :::;ŝ qÀ1l is identical to what it would be for L l1 ¼ ::: ¼ L lDÀ1 ¼ 0. Moreover, the covariance between the estimates of p 2l ; :::;p ql ;ŝ 1l ; :::;ŝ qÀ1l and the estimates ofp qþ1l ; :::;p Dl ;ŝ ql ; :::;ŝ DÀ1l is identical to what it would be for L l1 ¼ ::: ¼ L lDÀ1 ¼ 0.
2. For the next part of the theorem, we re-order our vector of fixed effects such that b ¼ ðl 0 ; p 2 ; s 1 ; p 3 ; s 2 ; :::; p D ; s DÀ1 Þ T and thus the ordering of the columns of the X s ri is now X s ri ¼ 1 r P 2r T 1s ri P 3r T 2s ri ::: P Dr T DÀ1s ri À Á : We proceed by induction over the number of stages completed l for general D. Now, we assume that at the lth interim analysis, the statement of the theorem is true. Now, the covariance at this lth analysis is covðb l ;b l jL l1 ¼ ::: ¼ L lqÀ1 ¼ 0; L lq ; :::; L lD Þ ¼ It is this specifically that we assume follows the required condition. Additionally, we assume that this covariance matrix can be computed; that is, that A is invertible. We show that if we conduct another stage of the trial with t treatments remaining, 2 t q, then the new covariance matrix covðb lþ1 ;b lþ1 jL lþ11 ; :::; has the required property forp 2l ; :::;p tl ;ŝ 1l ; :::;ŝ tÀ1l as well as forp 2l ; :::;p tl ;ŝ 1l ; :::;ŝ tÀ1l and p tþ1l ; :::;p Dl ;ŝ tl ; :::;ŝ DÀ1l . Let T l ¼ ðl 0l ;p 2l ;ŝ 1l ; :::;p tl ;ŝ tÀ1l Þ T ; T l 0 ¼ ðp tþ1l ;ŝ tl ; :::;p Dl ;ŝ DÀ1l Þ T ; W l ¼ ðp 2l ;ŝ 1l ; :::;p tl ;ŝ tÀ1l Þ T : Denote dimðT l Þ ¼ jT l j and similarly for T l 0 and W l . By our assumptions, we can write Where, for example, A À1 T l T l ¼ covðT l ; T l jL l1 ; :::; L lD Þ and with A À1 T l T l and A À1 T l 0T l ¼ ðA À1 T l T l 0 Þ T holding the required conditions for the fixed effects. Finally, as part of our inductive hypothesis we also assume that detðA À1 T l T l Þ>0.

¼A À1
T l T l I jT l j À ; Now detfcovðT lþ1 ; T lþ1 jL lþ11 ; :::; L lþ1D Þg ¼ detðA À1 T l T l Þdetf1 À ð1 þ lV À1 1 Þ À1 g Â detfI jW l jÀð1þlÞ À1 I jW l j g > 0; thus detðA À1 T lþ1 T lþ1 Þ>0 as required, and covðW lþ1 ; W lþ1 Þ ¼ A À1 W l W l fI jW l j À ð1 þ lÞ À1 I jW l j g; Similarly, covðW lþ1 0 ; W lþ1 Þ ¼ A À1 W0 l W l fI jW l j À ð1 þ lÞ À1 I jW l j g; Thus, the covariance of the fixed effects at the ðl þ 1Þth analysis has the desired property. Now, as the base case, consider having completed one stage of the trial and proceeding to complete another with any number of treatments t ¼ 2; :::; D remaining. Then, in this instance, By the previous result of this theorem, this is indeed invertible and has the desired property; moreover, by Lemma C.3, detðA À1 T l T l Þ>0. The proof is then complete.

D.2. R
The R package groupSeqCrossover allows the determination and exploration of group sequential power family crossover trial designs. The function gsco is used to determine the design, taking inputs for the value of L, a, b, d, r 2 e , D, and sequence type ("latin" or "williams"). The function plot can then be used through S3 methods to plot power curves, the expected sample size, and the expected number of observations for varying true treatment effects. Moreover, the function simulategsco can be used to simulate group sequential crossover trials in order to assess their operating characteristics. This is especially useful in the case of small sample size designs. The code below, for example, identifies the discussed design for D ¼ 0 and then plots the expected sample size curve # Identify the design e þ r 2 b 0 B B B @ 1 C C C A : Note that we have to remove the rows and columns of zeroes from the R r for r < D, and we use b and e for r b and r e respectively.