Exact group sequential designs for two-arm experiments with Poisson distributed outcome variables

We describe and compare two methods for the group sequential design of two-arm experiments with Poisson distributed data, which are based on a normal approximation and exact calculations respectively. A framework to determine near-optimal stopping boundaries is also presented. Using this framework, for a considered example, we demonstrate that a group sequential design could reduce the expected sample size under the null hypothesis by as much as 44% compared to a fixed sample approach. We conclude with a discussion of the advantages and disadvantages of the two presented procedures.


Introduction
It is desirable that experiments be designed and analyzed to ensure control of their type-I and type-II error-rates. For, within the context of a clinical trial, for example, an inflated type-I error-rate reduces our confidence that a significant health benefit has not been observed by chance, whilst under-powering a study raises the risk of failing to identify an efficacious treatment. It is for this reason that much research has been conducted to develop methodology for type-I error-rate control and sample size determination, in studies with many possible types of data, and many possible choices of null hypothesis. Within the context of experiments with discrete outcome variables, this research has often focused on the establishment of methods that are able to determine operating characteristics exactly. This allows us to reduce our reliance on approximate techniques that typically depend on asymptotic results. See, for example, Jung (2012), which summarizes exact methods for experiments with binary outcomes.
In contrast, for randomized two-arm experiments with Poisson distributed outcome variables, several authors have proposed approximate procedures for study design (Thode 1997;Shiue and Bain 1982;Mathews 2010), but it was only comparatively recently that Menon et al. (2011) described an alternative approach based on the exact distribution of the difference between two Poisson variables. They demonstrated that when the anticipated difference between the Poisson rates in the two arms was large, the sample size required by their approach was typically 5-10% smaller than that based on a normal approximation.
This result is a valuable one, as count outcomes occur in many experimental design settings of practical interest. In particular, within the context of clinical research, they are common in the form of the number of observed adverse events. More specifically, count outcomes occur as an endpoint of interest in epilepsy studies through seizure counts, in multiple sclerosis or Parkinson's trials via relapse counts, and in migraine treatment studies as the number of attacks. So to are they useful in cardiovascular trials as the number of hospitalizations observed over a particular time-frame, or in alcohol treatment studies as the number of drinks over a recent period of time. Whilst in asthma trials, the number of exacerbations is often of interest, and as we will discuss later, the number of apnea and hypopnea events per hour is a common primary outcome in sleep-apnea studies.
Regardless of the now available efficient exact method noted above though, it is always of interest to be able to develop methods for reducing requisite sample sizes further. One widely applicable approach to achieving this is to employ a group sequential design, through which an experiment's expected sample size (ESS) can be reduced by permitting early stopping following the repeated testing of a hypothesis of interest. Depending on the context, this can provide valuable savings in terms of time or money. For further details on group sequential methods see Jennison and Turnbull (2000).
Unfortunately, compared to settings with normally or Bernoulli distributed data, there is relatively little methodology available pertaining to the group sequential design of trials with Poisson distributed outcome variables. In fact, this is true more generally of count outcomes. Notable exceptions include the work of Cook and Lawless (1996), who described a nonparametric approach to interim monitoring in comparative studies with recurrent event outcomes. Moreover, Jiang (1999) considered the group sequential design of experiments with heterogeneous recurrent event endpoints. Xia and Hoover (2007) described a group sequential framework for Poisson outcome variables when interim analyses were timed after landmark numbers of events had occurred in each arm. They based their testing framework on the exact distribution of the number of events from one arm, conditional on the total observed number of events. Cook et al. (2010) presented methods for the sequential analysis of data from experiments with recurrent event responses observed over two periods, where one is a baseline period of observation. Recently, Mutze et al. (2018a) also considered recurrent event responses, presenting group sequential procedures for a robust semiparametric analysis of such data. Finally, Mutze et al. (2018b) described methodology for the group sequential design of trials with negative-binomial outcomes based on Wald test statistics using maximum likelihood estimators.
Here, we focus on a different design scenario to these articles, in which each outcome is assumed to be a single Poisson distributed variable, with its precise distribution dependent only on which arm it was accrued from. First, we describe how established group sequential design theory can be applied in this setting, allowing widely available software for design determination to be employed, and operating characteristics to be approximately controlled. Following this, we detail a novel design that allows for the exact computation of a design's operating characteristics based on the extension of the approach of Menon et al. (2011) to group sequential experiments. To allow maximal efficiency gains to be attained from the group sequential tests, we then describe an effective means of choosing stopping boundaries in a near-optimal manner. Next, we expound on the potential gains a group sequential design could bring through two hypothetical examples, before concluding with a discussion of the advantages and disadvantages of the two described procedures.

Notation and hypotheses
We consider an experiment in which data is to be accrued from two arms, which we index by j ∈ {1, 2}. In addition, we suppose that our group sequential experiment will have at most K ∈ ℕ + stages (explicitly permitting K = 1, which corresponds to a fixed sample design), and we index the stages by k ∈ {1, …, K}. Thus, from here, unless stated otherwise, it can be assumed that j ∈ {1, 2} and k ∈ {1, …, K}.
We then denote our outcomes variables by X ijk : the number of observed events for sample i, in arm j, in stage k. And, we assume that each outcome variable is accrued following observation over an equal fixed positive time, and that then X ijk ~ Po(λ j ), for λ j ∈ ℝ + the mean event rate on arm j. Furthermore, for simplicity, we assume that analyses are performed after kn, n ∈ ℕ + , outcome variables have been gathered on each of the two arms (implying i ∈ {1, …, n}). Note however that the methods which follow could readily be extended to consider unequal allocation to the arms in and across the stages.
Our null hypothesis of interest will be H 0 : λ 1 = λ 2 ∈ Λ 0 ⊂ (0, ∞). That is, we test a composite null hypothesis, to expressly allow for design in scenarios (see Example 1 below) in which a range of possible values of the null mean response are anticipated. Moreover, we power our experiment for a scenario in which λ 1 = λ 2 + δ ∈ Λ 1 ⊂ (0, ∞) for δ ∈ ℝ + , as we are principally motivated by a clinical scenario in which we hope the novel treatment (arm k = 2) will reduce the mean response. However, note that a design for δ ∈ ℝ − could be identified similarly.
We allow early stopping both to reject and to not reject H 0 , with our stopping rules dependent on vectors of boundaries that we denote by a = (a 1 , …, a K ) and r = (r 1 , …, r K ). For now, we assume that a, r ∈ ℝ K . More precisely, we will specify test statistics T k , which will be used with the following stopping rules

•
For k ∈ {1, … , K−1} -If T k ≥ r k stop the experiment, rejecting H 0 ; -If T k < a k stop the experiment, not rejecting H 0 ; -If T k ∈ [a k , r k ) continue the experiment to stage k + 1.
• For k = K -If T k ≥ r k stop the experiment, rejecting H 0 ; -If T k < a k stop the experiment, not rejecting H 0 .
To ensure that our stopping rules make sense (i.e., that we do not simultaneously recommend to both reject and not reject H 0 ), and to guarantee that a conclusion is drawn on H 0 , we enforce that a k < r k for k ∈ {1, …, K−1} and set a K = r K . We will also make use of the notation a k = (a 1 , …, a k ), and similarly for r.
With the above, we can compute the maximal type-I error-rate of our test as α′ n, a, r = max λ ∈ Λ 0 ∑ k = 1 K R k λ, λ n, a k , r k Furthermore, the maximal type-II error-rate is β′ n, a, r = max λ ∈ Λ 1 ∑ k = 1 K A k λ, λ − δ n, a k , r k Consequently, our goal in what follows will be to choose values for n, a, and r such that α′ (n, a, r) ≤ α, and β′(n, a, r) ≤ β, for specified α, β ∈ (0, 1).

Design based on a normal approximation
First, we describe how we can compute group sequential designs based on the asymptotic distribution of Wald-type test statistics. To this end, denote the maximum likelihood estimate of λ j , at analysis k, by λ jk . We have Next, note that the expected Fisher information of the parameter λ j at analysis k is I jk = kn/ λ j . Then, key to design determination is the notion of the information level at analysis k, which serves a measure of the knowledge available about the difference of the means of the two treatment arms. We denote this information by ℐ k with it given by Unfortunately, the information level ℐ k depends on the unknown means, λ 1 and λ 2 . Therefore, to test H 0 using asymptotic normality we replace ℐ k by a consistent estimator, ℐ k , obtaining the Wald-type test statistic for analysis k as Now, because ℐ k is estimated using a consistent estimator, Slutsky's theorem tells us that T Wk is still asymptotically normally distributed. Moreover, by standard results in group sequential design theory, the vector of Wald-type test statistics T W = (T W1 , …, T WK ) follows what has been referred to as the 'canonical joint distribution' (Scharfstein et al. 1997;Jennison and Turnbull 2000). That is, T W is asymptotically multivariate normal with mean vector λ 1 − λ 2 ℐ K 1/2 = λ 1 − λ 2 ( ℐ 1 1/2 , …, ℐ K 1/2 ) ⊤ and K × K covariance matrix Σ = {Σ k 1 k 2 } with Σ k 1 k 2 = Σ k 2 k 1 = ℐ k 1 / ℐ k 2 = k 1 /k 2 , for k 2 ∈ {1, …, K} and k 1 ∈ {1, …, Accordingly, we can compute the A k (·) and R k (·) via A k λ 1 , λ 2 n, a k , r k = ∫ −∞ a 1 ϕ x, λ 1 − λ 2 ℐ 1 1/2 , 1 dx : k = 1, ∫ a 1 r 1 … ∫ −∞ a k ϕ x, λ 1 − λ 2 ℐ k 1/2 , Σ k dx k …dx 1 : k ∈ 2, …, K and R k λ 1 , λ 2 n, a k , r k = ∫ r 1 ∞ ϕ x, λ 1 − λ 2 ℐ 1 1/2 , 1 dx : k = 1, ∫ a 1 r 1 … ∫ r k ∞ ϕ x, λ 1 − λ 2 ℐ k 1/2 , Σ k dx k …dx 1 : k ∈ 2, …, K respectively. Here, ϕ(x, μ, Ω) is the probability density function of a multivariate normal distribution with mean vector μ and covariance matrix Ω, and Σ k signifies the restriction of Σ to its first k rows and columns.

Design based on exact calculations
Given that the methodology described in Section 2.2 relies upon asymptotic theory, the operating characteristics computed using multivariate normal distribution functions could be a poor approximation to their empirical values. Therefore, to permit strict error control we now extend the results of Menon et al. (2011) to group sequential designs. To achieve this, note that the sum of the outcomes in arm j in stage k, Y jk say, has the following distribution based on the familiar result that the sum of independent Poisson random variables is itself Poisson Then, the difference in the sum of the outcomes on each arm, T k = Y 1k − Y 2k , as a difference between two independent Poisson random variables, has a Skellam distribution (Skellam 1946). We signify this by T k ∼ Skellam nλ 1 , nλ 2 , and denote the probability mass and cumulative distribution functions of T k on its support ℤ as follows g t n, λ 1 , λ 2 = ℙ T k = t n, λ 1 , λ 2 = e −n λ 1 + λ 2 λ 1 λ 2 t 2 I t 2n λ 1 λ 2 , G t n, λ 1 , λ 2 = ℙ T k ≤ t n, λ 1 , λ 2 = ∑ s ∈ ℤ : s ≤ t g s n, λ 1 , λ 2 where I ν (·) is the modified Bessel function of the first kind, and we make use of the particular representation presented in Menon et al. (2011).
Our test statistic at analysis k for the exact design is then T Sk = T 1 + ⋯ + T k . Now, within the context of exact group sequential designs for Bernoulli distributed outcome variables, it is well understood that the interim test statistics do not in general have a simple distribution (Jung 2012). Similarly, it is important to note that T Sk ≁ Skellam(knλ 1 , knλ 2 ). What is more, unlike the case for Bernoulli outcomes, we are in fact unable to compute the probability mass function of T Sk across its entire support, ℤ. This is a consequence specifically of the fact that the support of each T k is infinite. However, we can compute a part of the probability mass function, which we denote by h k (t|n, λ 1 , λ 2 , a k , r k ) = ℙ(T Sk = t|n, λ 1 , λ 2 , a k , r k ). Explicitly ℎ k t n, λ 1 , λ 2, a k, r k = g t n, λ 1 , λ 2 : k = 1, ∑ s = a k − 1 r k − 1 − 1 ℎ k − 1 s n, λ 1 , λ 2 , a k − 1, r k − 1 g t − s n, λ 1 , λ 2 : k ∈ 2, …, K − 1 , t ∈ a k , r k That is, we are able to evaluate the probability mass function for T Sk ∈ [a k , r k ). This result is in essence an extension of that of Schultz et al. (1973) for single-arm experiments with Bernoulli outcome variables.

Boundary determination
A variety of methods have been presented in the literature through which stopping boundaries and sample sizes for group sequential designs can be determined. For the method of Section 2.2, the normal approximation approach, many of these could be applied in our considered Poisson-outcome experimental design scenario. This includes comparatively simpler procedures such as that of Pampallona and Tsiatis (1994), through to more complex methods that allow design optimization . For our framework from Section 2.3, however, there are fewer methods available that could be directly applied to determine suitable designs. The primary reason for this is that in this setting the stopping boundaries are best treated as discrete parameters (i.e., it is best to apply the restriction a, r ∈ ℤ K ), otherwise a redundancy will exist in the design space (e.g., two designs that are otherwise equal apart from their values of a 1 being 2.1 and 2.2 respectively would have the same operating characteristics).
This makes a brute-force approach that searches across possible designs tempting, similar to that which has been applied to both non-randomized and randomized trials with binary outcome variables. Unfortunately, such a solution will for many K be computationally expensive given that the design space is 2 K-dimensional. Consequently, here, we propose an approach to determining 'near-optimal' values for these discrete bounds based on the error spending approach to group sequential design (Lan and DeMets 1983). We also utilize this method for determining designs based on the normal approximation procedure, in order to allow for a fairer comparison to the performance of the exact procedure, and as in our experience it offers an advantageous tradeoff between computational run-time and design efficiency.
However, for the exact approach with K = 2, we do also contrast the performance of these near-optimal designs to 'optimal' designs identified via an extensive search over possible values for n, a and r. Through this, we aim to demonstrate that a brute-force style approach may often provide little advantage over the comparatively easy-to-identify near-optimal designs. For this search, note that the infinite support of the T Sk implies that an exhaustive assessment of all possible boundary values is impossible. Accordingly, for any n, a method is required for limiting the considered a and r in a logical manner. Note that these restrictions should take in to account the chosen values for Λ 0 , Λ 1 , and δ. Here, for any n our approach is to identify the solutions of the following equations a * = argmax a ∈ ℤ max λ ∈ Λ 0 A 1 λ, λ n, a, r ≤ ϵ , r * = argmin r ∈ ℤ max λ ∈ Λ 1 R 1 λ, λ − δ n, a, r ≤ ϵ That is, we choose r * as the minimal integer that ensures the probability of stopping the trial after stage one to reject H 0 , under λ 1 = λ 2 + δ ∈ Λ 1 , is at most ϵ, and similarly for a * . We then search over boundaries such that a 1 ∈ [a * , r * −2], r 1 ∈ [a 1 + 2, r * ], and a 2 ∈ [a 1 + a * , r 1 + r * ]. This ensures that having continued to stage two, the conditional probability of rejecting H 0 is at most ϵ when λ 1 = λ 2 + δ ∈ Λ 1 , with a similar statement holding true for the probability of not rejecting H 0 at the end of stage two when λ 1 = λ 2 ∈ Λ 0 . Thus, we allow values for r that provide conditional probabilities of stopping after each stage to reject H 0 of at least ϵ under some scenario for which we wish to power the trial, and similarly for the values of a. For, if ϵ ≪ 1 it is reasonable to expect that little could be gained from designs that are not captured as part of this search procedure. Accordingly, in our search we set ϵ = 10 −7 . Then, we perform our evaluations over group-sizes n such that n ∈ {1, …, 1.5n fixed }, where n fixed is the group-size required by a corresponding fixed-sample (K = 1) design. Note that the maximal type-I and type-II error-rates for any considered design, across the sets Λ 0 and Λ 1 , are identified using Brent's algorithm (Brent 1973).
In contrast, using the error-spending approach, we specify error spending vectors π A = (π A1 , …, π AK ) and π R = (π R1 , …, π RK ), with ∑ k = 1 K π Rk = α, ∑ k = 1 K π Ak = β, π Rk, π Ak ≥ 0, k ∈ 1, …, K that then imply particular stopping boundaries, and a particular required group size. First, we describe how the boundaries are chosen for a fixed n ∈ ℕ + , as well as fixed π A and π R conforming to the requirements above. We then describe how n can subsequently be chosen for these π A and π R , before describing how the spending vectors themselves can be specified.
Put simply, for a given n, π A and π R , we recursively identify the boundaries, beginning with r 1 . The precise nature of the calculations differ for the exact and normal approximation based approaches. We proceed by describing the differences before expanding on the reasons that they are present.
First, for the exact procedure, for k ∈ {1, …, K−1}, we iterate between finding r k and a k as the solutions to argmin r k ∈ ℤ max λ ∈ Λ 0 R k λ, λ n, a k − 1 , r k ≤ π Rk and argmax a k ∈ ℤ max λ ∈ Λ 1 A k λ, λ − δ n, a k , r k − 1 ≤ π Ak respectively, where we may arbitrary take a 0 = r 0 = 0 since these are not used in our evaluations of A 1 (·) and R 1 (·).
Whilst, for the normal approximation approach, for k ∈ {1, …, K−1}, we iterate between finding r k and a k as the solutions to argmin r k ∈ ℝ R k λ, λ n, a k − 1 , r k ≤ π Rk for some arbitrarily chosen λ ∈ Λ 0 , and argmax a k ∈ ℝ A k sup Λ 1 , sup Λ 1 − δ n, a k , r k − 1 ≤ π Ak In either case, we then utilize the formula to specify r K , but not a K , setting it instead as a K = r K . This ensures that we control the theoretical type-I error-rate to the desired level α, and as discussed earlier it guarantees that a decision is made on H 0 during the course of the experiment. Now, we note the reasons for the differences in the approaches utilized for the two methods. They arise because of our desire to control the type-I and type-II error-rates over the sets Λ 0 and Λ 1 respectively. Specifically, for the normal approximation approach, note that the distribution of the T W when λ 1 = λ 2 ∈ Λ 0 does not depend on the specific shared value of λ 1 and λ 2 . Thus, the type-I error-rate can be found for an arbitrarily chosen value of these parameters, and in turn we need only base our values r k on controlling the R k (·) to below π Rk for this arbitrary value. Similarly, the theoretical type-II error-rate for the normal approximation approach when λ 1 = λ 2 + δ ∈ Λ 1 is maximized when λ 1 = sup(Λ 1 ), since this provides the minimal possible information over the set Λ 1 . Thus, if we wish to control our type-II error-rate to at most β over all possible scenarios λ 1 = λ 2 + δ ∈ Λ 1 , we can simply choose our a k to constrain the A k (·) to at most π Ak when λ 1 = λ 2 + δ = sup(Λ 1 ).
In contrast, for the exact test the maximal type-I and type-II error-rates are not easy to identify. For, we may hope that an analytical formulae for the location of the maximal errorrates could be derived (e.g., by proving monotonicity of rejection probabilities across the sets Λ 0 and Λ 1 ). However, as we discuss further in the Supplementary Material, we believe it is unlikely that this could be achieved, given that such a result was only recently derived for the comparatively simple case of a single-arm trial with Bernoulli outcomes (Shan et al. 2017), and at least for the type-I error-rate our explorations suggest there is no simple pattern to the location of the maxima. For this reason our determination of the a k and r k retains a one-dimensional numerical search for the maximal values of A k (·) and R k (·) respectively. With this, however, we are guaranteed to control the error-rates to the desired level, whilst the empirical error-rates of normal approximation designs may be above their nominal levels.
The above completes our computation of the stopping boundaries for fixed n, π A and π R . For any such error spending vectors, the value of n that provides the desired power can then be determined as argmin n ∈ ℕ + β′(n, a, r) ≤ β where the a and r here are those specifically derived for the particular n under assessment to control the type-I error-rate, using the methods above.
Thus, by the above we are able to determine the n, a, and r that correspond to any choice of π A and π R . Near-optimal designs are then identified by searching over possible choices for these two error spending vectors: by searching over a large number of such vectors, an approximately exhaustive search over possible designs can be executed.
All that is additionally required to choose a design is an optimality criteria to choose amongst potential designs. Here, we use the following, which has been considered extensively in the past for similar group sequential design settings (Mander et al. 2012;Wason et al. 2012) w 1 ESS λ ESS , λ ESS n, a, r + w 2 ESS λ ESS , λ ESS − δ n, a, r + w 3 2Kn where ESS(·) is a function that computes the ESS, given by ESS λ 1 , λ 2 n, a, r = 2n ∑ k = 1 K k A k λ 1, λ 2 n, a k , r k + R k λ 1 , λ 2 n, a k , r k and λ ESS is a specified mean response in arm 1. Furthermore, the w l ∈ [0, ∞), l ∈ {1, 2, 3}, are then weights given to the different components of the optimality function. Note we should generally ensure that w 1 + w 2 ∈ (0, ∞) as multiple designs will often have the same minimal maximal sample size, 2Kn. For brevity in what follows, we set w = (w 1 , w 2 , w 3 ).

Example group-sequential designs
To demonstrate the potential efficiency gains from utilizing a group-sequential design, and to compare the exact and normal approximation approaches, we consider two example trial design scenarios. The results for Example 1 are presented below, whilst those for Example 2 are included in the Supplementary Material.
We motivate the design parameters for Example 1 by considering a hypothetical obstructive sleep apnea-hypopnea (OSAH) trial. Specifically, continuous positive airway pressure is typically the first line treatment for severe sufferers of OSAH (National Institute for Health and Care Excellence 2008). However, its benefits in milder disease are less certain (Patel et al. 2003;Weaver et al. 2012), and intolerance of continuous positive airway pressure is also common (Weaver and Grunstein 2008). Therefore, we consider the design of a trial that aims to examine the efficacy of an alternative treatment option for OSAH in more moderate disease cases, such as for example the oral mandibular advancement devices that were examined by Quinnell et al. (2014) amongst others. Thus arm 1 will correspond to no treatment, and arm 2 to the new treatment option of interest.
We assume that, as is common in OSAH studies, the apnea-hypopnea index (AHI, the combined average number of apneas and hypopneas that occur per hour of sleep) will be the primary endpoint of interest. To account for variability in the mean AHI of enrolled patients on the control arm, we specify Λ 0 = Λ 1 = [15,30] (corresponding to the established definition of moderate OSAH disease). Finally, we assume that it is desired to control the type-I error-rate to α = 0.05 and have power of 1−β = 0.8 when δ = 2.25 (corresponding to at least a 15% reduction in AHI for moderate disease sufferers).
We observe that, as would be expected, utilizing a group sequential approach reduces the ESS when λ 1 = λ 2 = λ ESS and when λ 1 = λ 2 + δ = λ ESS relative to using a single-stage design. In particular, the ESS when λ 1 = λ 2 = λ ESS can be reduced by as much as 41% and 44% when using the normal approximation or exact approaches respectively (for K = 3, compared to their respective required sample sizes when K = 1).
Moreover, as is typical for group sequential designs, increasing the value of K allows us to increase efficiency further in terms of the ESS, but this comes at a cost to the maximal possible sample sizes. Interestingly though, using either design approach, there do exist designs that require only very minor increases to the maximal possible sample size that bring sizeable reductions to the considered ESSs. In particular, the optimal designs for w 6 , which place a non-zero weight on each of the three components of the optimality criteria, appear to perform particularly well in comparison to a single-stage approach under both λ 1 = λ 2 = λ ESS and λ 1 = λ 2 + δ = λ ESS , without requiring a substantial increase to the maximal possible sample size.
Finally, observe that the optimal design search is able to identify for each considered value of w a design that out-performs the corresponding near-optimal design. However, the difference between the operating characteristics of these designs is typically small. For example, for w 1 , the optimal design has ESS(λ ESS , λ ESS ) = 92.8, whilst the near-optimal design has ESS(λ ESS , λ ESS ) = 94.6, an increase of only 2%.

Empirical performance of normal approximation designs
Here, we expand on the potential problems associated with use of the normal approximation approach, utilizing simulation to assess the difference between the theoretical (estimated) error-rates of normal approximation designs in comparison to their empirical values. Specifically, to examine in detail the potential differences between the estimated and empirical values, we identified the normal-approximation error-spending designs for Λ 0 = Λ 1 = λ 0 ∈ {1, 1.5, …, 9.5, 10}, when δ ∈ {0.25λ, 0.275λ, …, 0.725λ, 0.75λ}, π A = (0.1, 0.1), and π R = (0.025, 0.025). For each of these 399 designs, 100,000 simulations were then used to evaluate the empirical type-I and type-II error-rates for comparison to their values estimated via the formulae of Section 2.2.
The difference between the estimated and empirical type-I and type-II error-rates of the designs are presented in Figures 1 and 2. In general, the differences are small on the depicted raw scale, with the maximal absolute difference in type-I and type-II error-rates being 0.0029 and 0.0319 to 4 dp respectively. However, when considered as a percentage difference from the estimated value, the maximal differences are 5.8% and 24.4%. These figures, along with other considerations, impact the applicability of the normal approximation approach.

Discussion
Here, we have described how we can use well established methodology to design group sequential experiments with Poisson distributed outcomes. Furthermore, in order to provide a method that is not dependent on asymptotic theory, we also utilized the Skellam distribution to determine operating characteristics exactly. This exact design required, in particular, careful consideration of how to evaluate the probability of stopping at each interim analysis.
Furthermore, to permit efficient group sequential designs to be identified for any value of K, we presented a method for design determination based on the error spending approach to group sequential design. Specifically, we searched over possible error spending vectors to find which amongst these minimized a particular optimality function. Whilst this is unlikely to identify the best possible design, we believe it is an effective means of finding efficient designs, particularly if parallelization is used to search over a large number of possible spending vectors. Indeed, from Table 1 we observed that at least for Example 1, the nearoptimal designs for K = 2 where only slightly less efficient than the optimal designs identified for an extensive search over possible values of n, a, and r.
Overall, for our considered example, we demonstrated that using a group sequential design had the potential to improve efficiency substantially; with the ESS for λ 1 = λ 2 = λ ESS in Example 1 reducible by as much as 44%. This unfortunately came at a cost of an increase to the maximal possible sample size of 23%. However, we did identify several designs that required only minimal increases to the maximal possible sample size, whilst still reducing the ESS when λ 1 = λ 2 = λ ESS and when λ 1 = λ 2 + δ = λ ESS notably. Statistically speaking, such designs have almost no disadvantages. Furthermore, as is typical with group sequential designs, the identified efficiency gains increased with the value of K. Note that we presented results here for K ∈ {2, 3}, as our investigations have suggested the value of setting K > 3, in terms of reducing the ESS, is often small. This should not be interpreted as design determination for K > 3 being intractable.
Note that one limitation of our sequential methodology, as is typical for adaptive designs, is that it is most effective in practice when outcome accruement following inclusion in the experiment is fast relative to the entire length of the study (i.e., the enrollment rate), as we make the underlying assumption that outcome accruement completes between interim analyses, removing potential issues caused by delayed outcomes. That is, the theoretical efficiency gains are most likely to be realized in reality in this case. This is not to say, however, that a group-sequential approach cannot be useful when the study duration is short, as accruement could be paused in each stage when the required sample size has been achieved. This though would likely come at a cost to the overall time taken to complete the experiment. Thus, the utility of a group-sequential approach may often depend on the willingness to trade an increased study length for a smaller required sample size.
Moreover, we here assume that all observations are accrued following a fixed and common period of observation. This may make application in fields such as manufacturing easier than many clinical research settings where observation time may be more difficult to standardize in this manner. For scenarios with highly variable observation times, alternative methodology for group sequential design (e.g., Xia and Hoover 2007) would likely be more appropriate. At the design stage of an experiment, however, provided only small variability in the observation times is anticipated, our methods could provide a simple means of determining the approximately required sample size.
We specified our null hypothesis in a composite form, and also powered the trial across a range of possible mean event rates. We made these choices as in many settings, in particular in clinical research, it would likely difficult to nominate single points at which to control the error-rates. This came at a cost, however, in that a method to control the maximal error-rates across the sets Λ 0 and Λ 1 was required. Whilst this was readily achieved, at least theoretically, for the normal approximation design, this was not the case for the exact approach. We thus retained one-dimensional searches for maximal stopping probabilities in our specification of the a k and r k . It is important to acknowledge that this approach may in general lead to conservative designs as, for example, the value of λ ∈ Λ 0 that maximizes R k (·) may not be equal across k ∈ {1, …, K}. In practice, this appears to not be the case, as we were able to identify highly efficient designs with maximal error-rates close to their desired level. Nonetheless, one possible solution to reducing this conservatism could be to conduct a test conditional on the number of observed events, similar to the approaches of Xia and Hoover (2007) and Grayling et al. (2017). This remains as a possible avenue for future research.
We finish with a discussion of an important point: the inherent pros and cons of the two described design determination procedures. The exact procedure is, as discussed, preferable precisely because it is exact. That is, it guarantees control of the type-I and type-II errorrates. In contrast, the normal approximation approach may not provide such error-rate control in practice, and verifying that it does via simulation may be particularly time consuming. However, our presented simulation study does suggest that in many settings the error-rates of normal approximation designs may often be close to their nominal levels. Furthermore, from Table 1 we observe that in certain cases, for a given w, the normal approximation designs have lower ESSs than the corresponding exact designs. This is a consequence of the fact that a, r ∈ ℝ K , rather than ℤ K for the normal approximation design, and so can be more precisely tailored to each possible value of n. In addition, provided additional simulations are not required to confirm the theoretical operating characteristics, the optimal normal approximation designs are typically faster to identify than their exact counterparts. This may lead us to conclude that the normal approximation approach may be preferred when only approximate error-rate control is desired, or when a search for an exact design would be time consuming. In contrast, if one wishes to be certain that the error-rates are controlled, or wishes to know for sure that the correct optimal or near-optimal design has been chosen, the exact approach may be preferred. Finally, in particular, whilst there does not appear to be a simple rule through which we can specify the estimated error-rates of a normal approximation design will begin to deviate from their empirical values, similar to design for Bernoulli outcome data, utilization of an exact approach would be most expedient when the requisite sample size is small (i.e., when the anticipated mean response is small and/or δ is large). Heeding this advice, group sequential tests for Poisson distributed outcome variables can then be determined which substantially reduce the requisite sample size compared to single-stage approaches. The difference between the theoretical and empirical type-I (λ 1 = λ 2 ) error-rate of exact designs is shown.  The difference between the theoretical and empirical type-II (λ 1 = λ 2 + δ) error-rate of exact designs is shown. The optimal two and three-stage designs for Example 1 are shown, based on the exact (optimal and near-optimal) and normal approximation approaches. For comparison, the single-stage designs are also given. Note that for brevity we write λ ESS -δ). The type-I error-rates and power figures are given to 3 dp, whilst ESSs are given to 1 dp.