Binomial confidence intervals for rare events: importance of defining margin of error relative to magnitude of proportion

Confidence interval performance is typically assessed in terms of two criteria: coverage probability and interval width (or margin of error). In this paper, we assess the performance of four common proportion interval estimators: the Wald, Clopper-Pearson (exact), Wilson and Agresti-Coull, in the context of rare-event probabilities. We define the interval precision in terms of a relative margin of error which ensures consistency with the magnitude of the proportion. Thus, confidence interval estimators are assessed in terms of achieving a desired coverage probability whilst simultaneously satisfying the specified relative margin of error. We illustrate the importance of considering both coverage probability and relative margin of error when estimating rare-event proportions, and show that within this framework, all four interval estimators perform somewhat similarly for a given sample size and confidence level. We identify relative margin of error values that result in satisfactory coverage whilst being conservative in terms of sample size requirements, and hence suggest a range of values that can be adopted in practice. The proposed relative margin of error scheme is evaluated analytically, by simulation, and by application to a number of recent studies from the literature.


Introduction
A fundamental problem in applied statistics is the construction of a confidence interval (CI) for a binomial proportion, p.In many applications, one deals with a large population within which an event of interest is rare.For example, in clinical statistics, p could represent the proportion of patients exhibiting treatment side effects; such a scenario arose in the context of COVID-19 vaccination (Polack et al., 2020).In manufacturing, the number of defective components is often very small relative to the large number of components produced.Indeed, many manufacturers now achieve a defect rate of 3.4 in one million (Evans and Lindsay, 2015;Woodall and Montgomery, 2014).In the aviation industry, strict regulations ensure that safety incidents are deemed as a rare occurrence, for example, Boeing (2022) shows that very few incidents occur within a large sample of flights.(Note: we revisit the COVID-19 and aviation examples as case studies in Section 8, along with an ADHD medication example.) Ascertaining the order of magnitude of p is important in "large populations" such as the aforementioned.Indeed, with population sizes in the millions (or billions), there is a big practical difference between p = 10 −4 and p = 10 −6 (but such differences are much less important/detectable in smaller populations).For example, in high-throughput manufacturing, a difference in the order of magnitude in the failure rate has significant implications for the number of defects and/or product returns.From a purely pragmatic perspective, note that ten thousand observations are needed to obtain, on average, one event when p = 10 −4 .However, while it is expected that relatively large samples will be required to adequately estimate the order of magnitude of a small proportion, p, practitioners will need more specific guidance on the sample size requirements; this is not well covered by the existing literature.
The problem of constructing a CI for p has a wide literature, including several comparative studies, for example, Gonçalves et al. (2012), Leemis and Trivedi (1996), Newcombe (1998) and Pires and Amado (2008).These works assess various proportion estimators, for example, Pires and Amado (2008) compare twenty different methods.However, these works, and the literature in general, focus primarily on situations where p is moderately large.As such, there is much less guidance in the existing literature regarding the scenario where p is small.Furthermore, there is little discussion of relative margin of error, which is needed in this small p setting.Whereas relative margin of error is not a prominent feature of CI assessment for moderately large proportions, it is essential that the margin of error scales with the magnitude of p for rare events.Therefore, we consider a valid CI estimator as one that achieves a desired coverage probability whilst also maintaining a specified relative margin of error, and, in contrast to much of the existing literature, we focus on the small p regime of p ∈ [10 −6 , 10 −1 ], where relative margin of error is especially important.
For our analysis, we consider the most widely used binomial confidence interval, the Wald interval, along with three other common intervals: Clopper-Pearson (exact), Wilson (score) and Agresti-Coull (adjusted Wald) (Agresti and Coull, 1998;Clopper and Pearson, 1934;Wilson, 1927).Despite its widespread use, the Wald interval is known to produce inadequate coverage when p is near 0 or 1, and/or the sample size, n, is small.It has also been well documented that this interval suffers from erratic coverage, even when p is moderate (Agresti and Coull, 1998;Blyth and Still, 1983;Böhning, 1994;Vollset, 1993).Brown et al. (2001) show that this coverage fluctuation occurs for large n and recommend against using the Wald interval in practice.Newcombe (1998) also discourages the use of the Wald interval and suggests that its use be restricted to sample size planning.In recent work, Andersson (2023) discusses the deficiencies of the Wald interval and examines its coverage and noncoverage performance relative to the Wilson interval.Whilst the criticisms of the Wald interval can be justified, particularly when n is small, it is worth noting that the issue of erratic coverage is not unique to the Wald interval; this behaviour is related to the binomial distribution and we illustrate (in Section 5) that it occurs for all four interval estimators.
A common approach in determining sample sizes is to set the (Wald) CI margin of error equal to a specified value, ϵ, and then solve for n.In order to maintain consistency between ϵ and p, we consider the relative margin of error, ϵ R = ϵ/p, and obtain sample sizes by setting ϵ R to a specified value and solving each interval equation for n.Lwanga and Lemeshow (1991) provide (Wald) sample size calculations for fixed and relative margins of error in the range [0.01, 0.5] for p ∈ [0.05, 0.95].However, in our work, we focus on the small-p regime of p ∈ [10 −6 , 10 −1 ] and provide computed coverage probabilities relating to ϵ R ∈ [0.05, 0.75].In this regime, it is important to consider relative precision over fixed precision (fixed ϵ value).For example, ϵ = 0.1 might be considered as reasonable precision for p = 0.4, but could equally be considered reasonable for p = 0.2.However, where ϵ = 0.05 could be considered as a valid margin of error for p = 10 −1 , it is far too large for a success probability of the order p = 10 −3 .Ultimately, we find that ϵ R ∈ [0.1, 0.5] yields a good compromise between estimation precision, coverage performance, and sample size requirements.
In this work, we illustrate the importance of using relative margin of error in a small p regime, recommend practical tolerances for both relative margin of error and coverage probability, and provide comparisons in terms of sample size requirements.We show that when CI performance is assessed in terms of both coverage probability and relative margin of error, the four CI estimators perform similarly in many cases.Although the differences between the estimators is less pronounced when considering the relative margin of error, we show that the Wilson (score) interval provides the best overall performance.In addition, we provide practical guidance on the sample sizes required to attain reasonable CI performance for various (small) values of p.We anticipate that this guidance will be useful to researchers working in real-world small-p applications.
The remainder of this article is organised as follows.In Section 2, we review some of the proportion estimators proposed in the literature, focusing, in particular, on the Wald, Clopper-Pearson, Wilson and Agresti-Coull intervals.Section 3 provides details of the CI evaluation criteria used in the work.Section 4 briefly discusses the initial estimation of p for sample size planning, then moves on to using relative margin of error in such planning, and CI performance evaluation.Sections 5 and 6 illustrate the importance of employing a relative margin of error scheme in CI assessment.Section 7 presents the challenge of estimating a rare-event proportion using a small sample size.In Section 8, we present a number of case studies to demonstrate the relative margin of error schemes in assessing the validity of estimated intervals.Finally, the article concludes in Section 9 with a discussion.

Binomial Proportion Interval Estimators
Several methods have been devised to estimate a binomial proportion, p, including the Wald, Clopper-Pearson, Wilson, Agresti-Coull, Jeffreys, arcsine transformation, Jeffreys' Prior and the likelihood ratio interval.A range of ensemble approaches have also been considered, for example, Kabaila et al. (2016), Park and Leemis (2019) and Turek and Fletcher (2012).In this work we assess the performance of the Wald, Clopper-Pearson, Wilson and Agresti-Coull intervals in standard form, i.e., without modification or application of continuity correction.

Details of the Intervals Considered in our Work
The Wald interval is included in this study as it is the most widely known and used estimator.We assess the Clopper and Pearson (1934) interval as it is an exact method, i.e., unlike the other estimators used in this work, it is based directly on the cumulative probabilities of the binomial distribution.For this reason it is often regarded as the "gold standard" in binomial proportion estimation (Agresti and Coull, 1998;Gonçalves et al., 2012;Newcombe, 1998).Where the Wald interval is known to produce inadequate coverage for small n or p, the Clopper-Pearson interval is generally regarded as being overly conservative, unless n is quite large (Agresti and Coull, 1998;Brown et al., 2001;Newcombe, 1998;Thulin, 2014).The intervals proposed by Wilson (1927) and Agresti and Coull (1998) both offer a compromise between the (liberal) Wald interval, and the (conservative) Clopper-Pearson interval.
We assess the performance of the Wilson and Agresti-Coull intervals in this work given their popularity within the literature.For example, Brown et al. (2001) recommend the Wilson or Jeffreys interval for small n.For larger n, they recommend the Wilson, Jeffreys or Agresti-Coull intervals, preferring the Agresti-Coull method for its simpler presentation.Agresti and Coull (1998) recommend the Wilson interval, and add that the 95% Wilson interval has similar performance to their method.Newcombe (1998) remarks on the mid-p method (an exact method closely related to the Clopper-Pearson method), and the Wilson method, noting the Wilson's advantage of having a simple closed form.Vollset (1993) too recommends the mid-p and Wilson (uncorrected and continuity corrected) intervals, along with the Clopper-Pearson interval, stating that those four intervals can be safely used at all times.Pires and Amado (2008) recommend the continuity corrected arcsine transformation or the Agresti-Coull method.Krishnamoorthy and Peng (2007) show that when controlling for Type I and Type II error rates in one-sided hypotheses, the Wilson (score) and exact (Clopper-Pearson) tests require the same sample size.However, for twosided hypothesis applications, and for constructing confidence intervals, they recommend the Wilson method.
Listed below are the Wald (W), Clopper-Pearson (CP), Wilson (WS) and Agresti-Coull (AC) interval formulas, where z α/2 denotes the 1 − α/2 quantile of the standard normal distribution and x is the number of successes in a sample size of size n.

Estimating Proportions when No Events are Observed
To provide an initial flavour of the challenging nature of estimating rare-event probabilities, we first consider the situation where no events are observed in the sample.Letting x denote the number of observed events, for very small p, it will be quite likely in small samples that x = 0, and, hence, p = 0/n = 0.For example, consider an event that occurs with probability p = 10 −3 , and consider the case where the sample size is n ≤ 100 such that Pr(X = 0) > 0.9, i.e., it is very likely that no events will be observed in this sample.
In this scenario where no events are observed, the resulting intervals are very conservative for small n.For example, with a sample of size n = 100, the following intervals are obtained: W: [0, 0], CP: [0, 0.0362], WS: [0, 0.0370], AC:[−0.0074,0.0444].Clearly the Wald interval is degenerate, and the remaining intervals are far too wide to be useful in small p settings where a good estimate of magnitude is required; the issue is more acute for smaller proportions, e.g., the probability of observing no events is 0.99 if p = 10 −4 and 0.999 if p = 10 −5 .As it is clear that a sample size of n = 100 will not suffice, further guidance is required in these scenarios, which we provide in the sequel.Of course, the required precision is analysis specific: the above intervals would be perfectly adequate if one was only interested in assessing if p < 0.1, for example.However, for situations where it is important to determine the order of magnitude, such as assessing the failure rate in high-volume manufacturing, then sufficiently large samples will be required to gain a reasonable estimate of the order of magnitude of p.

Evaluation Criteria
The most commonly used CI evaluation criteria are coverage probability and expected width (Gonçalves et al., 2012) -and these are the criteria that we consider in this work.However, other performance metrics have been proposed.For example, Vos and Hudson (2005) interpret an interval as the non-rejected parameter values in a hypothesis test and discuss the p-confidence and p-bias criteria; Newcombe (1998) presents a criterion using noncoverage as an indicator of location; and Park and Leemis (2019) adopt an ensemble approach and use root mean squared error and mean absolute deviation to measure CI performance.

Coverage Probability
The coverage probability can be interpreted as the computed interval's long-run percentage inclusion of the unknown parameter.Denoting L x and U x as the lower and upper CI bounds formed with x successes (suppressing the dependence on p and the significance level, α), the expected coverage probability, which we denote CP r, for a fixed parameter p, is given by where 1(•) is an indicator function that takes the value 1 when its argument is true, and 0 otherwise.

Expected Width
The expected width, which we denote EW , is given by and the expected margin of error, EM oE, is then given as EM oE(n, p) = EW (n, p)/2.

Calculating Sample Size
The first problem in CI estimation is determining the sample size required to achieve a desired estimation precision.There have been a range of sample size determination methods discussed in the literature, e.g., Gonçalves et al. (2012), Korn (1986) and Liu and Bailey (2002), but here we adopt the common approach of deriving the sample size from the CI formula with fixed ϵ R = ϵ/p.Used in conjunction with the Wald margin of error, one obtains where ⌈•⌉ denotes the ceiling function, and p * denotes an anticipated value of p, i.e., an initial estimate.(Sample size formulas for the Clopper-Pearson, Wilson and Agresti-Coull intervals are given in Appendix A.)

Initial Estimate of p
Selecting a value for p * is required to make equation ( 3) operational, and this is an inherent practical challenge in any such sample size calculation.In some situations, it might be possible to overcome this problem by utilizing subject matter knowledge or results from a previous study.If no previous information is available, a common approach is to consider the value of p * = 0.5, but we do not adopt this approach here given that the focus of this work is on small/rare-event probabilities.
In some situations one might be able to gain a reasonable estimate of p. Consider a manufacturing environment where, for example, past experience or consultation with process experts could provide an analyst with information on the order of magnitude of p.
One may be able to deduce that the true proportion is more likely to be of the order 1 in 10,000 rather than 1 in 1,000.Such insight could be sufficient in setting a reasonable value for p * and subsequently determining an appropriate sample size.Such initial estimation of an unknown parameter is a topic worthy of further discussion but is beyond the scope of this work.Here the focus is on the performance of the intervals after an initial estimate has been obtained.

Margin of Error Relative to p
The required precision is largely analysis dependent; what could be considered reasonable accuracy in one setting might be completely inappropriate in another.Although a relative margin of error scheme cannot be rigidly prescribed, we suggest a general scheme to avoid intervals that are too wide to be practically useful, or indeed intervals that are too narrow, in the sense that a reasonable estimate could have been obtained using fewer resources.
To ensure that the margin of error is not larger than the order of magnitude of p * , we impose ϵ R ≤ 1.Thus, one could consider 0 ≤ ϵ R ≤ 1 as a plausible margin of error scheme.However, considering ϵ R values too close to the bound of 1 results in very wide intervals, whilst considering ϵ R values too close to the bound of 0 results in very narrow intervals.As ϵ R approaches zero, an increasingly large sample size is required, and such high precision is not likely to be required for many studies.Therefore, we suggest ϵ R ∈ [0.1, 0.5] as a reasonable scheme.This scheme ensures that the interval is not impractically wide, nor excessively narrow (in terms of demanding very large sample sizes), and we show in Section 6 that acceptable coverage is achieved for this range of ϵ R values.
A comparison of calculated sample sizes corresponding to ϵ R = 0.4 is provided in Table 1.(Sample size values in this and subsequent tables are rounded to 2 significant digits.)We can see from Table 1 that the Wald, Wilson and Agresti-Coull sample sizes are similar across the p * range, but the Clopper-Pearson sample sizes are approximately 40% larger which is indicative of this method's conservatism.Note that Table 1 is only intended to provide an initial sense of the sample size requirements.However, final results can be found in Tables 10 and 11, and these account for the fact that empirical coverage for binomial proportion interval estimators is non-monotonic in the sample size (see Sections 5 and 6).
Given that ϵ R is a function of p * , the above calculated sample sizes are only applicable to each specific p * .Even if the true proportion p equals p * (i.e., the initial estimate is perfect), p will of course vary from sample to sample and may not equal p; hence, the realized relative margin of error will typically differ from ϵ R .

ϵ-p * Compatibility
Next we illustrate the importance of defining the margin of error in relation to the magnitude of the proportion.Consider the following fixed margin of error schemes: Table 2 displays the calculated Wald sample sizes and coverage probabilities corresponding to the above margin of error schemes.(A comparison of Wald, Clopper-Pearson, Wilson and Agresti-Coull coverage probabilities for the above ϵ schemes is given in Appendix B.) Referring to Table 2, fixing ϵ = 4 • 10 −2 and considering p * = p (Scheme 1 ) creates sample sizes that reduce dramatically as a function of p.This results in coverage probabilities that are completely inadequate for p ≤ 10 −2 .In Scheme 2, both ϵ and p * are fixed and this creates a constant sample size of n = 6 • 10 2 .This sample size is reasonable for p = 10 −1 , but is insufficient for the remaining p values, which is reflected in the poor coverage performance.
Scheme 3 is similar to Scheme 1, but here, ϵ is reduced to 4•10 −4 .This ϵ-p combination produces sufficient coverage for p ≥ 10 −2 , but deteriorates for the smaller p values.In Scheme 4, p * is fixed at 0.5 and ϵ = 4 • 10 −4 , this results in a constant sample size of n = 6•10 6 , which produces good coverage throughout the p range, particularly for p ≥ 10 −5 .Whilst the coverage is satisfactory in this scheme, the magnitude of ϵ is not compatible with all p values, particularly p = 10 −1 and p ≤ 10 −5 .For p = 10 −1 the resulting interval is [0.0996, 0.1004] which is too narrow in the sense that a reasonable interval could be obtained with a significantly reduced sample size.For p = 10 −6 the interval is truncated at [0, 0.000401].Here, even though the coverage is reasonable, the interval is too wide to be practically useful since its width is two orders of magnitude larger than p.
From Table 3, we see that by considering ϵ in relation to the magnitude of p, the coverage probabilities are reasonable across the p range, but now, the analyst must choose

Suitability of ϵ R Scheme
A range of qualifications/criteria are often used to check the validity of using approximate CI estimators.Fleiss et al. (2003) state that the normal distribution provides excellent approximations to exact binomial procedures when np ≥ 5 and n(1 − p) ≥ 5. Leemis and Trivedi (1996) also discuss the np ≥ 5 (or 10) and n(1 − p) ≥ 5 (or 10) qualification.

Tolerances for Assessing CI Performance
In this section we suggest suitable tolerances for assessing interval performance in terms of coverage probability and relative margin of error.In relation to achieving a desired coverage probability, one usually considers (1 − α)100 ± ϵ * %, where ϵ * denotes a predefined coverage tolerance.The definition of such a tolerance is dependent on the individual researcher and particular study, and is thus difficult to quantify.In one study (1 − α)100 ± 4% might be acceptable, whilst in another, one might require (1 − α)100 ± 0.5%.We suggest that ϵ * ∈ {1, 2, 3} would be reasonable tolerances for most analyses, and as such, consider acceptable expected coverage probabilities as CP r ∈ (1 − α)100 ± 3%, where CP r is described in equation (1).
A tolerance is also necessary with regard to the relative margin of error.As with the coverage, the desired margin of error is dependent on the particular research question and hence can not be rigidly prescribed.However, as previously discussed, it is important that the magnitude of the margin of error reflect the magnitude of the estimated proportion.
Table 5 provides suggested tolerances for the assessment of ϵ R and CP r for (1 − α)100% confidence intervals which could be considered reasonable in most settings.

Relative Margin of Error Central to Performance
Next we illustrate how the relative margin of error is fundamental to CI performance evaluation.We show that when a valid confidence interval is defined as achieving a desired coverage probability whilst simultaneously satisfying a minimum relative margin of error, the four interval estimators perform similarly for a given n-p * -α combination.
To demonstrate CI performance we consider the expected relative margin of error, which we define as εR = EM oE/p, where EM oE is half of the expected width (see equation ( 2)).As discussed in Section 4.2, a relative margin of error exceeding 1 is not acceptable from a practical perspective, and, as per Section 4.4, we suggest that it should not exceed 0.5.Figure 1 provides a 99% CI performance comparison for p * = p = 10 −1 , and shows that the Wilson, Agresti-Coull and Clopper-Pearson intervals all achieve satisfactory coverage across the sample size range, whereas for n ≤ 260, the coverage of the Wald interval oscillates around the lower limit of 98%.For example, the coverage is satisfactory at n = 240, but then falls below 98% at n = 260.This phenomenon of coverage oscillation relates to the discreteness of the binomial distribution and has been previously discussed in the literature, e.g., Agresti and Coull (1998), Andersson (2023), Blyth and Still (1983), Brown et al. (2001), andVollset (1993).For a given p, the empirical coverage does converge to the (1 − α)100% level with n as one would expect, but it does so in an oscillatory fashion for neighbouring values of n.Figures 1 and 2 show that all four estimators suffer from this erratic behaviour.
Whilst the coverage performance of the Wald interval is inferior to the other three intervals for n < 240, none of the intervals satisfy the εR ≤ 0.5 requirement at these lower sample sizes.Thus, by stipulating a minimum requirement for εR , the poor coverage at small n is rendered irrelevant and the performance of the Wald interval is more comparable to the other three intervals when εR ≤ 0.5.   5. Sample size range is from n = 1,000, to n = 8,000, in steps of 200.Labels shown adjacent to data points depict the first n where both CP r and εR requirements are satisfied.Additional data labels are referred to within the text.
A comparison of the performance of a 95% CI for p * = p = 10 −2 is given in Figure 2 which shows that the Wald interval achieves εR ≤ 0.5 for n ≥ 1,600.For n ≥ 1,600 the Wald interval encounters five sample sizes where the coverage drops below the lower limit of 94%.The Clopper-Pearson interval requires a sample size of n = 2,000 to satisfy both εR ≤ 0.5 and CP r ∈ [94%, 96%], with the coverage exceeding the upper limit of 96% on seven occasions for n > 2,000.The Agresti-Coull interval performs very well for n ≥ 1, 800, with just one value (n = 2,600), failing to satisfy both CP r and εR thereafter.The Wilson interval provides the best performance, satisfying both CP r and εR requirements ∀ n ≥ 1,600.
Figures 1 and 2 highlight the similarities in performance when one considers εR .In general, moderate-to-large sample sizes are required to satisfy both CP r and εR criteria, and at these sample sizes the performance across the four intervals is reasonably comparable.

CI Performance Tables
Tables 6 through 9 provide a 95% CI comparison for p * = p = 10 −1 and p * = p = 10 −6 , across a range of sample sizes and further illustrate the performance similarities among the estimators.The table cells are colour coded according to the tolerances discussed in Table 5: target (green), acceptable (yellow), minimally acceptable (orange) and unacceptable (red).
We first consider p * = p = 10 −1 and 10 ≤ n ≤ 140, with Table 6 showing that none of the intervals satisfy the desired CP r and εR requirements simultaneously.The importance of considering the relative margin of error in CI evaluation is clearly evident.In several cases the coverage probability lies within the desired tolerance but the excessive relative margin of error renders the estimate impractical.For example, referring to the Wilson interval, CP r(20, 10 −1 ) = 95.7%,however, εR = 1.32 which is not acceptable.We see from Table 7 that by increasing n, εR ≤ 0.5 is satisfied (at n = 150, the Clopper-Pearson interval marginally exceeds 0.5, but is less than 0.5 thereafter).Each interval encounters sample sizes where CP r exceeds the bounds of ±1%, but overall, CP r is satisfactory.As per Table 8, none of the intervals satisfy CP r ∈ [94%, 96%] and εR ≤ 0.5 for p * = p = 10 −6 and n ≤ 14 • 10 6 .The Wilson method performs best in this scheme, and if the tolerances of CP r ∈ [93%, 97%] and εR ≤ 0.75 were considered, it would produce a valid interval ∀ n. 9, for 15 • 10 6 ≤ n ≤ 25 • 10 6 and p * = p = 10 −6 , the Wald interval has the worst coverage, with four CP r values exceeding 95 ± 2%.The Wilson and Agresti-Coull intervals perform the best, but overall, all four intervals perform well in this large sample size scheme, particularly if the coverage tolerance was considered as CP r ∈ 95±2%.

Referring to Table
The CP r and εR performance across the p * range is given in Tables 10 and 11.Shown is the performance of each CI estimator at a selection of sample sizes of interest.The Wald, Wilson and Agresti-Coull methods perform similarly for a given n-p * combination, as shown in Table 10.The CP r and εR values of the Clopper-Pearson slightly exceed the desired limits, but overall, the performance is quite reasonable.Table 11 gives the sample sizes required to maintain a desired level of CP r and εR .Three performance schemes were investigated: (i) CP r ∈ 95 ± 3%, εR ≤ 1, (ii) CP r ∈ 95±2%, εR ≤ 0.75 and (iii) CP r = 95±1%, εR ≤ 0.5.For each scheme, the Wilson interval required the smallest sample size to achieve (and maintain) the desired performance, thus providing further evidence of its overall superiority among the four estimators.Table 12 shows the εR values pertaining to the sample sizes displayed in Table 11 (ϵ R values were found to be very similar to the given εR values).It can be seen that in relation to a 95% CI, to ensure that the coverage remains within ±2%, the Wald interval requires εR ≤ 0.37, whereas the Wilson interval requires εR = 0.75.The εR values corresponding to maintaining the coverage within 95 ± 1% (green table cells) are in close agreement with our recommendation to use εR ∈ [0.1, 0.5].
Table 12: εR values corresponding to maintained coverage CP r W CP WS AC 95 ± 3% 0.46 -0.50 0.80 -0.87 0.98 -1.00 0.97 -1.00 95 ± 2% 0.34 -0.37 0.45 -0.52 0.75 -0.75 0.60 -0.61 95 ± 1% 0.22 -0.23 0.17 -0.20 0.31 -0.39 0.29 -0.33 εR values observed across the p * range displayed as minimum -maximum 7 Estimating a Rare Event with a Small Sample Size It is clear from the above results, that as expected a priori, quite large sample sizes are required to accurately estimate rare-event probabilities.Achieving accuracy on the order of magnitude for a small p is usually most relevant in large populations, where it will also be possible to collect large samples.For example, a quality engineer may have little problem in obtaining high-throughput process data of order n = 10 6 or greater, and, in this large-scale production setting, it will be critical to know whether the defect rate is, say, one in one thousand, or one in ten thousand.In Section 8, we assess three data-rich scenarios from the literature, i.e., cases that involve estimating a small proportion whilst utilizing large samples.Notwithstanding the fact that accurate estimation of a small p is most important in large populations, an analyst may be faced with the challenge of estimating a small p with a limited sample size.We touched on this problem in Section 2.2, but now consider CI performance in more detail using the approach of Section 6.
Assume that the true proportion is of order 10 −2 .As we have seen previously in Table 11, sample sizes of order n = 10 3 will be needed to accurately estimate p.However, here, we assume that the analyst is dealing with a hard-to-reach population where n ≤ 100; the performance of the four intervals is displayed in Table 13.It is clear that all four intervals perform quite poorly in this scenario both in terms of coverage and relative margin of error.The coverage of the Wilson interval is notably better than the others, and is reasonable for some sample sizes, albeit is still somewhat erratic.This interval does achieve excellent coverage for n = 80 for example, but the relative margin of error is εR ≈ 3, i.e., the margin of error of 0.03 is much larger than p = 0.01.If the analyst only requires a rough estimate of p, for example, to answer the question of whether or not it is less than 0.1, then such a large margin of error will be acceptable.On the other hand, if the aim to is accurately estimate the order of magnitude of p, this will not be achievable for such small sample sizes (and, clearly, performance will degrade further for even smaller p).This again highlights the importance of considering relative margin of error in the small p setting, and our suggestion is to use εR ∈ [0.1, 0.5].
An anonymous reviewer advised us of two modern CI estimation approaches: an asymptotic method based on generalized fiducial inference (GFI) (Hannig, 2009), and a recentlydeveloped exact method known as the "repro samples" method (Xie and Wang, 2022).The GFI method is an extension of the fiducial argument proposed by Fisher (1930), and the repro method is a simulation-based method that provides a finite sample CI coverage guarantee, which is particularly useful in small samples.We have tested both of these more modern methods (see Appendix C), and have found that they provide reasonable coverage (starting from a conservative position akin to the exact Clopper-Pearson method).However, when p and n are small, the methods experience the same issues as the classical methods we have considered; in particular, the relative margin of error is too large to be used in settings where the order of magnitude of a small p is of interest.(It is noteworthy, however, that the GFI and repro sampling methods are general inference procedures that provide good finite-sample performance in a wide range of problems beyond proportion estimation.)Ultimately, all of our work points to the fact that large samples are required in this small-p setting, and we have provided guidelines in Section 6.

Case Studies
We now consider the use of the relative margin of error in the estimation of small/rareevent proportions using data from the literature.More specifically, we consider: (i) a study on the prevalence of ADHD prescriptions in adolescents, (ii) a clinical trial relating to COVID-19 vaccine efficacy, and (iii) accident data from commercial jet aircraft records.
Using the values of n and p reported in each of the aforementioned studies, we evaluate the validity of a 95% Wald CI in terms of the interval's relative margin of error.We discuss the Wald interval as it is the most commonly used interval estimator, and for each of these case studies, it produces similar results to the Clopper-Pearson, Wilson and Agresti-Coull intervals.We also refer to our sample size calculations/CI performance analyses to assess the suitability of the sample size in terms of achieving the desired coverage.

Assessing Prevalence of ADHD Medication
The first study we consider is a study conducted by Sawyer et al. (2017) to assess the prevalence of stimulant and antidepressant medication in Australian children and adolescents with symptoms of ADHD (Attention-Deficit/Hyperactivity Disorder) and major depressive disorder (MDD).A nationally representative sample of n = 6,310 children between the age of 4 and 17 was obtained, which found that 13.7% of those with symptoms meeting the criteria of ADHD had used stimulant medications.
For a sample size of n = 6,310, and an estimated proportion of magnitude p = 0.137, the 95% Wald CI is given as: [0.129, 0.145], with a realized relative margin of error of ϵ R = ϵ/ p ≈ 0.062.Using the Delta method (see Appendix D), a 95% CI for ϵ R is given as [0.060, 0.064].(For each case study, a CI for εR was obtained using Monte Carlo simulation in conjunction with the Delta method, and each was found to be in agreement with the corresponding CI estimate for ϵ R .)For this study, the ϵ R CI values fall outside of our recommended range of ϵ R ∈ [0.1, 0.5] meaning that the interval is somewhat narrower than what we recommend, i.e., one could achieve an acceptable result with fewer observations.Indeed, Table 14 displays the sample sizes for a selection of ϵ R values in this range, and, note, for example, that ϵ R = 0.4 leads to a sample size approximately forty times smaller than the sample size used in the study.It can also be seen from Table 14 that a relative margin of error of ϵ R = 0.2 corresponds to a CI which is similar to that computed at ϵ R = 0.06, but uses a sample size that is approximately ten times smaller.Had the order of magnitude of p been known in advance (e.g., if it was known that p ≈ 1/10, rather than p ≈ 1/100), then a smaller sample size would have sufficed.When p is of the order 10 −2 , very good coverage is achieved for n = 1,800 (see Table 10), hence the study sample size of n = 6,310 is more than adequate for the scenarios where p = 10 −1 and p = 10 −2 .

COVID-19 Vaccine Efficacy
An efficacy trial of the BNT162b2 mRNA COVID-19 vaccine was conducted by Polack et al. (2020).In this placebo-controlled, observer-blind trial, 43,548 participants were randomly assigned either the BNT162b2 vaccine or a placebo treatment.Of the 21,720 participants who received the vaccine, there were 8 cases of COVID-19 after the second dose.This leads to an estimated proportion of p = x/n = 8/21,720 ≈ 3.7 • 10 −4 , and, therefore, the 95% Wald CI is given by [1.1 • 10 −4 , 6.2 • 10 −4 ], with ϵ R ≈ 0.69, and a 95% CI for ϵ R of [0.421, 0.875].As per Section 4.2 we recommend ϵ R ∈ [0.1, 0.5], which is incorporated in the above interval.However, the CI lower bound is very close to our recommended ϵ R upper bound, and thus, in our suggested scheme, the computed interval could be questioned with regard to its width.
Aside from having a somewhat large margin of error (relative to p), we need to consider the sample size in relation to the order of magnitude of p. I.e., we must assess if the sample size is large enough to provide acceptable coverage.For example, if p were 10 −4 , Table 11 indicates that a sample size of the order 10 5 is required, whereas, here, the sample size is of the order 10 4 .Indeed, we have calculated that, with p = 3.7 • 10 −4 and n = 21,720, the expected coverage is just 89.2%.The Clopper-Pearson, Wilson and Agresti-Coull intervals perform better in this n-p scheme, achieving coverage of 96.9%, 95.2% and 95.2% (respectively).However, for this n-p combination, all three intervals have ϵ R > 0.72.Thus, to obtain a CI estimate where the margin of error is more consistent with p, and/or to enhance the coverage probability, a larger sample size would be required.

Commercial Aircraft Accidents
A summary of annual commercial jet aircraft flight hours, departures and accidents is provided by Boeing (2022).In the year 2021, there were 21.6 million aircraft departures with a total of 23 recorded incidents/accidents.Although all aircraft departures and accidents are recorded here, we may still view this as a sample from a larger population of flights that might have taken place (had demand been higher) or indeed for flights in upcoming years (provided that conditions such as aviation regulations and the composition of aircraft fleets remain similar).Therefore, it is still of interest to compute a confidence interval in this scenario, and, irrespective of the specific target population, the data still suffice for the purpose of demonstrating our proposed scheme.
In the context of estimating the proportion of aircraft accidents, the analyst has no control over the sample size.That is to say, had a larger sample size been required, one would simply have to wait for more aircraft departures to occur.However, our analysis provides us with reassurance that our computed interval will perform satisfactorily.

Discussion
When constructing confidence intervals for small success probabilities it is important that the margin or error, ϵ, be considered relative to the magnitude of the proportion, p. Incompatibilities between ϵ and p can lead to completely unsatisfactory coverage or unnecessarily narrow intervals that require extremely large sample sizes.When dealing with moderate success probabilities, say p ≥ 0.2, this is less important, but in the context of small or rare-event success probabilities, the consideration of ϵ relative to p is crucial to reduce the possibility of substantial mismatching between ϵ and p.For example, ϵ = 0.05 might be considered as valid precision for p = 10 −1 , but such a margin of error is not compatible with a proportion of the order p = 10 −3 .
To ensure ϵ is compatible with the order of magnitude of p, we recommend using a relative margin of error scheme, ϵ R .We suggest restricting the range of values to ϵ R ∈ [0.1, 0.5] as higher values lead to imprecision and poor interval coverage, whereas lower values lead to sample sizes that are likely to be impractically large for many studies.Our recommendation of ϵ R ∈ [0.1, 0.5] avoids intervals that are impractically wide or restrictively narrow in terms of sample size requirements, and we show that adequate performance is achieved within this range.In contrast to the existing literature, we have highlighted the importance of the relative margin of error, ϵ R , in conjunction with the empirical coverage, when assessing CI performance in the small-p setting.When both criterion are considered simultaneously the Wald, Clopper-Pearson (exact), Wilson and Agresti-Coull intervals perform similarly in many cases.In general, all four intervals fail to satisfy both criteria when the sample size is small, with improved performance at larger sample sizes as expected.For example, for a 95% confidence interval when p = 10 −1 , none of the methods produce a satisfactory interval for 10 ≤ n ≤ 140.Each interval achieves the nominal coverage of 95% at some (albeit not all) sample sizes in this range, but in each case the desired limit of ϵ R ≤ 0.5 is exceeded.Once the sample size is increased (n ≥ 150), and the ϵ R requirement is satisfied, all four intervals perform well in terms of coverage.
The coverage probabilities of the Wald and Clopper-Pearson intervals for small n are generally poor, particularly in comparison to the Wilson and Agresti-Coull intervals.However, the considerable difference in coverage in such situations is rendered immaterial once the (we believe reasonable) requirement that ϵ R ≤ 0.5 is considered.When satisfactory performance is defined as achieving a desired CP r and ϵ R , the performance across these commonly-used intervals is much more comparable, particularly if one considers empirical coverage in the range (1 − α)100 ± 2%.In this relative margin of error framework the criticisms of inadequate coverage for the Wald interval, and excessive conservatism for the Clopper-Pearson interval, are somewhat alleviated, and all four intervals perform quite similarly.Although there are performance similarities, the Wilson and Agresti-Coull intervals are generally superior to the intervals of Wald and Clopper-Pearson.The Wilson and Agresti-Coull intervals achieve similar CP r and ϵ R values for given n-p-α combina-tions, however the Wilson interval is narrower and achieves favourable performance at lower sample sizes.
When the success probability is small, failure to consider the margin of error relative to the order of magnitude of the estimated proportion can result in poor coverage, and/or intervals which are unnecessarily narrow or excessively wide.As shown in the case studies presented in Section 8, the relative margin of error criterion provides a simple and effective assessment of the validity of an estimated interval in terms of its width/margin of error.For example, we have shown that all of the interval estimators considered in this paper performed poorly for the COVID-19 study (Section 8.2) in terms of the relative margin or error, meaning that the confidence intervals were all impractically wide -and the Wald interval also had notably poor coverage.It is important to ensure that the interval precision is compatible with the order of magnitude of p.The relative margin of error serves as a useful evaluation criterion in this regard, and as such, we suggest that it should be considered when planning statistical studies.where z is the (1 − α/2) quantile of the standard normal distribution, and α is the significance level pertaining to ϵ.

Figure 1 :
Figure 1: CP r versus εR for p * = p = 10 −1 , α = 0.01.Dashed (grey) line represents the nominal CP r value.Dot-dashed (red) lines represent the target CP r and εR tolerances from Table5.Sample size range is from n = 100, to n = 800, in steps of 20.Labels shown adjacent to data points depict the first n where both CP r and εR requirements are satisfied.Additional data labels are referred to within the text.

Figure 2 :
Figure 2: CP r versus εR for p * = p = 10 −2 , α = 0.05.Dashed (grey) line represents the nominal CP r value.Dot-dashed (red) lines represent the target CP r and εR tolerances from Table5.Sample size range is from n = 1,000, to n = 8,000, in steps of 200.Labels shown adjacent to data points depict the first n where both CP r and εR requirements are satisfied.Additional data labels are referred to within the text.

Table 3 :
Wald-based sample size comparison -variable ϵ Coull intervals are similar to those shown in Table3for ϵ R ≤ 0.5.A comparison of coverage probabilities for ϵ R = 0.75 is given in Appendix B.
Wald 95% CI coverage shown in parentheses a scheme such that the resulting interval's width is appropriate.For example, consider p * = p = 10 −1 and ϵ R = 0.05 where the resulting interval is [0.095, 0.105].This interval is very narrow and the large sample size of 1.4 • 10 4 reflects this quite stringent margin of error.Moving to ϵ R = 0.75 has the advantage of significantly reducing the sample size, but, of course, the interval is significantly wider at [0.025, 0.175].To obtain intervals that are neither too liberal nor too conservative, that are reasonable in terms of coverage performance, and which avoid excessively large sample sizes, we recommend ϵ R ∈ [0.1, 0.5] as a reasonable scheme.The coverage values for the Clopper-Pearson, Wilson and Agresti-

Table 5 :
Coverage & relative margin of error tolerances

Table 10 :
95% CI performance comparison -n n A : First n where at least one interval satisfies CP r ∈ 95 ± 1% and εR ≤ 0.5 n B : First n where all intervals satisfy CP r ∈ 95 ± 1% and εR ≤ 0.5 Table 11: 95% CI sample size comparison