Blending Bayesian and Classical Tools to Define Optimal Sample-Size-Dependent Significance Levels

ABSTRACT This article argues that researchers do not need to completely abandon the p-value, the best-known significance index, but should instead stop using significance levels that do not depend on sample sizes. A testing procedure is developed using a mixture of frequentist and Bayesian tools, with a significance level that is a function of sample size, obtained from a generalized form of the Neyman–Pearson Lemma that minimizes a linear combination of α, the probability of rejecting a true null hypothesis, and β, the probability of failing to reject a false null, instead of fixing α and minimizing β. The resulting hypothesis tests do not violate the Likelihood Principle and do not require any constraints on the dimensionalities of the sample space and parameter space. The procedure includes an ordering of the entire sample space and uses predictive probability (density) functions, allowing for testing of both simple and compound hypotheses. Accessible examples are presented to highlight specific characteristics of the new tests.


Introduction
It has become clear that the tests performed by comparing a pvalue to 0.05 are not adequate for science in the 21st Century. Classical p-values have multiple problems that can make the results of hypothesis tests difficult to understand, multiple kinds of "p-hacking" can be used to try to get a p-value below the "magic" number of 0.05, and as many researchers have discovered, tests using standard p-values just aren't useful when the sample size gets large, because they end up rejecting any hypothesis. Jeffreys's tests using Bayes factors with fixed cutoffs also tend to have problems with large samples. However, hypothesis testing is a useful tool and is now a more-than-familiar way of thinking about how to do experiments and a very broadly used way of understanding and reporting experimental results. What is a researcher to do?
The present article introduces one solution for researchers who recognize that the hypothesis tests of the 20th Century are inadequate, but don't want to have to change the way they think about experiments and the ways they interpret and report experimental results. Section 2 delves into some of the problems that arise with the most widely used hypothesis tests and why those problems occur. Section 3 presents a new kind of hypothesis test that avoids the problems described, but doesn't "throw out the baby with the bath water, " retaining the useful concept of statistical significance and the same operational procedures as currently used tests, whether frequentist (Neyman-Pearson pvalue tests) or Bayesian (Jeffreys's Bayes-factor tests some of the advantages of the new tests and show researchers operational details of how the new tests can be used. Final considerations are given in Section 5. The article is written for researchers who are interested in a more modern tool for hypothesis testing, but who are not necessarily statisticians. It therefore does not go into deep theoretical detail, focusing instead on issues of more direct relevance to would-be users of the new tests. However, references are provided for those interested in the details of the theory behind the tests.

Context and Motivation: What's Wrong With the Tests People Have Been Using for so Many Years?
The subject of hypothesis testing and some of the problems that arise has been the subject of vigorous debate for several decades. Because frequentist tests using p-values are the most widely used, the use of p-values has been the subject of the most and harshest criticism. The journal Basic and Applied Social Psychology even went so far as to prohibit the use of p-values in articles-see Trafimow and Marks (2015). The controversy over the use of p-values has been so great that the American Statistical Association issued an official statement on p-values: Wasserstein and Lazar (2016). It is worth saying that hypothesis tests based on p-values are not the only tools subject to valid criticism. Hypothesis testing in general has been criticized, for example, in Cohen (1994), Levine et al. (2008), and Tukey (1969), and some of the alternative methods of hypothesis testing, such as the Bayes-factor hypothesis tests created by Jeffreys, have been the subject of specific criticisms like those of Gelman and Rubin (1995) and Weakliem (1999). Further, hypothesis tests are not the only statistical tools that can be criticized. Every statistical tool relies on a model of some kind, and ultimately seeks to do something that can't be done in an exact and perfect way: make inferences based on limited quantities of data. Even when there are many, many data available, the sample is always finite, and so the inference must be imperfect. As a result, there are valid criticisms of every statistical tool. Tests based on p-values get special attention because of the fact that they are so widely used, and their misuse has contributed greatly to the "reproducibility crisis" in science and medicine.
The way p-values are used in statistical tests is based on the work of Fisher and of Neyman and Pearson in the 1930s. Fisher produced "significance tests" in which a "null hypothesis" about a parameter in a statistical distribution is tested without any consideration of what the alternative or alternatives are. Neyman and Pearson created "hypothesis tests, " in which a null hypothesis is tested against a specific alternative or alternatives. In both cases, the null hypothesis is usually something like "there is no effect. " 1 When the null is rejected, the inference is that there is some interesting effect to be reported. In both cases, the null is rejected based on the comparison of a p-value to a "significance level. " A p-value has multiple possible definitions. It is not always defined as a probability (see, for example, Schervish (2012)), but its calculation is always a probability calculation. Given an experimental result x 0 , a p-value is calculated as the probability, if the null hypothesis were true, of an observation x that supports the null hypothesis as much or less than the actual experimental result x 0 . When the p-value is sufficiently small, the researcher decides that the actual observation x 0 is sufficiently unlikely under the null hypothesis that he or she can conclude that the null is false.
How small is "sufficiently small"? This is where the concept of "statistical significance" comes into the story. The p-value is generally compared to some small number that ends up being the probability of incorrectly rejecting a null hypothesis that is true. In his original work on significance testing, Fisher mentioned a one-in-twenty chance (5%, or 0.05) of rejecting a correct null as a convenient cutoff for declaring a statistically significant result, but did not intend for this number to be used universally. He states in his 1956 article "Statistical Methods and Scientific Inference" that the significance level should be set according to the circumstances. In the seminal work on statistical hypothesis testing, Neyman and Pearson (1933), an attempt is made to explicitly control errors in finding the best kind of test to choose between a null hypothesis and one or more specific alternative hypotheses. Two kinds of possible errors are considered: rejecting a true null hypothesis, called "errors of the first kind" (here called"errors of type I" or "type-I errors") and accepting a false null hypothesis, an "error of the second type" (here, "type-II error"). Neyman and Pearson write "The use of these statistical tools in any given case, in determining just how the balance should be struck, must be left to the investigator, " where "the balance" refers to the balance between the probabilities of the two types of error. Ironically enough, the very mathematical approach adopted by Neyman and Pearson makes it close to impossible for a researcher to determine "just how the balance should be struck. " Neyman and Pearson fix the probability of a type-I error, which is denoted here and in many statistics texts as α, and then prove that the test that minimizes the probability of a type-II error, here and elsewhere denoted as β, 2 is a test based on comparing the ratio of the likelihoods 3 of the competing hypotheses to some cutoff value, with the cutoff chosen so that the probability of incorrectly rejecting a correct null hypothesis is α. As shown in Pericchi and Pereira (2016), adopting this approach can lead to imbalances so great that the probabilities of the two types of error can vary by orders of magnitude.
In addition to the imbalance between the error probabilities of the two types described above, fixing α can lead to an even more serious problem. Because p-values tend to decrease with increasing sample size, the easiest and most common form of "p-hacking" is to keep taking data until the p-value falls below 0.05 or whatever other level of significance is chosen and fixed. No matter what level is chosen, a large-enough sample will almost inevitably lead to rejection of any null hypothesis. In the words of Berger and Delampady (1987), "In real life, the null hypothesis will always be rejected if enough data are taken because there will be inevitably uncontrolled sources of bias. " According to Pericchi and Pereira (2016), more information paradoxically ends up being "a bad thing" for Neyman-Pearson hypothesis testing (and Fisherian signficance testing). Fixing a lower value for declaring a statistically significant result, say, 0.005 instead of 0.05, as suggested recently by a group of 72 researchers, many of them quite prominent, and some of them highly respected statisticians (Benjamin et al. (2018)), will only postpone the problem to larger sample sizes. This is folly even at the dawn of the era of "Big Data, " and not a good solution in general. More data will still be "a bad thing" for tests with a cutoff that is lower, but still fixed, in direct conflict with the commonsense idea that a result based on a larger sample should be more believable.
A definition of the p-value for a null hypothesis H and observation x 0 commonly used in statistics textbooks is represented in Pereira and Wechsler (1993) as follows 4 : Definition 1 (PW2.2). The p-value is the probability, under H, of the event composed by all sample points that are at least as extreme as x 0 is. "[S]ample points" here refers to possible experimental observations. Note that this definition does not in any way take into account the alternative hypothesis A. One problem with this definition is that in some cases, a p-value defined this way can't even be calculated. For example, when the range of possible observations ("sample space") consists of separate intervals, or when it is multidimensional, it can be unclear what "extreme" means. Throughout this article, "p-value, " with a lower-case "p, " is used to refer to a quantity calculated using Definition 1. A contrasting definition of a quantity analogous to the small-p p-value was presented in Pereira and Wechsler (1993) and is reproduced in the next section, along with an explanation of its advantages over traditional p-values.
Another problem with commonly used hypothesis tests is that the results can be hard to interpret, and their correct interpretation depends on the intent of the experimenter. An issue that has been the subject of vigorous debate among statisticians for decades is the Likelihood Principle (LP). While it still has the status of a principle that one either accepts or does not 5 , and there are greatly respected statisticians who do not accept the LP, it is still true that LP-compliant tests have intuitive appeal and can be easier to interpret. For example, Cornfield (1966), Lindley and Phillips (1976), and Berger and Wolpert (1988) show that while frequentist tests that are not LP-compliant can give conflicting results for the same data, depending on what "stopping rule" is chosen for an experiment, the results of LP-compliant tests do not depend on the stopping rule and permit a single, unique inference based on the data without needing to know the intent of the experimenters. This brings the added advantage of allowing the collected data to be used while permitting researchers to stop an experiment early. Ethics can demand an early end to an experiment, for example, in the case of a medical study in which it becomes very clear that the patients receiving a new treatment are recovering, while those in the control group are not, or when patients receiving a new treatment are suffering serious side effects. Non-LP-compliant tests would require the researchers either to carry out the experiments to their pre-planned end, possibly in conflict with medical ethics, or to throw away all the collected data upon being forced by ethical concerns to stop the experiments earlier than planned.
There are Bayesian alternatives to frequentist hypothesis tests, the tests based on Bayes factors created by Jeffreys in the 1930s (see Jeffreys (1935Jeffreys ( , 1939) and reviewed 60 years later in Kass and Raftery (1995). They are not as widely used as frequentist tests, but they have gained some acceptance in certain areas of research and are sometimes held up as a better alternative. As noted earlier, no tool is perfect, and Bayes-factor 5 A proof of the equivalence of the LP to the combination of two much less-controversial principles, the Conditionality Principle (CP) and Sufficiency Principle (SP), appears in Birnbaum (1962), and a modified version in Wechsler, Pereira, and Marques (2008), but the validity of this kind of proof has been questioned by, for example, Evans (2013) and Mayo (2014). Gandenberger (2015) presents a proof of the same equivalence designed to resist the kinds of attacks brought against Birnbaum's proof, but the controversy continues to rage. It is worth noting that even if both proofs were incorrect, that would not mean that the LP was not equivalent to the combination of the CP and SP. Further, even if that equivalence were not valid, that would not mean that the LP is not true. Even so, the LP is still not considered proved, which is why it is still just a principle that one chooses either to accept or not.
tests have also been criticized by both Bayesian and frequentist statisticians. For the purposes of this article, it is worth noting that Bayes factors, once calculated, are also compared to fixed cutoffs in Jeffreys's tests. The calculation of Bayes factors is described in the next section. For now, it is enough to know that the Bayes factor BF HA is a measure of the evidence favoring H over A. Jeffreys proposed an initial table of evidence grades against a null hypothesis H, with cutoffs at half-integer powers of 10, and Kass and Raftery updated the table by reducing the number of grades of evidence, noting that the Bayes factor measuring evidence against a null hypothesis H in favor of an alternative A, BF AH , is 1/BF HA , and compiling a table of cutoff values of BF AH . No justification is given for the cutoffs, other than Kass and Raftery stating "From our experience, these categories seem to furnish appropriate guidelines. " Jeffreys's Bayes-factor tests do not take experimental error into account, and the cutoffs, whether those proposed by Jeffreys or those proposed by Kass and Raftery, do not take sample size into account. As with frequentist hypothesis tests with fixed cutoffs, inconvenient behavior with large samples is to be expected. This manifests itself in multiple ways, including a tendency of Bayes factors to favor null hypotheses strongly for large samples, especially in cases of small effect sizes.

Solving Some of the Problems in Hypothesis Testing
Most readers of this article already knew before starting to read it that there are some problems with the kinds of hypothesis testing done in many fields of research. In the previous section, some of those problems have been described in a bit more detail than a news article can usually dedicate to the subject. So now what? What can a researcher do? In this section, one solution is presented, and some of its advantages over currently used hypothesis tests are described. The approach uses both frequentist and Bayesian methods and results in hypothesis tests that are operationally very similar to the commonly used tests that compare a p-value to a significance level α. The authors of this article take the position taken over 60 years ago by both a Bayesian statistician, Lindley (1957), and a frequentist statistician, Bartlett (1957): a major part of the problems with p-value-based tests is in fixing a significance level that does not depend on the sample size. That is, the problem is not as much in the use of p-values themselves as in comparing p-values to fixed signficance levels. The solution to this issue is surprisingly simple and is rooted in the presentation of Neyman and Pearson's lemma in DeGroot's widely used textbook (DeGroot 1986), perhaps the greatest bridge between the frequentist and Bayesian "schools" of statistics. Instead of starting with a fixed α and determining the tests that minimize β as Neyman and Pearson did, DeGroot presented a generalized form of the Neyman-Pearson Lemma in which a linear combination of α and β is minimized. When α is then fixed, the result is the same as the one presented by Neyman and Pearson in 1933. In its full generality, the version presented by DeGroot has a major advantage: by minimizing a linear combination of α and β, it allows the probabilities of both types of error to vary, avoiding the kind of drastic imbalance between the probabilities of the two types of errors described in Pericchi and Pereira (2016), because instead of α being fixed and β tending to decrease with increasing sample size, both error probabilities depend on the sample size. By controlling the ratio of the coefficients of α and β in the linear combination that is minimized, a researcher can actually determine "just how the balance should be struck, " realizing the vision of Neyman and Pearson in a way Neyman-Pearson tests simply cannot. Cornfield (1966) suggested optimizing tests by minimizing a linear combination of α and β.
How should the coefficients a and b in the linear combination aα+bβ be chosen? The coefficients a and b represent the relative seriousness of errors of the two types or, equivalently, relative prior preferences for the competing hypotheses. If, for example, a > b, that means type-I errors are considered more serious than type-II errors. That means incorrectly rejecting H in favor of A is considered more serious than incorrectly rejecting A in favor of H, which indicates a prior preference for H. Here is a concrete example: imagine a state in which there have been more cases of meningitis than usual, and where the governor is very budget-conscious. Take H to be the hypothesis that there is not a meningitis epidemic in the state, and A to be the competing hypothesis that there is an epidemic. The governor may consider the unnecessary spending from an incorrect rejection of H to be more serious than the consequences of not declaring an epidemic, or equivalently, favor hypothesis H over A, and so would set a > b. Decision theory allows for the underlying assumptions to be made even more explicit by going more deeply into the meaning of a and b. The details can be found in the section on "Bayes test procedures" in DeGroot (1986), but the important point here is that if the losses due to incorrect rejection of each hypothesis can be quantified, and the prior probability that H is true can be estimated, then a and b can be calculated from those numbers. It is worth mentioning that the absolute scale of a and b, and therefore the absolute scale of the losses from the two possible types of errors, do not matter; only the ratio of a and b affects the actual decision whether to reject a hypothesis.
The Neyman-Pearson Lemma and the extended version of it presented by DeGroot are proved for simple-vs.-simple hypotheses, that is, for comparing specific values of a parameter. For example, for a normal (Gaussian) distribution with mean θ and variance 1, N (θ , 1), one might compare the hypotheses H : θ = 0 and A : θ = 0.7. For a given observation x, the ratio of likelihoods L(θ = 0|x)/L(θ = 0.7|x) would be compared to a cutoff chosen so that the probability, under hypothesis H, that is, if x obeyed a N (0, 1) distribution, of an observation falling in the rejection region would be some fixed value α, like the commonly used 0.05 or the recently suggested 0.005. For certain types of distributions and certain types of hypotheses (see DeGroot (1986) or other statistics textbooks for details), the Neyman-Pearson Lemma can be extended to find the best tests for composite hypotheses, that is, hypotheses involving multiple values or continuous ranges of values of the parameter of interest. A Bayesian approach to extending beyond optimal simple-vs.-simple hypothesis tests offers a simple and obvious way to extend to tests where one or both of the hypotheses can be composite hypotheses, and where either or both hypotheses may be very complex.
As usual, consider a random vector x representing experimental results, with a probability (density) function 6 f (x|θ) having a parameter vector θ , with x and θ elements of real sample space X and parameter space , respectively, each space having some positive integer dimensionality. The competing hypotheses H and A must partition the parameter space, that is, divide it into nonoverlapping pieces H and A such that the hypotheses can be expressed as H : θ ∈ H and A : θ ∈ A .
( 1 ) As long as the two pieces make up the entire space ( = H ∪ A ) and do not overlap ( H ∩ A = ∅), the hypotheses can be of any dimensionality and arbitrarily complex.
Define the binary parametric function λ(θ ) as follows: Because λ is a function of θ , one can write Now treat the original parameter θ as a "nuisance parameter" and remove it the Bayesian way: by taking averages of f (x|θ , λ), weighted by a prior g(θ ), over the two pieces of the parameter space H and A . The result is two predictive probability (density) functions Using the approach based on the generalized form of the Neyman-Pearson Lemma presented by DeGroot and previously used by Cornfield, as described earlier in this section, but now with the likelihoods averaged over H and A to produce f H and f A , one obtains averaged error probabilities α and β that are optimal in the sense of the generalized Neyman-Pearson Lemma, and the optimal averaged α can be used as a significance level that depends strongly on the sample size.
As stated in the previous section, one problem with hypothesis tests comes from the use of definitions like Definition 1 to calculate p-values. A second definition that takes into account the alternative hypothesis A, unlike Definition 1, is presented in Pereira and Wechsler (1993). As in the previous definition, x 0 represents the observation and H is the null hypothesis.
Definition 2 (PW2.1). The P-value is the probability, under H, of the event composed by all sample points that favor A (against H) at least as much as x 0 does.
Note that the quantity defined here is a capital-P "P-value, " to distinguish it from the small-p "p-values" defined by Definition 1. The P-value has the advantage that, unlike a small-p p-value, it can be calculated for arbitrarily complex hypotheses that lead to arbitrarily complex rejection regions (regions where p < α, where the null hypothesis would be rejected if the experimental observations were to occur there). However, to do so, it requires an ordering of the space of possible observations, the sample space, according to how much each possible observation favors one of the hypotheses over the other. The new approach, like Montoya-Delgado et al. (2001), uses the Bayes factor to order the sample space according to how much each point x favors H over A. Then, if the experiment yields a result x 0 , the P-value is calculated as the probability, given probability (density) function f H (x), of a point in the sample space favoring A as much as or more than x 0 does. That is, it is the sum or integral of the probability (density) function f H (x) over the part of the sample space where the Bayes factor BF HA (x) is less than or equal to the Bayes factor calculated at x 0 . Calling that part of the sample space , it is defined as and the P-value is There is a one-to-one correspondence between the P-value used in this approach and the Bayes factors used to calculate them, so once a cutoff for P-a significance level α-is determined, a corresponding cutoff for Bayes factors at the same significance level can be determined. As a result, researchers who are more comfortable with Bayes-factor tests can continue to use them, but with a cutoff determined by this method, which takes experimental error into account and depends on the sample size, rather than with arbitrarily defined cutoffs like those of Jeffreys (1935) or Kass and Raftery (1995).

Examples
Here some specific examples of applications of the new tests are presented. An important point emphasized throughout is the way the significance level varies with sample size.

Bernoulli ("Yes-No") Examples
Though they are the simplest models, Bernoulli models, which are used for experiments with a "yes-no" result, like "headsor-tails" coin flips, are very important. This is because many medical studies, among other kinds of experiments, fall into this category, with "yes" indicating, for example, a recovery from an illness or injury. The unknown parameter is the proportion or probability of success, and hypothesis tests are applied to that parameter in different ways, depending on the purpose of the study.

Comparing Two Proportions
A physician wants to show that the incorporation of a new technology in a treatment can produce better results than the conventional treatment. He plans a clinical trial with two arms: case and control, each with eight patients. The case arm receives the new treatment and the control arm receives the conventional one. The observed results in this example are that only one of the patients in the control arm responded positively, but in the case arm there were four positive outcomes. The most common classical significance tests result in the following p-values: the Pearson χ 2 p-value is 0.106, changed to 0.281 with the Yates continuity correction applied, and Fisher's exact p-value is 0.282. Traditional analysts would conclude that there were no statistically significant differences between the two treatments, using any of the canonical significance levels. Note that these procedures were for testing a sharp hypothesis against a composite alternative -H : θ 0 = θ 1 vs. A : θ 0 = θ 1 , comparing the proportions of success for the two treatments. Note that the hypothesis H is precise or "sharp, " representing a line in the two-dimensional (θ 0 , θ 1 ) parameter space. Here, the proposed P-value and the optimal significance level α(δ * ) are calculated to choose one of the hypotheses using the new tests.
To be fair in our comparisons, we consider independent uniform (noninformative) prior distributions for θ 0 and θ 1 . With these suppositions and the likelihoods being binomials with sample sizes n = 8, the predictive probability functions under the two hypotheses are where (x, y) ∈ {0, 1, . . . , 8} × {0, 1, . . . , 8} represent the possible observed values of the number of positive outcomes in the two arms of the study. Table 1 shows the Bayes factors for all possible results.
To obtain the P-value, define the set obs of sample points (x, y) for which the Bayes factors are smaller than or equal to the Bayes factor of the observed sample point, that is, obs = (x, y) ∈ {0, 1, . . . , 8} × {0, 1, . . . , 8} : BF ≤ BF obs , (9) and then the P-value is the sum of the prior predictive probabilities under H in obs : (10) Recalling the observed result of the clinical trial, (x,y) = (1,4), the observed Bayes factor is BF obs = 0.661. Based on this, the P-value is P = 0.0923. The test δ * minimizes the linear combination aα(δ) + bβ(δ). The Bayes factor is compared to the constant K, the ratio of the coefficients: K = b a . Then, define the set * = (x, y) ∈ {0, 1, . . . , 8} × {0, 1, . . . , 8} : BF ≤ K , and the optimal averaged error probabilities from the generalized Neyman-Pearson Lemma are In DeGroot (1986), it is shown using decision theory that a linear combination w H πα(δ) where w H is the expected loss from choosing to accept hypothesis A when H is true, and π is the prior probability that H is true. Taking the hypotheses to be equally likely a priori, π = 1 2 , and representing equal severity of type-I and type-II errors by taking w H = w A = 1, the result is K = 1. The set * is identified by the cells with boldface numbers in Table 1. The observed Bayes factor is in boldface italics. The optimal significance level is α(δ * ) = 0.1245 and the optimal averaged type-II error probability is β(δ * ) = 0.4815. The high type-II error probability is completely expected for small samples. Contrary to the classical results, the conclusion is now the most intuitive one: the null hypothesis is rejected because P < α(δ * ). However, the rejection is only at the 12.45% level of significance. So what sample size would be necessary to obtain some better (lower) significance level, say 10%?

Comparing Two Proportions With Varying Sample Sizes
Consider first a clinical trial just like in the previous example, but now with arms of size n = 20. The observed result is (x, y) = (4, 10), that is, four patients had a positive result in the control arm, while 10 had a positive result in the arm receiving the new treatment. The same calculations done for the previous section are repeated, but with the appropriate expressions for f H and f A for a trial with two 20-patient arms. The observed Bayes factor in this case is BF obs = 0.415, which leads to significance index P = 0.02901, optimal significance level α(δ * ) = 0.0995, and type-II error probability β(δ * ) = 0.3651. The classical χ 2 pvalue is p = 0.0467, indicating rejection of the null hypothesis at the canonical 5% significance level. The new test also rejects H, because P < α(δ * ). The same analysis can be done to calculate the optimal significance level and type-II error probability for any sample size. Figure 1 shows the optimal adaptive significance level α and optimal adaptive type-II error probability, plus the minimized linear combination α + β, as functions of the size of each arm in a study. Table 2 shows the optimal adaptive averaged error probabilities α and β for various arm sizes without the restriction of the two arms having equal size. For a given total sample size, an unbalanced sample can have higher probabilities of both type-I and type-II errors than a balanced sample. For example, the error probabilities for an unbalanced sample with n 1 = 60 and n 2 = 20 are larger than those for a balanced sample with n 1 = n 2 = 40, even though both experiments would have the same total sample size, n 1 + n 2 = 80. The effect of unbalanced samples can be as important as the effect of total sample size. For example, the error probabilities of an unbalanced sample with n 1 = 60 and n 2 = 10 are larger than those of a balanced sample with n 1 = n 2 = 20, even though the unbalanced sample has a total size of n 1 + n 2 = 70 and the balanced sample just 40.

Test for One Proportion and the Likelihood Principle
A common example in which the Likelihood Principle can be violated is the comparison of binomials to negative binomials in a coin-flipping experiment with a coin that may or may not have the expected 50-50 chance of coming up "heads, " as described in Lindley and Phillips (1976). 7 For the same values of x, the number of successes in n independent coin flips, the two distributions produce different p-values, which can lead to different decisions at a given level of significance. That is, the inference can actually be different for exactly the same data, depending on the intent of the researcher before starting the experiment. If a result is nine heads and three tails, did the researcher start the experiment planning to flip the coin exactly 12 times? Was the intent to flip until the 9th occurrence of heads? Until the third occorrence of tails? Was it some other Table 2. Optimal averaged error probabilities α(δ * ) and β(δ * ) for comparison of two proportions for various arm sizes n 1 and n 2 in a two-arm medical study. Calculations were performed with a = b. In this example, the new tests are applied to show that they do not violate the Likelihood Principle. The reason the inference (decision to accept or reject a hypothesis about the parameter θ ) ends up being the same for different models is that although the P-values for the two models are different from each other, the adaptive significance levels α for the two models are also different, and the decision about rejecting one hypothesis in favor of the other ends up being the same. Using different notation from the previous example, let the sample vector consist of the number of successes x and the number of failures y, and let the corresponding vector of probabilities be (θ , 1 − θ). Take H : θ = 1 2 vs. A : θ = 1 2 , that is, a fair coin vs. an unbalanced coin, as the hypotheses to be tested. Taking a uniform prior for θ , taking the two hypotheses to be equally probable a priori (π = 1 2 ), and considering the two types of error equally severe, the predictive probabilities for the tests are as follows.
For a binomial, and for a negative binomial, The Bayes factors are equal for the two models, and since using the lemma will lead to comparing them to the same constant, the decisions about the hypothesis H : θ = 1 2 end up being the same. The P-values and significance levels α are different for the two models, but the inference ends up being the same. Considering the observations (x, y) = (3, 10) and (x, y) = (10, 3) for a binomial, both samples yield the same results: P = 0.02, where the optimal error probabilities are α = 0.09 and β = 0.43. For a negative binomial, the same observations produce different values of the significance index P, but the error probabilities are different. For the first (second) sample, one stops observing when the number of successes reaches 3 (reaches 10). For the first sample point, the P-value is 0.01, and the relevant error probabilities are α = 0.18 and β = 0.48. For the second sample, P = 0.01, and the error probabilities are α = 0.12 and β = 0.33. The decisions made for binomials are the same as those for negative binomials with the same (x, y). This behavior is much more general than this specific example, and in fact it is proven in Pereira et al. (2017) that the new tests are compliant with the Likelihood Principle for any discrete sample space.

An Important Note About "Yes-No" Experiments and the New Tests
The predictive densities under many common hypotheses in Bernoulli experiments can be calculated analytically for binomial and negative bionomial models when Beta priors are used.

Tests in Normal Distributions
Normal distributions are very widely used because the sample means of large random samples of random variables tend to act like normally distributed variables. This is the familiar tendency of large-sample distributions to look like "bell curves, " described mathematically by the central limit theorem. Details can be found in DeGroot (1986)

Test of a Mean in a Normal Distribution
Consider iid variables X 1 , . . . , X n | μ, that obey a normal distribution with mean μ and variance 1. Take as a prior for μ a normal distribution with mean m and variance v. Instead of working with the full n-dimensional sample space, the minimal sufficent statistic X n , the sample mean, can be used. The sampling distribution of X n | μ is a normal distribution with mean μ and variance 1/n. For hypotheses H : μ = 0 and A : μ = 0, the predictive distributions are X n | H ∼ N (0, 1/n) and X n | A ∼ N (m, v + 1/n).
To calculate the probability of a type-I error, define the region and evaluate α = X A f H (x n )dx n . For the probability of a type-II error, define the region X H = {x n : f H (x n )/f A (x n ) > 1}, and evaluate β = X H f A (x n )dx n . These error probabilities are plotted in Figure 2 for v = 100 and two values of the prior mean: m = 0 and m = 10. The plot with m = −10 would be identical to the plot with m = 10.

Test of a Variance in a Normal Distribution
This is an example used by Pereira and Wechsler (1993), showing that the region of the space of possible results where a test rejects the null hypothesis is not always the tails of the null distribution; it can be a union of disjoint intervals. In such cases, it can be impossible to calculate a classical p-value defined as in Definition 1, but the ordering of the entire sample space by Bayes factors allows for an unambiguous definition and calculation of the new index, a capital-P P-value in the sense of Definition 2. Consider a normally distributed random variable X with mean zero and unknown variance σ 2 . The hypotheses considered here are H : σ 2 = 2 and A : σ 2 = 2. A χ 2 1 (chisquared with one degree of freedom) distribution is used as a prior density for σ 2 . The predictive densities under hypotheses H and A are x 2 4 and f A (x) = 1 π 1 + x 2 , (15) a normal density with mean zero and variance 2, and a Cauchy density, respectively. Figure 3 shows a plot of the Bayes factor, using a value of 1.1 as a cutoff for a decision about the hypotheses. The sample points that do not favor H are in three separate regions: a central interval and the heavy tails of the Cauchy density. The set that favors H is the complement, made up of two intervals between the central interval and the tails: The set that favors A over H includes, in addition to the tails, the central region (−0.6, 0.6). Even with this or even more complex divisions of the sample space into regions that favor one hypothesis over another, the new method allows for calculation of a P-value.

Test of Hardy-Weinberg Equilibrium
"Hardy-Weinberg equilibrium" refers to the principle, proven by Hardy (1908) and Weinberg (1908), that allele and genotype frequencies in a population will remain constant from generation to generation, given certain assumptions about the absence of external evolutionary influences. An individual's genotype is determined by a combination of alleles. If there are two possible alleles for some characteristic (say A and a), the possible genotypes are AA, Aa, and aa. Let x 1 , x 2 , x 3 be the observed frequencies of the genotypes AA, Aa, and aa, respectively, and θ 1 , θ 2 , θ 3 the corresponding probabilities. Assuming a few premises, as described by Hartl and Clark (1989), the principle says that the allele probabilities in a population do not change from generation to generation. It is a fundamental principle for the Mendelian mating allelic model. If the probabilities of alleles are θ for allele A and 1 − θ for the allele a, the expected genotype probabilities are (θ 2 , 2θ(1 − θ), (1 − θ) 2 , 0 ≤ θ ≤ 1.

Final Considerations
Using the hypothesis-testing procedure described here, the sample-size dependence of the optimal averaged α can be used to determine the best significance level for a given n, which can be relevant in studies where a limited number of trials can be done, or to determine the necessary sample size to achieve a desired α.
The procedure can be applied to other models and hypotheses, without restrictions on the dimensionality of the parameter space or the sample space. The sample space, regardless of its dimensionality, is ordered by a single real number, the Bayes factor. In all the examples used in this article, analytic (closedform mathematical) solutions are available, but this is not the case for all distributions and hypotheses. For example, hypotheses about a normal distribution with unknown mean and variance involve calculations of significantly greater complexity and require some sort of analytic approximations or approximate methods for evaluating the necessary integrals, such as Monte Carlo methods. Note that the method requires the calculation of five integrals: two over the nuisance parameter(s) to find the predictive distributions f H (x) and f A (x); and three of predictive distributions over specific regions of the sample space to obtain α, β, and the capital-P P-value.
It is worth noting that the approach described here is compatible with guidelines in the ASA's statement on p-values (Wasserstein and Lazar (2016)). Specifically, because the signficance level depends on the sample size, and therefore is not the kind of predefined "bright-line" rule the ASA recommends avoiding, the approach is compatible with point 3, "Scientific conclusions and business or policy decisions should not be based only on whether a p-value passes a specific threshold. "

Funding
This study was financed in part by Coordenação de Aperfeiçoamento de Pessoal de Nível Superior -Finance Code 001. C.A. de B.P. thanks the Conselho Nacional de Desenvolvimento Científico e Tecnológico, CNPq, for support under grant number 308776/2014-3.