Strong-Form Frequentist Testing In Communication Science: Principles, Opportunities, And Challenges

ABSTRACT This paper discusses ‘strong-form’ frequentist testing as a useful complement to null hypothesis testing in communication science. In a ‘strong-form’ set-up a researcher defines a hypothetical effect size of (minimal) theoretical interest and assesses to what extent her findings falsify or corroborate that particular hypothesis. We argue that the idea of ‘strong-form’ testing aligns closely with the ideals of the movements for scientific reform, discuss its technical application within the context of the General Linear Model, and show how the relevant P-value-like quantities can be calculated and interpreted. We also provide examples and a simulation to illustrate how a strong-form set-up requires more nuanced reflections about research findings. In addition, we discuss some pitfalls that might still hold back strong-form tests from widespread adoption.

There are two well-known issues with this approach -both of which hamper its value as a means for stringent hypothesis testing. First, NHST leaves much of the actual statistical logic to conventional decision rules and the behavior of software. Researchers choose a statistical model for the data at hand and report the results as expected by APA-standards, but they rarely confront the inferential meaning of their results. They stick to a binary conclusion of a finding being 'significant' or not, which ignores more fundamental reflections about the strength of a finding or its theoretical relevance. As a consequence, the evidence gleaned from NHST -in particular, the P-value -is known to be often misrepresented and overblown (e.g., Greenland, Senn, Rothman, Carlin, Poole, Goodman, & Altman, 2016;Nickerson, 2000;Vermeulen et al., 2015). A second, and even more fundamental, problem with NHST is that it is generally unable to say anything useful about hypotheses of substantive interest. This is true even when the P-values derived from the test are, technically speaking, correctly interpreted: the nil hypothesis is set up as a straw man (i.e., the claim that not the tiniest effect or relation exists between the variables of interest), and its rejection (or lack thereof) tells us little about the types of substantive hypotheses researchers are typically interested in (e.g., De Groot, 1956Meehl, 1967Meehl, , 1990. Indeed, when pressed, communication researchers would probably not claim to be interested in infinitesimally small effects. Much of communication science is conducted within an aura of social relevance. Time and money is spent on investigating the impact of communication styles and technology because it is believed that there might, in fact, be substantial and replicable effects that require regulation and/or intervention. P-values derived from NHST seem of little help in such a scenario. In light of these issues, many voices have been calling for statistical reforms that move away from NHST, the use of P-values, and even frequentist statistics in general (for an overview, see the special issue in the American Statistician: Wasserstein et al., 2019). In this paper, we will not take such a fundamental stance. Rather, we will try to familiarize communication scientists with one line of thought within the reform movement that remains close to the familiar P-value concept but aligns it with the falsificationist ideal of stringent hypothesis testing. After Meehl (1967), Meehl (1990), we label this approach 'strong-form'frequentist testing'although its core principles reside under different names as well (e.g., 'severe testing': Mayo, 2018;Mayo & Spanos, 2006; 'minimal effect testing': Murphy & Myors, 1999; '(non-)equivalency testing': Weber & Popova, 2012;Wellek, 2010). The basic idea of strong-form testing is straightforward: instead of testing a nil-null-hypothesis we simply set up a test for a statistical hypothesis of (minimal) theoretical interest. This provides us with a different kind of P-valuewhich, in this paper, we choose to denote by _ P -that may be more directly interpreted as a measure of falsifying or corroborating evidence (see also, Greenland, 2019;Haig, 2020): it provides information on how much evidence the data offer to refute (or corroborate) the hypothesized effect size of (minimal) interest.
In what follows, we first provide a refresher of key concepts within the frequentist hypothesis testing framework -in particular, sampling distributions, test statistics, P-values, and statistical power. Readers who are knowledgeable of frequentist inference will find the information in this section familiar and may decide to skip it entirely; the value in this part mainly lies in (1) gradually introducing the logic and terminology of frequentist statistics (which seems useful for non-expert audiences), (2) clarifying some terminological choices made in the paper, and (3) showing that the 'strong-form' construal flows naturally from a properly conceptualized frequentist logic. In the second part of the paper we formulate general guidelines for the application of strong-form testing within the context of the General Linear Model: we show how _ P can be calculated, discuss some important considerations when interpreting _ P, and we illustrate its behavior through a simulation. In the last part of the paper, we consider some technical challenges for strong-form testing in more complicated use-cases, and we suggest pathways for future developments.

The meaning of P
While the P-value lies at the heart of statistical inference in communication science, the concept is notoriously misunderstood (Vermeulen et al., 2015). Some widespread misconceptions are, for instance, that P-values represent the probability that the null hypothesis is true, that they reflect the probability of results being replicable, or that small P-values imply practically important effects (e.g., Greenland et al., 2016;Nickerson, 2000). If we want to understand what P-values represent, we first need to be familiar with two key concepts from frequentist statistics: sampling distributions and test statistics. Sampling distributions form the building blocks of frequentism; they represent the theoretical distributions of sample statistics (e.g., means, regression coefficients, correlation coefficients, . . .) that would, hypothetically, arise from drawing infinitely many random samples of size n from a given population characterized by a population parameter δ (e.g., a population mean or regression coefficient). To illustrate what this means, assume we are investigating video game play among teenagers and randomly sample, say, 1000 10-to 19-year-olds and find that mean daily game play is 42.545 minutes. Of course, we know that this estimate only represents the mean in our particular sample, and if we would have drawn a different sample we would have most likely gotten a different estimate -for instance, 43.123 minutes. Thus, if we would randomly draw 1000 10-to 19-year olds from the population an infinite number of times, we would end up with a continuous distribution representing all possible values for the sample mean; this distribution is called the sampling distribution. It can be shown that the sampling distribution of a mean can be expressed mathematically by the curve -called the probability density function -of a normal distribution, centered around the true value of the population mean. This result is, known as the Central Limit Theorem, plays a crucial role in frequentist statistics. The reason for this is that it can be referenced to derive the sampling distributions of various other types of statistics; examples of this are differences between means (normal), regression coefficients (normal), variances (scaled chi-square), and all of their related test statistics. Test statistics are standardized summaries of data, constructed as a ratio of a statistic and its (estimated) variance or 'error.' In traditional frequentist applications, we will often be working with sampling distributions of these test statistics, and not those of the 'raw' statistics on which the test statistics were initially based.
Once we have a proper understanding of test statistics and sampling distributions, we naturally arrive at the set-up of a frequentist hypothesis test. That is, if we can derive that test statistics follow a mathematically defined sampling distribution parametrized by a given population parameter (denoted by δ), then we may (1) hypothesize a value d for that population parameter (e.g., the mean standardized difference in aggression scores between the violent and nonviolent video game group is Cohen 0 sd of, say, 0 or 0.30), (2) derive the sampling distribution of test statistics if δ ¼ d, and (3) evaluate where the observed test statistic (t) lies within that sampling distribution. The position of the observed test statistic t within the sampling distribution is typically expressed through the probability of observing an even more extreme test statistic than the one observed, assuming that the tested hypothesis were true. This quantity is known as the P-value. Symbolically, then, a typical (rightsided) P-value can be written as 1 where Pr T > t; δ ¼ d ð Þ is read as 'the probability (Pr) 2 of randomly drawing a test statistic T larger than observed test statistic t assuming that the population parameter δ is equal to d'. If the observed test statistic is relatively extreme within the sampling distribution under δ ¼ d, P will be small. This can be 1 For the sake of simplicity of these expressions we assume d > 0 throughout the paper, unless stated otherwise. 2 Most texts would prefer to Pr When talking about continuous distributions (as we will be doing throughout the paper) this is the same thing: the probability at any single point in a continuous distribution is interpreted as t being evidence against the hypothesis that δ ¼ d. If the observed test statistic is not extreme within the sampling distribution under δ, P will be large. In this case, we may not consider t as evidence against the hypothesis that δ ¼ d.
Notice how this description of P-values alludes to the intimate relationship between frequentist hypothesis testing and a (neo-)Popperian, falsificationist epistemology (Mayo, 2018;Meehl, 1967;Meehl, 1990;Popper, 1963). The key point of falsificationism is that science logically progresses through critical tests of theories; that is, what sets science apart from pseudo-science in the falsificationist sense is that the latter typically tries to gather observations that are able to confirm theories, whereas the former actively tries to challenge theories and find counterevidence for them. Only if observations are inconsistent with a theoretical hypothesis, a falsificationist concludes that a hypothesis -or, at least, a set of background assumptions guiding the test -has been refuted. If observations are not inconsistent with the hypothesis, the hypothesis is said to be temporarily corroborated -that is, not refuted, but not literally confirmed either. Clearly, frequentist inference offers a probabilistic toolbox for this type of reasoning: if a P-value is low, there is a low probability of observing a more extreme test statistic if the tested hypothesis were true. In falsificationist terms, this means that test statistic t entails a high level of refutational information against the hypothesis δ ¼ d (or a low level of corroborating information). If a P-value is high, there is a high probability of observing a more extreme test statistic under the tested hypothesis δ ¼ d. This implies low refutational information against δ ¼ d in t (or a high level of corroborating information), but not a literal 'confirmation' in any sense.
Unfortunately, the way in which frequentist hypothesis testing is typically used in communication science ignores its falsificationist roots, and it defeats much its promise for scientific reasoning: usually, communication researchers (implicitly) set their tested hypotheses to a nil-null hypothesis in which δ ¼ 0. As a consequence, a P-value as it is typically reported will only encode (a lack of) refutational information against a statistical claim of 'no difference between group means' or 'no linear relationship between variables x and y.' This is not completely without merit, of course: any study hypothesizing some effect or difference should aspire to refute, at the very least, the hypothesis that its results could have been caused by random noise alone (that is, without even the smallest systematic relationship or difference in the population). That said, it is also obvious that evaluating refutational information against a nil-null hypothesis is rarely the actual purpose of a study. In practice, researchers often use small P-values to infer some type of 'meaningful' effect (i.e., 'significant' in the non-statistical sense of 'important, ' Nickerson, 2000), but such a conclusion ignores the logical asymmetry between falsifying or corroborating a nil-null and corroborating or falsifying any particular hypothesis of interest: it is not because a finding is extreme in light of δ ¼ 0 that it is therefore able to corroborate a theoretically viable effect δ ¼ d (the 'fallacy of rejection, ' Mayo & Spanos, 2006;Spanos, 2014). Likewise, the absence of counterevidence against δ ¼ 0 does not necessarily imply counterevidence against a meaningful alternative δ ¼ d (the 'fallacy of acceptance, ' Mayo & Spanos, 2006;Spanos, 2014).
In short, there is very little of value in a test of the nil-null; at best, it will only ever provide very weak (counter)evidence against theoretically meaningful claims. This is also why Meehl (1967, Meehl, 1990 dubbed nil-null-testing the 'weak' form of frequentist testing. Meehl contrasted this with the 'strong' use of frequentist testing as it is typically applied in domains such as particle physics (see also , Cousins, 2020). In a 'strong-form' test, a researcher directly compares an observed test statistic t against a point prediction δ ¼ d that has been derived from theory (e.g., the standard model of physics) -literally, as per Expression (1). A P-value generated in this manner provides directly informative (i.e., 'strong') information with regard to the falsification or corroboration of an actual theoretical claim, and it does not take an unnecessary detour by first setting up an easily refutable straw man nil-null hypothesis.
There appears to be no fundamental reason why communication scientists could not also turn to an application of frequentist testing in a 'strong' -or at least 'stronger' -sense. A main hurdle seems to be that it is not the standard in undergraduate textbooks or statistical software and, therefore, requires some technical involvement on part of the researcher: in order to conduct a strong-form test a researcher will need to (1) specify the effect size that is theoretically hypothesized (instead of just assuming a standard nil-null across all studies), and (2) determine the sampling distribution of the statistics under this alternative hypothesis. Upon reflection, however, it will become clear that these steps do not actually require that much of a change to existing principles -at least not when researchers already abide by 'best practices' in their tests of the nil-null. As has been stressed in various papers (e.g., Dienlin et al., 2020) communication scientists relying on NHST should always calculate a quantity known as statistical power which, technically speaking, already requires them to go through most of the necessary steps for a strong-form test. Indeed, as we will see shortly, strong-form testing can be conceptualized as a relatively straightforward extension of power analyses -at least for the case of General Linear Models with fixed effects.
Before fleshing out the relationship between power analyses and strong-form testing in more detail, it seems useful to clarify four issues that are tangential to our further discussion. First, for the remainder of the paper we will simply refer to the 'nil-null' hypothesis as the 'null hypothesis'. We do so despite knowing that the concept of null hypothesis was never meant to reflect only the situation where δ ¼ 0. In fact, as Fisher (1955) noted, the null hypothesis could refer any working hypothesis to be 'nullified' (see also , Gigerenzer, 2004), which, again, attests to the intimate relationship between frequentists hypothesis testing and a falsificationist epistemology. However, we believe that using the concept of 'null hypothesis' to reflect the situation where δ ¼ 0 aids the readability of the paper, and it also corresponds to its typical usage in the communication literature. When we refer to the situation where we are testing an alternative, theoretically interesting hypotheses δ ¼ d, we will speak about the 'substantive hypothesis', 'alternative hypothesis', or 'hypothesis of interest'.
Second, the paper will be discussing P-values as continuous measures of refutational evidence against the null hypothesis. It is useful to note that there is some controversy about the interpretation of P-values in such evidential terms: some scholars deny this 'neo-Fisherian' interpretation of P-values and stick only to a decision theoretical interpretation along the lines of Neyman and Pearson (e.g., Lakens, 2021). Others do, in fact, interpret P-values in similar ways to the current paper (e.g., Greenland, 2019). We believe that the preference for either interpretation is largely a philosophical matter and that both can be justified assuming that existing concerns about the preferred interpretation are adequately considered. 3 Third, the paper will only be discussing P-values as measures of refutational evidence against a hypothesis. We will not engage with the question how small a P-value should actually be to falsify a hypothesis, or how this should be weighed against the probability of a false refutation (a Type-I-error). Within the frequentist framework, this type of question is addressed by significance level α, not by P-values per se. 4 In the debates on statistical reform many different opinions have been raised about the specification of α -whether it should always be set to a very stringent level (e.g., α ¼ :005 ; 3 Important arguments against interpreting P-like quantities in terms of (counter)evidence are (1) that P does not directly say anything about the truth or falsity of a hypothesis; (2) that all P-values are equally likely under the null hypothesis; (3) that, under conditions of low power, relatively high P-values do not provide any evidence against a null hypothesis; and (4) that, under conditions of high power, relatively low P-values may actually provide more evidence in favor of the null rather than an alternative. The current paper takes note of these concerns: for (1), it will be stressed that P-values provide information about the extremity of an observation assuming the truth of a hypothesis. Thus, P is only used to quantify the observation; when extreme, the observation is considered to be a falsifier of the hypothesis (but this doesn't mean the hypothesis has been shown to be false; see also Footnote 4). For (2), it will be noted that, under any alternative hypothesis δ > 0 , P-values nearing 0 are more likely (assuming power > α). This is reflected in our discussion on the skewed distribution of P-and _ P-values; the skew, then, is taken to be suggestive of the direction of δ compared to 0 or, more generally, d (i.e., whether the finding is a falsifier or corroborator). For (3) and (4), it will be stressed that any interpretation of a P-or _ P-value requires an explicit consideration of other metrics such as statistical power (i.e., the overlap between null and alternative distributions). 4 Note that we are actually cutting a terminological corner in this paper in order not to overcomplicate our discussion: it is very typical to say, as we have said here, that when P < α, the tested hypothesis can be rejected. However, there is a subtle problem in this wording, as it conflates a property of the observation (i.e., t is extreme enough to be called a falsifier of a hypothesis) with a property of the hypothesis (i.e., the hypothesis is falsified). Generally, in a frequentist framework -and in contrast with a Bayesian approach -we do not actually try to falsify or confirm any given hypothesis. That is, the focus lies on calculating the probability of observations assuming the truth of a hypothesis P data; H ð Þtheore -not on the probability of a hypothesis conditioned on the observations, PðHjdataÞ. Strictly speaking, then, in frequentist terms P < α does not allow us to say that "the tested hypothesis can Benjamin et al., 2018), whether it should be justified on a case-by-case-basis (Lakens et al. 2018), or whether it should not be used at all (Amrhein & Greenland, 2017;McShane et al., 2019). Although we agree that pre-registering significance thresholds is important to guarantee an honest and critical interpretation after observing results (see Mayo, 2019), we remain largely agnostic about the existence of such thresholds here: whether or not a researcher specifies a significance level, interpreting the P-value in continuous terms will always provide more specific information about the refutational information entailed in an observed test statistic, thereby allowing for -even requiring -more nuanced conclusions.
Fourth, the paper will not go into detail on the deeply rooted philosophical debates between Bayesians and frequentists on what would be the 'superior' approach to statistical testing (for discussions, see Mayo, 2018). While the current paper is clearly embedded in frequentism, this is not to argue that it is necessarily to be preferred. We consider a choice for either side to be mostly a matter of epistemological belief (such as one's vision on subjective and objective probability, inductive logic, and the role of falsification in science) and of dominant practice in a field. In communication science, the focus is mostly on frequentist hypothesis testing using P-values, which suggests the relevance of the approach taken in this paper. However, researchers accepting the philosophical foundations of Bayesianism may reasonably prefer other types of metrics, such as Highest Density Intervals to summarize the posterior probability distribution for an estimated parameter, or Bayes Factors to compare the relative strength of confirmatory evidence for two distinct hypotheses (e.g., a nil-null and a minimal effect size of interest). 5

Statistical power, and the transition from a 'weak' to a 'strong' construal
In any proper application of a frequentist null hypothesis test, researchers do not only specify a null hypothesis but also put forward a particular alternative a priori that defines a population effect size of theoretical interest (d). This is done to calculate a sample size needed to attain a reasonable level of statistical power for their null hypothesis test: statistical power is the probability of correctly rejecting the null if, in fact, the alternative of interest were true (Cohen, 1988;Neyman, 1942). Assuming a onesided test and d > 0, this translates into the following symbolism: where t crit refers to the 'critical' value of the test statistic -that is, the value that is considered "extreme enough" (by whatever standard) to refute the null. Notice the δ ¼ d in this expression: it shows that the calculation of statistical power does not assume test statistics to come from the null distribution; rather, it assumes that they are drawn from an alternative distribution defined by a population effect size δ ¼ d. In statistical jargon, these alternative sampling distributions are called 'non-central' sampling distributions, whose probability density functions are parametrized by a non-centrality parameter λ. We discuss the meaning of λ in more detail later in the paper and in the Technical Appendix (available from OSF, see https://osf.io/sdu9m/), but for now it will be enough to say that λ is some increasing function of the population effect size δ and sample size n (i.e., λ ¼ f δ ð Þf n ð Þ). When δ ¼ 0 (under the null hypothesis) non-central distributions reduce to so-called 'central' sampling distributions, with λ ¼ 0. As such, central sampling distributions are the ones we typically use in null-hypothesis tests, whereas non-central distributions are the ones needed to conduct a power analysis. To visualize what all of this means, Figure 1 depicts 3 be rejected". Rather, it suggests that the test statistic is extreme enough to be called "a falsifying observation with regard to the hypothesis". Similarly, α should not be defined as "the probability of a false refutation", but rather as "the probability of incorrectly labelling an observation as a falsifier". 5 An in-depth treatment of Bayesian inference is not within the scope of this paper; interested readers are referred to Gelman et al., 2013).
Student T-distributions: one central T-distribution assuming that the null hypothesis is true, and 2 non-central distributions assuming that a given alternative hypothesis is true. Statistical power is represented by the area under the non-central distributions beyond the t crit .
There are good reasons why statistical power is considered to be essential for a null hypothesis test. First, before conducting a study, we want to ensure a sample size that would give us a reasonable probability of rejecting the null if, in fact, the hypothesis of interest were true. If we know that there will be only a low probability of correctly rejecting the null hypothesis, there is little use to running the study at all! Second, high statistical power ensures that published null refutations will, on average, be more easily replicable and less inflated. This means that a literature relying on null hypothesis tests will be of more value if it is highly powered (Asendorpf et al., 2013). That being said, it should also be apparent that a consideration of statistical power still fails to address the most fundamental logical issue with tests of the null hypothesis: even though the sampling distribution under the hypothesis of interest δ ¼ d is derived, the null hypothesis δ ¼ 0 is still the only one actually used for the statistical test. Fortunately, if we look at the statistical definition of power as in Expression (2), it is quite straightforward to arrive back at the definition of P-values as in Expression (1): we just need to plug in the observed test statistic t for t crit in Expression (1) to arrive at P-values in a strong-form set-up.
Practically speaking, though, this 'simple' extension will still raise questions: how could a communication scientist ever be able to define a substantive hypothesis δ ¼ d for which a literal falsification would be theoretically meaningful? In contrast to physics, we do not have theories that allows us to derive point predictions, so setting up a test to falsify exactly δ ¼ d-as Meehl advocatedmight seem out of reach. Fortunately, this problem has a practical solution: within the context of a power analysis, we will also rarely use a hypothesis derived from theory (even though that would be optimal); often, we will define the smallest effect size of interest (SESOI; Lakens et al., 2018) for which a we want a high probability of refuting the null (denoted by d min ). Of course, testing against a minimal effect size d min does not exactly deliver the 'strongest' possible test in Meehl's sense, which requires an exact theoretical point prediction δ ¼ d. However, it does give way for what at least appears to be a 'stronger' test than a typical nil-null test. That is, if we are able to define a meaningful d min (ideally, based on a close reading of prior research findings or meta-analysis) we may also assess to what extent our findings serve as corroborating evidence that, in fact, δ � d min . To do so, we can evaluate the extremity of an observed test statistic t within the sampling distribution under the population parameter of minimal interest, δ ¼ d min . This requires the calculation of a one-sided P-value, which we will denote by _ P ('P-dot'): 6 As Figure 2 visualizes, _ P represents the refutational information against the hypothesis of minimal interest that δ � d min (assuming d min > 0). When it is small, there is little refutational information against δ � d min (i.e., it is a corroborator of δ � d min ); a high value for _ P implies a test statistic with considerable refutational information against the hypothesis that δ � d min (i.e., it is a falsifier of δ � d min Þ. Synonymously, one could say that, in the former case, t entails falsifying information against δ < d min ; in the latter case, t corroborates δ < d min . The way in which P is defined here would traditionally be said to correspond to a one-sided test of the form δ � d min rather than a test of the form δ � d min . Therefore, this definition might not appear to be formally consistent with our requirement of putting the actual hypothesis δd min to the test. However, we prefer to stick with this definition for two reasons. (1) The tests of δ � d min and δ � d min are statistically identical for continuous distributions of T, in the sense that both tests evaluate t at δ ¼ d min . This means that P-values arising from both set-ups provide the exact same information (they are just each other's complement).
(2) If we would have defined _ P as formally expected, a researcher would have needed to use inverse logic when interpreting P and _ P. That is, a low P-value typically means falsifying δ � 0 (i.e., it is in line with a substantive claim; 'good'!), but a low _ P would have meant 'falsifying δ � d (i.e., not in line with the substantive claim; 'bad!'). In our definition, P and _ P values maintain the same evidential rank-order. Yet another possibility would have been to also define traditional P-values as testing the hypothesis δ � 0. This would also have maintained the rank-order compared to testing the claim δ � d min but it would have inverted the interpretation of traditional P-values: instead of 'smaller is better' it would be 'larger is better'. This, we think, would have been unnecessarily confusing for researchers used to work with P-values.
Given this relatively straightforward extension from a typical P (under the null) to _ P (under a hypothesis of minimal interest), it seems natural to ask: why has this approach not been adopted as a standard practice in the social-scientific literature? One reason could be that frequentist statistics has long been ritualized: the null routine has often been presented as "statistics per se" (Gigerenzer, 2004, p. 589), and even foundational principles such as statistical power took decades to be adequately considered (Cohen, 1992). A second reason could be that writings on strong-form testing have remained relatively scattered, operating under differing terminologies and, sometimes, lacking much technical and/or conceptual guidance. For instance, papers in the statistical reform movements have often been promoting the use of P-values under alternative, non-null-distributions, but they have not always attached conceptual or practical recommendations to it (Amrhein et al., 2019;Greenland, 2019;Nickerson, 2000). Others have provided more in-depth discussions on the relevant concepts and technicalities but discussed them within seemingly disparate frameworks. Some examples are minimal effects testing (Murphy & Myors, 1999), equivalency testing (Lakens et al., 2018;Weber & Popova, 2012;Wellek, 2010), magnitude-based inference (Aisbett et al., 2020), and severe testing (Mayo, 2018).
As the name suggests, minimal effects testing recommends an identical set-up to the one suggested in this paper (Murphy & Myors, 1999): define a minimal hypothesis of interest, derive the corresponding sampling distribution, and assess the extremity of an observed test statistic within this sampling distribution -that is, a quantity such as _ P. Tests of equivalency are also technically similar, although the conceptual idea is somewhat different: within an equivalency test, researchers substantively predict the null-value, which requires them to (1) define the set of effect sizes that are 'practically equivalent' to the null and (2) refute the hypothesis that the population effect size is larger in absolute value than the upper and lower bounds of practical equivalence (Lakens et al., 2018;Weber & Popova, 2012;Wellek, 2010). This can be achieved on the basis of confidence intervals, or by calculating P-values from two one-sided tests against the largest effect size of practical equivalence ('TOST': Lakens et al., 2018). It should be clear that the latter is the same as setting up an estimate of δ ¼ d min , deriving the non-central sampling distribution, and calculating _ P. Similar principles apply to the framework of magnitudebased inference, although one should be cautioned about the flawed, pseudo-Bayesian interpretations that have been circulating in this corner of the literature (see, Aisbett et al., 2020, for a discussion).
The most advanced philosophical treatment of strong-form testing has been delivered by Mayo (2018). In Mayo's terminology, 'strong testing' is referred to as 'severe testing'-more generally, 'error statistics' -but her basic argument is similar to Meehl's (1967). Mayo also emphasizes the falsificationist rationale of frequentist testing: as she puts it, the purpose of a statistical test should be to find out how severely the test probed a statistical hypothesis H (e.g., δ > d) . To evaluate whether the hypothesis is severely probed we need to find out, first, how capable the test would have been to generate an observed test statistic as extreme as t if H had not been true. Technically, Mayo's logic also boils down to (1) setting up the substantive hypothesis δ � d , and (2) calculating the one-sided extremity of a test statistic t within a non-central sampling distribution assuming δ ¼ d . Mayo uses the concept of severity (SEV) to refer to the P-value-like quantity that arises from this, and SEVcorresponds to what we have defined as _ P above: when SEVis low, there is a fairly low probability that t would have been larger than it is, if the population effect size would be any smaller or equal to d. By the Frequentist Principle of Evidence (Mayo & Spanos, 2006), this implies that the data are evidence that δ > d. In contrast, when SEV is high, there is a fairly high probability that t would have been larger than it is if the population effect size had been at least d; by the Frequentist Principle of Evidence, this serves as counterevidence against δ > d .
In this paper, we choose to stick to the symbol of _ P rather than SEV because severe testing, as discussed by Mayo (2018), entails much more than just the calculation of P-value-like quantities; for a claim to be severely tested researchers also need to probe the full chain of assumptions and auxiliaries underlying their tests (see also, Mayo, 2018;Scheel, Tiokhin, Isager, & Lakens, 2020;Spanos & Mayo, 2015). Here, we are only concerned with the calculation and interpretation of values such as P and _ P per se, so we prefer not to introduce the connotations attributed to the general concept of severe testing. Importantly, the fact that we are not concerned with probing model assumptions also entails an important caveat: throughout our discussion, we will already assume that all statistical assumptions for General Linear Models are met. If the model is incorrectly specified, P and _ P naturally lose their interpretability. That said, in all situations where we consider a regular P-value to be reasonably meaningful, _ P should also apply.

Strong-form frequentist testing: calculating _ P in the General Linear Model
Let us now turn to considering what communication scientists should do, specifically, to apply the principles of strong-form testing (aka, minimal effects testing, severe testing, etc.). To begin with, it should be clear from our previous discussion that the technical application is relatively straightforward for General Linear Models with fixed effects: much of the set-up of a strong-form test will remain identical compared to the typical null-hypothesis case: the type (i.e., Z, T, χ 2 , F) and observed value of the test statistic will be the same, and the degrees of freedom will remain constant as well. This means that the information typically used and reported in a null-hypothesis test can be recycled into the calculation of _ P. The only technical challenge arises in deriving the non-centrality parameter λ, which is needed to determine the sampling distribution of test statistics under the substantive hypoth- The concept of the non-centrality parameter is rarely discussed in much detail in the literature on applied statistics. This is very surprising, especially given its fundamental role in power analyses; in fact, even Cohen's (1988) widely cited reference manual on statistical power does not provide a lot of detail on the meaning or calculation of λ (see also, Liu & Raudenbush, 2004). Conceptually, λ can be understood as a mathematical expression of "the degree to which the null hypothesis is false" (Kirk, 2013, p. 139). More technically, it is a parameter that arises in distributions of random variables that have been derived from other, normally distributed, random variables with non-zero means (Liu, 2014). It is these non-zero means of the initial variable that define the value of λ for the newly defined random variable; λ, in turn, can be factored into the expected value (the mean) of that new variable. The exact formula of λ will depend on the underlying transformation, but it generally holds that λ can be expressed as an increasing function of sample size n and population effect size δ -i.e., λ ¼ f n ð Þf δ ð Þ, as noted above. We will not go into further detail on the derivation of λ here; interested readers are referred to Liu (2014) and the Technical Appendix (available from OSF, see https://osf.io/sdu9m/), where the origins of λ are discussed further. Table 1 provides an overview of formulas of λ for common applications of the General Linear Model. Alongside the non-centrality parameter, Table 1 also reports the formula for test statistics, degrees of freedom, and the functions that can be used to calculate _ P in R (R Core Team, 2020). Importantly, researchers do not necessarily need to calculate λ manually through the formulas from Table 1; any software application conducting power analyses for the General Linear Model should be able to determine λ (as it is used behind the scenes anyway). G*Power (Faul, Erdfelder, Lang, & Buchner, 2007), for instance, routinely outputs whenever a researcher chooses a post-hoc power analysis procedure. To avoid confusion, this does not mean that a researcher using the post-hoc command in G*Power to calculate λ will be conducting a post-hoc power analysis! In a post-hoc power analysis, one would use the observed effect size to calculate statistical power assuming that the observed effect size equals the population effect size. In contrast, when using G*Power to calculate λ we only impute the smallest effect size of interest d min into our calculation to obtain the value of λ assuming that the population effect size were equal to the smallest effect size of interest. This set-up is clearly different, so there is no need to worry about the logical problems with post-hoc power analyses in the context of calculating _ P. In sum, the practical workflow for calculating _ P in General Linear Models will look something like this: Difference of 2 independent sample means, known and equal variances.
¼Δ ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi Difference of 2 independent sample means, unknown variances. Equal population variances assumed.
Difference of more than 2 dependent sample means ("repeated measures ANOVA"). Sphericity assumed.
where k is the number of (repeated) groups and n is the number of participants (in each group)f where � ρ is the average correlation in scores between the groups Note. Except for the Welch T-test the formulas reported here have been restricted to cases where λ can be straightforwardly expressed as a function of a common (standardized) effect size. See the Technical Appendix for more details. Formulas are also reported in Liu (2014), Cohen (1988), the G*Power manual, and the manual for (2020) software. Suggestive rules-of-thumb for the size of standardized effects r, R 2 , d, w, and η 2 are mentioned by Cohen (1992, p. 157).
1 d in the paired-samples-formula corresponds to d as calculated by G*Power, with the standardizer in the denominator reflecting the standard deviation of the difference. It has been noted that the standard deviation of the difference is not very easy to interpret substantively, and it is sometimes recommended to use the standard deviation of the baseline to make d more interpretable (Cumming, 2013). However, the formulation as given in this table most easily translates into λ, which serves the purpose of the current paper.
(1) Define the minimal effects size of interest prior to data collection. This is based on general insight in the research topic, previous research and literature review. Optionally, define (and pre-register) the cutoffs for the interpretation of statistics such as _ P (see step 5).
(2) Conduct the nil-null hypothesis test (as one would usually do). The analysis outputs an observed test statistic and, when applicable, degrees of freedom, correlations between (repeated) measures, etc.
(3) Calculate the non-centrality parameter for the sampling distribution of test statistics. This can be done either through the formulas in Table 1 or by choosing the 'post-hoc' procedure within G*Power and providing the necessary information (degrees of freedom, sample size, variances, minimal effect size of interest, correlations between repeated measures, . . .). Note that any value specified for α in software such as G*Power will be irrelevant for the calculation of λ.
(4) Use Table 1 to choose the appropriate formula for calculating the the non-central Cumulative Distribution Function of T. Impute the observed test statistic t, the non-centrality parameter λ, and (when appropriate) degrees of freedom v as provided by steps 1 and 2. The outcome value, or its complement, generates _ P (see Table 1).
(5) Evaluate _ P in terms of how much corroborating or falsifying evidence it provides for or against the claim δ � d min . When _ P is small, it offers (relatively speaking) more corroborating than falsifying information about δ � d min . When _ P is large, it offers (relatively speaking) more falsifying than corroborating information about δ � d min . How much falsifying or corroborating information one would need to consider it "convincing enough" will depend on the researcher's testing philosophy and pre-registered cutoffs.
While this procedure is not overly complicated, steps 3 and 4 still require some manual operations. For this reason, a web application is being developed that allows researchers to automatically calculate _ P based on the relevant inputs. A link to the application will be available on the OSF page associated with this project (https://osf.io/sdu9m).

Illustrative examples
We now turn to a discussion of three hypothetical examples to show how _ P may be calculated and interpreted. The code associated with the examples is also available from OSF (https://osf.io/sdu9m/).

Example 1. Difference between two independent group means, unknown variances (equal variances assumed).
Imagine a researcher planning to conduct a simple advertising experiment with an independent two-group design: in one group, she will expose participants to an advertisement with celebrity endorsement; in the second group she will expose them to an advertisement with an unknown endorser. After exposure, she measures attitudes toward the product using a validated measurement scale. Say that she will be using a Student T-test (assuming similar population variances in both groups) and predicts an effect size of minimal interest of at least Cohen 0 s d ¼ :30 (a 'small' to 'medium' effect by Cohen's standards). She wants to achieve reasonably high power -say, power ¼ :80 -at a one-sided significance level of α ¼ :025 (corresponding to the critical value of a two-sided α ¼ :05), which requires a total sample size n ¼ 352. She conducts the study and observes the mean attitude in the celebrity endorsement group to be 2.70 (SD ¼ 0:82; n ¼ 180); in the control group it is 2.33 (SD ¼ 0:85; n ¼ 172). This corresponds to an observed Cohen 0 s d ¼ 0:4430423, with t 350 ð Þ ¼ 4:153327 , and P ¼ :00002 (right-sided). This means (1) that the observation entails relatively strong falsifying evidence against the null hypothesis, and (2) that the observed effect size is larger than the effect size specified as being of minimal interest.
From the latter observation, it seems straightforward to conclude that, in fact, the findings corroborate δ � d min . However, an important additional question is as follows: how strong is the corroborator? This can be evaluated on the basis of _ P. Plugging in the formulas from   This can be interpreted in three equivalent ways. First, in terms of probability, this result means that there would be a probability of at least 9.3% to observe a larger test statistic than the one observed if the true population difference would be .30 or larger. Second, and more informally, the results suggest relatively strong counterevidence against δ < d min . If δ had been any smaller than d min , the probability of finding a test statistic as large as the one observed would have been even smaller than _ P � :09 . This is identical to a third way of interpreting the results: the finding t serves as a relatively strong corroboration that δ � d min (see also Figure 3).

Example 2. A correlation coefficient.
Now consider a media researcher testing a traditional 'cultivation effects' hypothesis, suggesting there being a bivariate correlation between total television exposure and the expressions of beliefs in a just world (Figure 3, second column). The researcher uses a large-scale survey in a random sample of 1900 adults, and observes r ¼ À :08; t 1898 ð Þ ¼ À 3:49649; P ¼ :0002411313 . She concludes that the relationship is statistically significant at α ¼ :05 and therefore corroborates the cultivation hypothesis. However, a reviewer posits that this is not at all convincing: while the finding provides refutational information against the null hypothesis, a theoretically meaningful manifestation of a cultivation effect should have a minimal effect size of at least R 2 � :01, or r j j � :10. He notes that the observed effect size r ¼ À :08 already falsifies the claim -but how strong of a falsifier is it? The minimal effect size R2 � :01 implies r � À :10, which -according to Table 1 -corresponds to non-centrality parameter λ ¼ À :10 ffi ffi ffi ffi ffi ffi ffi Calculating _ P from this gives pt q ¼ À 3:49649; df ¼ 1898; ncp ¼ À 4:380858 ð Þ ¼ 0:811492 . This means that the actual population effect size were no smaller (in absolute value) than a minimum effect size of interest r j j ¼ :10 there would have been a probability of at least _ P � :81 to observe a larger test statistic (in absolute value) than the one we have observed. In other words, the observation serves as a relatively substantive falsifier with regard to a theoretically meaningful manifestation of a cultivation effect -even though the relationship is statistically significant by common standards.

Example 3. R 2 -change in multiple regression.
Consider the scenario where a student conducts a relatively small-scale survey (n ¼ 120) to assess the relationship between social media use and depressive feelings, measured using a validated measurement scale. The model also contains 4 socio-demographic and psychological variables as controls. The student hypothesizes that social media use contributes substantively to the explanation of variance in scores on the depression scale, reflected in a minimal change of ΔR 2 ¼ :02. Using a hierarchical regression she finds that the four control variables (added as a first block) generate R 2 ¼ :15 . In a second block, she adds social media use, which increases R 2 by ΔR 2 ¼ :03; F 1; 114 ð Þ ¼ 4:170732; P ¼ :04343538 . This means that, by the typical standards of a null hypothesis test, we would reject the null hypothesis and find an effect size of R 2 ¼ :03, larger than one proposed as a minimal effect size of interest. Now, do we consider this finding as a strong corroborator of the hypothesis R 2 � :02? According to statistic than the one observed if the population effect size would be at the absolute minimum value of interest R 2 ¼ :02; that is, if the population effect size were any smaller than R 2 ¼ :02 , the probability of observing a larger test statistic would still only be at least _ P � :37. Hence, the result lie in the direction of corroboration, albeit not convincingly.
Note from this example that the formula for λ used in multiple regression (i.e., based on f 2 )is generalizable across tests within the General Linear Model: a relationship that can be expressed in terms of explained variance (i.e., R or η 2 ) can be reformulated into the form of λ ¼ nf . For ANOVAtype analyses, calculating λ in this way will generally be easier than having to express all group-specific differences separately (see Table 1).

The importance of interpreting overlap with the null distribution
Looking at the specifics of our third example, and comparing the curves from the three examples in Figure 3, it should be clear that _ P should not just be interpreted on the basis of the curve of the noncentral distribution alone. That is, when the null and alternative distributions overlap heavily, _ P values will often be pointing into the direction of corroboration, even when the test statistics were drawn from the null distribution. To put it in falsificationist terms, a test with large overlap between the null and alternative distributions is highly unfalsifiable: even under the absolute null scenario, it would be relatively likely to find a tests statistic corroborating the hypothesis δ � d min . If the null and alternative distributions overlap completely, finding a corroborator for δ � d min is equally probable under δ ¼ 0 and δ ¼ d min , which means that finding a corroboration is completely uninformative. This is a reminder that the interpretability of test results always depends, at the very least, on the results the test is expected to generate under absolute null scenario! A general rule is this: for any given _ P, a test with more overlap with the null distribution implies weaker corroborations of the claim δ � d min , as observations from the null distribution will be expected to generate relatively low _ P values with high probability (claim δ � d min is less falsifiable; cf., Mayo, 2018). This shows why evaluating _ P also always requires a consideration of overlap, distance, or divergence between the null distribution and the alternative distribution. There are several pieces of information that could be used as part of this evaluation.
A first piece of relevant information can be found by simply comparing the observed P-value with _ P. When the two sampling distributions overlap completely, P and _ P for any given test statistic twill be identical. The less the two distributions overlap, the larger _ P should be compared to P for any given test statistic t. While this method is straightforward, it is also informal and subjective. More formal approaches will require the interpretation of statistics quantifying the overlap, distance, or divergence between the null and alternative distributions (see Figure 3 for a visualization). There are many options available, but we will focus our discussion on statistics with intuitive bounds between 0 and 1 (thereby excluding widely used statistics such as the Kolmogorov-Smirnov distance, the Kullback-Leibler divergence, and the Jensen-Shannon divergence).
One option is to evaluate statistical power of the null hypothesis test. Power has the advantage of being familiar to researchers, and it can be used as an indication of overlap in the sense that it is the probability of exceeding a critical (tail) value of the null distribution if the alternative is true. This may then lead one to conclude that, conceptually, higher power reflects higher distance and less overlap between the null and alternative distributions. The main problem with using power in this way is that it requires a significance level to be defined, which means that (1) it cannot be defined without requiring researchers to put a significance level in place, and (2) it is not comparable between different significance levels. These issues can be bypassed by extending the concept of power to what one could call "falsifiability." "Falsifiability" can thus be defined as the probability of finding a test statistic that provides at least some falsifying information against δ � d min if the null hypothesis were true. In other words, it corresponds to the probability of observing test statistics corresponding to δ < d min if the null hypothesis were true. Symbolically, where φ ("phi") stands for "falsifiability" (i.e., falsifiability of δ > d min when δ ¼ 0), and t δ¼d min refers to the value of the test statistic corresponding to δ ¼ d min . Falsifiability is conceptually similar to power, but instead of using probabilities under the alternative based on a critical value of T, it uses probabilities under the alternative based on the value of T corresponding to d min . Hence, the value for φ can easily be calculated through a minor extension of the test statistic formulas in Table 1: the expected effect size of interest d min ð Þ can be plugged into the formula for the test statistic, together with the observed sample size(s) n and (in case d min is unstandardized) standard deviation(s) σ. This will provide the test statistic corresponding to d min ; t δ¼d min From this, we can calculate the probability of observing test statistics smaller than t δ¼d min assuming that the null hypothesis is true. Ideally, this probability should be 1, as this means that δ � d min cannot be corroborated under the null hypothesis. Note that tests with reasonable power will already tend to spawn φ very close to 1; this is important to keep in mind, because it means that one may set very high standards for φ (with values nearing 1 being required).
When comparing null and alternative distributions one may also resort to measures that have been specifically designed to quantify distributional (dis)similarity. One notable example of this is the socalled Overlapping Coefficient (OVL), which is used as a similarity measure in various areas of statistics. OVL represents the percentual overlap of two distributions, meaning that it ranges from 0 (0% overlap) to 1 (100% overlap), with alues close to 0 to be preferred. OVL can be implemented in R using the Overlapping package (Pastore, 2018). As yet another alternative, one could calculate the Total Variation Distance (TVD), which is implemented in R through the distrEx package (Ruckdeschel et al., 2006). TVD also ranges from 0 to 1, and it represents the largest possible difference assigned to the same event by two probability distributions. If two distributions are identical, the largest possible difference in assigned probabilities is 0; if they are completely different (i.e., if they don't overlap), the largest possible difference is 1. Hence, values close to 1 are also to be preferred here. Figure 4 illustrates the relationship of all statistics mentioned (power, φ; OVL; and TVD) in case of a two-sample T-test. Now let us briefly consider how the overlap or distance between the alternative and null distributions would affect our interpretation of our three examples. Eyeballing Figure 3, it is clear that the curves from the third example overlap heavily, whereas the curves of the other two examples do not: in the first and second example, there is relatively high power, little overlap and high falsifiability. In the third example, however, the distributions under the null and substantive hypothesis overlap heavily, which is also reflected in the values of power, φ, OVL, and TVD. Considering φ specifically, the third example shows a φ ¼ :87 probability of finding a test statistic in the direction of δ < d min if the null hypothesis is true. In other words, this still leaves a.13 probability of corroborating δ � d min if, in reality, δ ¼ 0. The values of φ are close to 1 for the other two examples, showing that, comparatively, they provide more informative and adequate tests. Note that the web application associated with this article will also include an option to output these measures of overlap and similarity alongside _ P (see https://osf.io/sdu9m).

What about the overlap with other, non-null-distributions?
There is an infinite set of possible parameter values different from d min , all of which would spawn a different value for λ, a different sampling distribution, and a different value for _ P. One could ask: should a researcher not evaluate all of these parameters? Our answer would be: no, not necessarily. We know that testing any effect below d min would always suggest more corroborating information in test statistic t. But given that we were calculating _ P at the minimal effect size of interest, any δ < d min is not of relevance. Likewise, a given t will always generate less corroborating information for effect sizes larger than d min , but this is also not much of an issue for the initial hypothesis test: the only claim we are making is that δ > d min , so finding less corroborating information for any alternative δ 0 > δ > d min has no direct implications for our claim.
Importantly, this is not to say that it is irrelevant to evaluate the falsifying or corroborating information of an observed test statistic t in light of parameter values other than d min . However, doing so seems of most use after the initial hypothesis test of interest -that is, as a follow-up, 'inductive' step. We see two ways in which a researcher could proceed at this stage. A first approach is to evaluate confidence intervals for an observed effect size. In recent years, the reporting of confidence intervals for effect sizes has been widely promoted, even to the extent that it has been dubbed "the new statistics," ready to replace P-values (Cumming, 2014, p. 7). From the viewpoints expressed in this paper, confidence intervals should not be seen as a replacement of P-like quantities. Rather, we believe they are mutually compatible, of use at subsequent steps of the testing-and-inference-cycle: we first test a hypothesis of minimal interest, and then we may use confidence intervals to infer the range of parameter values that could not be rejected (at a fixed α level). That is, confidence intervals represent intervals of non-refutation (corroboration) given an observed test statistic t and dichotomous decision rule (i.e., corroborate/falsify) based on some threshold α ("compatibility intervals": Amrhein et al., 2019;Greenland, 2019;Hawkins & Samuels, 2021). A second approach -and one that does not require fixed α levels -is to eyeball so-called "severity curves" (Mayo, 2018). Severity curves represent the value of SEV, or _ P, for a wide range of parameter values. Plotting severity curves can be done iteratively, through a simple application of the formulas in Table 1. The steps are as follows: (1) specify a range of values for the hypothesized effect size, (2) calculating _ P given all of these effect sizes, and (3) plot the values for _ P=SEV against their respective effect sizes. Figure 5 shows severity curves for the three examples described above. The accompanying web application will also include the option to plot severity curves automatically (see https://osf.io/sdu9m).

Summary: the steps in a 'strong-form' test
In sum, there will typically be four practical steps associated with a 'strong-form' test: Figure 4. Graph representing the relationship between power, φ ("falsifiability"), the overlapping coefficient, and the total variation distance. This graph clarifies that φ is very high even at low levels of statistical power, which implies that, in general, there is a small probability of corroborating δ � d min if δ ¼ 0. (1) Define minimal effect size of interest d min and (possibly) significance thresholds.
(2) Calculate a traditional P-value. Critically evaluate the extent to which it corroborates or falsifies δ � 0. Use thresholds (when those were defined) to label findings as "corroborators" or "falsifiers" of the null hypothesis. (3) Calculate _ P. Critically evaluate to what extent it corroborates or falsifies δ � d min . Use thresholds (when those were defined) to label findings as "corroborators" or "falsifiers" of the hypothesis of (minimal) interest.
(4) Evaluate the overlap, distance, or divergence between the alternative and null distributions.
A corroboration should be considered as (relatively) less valuable when there is a large overlap, and (relatively) more valuable when there is little overlap. (5) Assess the corroborating/falsifying information in t with regard to other possible values of the parameter δ. This can be achieved by interpreting a confidence interval for the effect size (i.e., an interval of compatibility/non-refutation) and/or by plotting severity curves.

The behavior of _ P: a simulation
We now turn to a simulation to illustrate our previous claims and to portray the long-term behavior of P and _ P in more detail. The simulations revolve around the example of a correlation coefficient, but the definition of standardized quantities such as P and _ P does not depend on the type of statistic being used. As a result, the general pattern of findings shown here will extend to other types of statistics as well -at least as long as the assumptions of the underlying statistical models are met.

Set-up
In our simulation, we begin by defining five population effect sizes for the correlation coefficient ρ ¼ 0; 0:05; 0:1; 0:2; 0:3 f g , five sample sizes n ¼ 30; 50; 100; 200; 300; 500 f g, five hypothesized (i.e., d min ) correlations that are supposedly put to the test ρ ¼ 0; 0:05; 0:1; 0:2; 0:3 , and one significance level of α ¼ :025 (one-tailed). For all combinations, we simulate 10,000 data sets and store observed test statistics, P-values, _ P-values, non-centrality-parameters (λ), and measures of similarity or overlap. This allows us to simulate the behavior of P and _ P (1) under various combinations of true effect sizes, tested effect sizes, and sample sizes, and (2) under various conditions of the overlap between the null and alternative distributions. The annotated code of the simulation is available from the paper's OSF repository (https://osf.io/sdu9m/).

Illustration of key points
The non-centrality parameter λ is an increasing function of sample size and population effect size, so the distribution of simulated test statistics should shift further away from zero the higher (1) the population value of the correlation coefficient, and (2) the sample size. As visualized by Figure 6, this is reflected in the simulations: when the population correlation coefficient ρ ¼ 0, the simulated distribution of test statistics follows a central distribution with λ ¼ 0 . When the population coefficient is nonzero, the distribution of observed test statistics shifts to a non-central T distribution with λ ¼ . Of course, in reality, researchers will always be working with sampling distributions under a hypothesized effect size, and not the true effect size. This means that when we are calculating P or _ P (i.e., under a hypothesized effect of 0 or d min ) we are assuming a distribution that may or may not differ from the true distribution. The distribution of P and _ P values will therefore also depend on the difference between the hypothesized and actual sampling distributions of test statistics. To illustrate, consider the case of a regular P-value, where the hypothesized effect size is always equal to δ ¼ 0. It is a well-known fact that, if the null hypothesis is true (if λ ¼ 0), P-values will be uniformly distributed between 0 and 1 -regardless of sample size. This follows directly from the definition of a P-value,  which can be interpreted as stating that, if the null hypothesis were true, the observed test statistic t would be at the PÀ th percentile of highest ranked test statistics. From this it follows that if the null hypothesis is actually true, exactly P percent of all test statistics will result in a P-value smaller than P; that is, PrðP < P observed ; H 0 Þ ¼ P observed , which implies a uniform distribution. In contrast, if the true population effect size is any greater than 0, P will have a right-skewed distribution. This also makes sense: if we are drawing test statistics from a distribution that, on average, spawns large tests statistics, then we will, on average, find small P-values. This phenomenon will be more and more pronounced the more the true non-centrality parameter λ moves away from 0 (that is when the effect size and/or sample sizes increase). This behavior of P-values is visualized in Figure 7. Unsurprisingly, the behavior of P also translates into _ P; _ P will also be uniformly distributed if the tested hypothesis d min is true in reality, and if the true effect size is larger (in absolute value) than the tested parameter d min , the sampling distribution of _ P will be right-skewed as well. However, as Figure 7 shows, there is another interesting case to consider for _ P: if the true effect size is smaller than the hypothesized effect d min the sampling distribution will be left-skewed. This means that for a population parameter greater than 0 but smaller than d min , the distribution of P will be right-skewed, and the distribution of _ P will be left-skewed. The skew in the distribution of _ P will also more pronounced when the true non-centrality parameter diverges to a greater extent from the one assumed in the hypothesis test (e.g., when the population effect size differs greatly from the tested effect size).
The notion of skew in _ P distributions illustrates two fundamental points about the interpretation of strong-form tests. First, it illustrates the importance of sample size: for any fixed population effect size and tested effect size, it will be true that with increasing sample sizes, it will be easier to find falsifiers against the hypothesis δ � d min if the population effect size were any smaller than the tested effect d min . This means that higher sample sizes increase Popperian 'risk' (Meehl, 1968): there is a higher probability of refutation of a hypothesis δ � d min if, in reality, δ is somewhat smaller than d min . Likewise, higher sample sizes will also make it easier to detect strong corroborators if the actual population effect size were any greater than d min . Both of these indicate why researchers would want to aim for large samples: (1) larger samples increase falsifiability and should therefore be lauded by reviewers; (2) when δ is any greater than d min , higher sample sizes will, on average, spawn smaller values of _ P (i.e., the results will tend to be more convincing). A second crucial point illustrated by the skew in the distribution of _ P is the relevance of (dis) similarity and overlap between the null and alternative distributions: when sample sizes and/or effect sizes are very small (i.e., when non-centrality parameters are very small) the null and alternative distributions overlap heavily. When this is the case _ P will, on average, be smaller under the null hypothesis, and, as a result, become less informative: even if the absolute null situation were true, it would be relatively likely to find strong corroborating information that δ > d min . This is also reflected in the distribution of _ P-values: the more the distributions overlap, the less the distribution of _ P will be skewed if the null hypothesis is true -that is, the higher the probability becomes of finding small _ Pvalues under the null. To illustrate, Figure 8 visualizes the distribution of _ P -values at various levels of overlap with the null distribution, quantified by (1) power, (2) φ, (3) OVL, and (4) TVD.

The added value of the strong-form construal
In our view, the construal of strong-form frequentist testing (severe testing, minimal effects testing, etc.) offers a promising addition to the statistical toolbox of communication scientists. It is a relatively straightforward extension of widespread practice, and it adds to the convergent evolution of the field toward critical theory assessment in a falsificationist sense: a strong-form test allows for a more direct evaluation of a statistical hypothesis, requires a more critical reading of findings that are "statistically significant," and underlines the importance of making thoughtful methodological decisions before conducting a study. That is, if we require strong-form tests (or some related conceptualization), researchers will need to pre-register a reasonable value for d min on the basis of their knowledge of the literature (see also, Dienlin et al., 2020). Fixing one general value of d min across studies will not be very useful, as the effect size of interest will be highly contingent on the type of hypothesis under investigation. For instance, when a researcher is conducting a large-scale survey to assess the overall relationship between media use and well-being it seems reasonable to set d min to a relatively low value. In contrast, when the purpose of a study is to test the behavioral effects of a costly intervention campaign, it will be necessary to specify a much larger effect size as being of minimal interest. For this reason, the specification of a given d min should be considered as a central topic of both a study's preregistration plan as well as its literature review. In general, and when feasible, larger values of d min are to be preferred as they represent more meaningful effects in a theoretical and practical sense; methodologically, they generate less overlap between the null and alternative distribution given a fixed sample size (i.e., higher falsifiability).
Another advantage of the strong-form construal is that it incentivizes researchers to gather large samples. This incentive is much more apparent compared to the traditional null hypothesis P-value, because it literally creeps into our interpretation of 'evidence': if we only gather small samples we know that, when evaluating the results, we will find larger overlap between the null and alternative distributions. As a result, a corroborator (if we already find one) will be considered of less value than it would have been if we had had a larger sample. Also, the larger the sample size, the smaller _ P will be for relatively small deviations δ > d min . This means that larger sample sizes will make it easier to find corroborating information in favor of δ > d min if, in reality, the population parameter is any greater than d min . Hence, gathering larger samples will generally be more interesting for researchers and, at the same time, ensure a more risky test of hypotheses. In other words, in a strong-form set-up, gathering large samples will be a win-win situation for both science and the scientist (which is not necessarily the case for null hypothesis tests, where scientifically meaningless deviations from 0 already spawn low P-values with large samples). A third advantage of strong-form testing is that it underlines the value of oft-maligned distance measures such as P, up and above confidence intervals (Cumming, 2014): when assessing the falsifying information of a finding in light of a fixed hypothesis, confidence intervals only provide binary information about a predicted parameter -that is, it lies within the interval or not. Distance measures are more specifically informative, as they quantify the extremity of an observation against a specific prediction. As we have mentioned earlier, measures such as P or _ P and confidence intervals can be seen as complementary: we first test predictions by evaluating falsifying information as expressed through a distance measure; next, we may use an interval to (inductively) reason through the set of hypothetical parameter values for which the observation would not be among the 1 À α most extreme observations. Alternatively, if we are unwilling to specify a given significance level, we may use severity curves to obtain an overview of _ P values across hypothetical parameter values.

Challenges, limitations and future directions
Despite its potential for strengthening the statistical inferences of communication scientists, the strong-form framework carries along various challenges. Most crucially, it is in need of a generally established set of definitions and best-practices: as there are still many ways to define or conceptualize similar sets of ideas, researchers will need to add some technical detail to their statistical definitions and calculations when resorting to strong-form tests. While this might raise the bar for practically oriented scholars, we hope that the principles laid out in this paper are formulated clearly enough to be used as a reference guide. A second challenge is finding a way to generalize the principles of strong-form testing to more complicated analyses. By this, we mainly refer to situations where the non-centrality parameter is not easily expressed in terms of standardized effect sizes, and/or the sampling distribution cannot be derived analytically. With regard to the first situation, it might happen that we are unable (or unwilling) to express effects sizes in standardized form: some methodologists have argued against the use of standardized metrics (e.g., Baguley, 2009) and, in some situations, there is no consensus on how a standardized effect should be defined at all (e.g., when variances are not assumed to be homogeneous, as in a Welch ANOVA). Under these circumstances, calculating the non-centrality parameter will often require us to separately define (1) raw effect sizes as well as (2) population variances. Of course, population variances are typically unknown, so the only way to proceed seems to be with using the sample variance to calculate the non-centrality parameter. This is not unreasonablethe sample variance is, after all, an unbiased estimator -but it does introduce an additional layer of variability: the sample variance has its own sampling distribution and may therefore generate both an under-or overestimation of the true population variance. The key question therefore is as follows: what would this use of a sample variance imply, exactly, for the conclusions drawn from a statistic such as _ P? Detailing the answer to this question goes beyond the scope of the current paper, and we leave a full discussion to future studies. However, it seems that there are at least two competing dynamics to be considered: on the one hand, the sample variance will underestimate the population variance more than 50% of the time (its distribution is right-skewed). This means that, more often than not, the noncentrality parameter calculated through the sample variance will be greater than the non-centrality parameter obtained through the population variance. As a consequence, the overlap between the null and alternative distributions will be underestimated more than 50% of the time when using the sample variance, and quantities such as statistical power will be overestimated more than 50% of the time. On the other hand, as we know that _ P will be calculated on the basis of a non-centrality parameter that is, on average, overestimated, _ P should typically be overly conservative: that is, a given test statistic will, relatively speaking, generate a higher _ P value (less corroborative information for δ ¼ d min ) than it would have had if we had used the population variance to calculate the noncentrality parameter. This would appear to provide yet-another incentive for researchers to obtain high sample sizes: with higher sample sizes, _ P as calculated from sample variances would be underestimated less often. 7 Another complication for strong-form testing arises when non-central distributions are not easily expressed in analytical form. This occurs, for instance, for the widely used cases of logistic regression, multilevel models, and structural equation models. For logistic regression -which is a non-linear model -we need to consider that alternative distributions of the test statistic are contingent on the exact distribution of the independent variables (Demidenko, 2007); for multilevel models we need to specify separate variance components (Snijders, 2005); and for SEM we naturally need to take into account measurement reliabilities (Liu, 2014). While there are workarounds to approximate these non-central distributions without changing much to the principles for the General Linear Model (see Faul et al., 2009, for logistic regression, or Liu, 2014, for multilevel models and SEM) it seems useful to think about developing a framework for strong-form testing that does not require analytical solutions. For instance, one could consider extending the strong-testing framework to simulation-based methods which is, in fact, the way in which frequentist testing proceeds in physics or how power analyses are set up for complicated use-cases. A detailed treatment or validation of simulation-based applications is far beyond the scope of the present paper, so future studies should flesh out those principles in more detail. We believe, however, that this exercise would be particularly useful: a generalizable framework for strong-form testing would not just add to the statistical practices of communication scientists, but it could aid all of the social sciences evolve toward a common standard for critical hypothesis testing.
With all of this said, there is still one important caveat to be stressed: conducting what we called strong-form tests should not be conflated with 'having strong theories.' Strong-form testing makes a more stringent assessment of theoretical claims possible -at least compared to nil-null testing -but it provides no safeguard against testing theoretically meaningless (or methodologically contrived) claims. Hence, communication scientists should be wary not to overinterpret the results of a strong-form testor any test, for that matter. A qualitative appraisal of theoretical arguments and methodological choices remains essential, regardless of the value obtained for any summary statistic such as P or _ P:

Disclosure statement
No potential conflict of interest was reported by the author(s).

Notes on contributors
Lennert Coenen (PhD, KU Leuven) is an assistant professor at the Tilburg Center for Cognition & Communication (Tilburg University, The Netherlands) and a visiting professor at the Leuven School for Mass Communication Research (KU Leuven, Belgium). His research mainly aims to develop theory, metatheory, and methodology in the area of media effects on public opinion.

Tim Smits (PhD, KU Leuven) is a full professor in Persuasion and Marketing Communication at the institute for Media
Studies, KU Leuven, Belgium. Tim published on various topics within these fields, but his main research focus pertains to persuasion and marketing communication dynamics that involve health and/or consumer empowerment and how these are affected by situational differences or manipulations. He also has a more methodological line of research on science replicability.

Funding
Lennert Coenen's contribution to the paper was made possible (in part) by his postdoctoral fellowship at the Research Foundation Flanders [12J7619N].

7
All of this being said, initial simulations suggest that over-and underestimation of the population variance will occur with almost equal probability when errors are normal and sample sizes are reasonably large.