On the term"randomization test"

There exists no consensus on the meaning of the term"randomization test". Contradicting uses of the term are leading to confusion, misunderstandings and indeed invalid data analyses. As we point out, a main source of the confusion is that the term was not explicitly defined when it was first used in the 1930's. Later authors made clear proposals to reach a consensus regarding the term. This resulted in some level of agreement around the 1970's. However, in the last few decades, the term has often been used in ways that contradict these proposals. This paper provides an overview of the history of the term per se, for the first time tracing it back to 1937. This will hopefully lead to more agreement on terminology and less confusion on the related fundamental concepts.


Introduction
Nonparametric tests, involving permutations or other rearrangements of data, have a fascinating history, going back to Fisher (1935a) and earlier if one includes e.g.Fisher's exact test (Fisher 1934(Fisher , 1935b) ) or Eden & Yates (1933).The most well-known benefit of nonparametric tests is that they do not require distributional assumptions (Fisher 1935b, Pitman 1937, Welch 1937, Good 2013).Another benefit is that in multi-and high-dimensional settings, simultaneous permutation or transformation may be used to account for complex, unknown dependencies in data (Westfall & Young 1993, Westfall & Troendle 2008, Pesarin & Salmaso 2010, Hemerik & Goeman 2018b, Hemerik et al. 2019, Blain et al. 2022, Andreella et al. 2023).Nonparametric tests often require permuting or rearranging a dataset many times, each time computing some statistic based on the transformed dataset.Until the arrival of modern computers, this was often computationally infeasible.This, together with the current importance of high-dimensional data, explains why the last few decades have seen increasing interest in nonparametric inference.
Nonparametric tests come in various flavours and the term "randomization tests" is typically used to refer to some of them.However, beyond that, there is no consensus on the meaning of that term.As e.g.Arndt et al. (1996Arndt et al. ( , p.1272)), Ernst et al. (2004, p.676), Berry et al. (2014, p.121), Onghena (2018) and Rosenberger et al. (2019, p.6) note, some authors use the term interchangeably with "permutation tests".Other authors will use the term for a very broad class that contains permutation-based tests as a special case, while yet others will only consider some permutation-based tests to be randomization tests.For example, Romano (1989), Kennedy (1995), Manly & Navarro Alberto (2020) and Lehmann & Romano (2022, ch.17) use the term "randomization test" in a very general sense, while some other statisticians have explicitly argued to use the term in a much more narrow sense (Box & Andersen 1953, Kempthorne & Doerfler 1969, Gabriel & Hall 1983, Ernst et al. 2004, Edgington & Onghena 2007, Onghena 2018, Rosenberger et al. 2019, Ramdas et al. 2022, Zhang & Zhao 2023, Liao et al. 2023).These authors propose reserving the term for nonparametric tests based on data from randomized experiments.Such tests have unique properties.In particular, if we are not willing to assume that the observations are independent -e.g., when considering a specific school class -then such tests still provide valid inference on the sample at hand (Pitman 1937, Edgington 1966, Kempthorne & Doerfler 1969, Onghena 2018).For example, randomization tests can be validly applied in randomized single-subject experiments (Edgington & Onghena 2007).Further, randomization tests do not always require using a set of transformations that has an algebraic group structure.This leads to some added flexibility in experimental design (Hemerik & Goeman 2021).
Due to the inconsistent use of the term "randomization test", there has been much confusion regarding differences between types of nonparametric tests.This has first of all led to widespread misunderstandings regarding model assumptions and the conclusions that tests allow (Edgington 1966, Kempthorne & Doerfler 1969, Edgington 1980, Ernst et al. 2004, Onghena 2018, Rosenberger et al. 2019).Further, such confusion has led to the use of invalid statistical methods (Hemerik & Goeman 2021).Recently statisticians have increasingly felt the need to be explicit about their definition of the term "randomization test" and to give some explanation for their choice (in addition to the above authors, see e.g.Dobriban 2022, Ramdas et al. 2022, Nair & Janson 2023).This is done to acknowledge the fact that there exist different definitions and to reduce confusion among readers.Some authors, in particular Onghena (2018), have investigated the history of the term "randomization test" per se, to gain insight into how and why the term was used in the past.Onghena (2018) traces the term itself back to two publications from 1952 and notes that the term may likely have been coined by an earlier source.However, to the best of our knowledge, no recent author traces the term farther back than 1952 (David & Edwards 2001, David 2008, Berry et al. 2014, Onghena 2018).
In this paper, we trace the term back to multiple publications from the 1930's.Our aim is to understand the histories of the existing interpretations of the term "randomization test".In particular, we give an overview of the different suggestions that have been made in history on how to use the term.The first usage of the term that we have found is by E.S. Pearson, who uses the term once in Pearson (1937).We also discuss the emergence of the related term "permutation test", which only happened in the early 1950's.To the best of our knowledge, the early history of that term per se has not been described elsewhere either.For example, Berry et al. (2014) provides an extensive history of permutation tests, but not of the terminology itself.
This paper is built up as follows.In Section 2, we discuss the early history of the term "randomization test".In particular we discuss the first use of the term in 1937 and the years that follow.In Section 3 we focus on the emergence of the term "permutation test" in the 1950's and early discussions on terminology.In particular, we discuss works that explicitly contrast the terms "randomization test" and "permutation test".Section 4 discusses the recent history of these terms.

The early history of the term
The written history of the term "randomization test" starts with Pearson (1937), to the best of our knowledge.Perhaps the term was used earlier in some private communications.E.S. Pearson's paper appeared in Biometrika, of which he had recently become editor, following his father K. Pearson's death (Cox 2001).Although Pearson (1937) uses the term only once and does not explicitly provide arguments for using it, his article does provide clues as to why he did.We first discuss the history leading up to Pearson (1937), to better understand Pearson's choice of words.

Before Pearson's 1937 article
As is well known, prior to 1937, the term "randomization" was commonly used by statisticians in the context of design of experiments (Fisher 1925(Fisher , 1935a)).Further, from 1935 onwards, the knowledge was becoming widespread that this principle of randomization of treatments can be used to construct tests that do not rely on normality (Eden & Yates 1933, Fisher 1935a).For example, Welch (1937) considers nonparametric tests based on what he and several others call "randomization theory".
Note that the term "permutation test" was not in use in the 1930's.Another important observation to keep in mind is that around 1937, by many statisticians, the prime way to guarantee valid statistical inference was seen to be a randomized experiment, i.e., an experiment involving randomization of treatments, such as fertilizer.Indeed, Fisher (1935a, p.51) went so far as to say [...] the physical act of randomisation [...] is necessary for the validity of any test of significance [...] Such statements probably strengthened readers' association of nonparametric tests with experimental randomization (also see the quote from Box & Andersen 1953 below).Finally, it should be noted that a large part of the data that statisticians faced, came from randomized experiments.These insights will help to understand why authors in the late 1930's started to use the term "randomization tests".

Pearson (1937)
The aim of Pearson (1937) is to provide an overview of various nonparametric tests based on data from randomized experiments.He calls the nonparametric test from Fisher (1935a, §21) a "test based on randomization".Further in the paper he discusses a different nonparametric test and refers to it as an "application of this principle of randomization".These phrases are not surprising, since he considers tests that rely on experimental randomization.
Keep in mind that the tests in Pearson (1937) do not rely on randomly rearranging the dataset, but rearrange the dataset deterministically in all allowed ways, just like other early papers that use the term "randomization test".In other words, these papers do not randomly shuffle data, although this would later become a popular strategy in practice for reducing computation times (Hemerik & Goeman 2018a).Note that Eden & Yates (1933) do use random shuffling, but do not use the term "randomization test" and are not cited by Pearson (1937).
One probable reason why Pearson (1937) chose the term is as one might guess: the test relies on data from experiments involving randomization.However, to better understand his choice for the term, we must consider the other ways in which he uses the word "randomization".For example, consider his usage of the phrase "randomization of yields" (Pearson 1937, p.63).There, he is referring to a method in Welch (1937), which appears in the same number of Biometrika.Welch (1937) discusses a nonparametric randomization test for Latin square designs.This test, like other randomization tests, requires rearranging the treatment assignments in all allowed ways.This deterministic rearrangement is what Pearson calls "randomization of yields" and several authors would follow this manner of wording.Part of the explanation for this choice of words may be that the rearrangements that he considers, are analogous to the possible randomization patterns during the physical field experiment.Such an analogy is always present in randomization tests.For example, consider a standard randomized clinical trial, where 5 individuals receive treatment A and 5 treatment B. Then there are 10 5 possible treatment randomization patterns, i.e., 10 5 ways to administer the treatments.Analogously, the randomization test then requires rearranging the data in the 10 5 corresponding ways.
Another clue to understanding Pearson's choice of words, is his use of the term "randomization distribution", which has also been used by several later authors.By this term, he means the distribution obtained by re-computing the test statistic for all allowed rearrangements of the data.The test compares the observed test statistic with a quantile of this "randomization distribution".
To conclude, Pearson (1937) may have chosen the term "randomization test" for some of the following four reasons: 1. the fact that he considered data from randomized experiments; 2. the fact that according to e.g.Fisher (1935a), experiments involving randomization were considered the prime way of guaranteeing valid statistical inference; 3. the analogy between the rearrangements considered by the test and the experimental randomization; 4. the fact that the test compares an observed statistic with the "randomization distribution", in Pearson's words.

After 1937
In the years immediately following 1937, "randomization test" is used by a few other authors.Both Cochran (1938) and McCarthy (1939) call the nonparametric test in Welch (1937) a "randomization test".Pearson (1938) does not use the term, but he does say "the distribution [...] under randomization", which is similar to the term "randomization distribution" used in Pearson (1937).Nair (1940) writes "test by randomization".Thus, the term "randomization test" was by 1940 already somewhat established, although it should be added that in the 1940's the term was hardly used.Instead, authors preferred to say e.g."tests based on the principle of randomization".
A main question is whether authors saying "randomization test" or "test based on the randomization principle" associated this terminology with randomized experiments exclusively.This is difficult to pinpoint, but several early authors did not explictly assume that the data come from a randomized experiment.Here are some examples.The first one is Neyman (1942), which was written in 1939 already.Here, Neyman considers a class of nonparametric tests, that he does not explicitly assume to be based on data from randomized experiments.Instead, he simply assumes "symmetry of the probability law" of the variables, which guarantees exactness.He nevertheless says that the methodology is "commonly known as the method of randomization" and also uses the term "randomization tests".As further examples, Craig (1941), Lehmann & Stein (1949), Wolfowitz (1949) and Hunt (1951) consider nonparametric tests based on two samples from two populations.They do not explicitly assume that the two samples come from an experiment with randomized treatments.They do not use the term "randomization test", but, they do all use phrases like "test based on the method of randomization".The real-life example provided in Lehmann & Stein (1949, p.29) is about a randomized experiment, but of course this does not prove that they exclusively had such applications in mind.Dwass (1957) considers tests for two samples from two populations and calls them "randomization tests" in the title of his paper.Apart from the title, he does not mention "randomization", except when he writes "randomization tests" when referring to the methodology in Lehmann & Stein (1949).Note that there were certainly also articles on nonparametric testing that did not mention "randomization" at all (Hotelling & Pabst 1936, Wald & Wolfowitz 1943, 1944, Hoeffding 1952).
The works by Kempthorne (1952Kempthorne ( , 1955) ) are the first to frequently use the term "randomization test".Importantly, unlike the authors just mentioned, Kempthorne does explicitly restrict his attention to data from randomized experiments.Later, in Kempthorne & Doerfler (1969, §4) he strongly emphasizes that by "randomization tests" he only means tests from such randomized experiments; see §3.Other early users of "randomization test" are Anscombe (1948), Kempthorne & Barclay (1953), Mood (1954), Wilk (1955), Siegel (1956), Napier et al. (1956), Watson (1957).Most of these works are focused on randomized experiments.Some of these works are more mathematical and do not specify whether randomized experiments are considered.Siegel (1956) uses the term for a general class of tests that is not restricted to randomized experiments, as is clear from the examples he provides.Chung & Fraser (1958) do not consider the term to be limited to randomized experiments either.Indeed, they also use the term in the context of observational studies, for example a comparison of alcoholics and non-alcoholics.Moreover, they suggest that some others also use the term in such a general way: Tests based on permutations of observations are non-parametric tests.For the two-sample problem they require no more than the basic assumption of the problem -that under the null hypothesis the two samples behave as a single sample from a population.Actually, they require less -that, under the null hypothesis, the probability distribution is symmetric under all permutations of the observations.If the populations correspond to different "treatments," this symmetry can be assured by randomly assigning the treatments to the experimental units.As a result, these tests are often called randomization tests.
Thus, in the interpretation of Chung & Fraser (1958), experimental randomization is optional and not necessary for a method to be called a randomization test.On the other hand, they do claim that randomization tests got their name because they are sometimes based on data from randomized experiments.
3 Emergence of the term "permutation tests" and early discussions on terminology (1950's and 1960's) To the best our knowledge, the term "permutation test" is first used in Cox (1952) and soon after in Kruskal & Wallis (1952), Box & Andersen (1953) and Box & Andersen (1954).Before 1952, sometimes formulations such as "tests based on permutations of the observations" were used.For example, Wald & Wolfowitz (1944) and Hoeffding (1952) used this exact wording.It is of course not surprising that at some point this was shortened to "permutation tests".Cox (1952) is a review of the book by Kempthorne (1952).Cox (1952) uses "permutation test" interchangeably with "randomization test", although the book under review in fact consistently uses "randomization test".Since the randomization tests in Kempthorne were indeed based on permutations, Cox's use of "permutation test" was perhaps not considered objectionable.
Around this time, some authors felt that the term "randomization test" was strongly associated with experimental randomization.Some thought that this was limiting and that a name was needed for general permutation-based tests that were not necessarily based on randomized experiments.In this vein, none other than E. S. Pearson' former PhD student G. E. P. Box, together with S. L. Anderson, wrote in Box & Andersen (1953): An approach to the problem of hypothesis testing which does not involve assumptions concerning the form of the parent distribution was given by R.A. Fisher in 1937 [sic].In this method, the argument has come to be associated with the concept of randomization (i.e. the concept of arranging an experimental program in a randomized design).It should be noted, however, that, in fact, this type of test, known as the randomization test, might more pertinently be called a permutation test.In fact it is only necessary to assume that the samples are drawn from the same distribution.The only restriction on the distribution is that the likelihood is unchanged when the observations are rearranged.For example independent observations from any populations, satisfy this condition.
Their point was that permutation-based testing is not restricted to experimental data, but can also be applied to certain observational data.This idea was not new, see e.g. the test in Fisher (1936, p.58) comparing Englishmen and Frenchmen.However, Box & Andersen (1953) felt that permutation-based tests in general had become too much "associated with the concept of randomization".They felt a term was needed for general permutation-based tests that were not necessarily based on randomized experiments.They proposed to use the term "permutation tests" for this general class of methods.
From 1952 onwards, the term "permutation test" became increasingly common (Box & Andersen 1955, Ruist 1955, Hack 1958, Wallace 1959, Lehmann 1959), with some authors writing "permutation or randomization tests" (Box & Andersen 1955, Dempster 1960, Cox & Kempthorne 1963).Kempthorne & Doerfler (1969, §4) provides an insightful discussion on the terminology "permutation test" versus "randomization test".They argue (like e.g.Edgington 1966) that nonparametric tests based on randomized experiments have some unique properties that other permutation-based tests do not; these differences were briefly discussed in the Introduction of this paper.For these other permutation-based tests they propose to use the term "permutation tests".They propose to reserve the term "randomization tests" for tests based on randomized experiments.Similar suggestions are later done in Gabriel & Hall (1983, p.828), Ernst et al. (2004, p.676), Edgington & Onghena (2007), Onghena (2018) and Rosenberger et al. (2019, p.6). Lehmann (1975) always uses "permutation tests" and not "randomization tests", but does clearly distinguish between the "population model" and the "randomization model" (see also e.g.Onghena 2018).The former is the well-known model involving independent, uniform sampling from populations.The latter model does not make such assumptions and requires randomization of treatments.Nonparametric tests based on that model are precisely what e.g.Kempthorne calls randomization tests.
Currently, many authors consider "randomization tests" to be a quite general term, that does not necessarily refer to tests involving data from randomized experiments.As a recent example, the already classic paper Candes et al. (2018, p.557) introduces the term "conditional randomization tests", which has often been used since.This term refers to a Monte Carlo test that repeatedly randomly samples from a conditional distribution, given certain covariates.Such tests do not presuppose experimental randomization and hence are not randomization tests in the sense of e.g.Edgington and Kempthorne.

Discussion
We have seen that the term "randomization test" is older than "permutation test".The latter term was introduced in the early 1950's for general tests based on permutations.This was done to avoid associations with experimental randomization, since permutation-based tests do not necessarily rely on such randomization.Such distinctions can be important, because tests based on experimental randomization have unique properties, that other permutation methods lack, as discussed in the Introduction.Unfortunately, there has never been a clear, universally respected consensus regarding the term "randomization test" -although there almost was one around the 1970's.In the last few decades, the term "randomization test" has increasingly been used for general classes of tests, not necessarily based on experimental randomization.The consequence of such contradicting uses of terms is confusion (Onghena 2018, Hemerik & Goeman 2021).Schopenhauer (1859) said the following on attaching meanings to words: The words, therefore, are no longer unappropriated, and to read into them a meaning entirely different from that which they have had hitherto is to misuse them, to introduce a licence according to which anyone could use any word in any sense he chose, in which way endless confusion would inevitably result.Locke has already shown at length that most disagreements in philosophy arise from a false use of words.
(translation by Payne 1957).If we agree with this, then the meaning that early authors like Pearson (1937), Cochran (1938) and Neyman (1942) gave to the term "randomization test", is worth studying.What makes this challenging, however, is that these authors mostly dealt with data from randomized experiments.It is not clear whether some considered certain tests outside this domain to be randomization tests.In any case, what we can do is take note of works from the 1950's and 1960's that explicitly define what is meant and not meant by the term.While Chung & Fraser (1958) do not mind using the term in a very general way, Box & Andersen (1953) and Kempthorne & Doerfler (1969, §4) object to this.Kempthorne & Doerfler (1969, §4) explain why they consider it important to reserve the term for tests based on experimental randomization.Likewise, Box & Andersen (1953) propose to call general permutation-based tests "permutation tests".Most later authors who discuss the term "randomization test" in detail, roughly agree with the definition of Kempthorne & Doerfler (1969, §4).Those later authors also emphasize the unique reasoning and properties of tests based on experimental randomization (Edgington 1980, Gabriel & Hall 1983, Ernst et al. 2004, Onghena 2018, Rosenberger et al. 2019).Thus, rather than calling a wide range of approaches "randomization tests", using other names can be less confusing.Here are some suggestions.Rather than calling all tests based on randomly permuting observational data "randomization tests", one could say "random permutation test".As another example, general tests of invariance under a group of transformations (Hoeffding 1952, Romano 1989, 1990) may be called "group invariance tests", in line with the existing notion of "group invariance" (Eaton 1989, Giri 1996, Romano 1990, Lehmann & Romano 2022).