Will the ASA's Efforts to Improve Statistical Practice be Successful? Some Evidence to the Contrary

ABSTRACT Recent efforts by the American Statistical Association to improve statistical practice, especially in countering the misuse and abuse of null hypothesis significance testing (NHST) and p-values, are to be welcomed. But will they be successful? The present study offers compelling evidence that this will be an extraordinarily difficult task. Dramatic citation-count data on 25 articles and books severely critical of NHST's negative impact on good science, underlining that this issue was/is well known, did nothing to stem its usage over the period 1960–2007. On the contrary, employment of NHST increased during this time. To be successful in this endeavor, as well as restoring the relevance of the statistics profession to the scientific community in the 21st century, the ASA must be prepared to dispense detailed advice. This includes specifying those situations, if they can be identified, in which the p-value plays a clearly valuable role in data analysis and interpretation. The ASA might also consider a statement that recommends abandoning the use of p-values.


Introduction
The American Statistical Association recently launched a number of unprecedented initiatives aimed at improving statistical practice. These include the ASA statement on p-values (Wasserstein and Lazar 2016), the October 11-13, 2017 Symposium on Statistical Inference in Bethesda, MD, which in turn spawned this TAS Special Issue, "Statistical Inference in the 21st Century: A World Beyond p < 0.05, " an online, permanently open access issue of The American Statistician. For these efforts, the ASA is to be warmly congratulated. But will they be successful?
Leaving aside the tricky definition as to what constitutes "successful, " and the time horizon this might involve, it is clear that the ASA faces a daunting uphill battle. As Berry's (2017) introductory quotation spells out, previous attempts to tackle this problem, especially the rampant misuse and abuse of null hypothesis significance testing (NHST) and its ubiquitous p-value, have been totally ineffective.
The present contribution provides empirical support for this claim. More specifically, it is shown how a series of highly cited articles and books exposing the fundamental weaknesses associated with NHST, and its detrimental effects on knowledge development, nevertheless were unable to prevent its inexorable rise in the social and management sciences over the period 1960-2007. This does not bode well for the ASA's initiatives.

Data on the Spread of NHST, 1960-2007
Data on the spread of NHST in the social and management sciences were obtained from Hubbard (2016, pp. 16-30). He content-analyzed a randomly selected issue of leading journals in the social (geography, political science, psychology, and sociology) and management (accounting, economics, finance, management, and marketing) sciences for every year from 1960 through 2007 to estimate the incidence of empirical research using NHST in these areas. Leading journals were targeted since they would be expected to feature best research practices. For each social science discipline, these journals are given in parentheses: Geography ( This examination resulted in a total of 5541 empirical articles employing NHST across the four disciplines. The percentage of empirical articles employing NHST in these four disciplines, on a decade-by-decade basis (the final "decade" being [2000][2001][2002][2003][2004][2005][2006][2007], is given in column 10 of Table 1. The above analysis was repeated for the management disciplines. Being younger disciplines, some journals did not span  Review, Economic Journal, Journal of Political Economy, Quarterly Journal of Economics, Review of Economics and Statistics); Finance (Journal of Finance, Journal of Financial Economics, 1974, Journal of Financial and Quantitative Analysis, 1966, Journal of Money, Credit and Banking, 1969; Management (Academy of Management Journal, Administrative Science Quarterly, Human Relations, Journal of Management, 1975, Journal of Management Studies, 1964, Organizational Behavior and Human Decision Processes, 1966, Strategic Management Journal, 1980andMarketing (Journal of Consumer Research, 1974, Journal of Marketing, Journal of Marketing Research, 1964).
All told, some 7762 empirical articles in the management sciences relied on NHST. Column 11 of Table 1 shows this as a percentage on a decade-by-decade basis. Table 1 presents the Google Scholar citations earned by 25 works (18 articles, 7 books) from the dates of their initial publication through December 10, 2017. It reveals that these ostensibly influential publications highly critical of the uses and abuses of NHST were nevertheless unable to reduce its prevalence. On the contrary, every decade from 1960 to 2007 saw an increase, or no diminution, in its adoption.

Decade-by-Decade Citation Analysis
During the 1960s, when approximately 52%-56% of empirical studies in the management and social sciences employed NHST, four of the earliest and most damning indictments of how its use retards good science appeared in the literature. The articles by Rozeboom (1960), Bakan (1966), Meehl (1967), and Lykken (1968) ignited a debate that continues to this day. Between them they garnered 72 citations in this decade.
The beginning of the 1970s saw the publication of Morrison and Henkel's (1970) anthology, The Significance Test Controversy, which included the four articles mentioned above, draw further attention to this issue. Later that decade, Greenwald (1975), Carver (1978), and Meehl (1978) expressed concerns over the damaging effects of NHST. These four efforts attracted 191 citations for 1970-1979. The cumulative citations earned by the eight works listed in Table 1 for 1960-1979 now stood at a respectable 688. Yet 1970-1979 revealed a substantial increase over 1960-1969 in the proportion of empirical research using NHST in both the social (from 56% to 72%) and management (from 52% to 80%) sciences (see Table 1).
Six further prominent critiques of NHST appeared in the 1980s, most of them in the latter half. During this time they gathered 281 citations, with Leamer (1983), at 204, responsible for the lion's share of these. For 1980-1989, a total of 1603 citations accrued to the 14 contributions in Table 1 decrying NHST. Meanwhile, the cumulative citations of these reached 2291. Despite this, the percentage of empirical research based on NHST forged ahead to 84% in the social, and 89% in the management, sciences.
The 1990s witnessed the arrival of two juggernauts by Cohen (1990Cohen ( , 1994 on the damage to scientific progress caused by NHST. At 543, his 1994 article gained the most citations for this decade (or any before it), while his 1990 publication, at 469, took third place. Sandwiched between these, on 475, was Meehl (1978). A total of 4737 citations of research challenging the ubiquity of NHST occurred during 1990-1999. Cumulatively, this total rose to 7028. By now, almost predictably, the incidence of NHST featured in empirical research moved ahead in both the social, 92%, and management, 92%, sciences.
The largest number of citations attracted by the 25 studies in Table 1 is for the (partial) decade 2010-2017. Over this period, these works were cited 14,448 times. Cohen (1994) and Cohen (1990) took the first and third places with 2070 and 890, in turn. Wilkinson et al. (1999), on 1590, are runners-up. A recent newcomer to this debate warrants special attention. This is Wasserstein and Lazar's (2016) "The ASA's Statement on p-Values: Context, Process, and Purpose. " In less than 2 years this statement has gone viral, from a citations-generated perspective, with 762 of them. Incredibly, the cumulative total number of citations won by the 25 articles and books in Table 1 for the period 1960-2017 is 32,360.
In closing this section, it is instructive to note that over these same decades very few articles were published attempting to defend the practice of NHST. A notable exception is Hagen (1997). Another is Wainer (1999), although he could only muster "One cheer for null hypothesis significance testing. " And while virtually all of the 25 works listed in Table 1 are severely critical of NHST and its baneful consequences for scientific progress, some, for example, Nickerson (2000, p. 241), remark "that when applied with good judgment it [NHST] can be an effective aid to the interpretation of experimental data. " The only problem is that, in a 60-page journal article, Nickerson never provides us with examples showing how. Table 1 Other patterns in the data in Table 1 deserve notice. The first is that some 14 of the 24 (58%) entries show monotonic increases in their citation rates over the decades (Wasserstein and Lazar, 2016, were excluded because there is no prior decade for comparison). This is especially impressive for the earlierpublished articles, such as Rozeboom (1960), Lykken (1968), Carver (1978), and Meehl (1978).

Further Insights from
Second, for 17 of the 24 (71%) entries, the period 2010-2017 is the one recording their maximum number of citations. This latest surge in numbers, which is evident also in the years preceding 2010-2017, possibly could reflect greater concern with bad research practices. On the other hand, it is far more likely that this growth in citations is attributable to the profusion of journals, the development of more extensive and sophisticated citation-tracking systems, and manipulations to inflate them, over recent decades.

Discussion
Before dispensing advice to the scientific community on p-values and related matters, it appears that the statistics profession must first get aspects of its own house in order. This is because it is not just practitioners who can misinterpret p-values, but also statisticians (Hubbard 2016, pp. 207-208;McShane and Gal 2017). Indeed, Gelman and Carlin (2017, p. 900) sympathize with those who may question the merits of recommendations by statisticians to improve practice given "the mess we have helped to create. " The proliferation of p-values is central to this mess. Here is Berry's (2017, p. 896) take: "We have saddled ourselves with perversions of logic-p-values … [which] are fundamentally un-understandable … We created a monster … The only reasonable route forward is to kill it. " This sentiment is echoed, in more understated fashion, by Matthews (2017, p. 40). Or consider Briggs's (2017, p. 897) view that "There are no good reasons nor good ways to use p-values. They should be retired forthwith. " It is in this context that I applaud the courage of David Trafimow to actually ban the use of p-values in Basic and Applied Social Psychology, a journal he edits (Trafimow and Marks 2015). This is a policy that Cumming (2014), whose article "The New Statistics: Why and How" has already gained an impressive 1180 Google Scholar citations, would surely approve.
The ASA statement on p-values (Wasserstein and Lazar 2016), of course, had to be of a general nature. Subsequent publications on the topic of the appropriate and inappropriate uses/interpretations of p-values, whether from the ASA or elsewhere, must be specific; the more specific the better. They must amount to a list of Do's and Don'ts concerning p-values. A good place to start would be for the ASA to articulate those circumstances, if they exist, in which use of NHST clearly is beneficial. At the same time this will serve to illustrate that its rank and file usage is little more than scientist window dressing.

Conclusions
This article has furnished overwhelming evidence supporting Berry's (2017) introductory quotation regarding the abject failure to rein in the use and abuse of NHST and p-values. With 32,360 citations between them-the kind of publicity only dreamed of in academic circles-25 publications severely critical of NHST have not been able to arrest, never mind reverse, its growing popularity in the social and management sciences. In fact, I would not be at all surprised to learn that its percentage usage in empirical work in these areas has actually inched ahead of the 92% and 93% highs for 2000-2007 in more recent years.
If the ASA's late intervention in this decades-long saga to improve statistical practice is to have any hope of success, its advice must be both specific and possibly radical. By specific I mean identifying those, if any, situations where p-values make a genuine contribution. As amply demonstrated, general statements have been spectacularly unsuccessful. By radical, I mean that the ASA might contemplate issuing a statement banning the use of p-values. Such actions will be difficult, but necessary, if the statistics profession is to be relevant in both the classroom, and in scientific method, in the 21st century.