Publication bias, statistical power and reporting practices in the Journal of Sports Sciences: potential barriers to replicability

ABSTRACT Two factors that decrease the replicability of studies in the scientific literature are publication bias and studies with underpowered desgins. One way to ensure that studies have adequate statistical power to detect the effect size of interest is by conducting a-priori power analyses. Yet, a previous editorial published in the Journal of Sports Sciences reported a median sample size of 19 and the scarce usage of a-priori power analyses. We meta-analysed 89 studies from the same journal to assess the presence and extent of publication bias, as well as the average statistical power, by conducting a z-curve analysis. In a larger sample of 174 studies, we also examined a) the usage, reporting practices and reproducibility of a-priori power analyses; and b) the prevalence of reporting practices of t-statistic or F-ratio, degrees of freedom, exact p-values, effect sizes and confidence intervals. Our results indicate that there was some indication of publication bias and the average observed power was low (53% for significant and non-significant findings and 61% for only significant findings). Finally, the usage and reporting practices of a-priori power analyses as well as statistical results including test statistics, effect sizes and confidence intervals were suboptimal.


Introduction
Replicability refers to testing an effect observed in a prior finding using the same study design and data analysis but collecting new data (Nosek et al., 2022).When a study finding can be replicated, researchers can therefore be more confident the original finding is not a false positive or a false negative.Replication projects across several scientific disciplines such as psychology (Open Science Collaboration, 2015), the social sciences (Camerer et al., 2018) and, more recently, cancer biology (Errington et al., 2021) have attempted to replicate original studies.A common outcome of these replication projects was that original effects were often difficult to replicate even when larger sample sizes were collected, and if detected, effect sizes were smaller than in the original report (i.e., overestimated effect sizes).These results have sparked renewed interest in research practices that hinder the replicability of prior findings (Button et al., 2013;Carter & McCullough, 2014;Errington et al., 2021;Francis, 2012;Simmons et al., 2011;Wicherts et al., 2016).In the Neyman-Pearson approach to null hypothesis significance testing or frequentist statistics, three issues that are known to lower the replicability of prior findings are studies with underpowered designs, p-hacking, and a scientific literature that suffers from publication bias (Bakker et al., 2016;Button et al., 2013;Fraley et al., 2014;Francis, 2012;Franco et al., 2014;Stefan & Schönbrodt, 2023).
In frequentist statistics, statistical power is the probability of rejecting the null hypothesis when it is false (i.e., the probability of finding a significant effect when there is one to be found) and depends on the effect size of interest, the sample size, the statistical test and the Type I error rate (Cohen, 1962;Maxwell et al., 2017).For example, studies investigating small and medium effects with small samples are likely to be underpowered, and therefore they have a higher probability of yielding a false negative result.Interestingly, Abt et al. (2020) reported that the Journal of Sports Sciences published studies with a median sample size of 19 participants.Depending on the design and the effect size, a study using a sample size of 19 participants may not have sufficient power, particularly when effects are relatively small and between participant designs are used (Maxwell et al., 2017).For example, a within-participant design with a sample size of 20 participants and where the effect of interest, d z , is 0.5, would have 56% power for a two-sided test with an alpha of 5%.A between-participant design with a sample size of 10 in each condition and an effect of interest, d s , of 0.5 would have a power of 19% for a two-sided test with an alpha of 5%.These two studies would require a total sample size of 44 and 172, respectively, to detect a Cohen's d z of 0.5 with a statistical power of 90%.Consequently, it is important to examine if the designs of studies published in the Journal of Sports Sciences are sufficiently powered for effects of interest despite the small sample sizes previously reported (Abt et al., 2020).
Publication bias occurs when studies with statistically significant findings have a higher chance of being published than statistically non-significant findings.This phenomenon includes editors and reviewers selectively publishing studies with significant findings (i.e., review bias; Mahoney, 1977) and researchers deciding not to submit studies with non-significant results (i.e., the file-drawer problem; Rosenthal, 1979).This is especially problematic when studies have underpowered designs because such studies suffer from large sampling error which leads to substantial uncertainty about the true effect size (Cumming, 2013).For instance, when a study with a betweenparticipant design investigates a true Cohen's d s effect size = 0.5 and there are only 20 participants per condition, it is not possible to get a p < 0.05 unless the true effect size is overestimated (Cumming, 2013), as the minimal detectable effect size with an alpha of 0.05 is d s = 0.64 (Lakens, 2022).Thus, publication bias based on statistical significance and in the presence of studies with underpowered desgins leads to overestimated effect size estimates (Bakker et al., 2012;Button et al., 2013;Anderson et al., 2017).Furthermore, publication bias increases the false positive report probability (Wacholder et al., 2004) or the probability that a published significant finding is actually a Type I error.Despite the relevance of publication bias to the non-replication of studies and cumulative research (Carter & McCullough, 2014;Francis, 2012;Franco et al., 2014), it has been overlooked in the field of sports and exercise science.The presence of publication bias and studies with underpowered designs in a body of literature can be examined using a z-curve analysis (Bartoš & Schimmack, 2022;Brunner & Schimmack, 2020; see also Simonsohn et al., 2014aSimonsohn et al., , 2014b for p-curve).The z-curve method converts significant and nonsignificant p-values in the literature, and uses the distribution of z-scores to determine the presence of publication bias.It also estimates the average statistical power of the studies conducted and provides an estimate of their replicability.
In frequentist statistics, researchers interested in performing a hypothesis test should ensure their studies are adequately powered to observe the effect size of interest by conducting an a-priori power analyses (Lakens, 2022; see Kovacs et al., 2022;Lakens, 2014;Maxwell et al., 2008 for other approaches to sample size estimation).However, despite the importance of providing an adequate sample size justification, Abt et al. (2020) reported that only 10% of articles (12 out of 120) published in the Journal of Sports Sciences included a-priori power analyses.The lack of an a-priori power analysis may indicate that researchers rely on intuition, rules of thumb, or prior practices (a.k.a., heuristics) to determine study sample sizes, such as "20 participants per condition" or otherwise simply using the same sample sizes typically reported in their field of research (Anderson et al., 2017;Bakker et al., 2016;Lakens, 2022).Alternatively, it may also indicate that some researchers determine the sample size based on the questionable research practice of "optional stopping" which involves repeatedly performing a hypothesis test during the data collection and deciding to stop collecting data when the observed p-value has reached statistical significance without controlling the Type I error (Simmons et al., 2011;Stefan & Schönbrodt, 2023)."Optional stopping" should not be confused with the good practice of "sequential testing which also involves repeatedly analysing data but applying appropriate methodological procedures to control the Type I error (Lakens, 2014).Furthermore, Abt et al. (2020) also reported that all studies (12 out of 12) that included a-priori power analyses failed to disclose information on the statistical test to be conducted to detect the chosen effect size.Although this prevents other researchers from evaluating the adequacy of a-priori power analyses, as well as making it impossible to assess their reproducibility, no study has examined the reporting practices including the magnitude of the effect size of interest, the statistical test and the intended power which are required to enable the reproducibility of a-priori power analyses at the very least.
Given that the presence of publication bias and studies with underpowered designs are a threat to the replicability of original findings, one response to the presence of these issues is the replication of original studies with well-powered designs (e.g., Open Science Collaboration, 2015).To facilitate the replicability of original studies, studies should provide a complete description of statistical results.Several current practices in terms of null hypothesis significance testing require the use of the original effect size for assessing the replicability of original studies (Camerer et al., 2018;Errington et al., 2021;Open Science Collaboration, 2015;Simonsohn, 2015).Furthermore, effect sizes from published studies can be used to conduct a-priori power analyses for sample size planning in follow-up studies and to draw meta-analytic conclusions by comparing effect sizes across studies (i.e., in a metaanalysis).Finally, the reporting of effect size estimates allows researchers to discuss the magnitude or practical significance of the studied effect (Kelley & Preacher, 2012; see also Götz et al., 2022;Primbs et al., 2022).However, the reporting of only the effect size estimate might not be sufficient.The American Psychological Association's (APA) recommendations for best reporting practices include the effect size, confidence intervals (CI), and exact p-value (see Appelbaum et al., 2018).Studies with underpowered designs increase the uncertainty around the effect size estimate which is reflected in the width of the CI for the effect size estimate (Asendorpf et al., 2013).However, to what extent these recommended best practices are implemented in sport science journals is unknown.
Our first aim in this study was to assess the presence of publication bias and studies with underpowered designs in a set of studies published in the Journal of Sport Sciences.The rationale of selecting the Journal of Sports Sciences was the use of small samples (n = 19) and the scarce use of a-priori power analyses in studies published in this journal (Abt et al., 2020).The second aim was to examine the usage, reporting practices and reproducibility of a-priori power analyses.Third, we sought to investigate the prevalence of reporting practices of t-statistics or F-ratios, degrees of freedom, exact p-values, and effect sizes and their CI.

Methods
The materials including the study selection protocol, dataset generated, disclosure table and R code for the z-curve analysis are available at https://osf.io/e3rab/.This study was exploratory with an observational and retrospective design.

Selection protocol
The selection protocol for the studies to be included in the z-curve analysis is based on the Selection Protocol for Replication in Sports and Exercise Science (Murphy et al., 2022).Hence, only applied sports and exercise science studies in the subdisciplines of biomechanics, injury prevention, nutrition, physical activity, physiology, psychology and sports performance published in the Journal of Sports Sciences (from Volume 39 (Issue 12) to Volume 37 (Issue 16)) were manually searched.Furthermore, applied studies had to use either an experimental or quasi-experimental design.Studies were selected if they included a research hypothesis, which was defined as a verbal statement (prediction) about some testable relationship or causal effect between variables, that was either corroborated or falsified by conducting a hypothesis test such as a t-test or F-test.Studies that test a hypothesis are especially sensitive to publication bias, compared to studies that only report descriptive statistics or effect size estimates, as both authors and scientific journals value significant results more than non-significant results (Greenwald, 1975).The z-curve method uses all p-values regardless of whether the p-value is yielded by a non-parametric test (i.e., Wilcoxon Rank-Sum tests, Mann-Whitney U-Tests or Kruskal-Wallis one-way ANOVA).Therefore, p-values derived from the above nonparametric tests were also included.A total of 523 studies were screened of which 349 were excluded for not meeting the above criteria.Eighty-nine studies met the above criteria and were included in the z-curve analysis (Figure 1).

Extracting p-values
After study selection, only one p-value per independent experiment was extracted in order to meet the independence criteria (Bartoš & Schimmack, 2022).The extracted p-value corresponded to the first or primary dependent variable stated in the research hypothesis.In cases where there were multiple research hypotheses, the first or primary hypothesis was considered.If the selected hypothesis included multiple dependent variables, the first or primary dependent variable was considered.If the selected dependent variable was operationalized using several outcome measures of the same construct (i.e., to be measured in several alternative ways), the first outcome measure reported was selected.Extracted p-values were recomputed when sufficient information was available (i.e., degrees of freedom and F-ratio or t-statistic) using the functions T.DIST.2Tor F.DIST.RT for t-tests and F-tests in Microsoft Excel for Mac (Version 16.45).P-values were discarded under five circumstances; a) when the p-value was reported relatively (e.g., p < 0.05) and it could not be recomputed due to lack of sufficient information; b) when studies tested a research hypothesis for non-significance; c) the described statistical test in the methods did not match the statistical test reported in the results section of the study; d) the study did not report the effect of interest given the research hypothesis stated in the introduction; and e) the study expected to find a significant difference in one direction but observed an effect in the other direction; the inclusion of this category of significant p-values in z-curve would be problematic because it could create bias in favour of statistical significance (for a more detailed explanation, see supplemental material at https://osf.io/e3rab/).A disclosure table containing all extracted information for the z-curve analysis can be found at https://osf.io/e3rab/.A total of 174 studies were screened of which 85 did not meet the above criteria.Thus, 89 studies were included in the z-curve analysis.A secondary z-curve was performed on 119 p-values obtained from studies that aimed to test a hypothesis (n = 89) and studies that did not (n = 30) can be found in supplemental material at https://osf.io/e3rab/.

Publication bias and statistical power
Z-curve is based on the idea that the average power of a set of studies can be derived from the distribution of z-scores (Bartoš & Schimmack, 2022;Brunner & Schimmack, 2020).Z-curve converts significant and non-significant p-values reported in a literature into z-scores, and uses the distribution of z-scores within the range of 0 to 6 to calculate two estimates of average statistical power.First, the conditional mean power is computed using only the significant results in the published studies.By using this estimate of average power, it is possible to calculate the Expected Replication Rate, that is, the expected success rate (in the long run) if these studies would be exactly replicated.If there is no true effect, the Expected Replication Rate equals the Type 1 error rate and if there is a true effect, it equals the average power estimate.Second, the unconditional average power is computed, which is an estimate of the power in studies that were not published because these studies yielded statistically non-significant findings, and remained in the filedrawer.The presence of publication bias can be examined by comparing the Observed Discovery Rate to the Expected Discovery Rate.If the point estimate of the Observed Discovery Rate lies within the 95% CI of the Expected Discovery Rate, there is no evidence of publication bias.The z-curve method also provides other estimates of publication bias such as the file-drawer ratio, which is the ratio between the Expected Discovery Rate and the Observed Discovery Rate and is expressed as the number of unpublished studies that are predicted to exist for every published study.However, one should note the file-drawer ratio is simply a transformation of the Expected Discovery Rate.

A-priori power analyses and their reporting practices
To investigate the frequency of usage of a-priori power analyses and their reporting practices, the sample of studies was expanded to include those studies that did not meet the criteria for the z-curve analysis (as identified in Figure 1 under "Eligibility").Thus, a total sample of 174 studies was used for the second aim of this study.Two strategies were used to detect the use of a-priori power analyses.First, a visual inspection was performed.The author C.M. searched for any mention of an a-priori power analysis or implicit suggestions of power reported within the methods section (i.e., Participants and Statistical analysis) of an article.If the first strategy was unsuccessful, the article was then downloaded as a PDF and a search was conducted by using keywords "power", "sample", "size" and "participants" or "subjects".In case the study reported the use of a power analysis, the following information was retrieved when available: type of power analysis (i.e., a-priori, sensitivity, compromise or post-hoc), software, statistical test, variable of interest, magnitude of the effect size and its type (e.g., Cohen's d, Hedge's g, Cohen's f), effect size justification (i.e., previous study, pilot study, Cohen's d benchmarks, smallest effect size of interest (SESOI) and meta-analysis), alpha level and its justification, intended power and its justification and the sample size required to achieve the intended power.For clarity, an a-priori power analysis yields the sample size required to achieve the desired error rates given the effect size of interest.A sensitivity power analysis fixes the sample size and desired error rates, and yields a curve that can be used to evaluate the power across a range of effect sizes of interest.A compromise power analysis fixes the sample size and the effect size and yields the error rates based on a desired ratio between the Type I and Type II error rates.Finally, a post-hoc analysis yields the observed power of the study given the observed effect size, the alpha level and the actual sample size; however, this type of power analysis is considered bad practice (Christogiannis et al., 2022;Yuan & Maxwell, 2005).Unlike the first three power analyses described (i.e., a-priori, sensitivity and compromise), no information regarding the reporting practices of post-hoc power analyses was retrieved.Once the above information was retrieved, each category was scored dichotomously as either one or zero (1 = present, 0 = not present).Moreover, the author C.M. coded whether each one of the sampled 174 studies that tested a research hypothesis included an a-priori power analysis because studies that have the goal to test a hypothesis (compared to studies that have a descriptive or estimation goal) should be designed to explicitly control the Type 2 error rate by collecting sufficient data (Lakens & Evers, 2014).We also attempted to reproduce the sample size obtained from a-priori power analyses that reported effect size magnitude and type, statistical test and intended statistical power using the original statistical software.For the studies that included this information, all studies used G*Power.

Reporting practices of statistical results
To investigate the reporting practices of statistical results, the same sample of studies as described above was used (n = 174).
To select the statistical result, the same procedures applied to extract the p-value for the z-curve analysis were followed.Thus, the statistical result selected was chosen in relation to the first or primary study hypothesis/aim as well as the first or primary dependent variable stated within.The following statistics were retrieved from the results section of an article when available: mean ± standard deviation (SD) or mean ± standard error of mean (SEM), t-or F-statistic, degrees of freedom, p-value, standardized effect sizes (e.g., eta squared (η 2 ) and Cohen's d family) and its CI.For the purpose of this study, only standardized effect sizes were considered because such effect sizes allow researchers to directly conduct a-priori power analyses for follow-up studiessoftware such as G*Power requires a standardized effect size as input-.For studies in which the study hypothesis was linked to a factorial analysis, we only considered the effect size (e.g., partial eta squared (η 2 p ), eta squared (η 2 )) for the omnibus effect of interest (i.e., main or interaction effect).For instance, if a study using a one-way between-participant ANOVA with four levels only reported a pairwise effect size but not the omnibus effect, the pairwise effect size was not considered.A pairwise effect size was only considered if the omnibus effect of interest was a main effect with only two levels.This is because a main effect with only one degree of freedom would be equivalent to a statistical test of mean differences (e.g., one-sample and two-sample t-test), and therefore the correct effect size to report would be part of Cohen's d family.Once the above information was retrieved, each category was scored dichotomously as either one or zero (1 = present, 0 = not present).

Open science practices
At reviewers' request, we also investigated the proportion of studies that were preregistered and made their data publicly available in the full sample of articles (n = 174).This information could be of interest because a) the only way to verify that a-priori power analyses were indeed conducted before the commencement of the data collection is through preregistration and b) suboptimal reporting practices can be offset by researchers making study data publicly available.To assess whether a study had been preregistered or not, the leading author C.M. performed first a visual inspection for any mention or implicit suggestions of preregistration.If the first strategy was unsuccessful, the study was downloaded as a PDF and a search using "preregist" as the keyword was conducted.Similarly, to assess whether study data had been made publicly available, C.M. first checked both the "data availability" and "data deposition" statements in the manuscript.If no statement was provided, the "supplemental information" in the online version of the article was then checked, if available.If the first strategy was unsuccessful, a search in the PDF document was conducted by using the following keywords "data", "open" and "public".

Statistical analysis
The R package zcurve 2.0 was used to conduct the z-curve analysis (Bartoš & Schimmack, 2022).Descriptive statistics in the form of count and frequency (%) were used to evaluate the prevalence of any type of power analysis and reporting practices for both power analysis and statistical results.A Poisson simple regression with power analysis as a predictor was used to determine whether a) studies that performed an a-priori power analysis did not have different sample sizes compared to studies without an a-priori power analysis, and b) among studies that tested a hypothesis, studies that performed an a-priori power analysis did not have different sample sizes compared to those that did not perform a-priori power analysis.Alpha level was set to α < 0.05.Statistical tests were conducted using R (Version 4.1.2;R Core Team, 2021).To reproduce the a-priori power analyses reported in the set of studies, we used G*Power (Version 3.1.9.6).

Results
A total of 89 independent p-values (including 65 significant and 24 non-significant p-values) were converted into z-scores to fit the z-curve model.The Expected Discovery Rate was 0.53 [0.13; 0.71] indicating an average power of 53% for studies reporting both significant and non-significant results (see Figure 2).The Expected Replication Rate was 0.61 95% CI [0.42; 0.75] indicating that studies reporting significant results have an average power of 61%.This suggests that if we were going to conduct direct replications (with the same statistical power, effect size and sample size) of the studies reporting significant findings, only 61% of these studies would be expected to yield another significant effect.Publication bias can be examined by comparing the Observed Discovery Rate (the percentage of significant results in the set of studies) to the expected discovery rate (the proportion of the area under the curve on the right side of the significance criterion).The point estimate of the Observed Discovery Rate (0.73) lies outside the 95% CI of the Expected Discovery Rate of 0.53 [0.13; 0.71] suggesting that we can statistically reject the null hypothesis that there is no publication bias.This conclusion is also supported by a visual inspection of the obtained results, which suggest that there is a potential indication of publication bias (see Figure 2); there is a steep drop from the frequency of just statistically significant values (i.e., z > 1.96) relative to the frequency of non-significant values.This figure suggests that, even when publication bias might not be extreme (i.e., a reasonable proportion of non-significant findings are published in this literature) there are still relatively less p-values just above the traditional alpha level of 5% (i.e., z = 1.96) than below this threshold.
Out of 174 sampled studies, only 44 (25%) included an a-priori power analysis, 2 (1%) a sensitivity power analysis and 10 studies (6%) reported a post-hoc power analysis.The result of the Poisson simple regression indicated that the inclusion of an a-priori power analysis (29 ± 18 participants) does not significantly increase the sample size (95% CI [1; 2]; p = 0.46) in comparison to the absence of an a-priori power analysis (26 ± 23 participants).Out of 174 studies, 129 (74%) tested a hypothesis.Of those, only 38 studies (29%) included an a-priori power analysis, 1 study (0.78%) included a sensitivity power analysis and 8 studies (6%) included a posthoc power analysis.Among the studies that tested a hypothesis, the result of the Poisson simple regression indicated that the inclusion of an a-priori power analysis (27 ± 15 participants) does not significantly increase the sample size (95% CI [1; 2]; p = 0.78) in comparison to the absence of an apriori power (26 ± 23 participants).Table 1 presents the frequency of usage of reporting practices in studies with a-priori power analyses.25 out of 26 (96%) studies that reported the software used G*Power.Results indicate that most studies did not report all components required to allow a full assessment of the reported a-priori power analyses.The minimum components required to computationally reproduce an a-priori power analysis are the statistical test, the magnitude and type of effect size and the intended power, which, with the exception of the latter, were often unreported.Thus, only 8 out of 44 (18%) studies that reported an a-priori analysis could be computationally reproduced.We could fully reproduce the sample size reported in 7 out of 8 a-priori power analyses.The a-priori power analysis that could not be fully reproduced reported a sample size of 61, whereas our analysis yielded a sample size of 58.
The types of justification for the effect size estimate used to conduct the pre-study power analyses are presented in Table 2.The most used justifications to select the effect size of interest were based on a previous study, followed by Cohen's d benchmark and a pilot study.The use of the two justifications considered best practice including a meta-analytic effect size and SESOI was almost non-existent.
The reporting practices of inferential tests are presented in Table 3.The most reported components were mean ± SD or mean ± SEM for both inferential tests.Other components such as test statistics and degrees of freedom were usually unreported, although the frequency of reporting is lower for t-tests.Contrarily, effect sizes were reported more often for F-tests than for t-tests.The CI for effect sizes was not reported in studies using F-tests, whereas in studies using t-tests, the CI was seldom reported.
Regarding the Open science practice of preregistration, out of 174 studies, 1 study was registered as a randomized clinical trial and 173 studies were not preregistered.The registered randomized clinical trial did not include the a-priori power analysis reported in the published article.None of the 174 studies sampled made their data publicly available.

Discussion
The first aim of this study was to investigate the presence of publication bias and studies with underpowered designs in a set of studies published in the Journal of Sport Sciences.The statistical power estimates observed in our sample of studies are not as low as in other disciplines such as psychology and neuroscience (Bakker et al., 2012;Button et al., 2013;Stanley et al., 2018;Szucs & Ioannidis, 2017).For instance, Stanley et al. (2018) reported an average power of 36% in studies included in a sample of 200 meta-analyses.The observed 73% of studies reporting a significant finding is in agreement with Twomey et al. (2021) who similarly observed that approximately 70% of the studies published in three flagship sports science journals reported significant findings.The percentage of non-significant results is slightly higher than in many other disciplines (Fanelli & Scalas, 2010;Scheel et al., 2021).For instance, Scheel et al. (2021) compared the number of significant findings reported in a sample of registered reports with a sample of standard studies in psychology, and they found 96% of significant findings in standard studies but only 44% in registered reports.The extent of publication bias in sports and exercise science is unknown.However, one estimate can be derived from investigating the difference between the percentage of significant findings and the statistical power.Assuming an average power of 61%, only 61% of the studies investigated in our sample should be expected to detect the investigated effect as statistically significant.Yet, if we consider our study sample, we find that 73% of studies report statistically significant findings, which is at least 12% points more than we should expect suggesting the presence of a biased literature.However, it is theoretically possible that the estimate of 73% significant results emerges when all studies that are performed are submitted for publication and published, or in other words, when there is no publication bias.To explain the 73% of significant results (Positive Result Rate (PRR)), we must assume some combination of statistical power and proportion of true hypotheses that researchers test (Scheel where α is the Type 1 error rate, t is the proportion of true hypotheses and 1 − β is the power of a test (Scheel et al., 2021).Assuming no publication bias and fixing the alpha level to 0.05, a PRR = 0.73 can be achieved with, for example, a statistical power of 96% when 75% of the hypotheses that are tested are true hypotheses.However, we observed relatively low power estimates in the sampled studies (i.e., 53% for both significant and non-significant studies and 61% for significant studies).If we assume the upper bound (75%) of the 95% CI (0.42, 0.75) for significant findings as the true power estimate, researchers would need to test almost exclusively true hypotheses (>95%) to observe a 73% of significant findings.Yet, these estimates of power and the proportion of true hypotheses seem overly optimistic and might not be supported by empirical evidence (Szucs & Ioannidis, 2017;Wilson & Wixted, 2018).Altogether, our results indicate the presence of some publication bias and studies with underpowered designs, which are likely to increase the number of false positives in a literature body (Ioannidis, 2005) and produce overestimated effect sizes (Bakker & Wicherts, 2011;Button et al., 2013;Kvarven et al., 2020).
The second aim was to examine the frequency of reported apriori power analyses and their reporting practices.The fact that only 29% (38 out of 129) of studies testing a hypothesis conducted an a-priori power analysis is concerning because researchers should aim to perform studies that yield informative results when they test hypotheses (as was the goal in 129 out of the 174 studies we examined).An a-priori power analysis is an important way to design studies that have a high probability to yield informative results (Lakens, 2022; see Kovacs et al., 2022;Lakens, 2014;Maxwell et al., 2008 for other approaches to sample size justification).First, a study with an underpowered design that reports a non-significant effect is barely informative because it lacked power to find a significant effect if there was one to be found.This makes it especially difficult to publish null findings, which contributes to publication bias.Second, studies with highpower designs yield more precise effect size estimates and reduce the uncertainty around the CI.Therefore, the adoption of an a-priori power analysis is one way to move the field forward.Surprisingly, there was no significant difference in sample size between studies which included an a-priori power analysis and studies which did not include it.It is possible that this is a Type II error, but it also raises the possibility that a-priori power analyses were performed following the "sample size samba" where researchers choose an "expected" effect size for their power analysis that yields the sample size they wanted to collect to begin with (Schulz & Grimes, 2005).Another possibility, raised by a reviewer, is that a-priori power analyses are reported as "a-priori" when they in fact have been conducted "post-hoc" using the observed (overestimated) effect size.Of course, there is no way to verify that a-priori power analyses were indeed conducted before the commencement of the data collection without preregistration.Furthermore, the similar mean sample sizes observed (n = 29 and n = 26 for studies with and without an a-priori power analysis that tested a hypothesis, respectively) might indicate that the effect size estimates included in a-priori power analyses are overestimated and if all things are equal, the sample size required to achieve the intended power will be smaller (Anderson et al., 2017).
We found that some studies included a post-hoc or retrospective power analysis.As mentioned above, this form of power analysis uses the observed effect size, the alpha level and the actual sample size to evaluate the power of the study after it has been completed.However, this is not a good practice because treating the observed effect size as the true effect size in a power analysis is simply a transformation of the observed p-value (Hoenig & Heisey, 2001;Yuan & Maxwell, 2005; see; Christogiannis et al., 2022 for a non-technical explanation).For a t-test, whenever the p-value = 0.05, post-hoc power will always be 50%, regardless of the combination of sample size and study effect size (Yuan & Maxwell, 2005).If a non-significant p-value is observed, retrospective power will always be low, regardless of the true (always unknown) power of the study (Yuan & Maxwell, 2005).These reasons render posthoc power analyses uninformative, and it is better to interpret non-significant results with equivalence tests.
When a-priori power analyses were reported, the reporting practices were often suboptimal.Effect size type and magnitude, the statistical test and intended statistical power are key components to ensure reproducibility of a-priori power analyses because otherwise any attempt to reproduce them would require a large amount of guesswork.For instance, omitting the statistical test used is problematic because often studies perform multiple statistical tests and thus researchers might not be able to evaluate which statistical test the power analysis was conducted for.Furthermore, power is impacted by the study design and the statistical test used (Maxwell et al., 2017).For example, within-participant statistical tests such as a paired t-test and a one-way within-participant ANOVA will achieve higher power in comparison to their between-participant counterparts (Maxwell et al., 2017).The omission of the dependent variable would not be problematic if studies tested only one single hypothesis that predicted the effect of a treatment or intervention on one dependent variable.However, this is far from reality because studies often test a multitude of hypotheses, and a multitude of dependent variables are measured.The non-reporting of the magnitude of the effect size of interest prevents other researchers and reviewers from reproducing and evaluating a-priori power analyses.Reporting the type of effect size is important because there are several effect sizes within the same family (Goulet-Pelletier & Cousineau, 2018;Lakens, 2013;Morris & DeShon, 2002).For example, considering the simple case of a one-sample design, Cohen's d can be computed as d z , d rm , and d av (see Lakens, 2013).Finally, none of the studies provided a justification for the alpha level or the desired power.Especially when sample sizes are limited, the error rates in a study are an important consideration (Maier & Lakens, 2022).As outlined in the editorial of Abt et al. (2020), researchers should include a detailed description and justification of the steps followed to conduct an a-priori power analysis allowing other researchers and reviewers to reproduce its content and ultimately evaluate the validity of the analysis.Most of our sampled studies were published between 2019 and 2020, so we are not in the position to assess whether Abt et al.
(2020)'s suggestions have had an impact on the reporting practices of a-priori power analyses.Nevertheless, the Journal of Sports Sciences should consider to make it mandatory to report all the information that is required to reproduce and evaluate the validity of a-priori power analyses.
The process of planning the study sample size based on an effect size estimate is not as straightforward as it might seem (Bakker et al., 2016;Collins & Watt, 2021).Researchers are faced with the dilemma of justifying the effect size estimate they are interested in.This is a critical step because the magnitude of the effect size determines the sample size given an intended power.However, despite its importance in an a-priori power analysis, there is empirical data suggesting researchers have difficulties in justifying the selected effect size estimate (Bakker et al., 2016;Collins & Watt, 2021).When the effect size estimate is obtained from a previous underpowered study, it is likely that the original effect size estimate is overestimated (Bakker et al., 2012;Button et al., 2013;Simmons et al., 2011).Similarly, pilot studies are also likely to provide overestimated effect sizes (Albers & Lakens, 2018).This is problematic because the use of overestimated effect sizes for a-priori power analyses will result in studies with underpowered designs unless adjusting methods are used (see Anderson et al., 2017).The use of fixed effect sizes based on Cohen's benchmarks may not match well with the typical effect size observed in another research area because Cohen's benchmarks were derived from effects observed in behavioural science (Cohen, 1988).For instance, Swinton et al. (2022) conducted a Bayesian hierarchical meta-analysis to identify specific effect size benchmarks in strength and conditioning interventions and reported that the benchmarks for small, medium and large effect sizes were 0.12, 0.43 and 0.78, respectively.A better practice would be to obtain the effect size of interest based on a meta-analysis which can provide more accurate effect size estimates than single studies (see Lakens, 2022 for some recommendations when justifying the use of a meta-analytic effect size estimate for an a-priori power analysis).However, to further compound the problem, some caution is needed as publication bias and flexibility in data analysis can result in overestimated effect sizes (e.g., Kvarven et al., 2020) or even cases where the meta-analytic effect size turned out to be zero or negligible (e.g., Hagger et al., 2016;Maier et al., 2022).Suggestions for how to deal with the possibility that effect sizes estimates from the literature are overestimated are offered in Lakens (2022).Best practice would be to power a study design based on the smallest effect size of interest (SESOI; see Anvari & Lakens, 2021;Lakens, 2022).Thus, instead of conducting an a-priori power analysis based on the effect size estimate that the researcher expects to observe, researchers should rely on the smallest effect considered to be theoretically or practically meaningful (see Carlson et al., 2022 for an example) However, none of the studies sampled did so.Researchers might benefit from consulting a statistician if they find it challenging to determine the required sample size for a future study, and researchers in sports and exercise science might want to start a discussion about which effect sizes are deemed large enough to matter, so that future studies can be designed to detect the presence or absence of the smallest effect size of interest.Yet, there are situations where the effect size of interest is uncertain or unknown -for instance, in a new line of research -or it is plausible that the true effect size is larger than the SESOI that the study design has power to detect.In such cases, researchers should consider another approach to sample size estimation known as sequential testing (Lakens, 2014;see;Schönbrodt et al., 2017 for hypothesis testing with Bayes factors).In comparison to an a-priori power analysis (Lakens, 2014), sequential testing can be more efficient because it allows researchers to terminate the data collection earlier than planned when the observed effect size is larger than the SESOI or the presence of the SESOI can be rejected.
The third aim was to investigate the reporting practices of inferential tests.Overall, reporting practices of statistical results were suboptimal and journals and researchers should adopt the journal article reporting standards recommended by APA (Appelbaum et al., 2018), although there are others such as the Bayesian analysis reporting guidelines also known as BARG (Kruschke, 2021).For instance, following APA standards, the results of inferential tests should be reported in the following order: the F-ratio or t-statistic and degrees of freedom (in parentheses) followed by the exact p-value (e.g., F(1,35) = 5.45, p = 0.001 or t(85) = 2.86, p = 0.025).This would be beneficial for a few reasons.First, the reporting of the F-ratio or t-statistic and degrees of freedom allow to recompute the p-value reported and therefore verify the reported p-value.This and data sharing is of importance when there is evidence that one in eight papers contained errors in the reported p-value that may have affected the statistical conclusion of the study (Nuijten et al., 2016; see also Artner et al., 2021 for a summary of studies on this topic).From an epistemological point of view, reproducibility should be assessed before replicability because it makes little sense to try to replicate a prior finding if the results supporting the finding are numerically incorrect.Second, both the F-ratio and t-statistic can be used to compute the effect size estimate (see Lakens, 2013).For instance, the reporting of the F-ratio and degrees of freedom allows computation of eta partial squared (η 2 p ; e.g., F(1,35) = 5.45, η 2 p = 5.45 × 1/(5.45 × 1 + 35)).Third, it would facilitate machine readability and data usability enabling the analysis of large sets of data containing p-values.Methods such as p-curve and z-curve that can be used to address meta-scientific questions require the input of exact p-values, which are not always reported.Regardless of the used reporting guidelines, researchers should fully report the statistical results of inferential tests with the goal of facilitating computational reproducibility, allowing other researchers to assess the veracity of prior findings and facilitate cumulative science.Suboptimal reporting practices could be offset by making data and code publicly available as researchers would then be able to compute statistics of interest that were not reported in the article.However, none of the studies included in our sample (n = 174) made their data publicly available.Therefore, as long as researchers do not make their data including code and methods publicly available, researchers should fully report the statistical results of inferential tests.
The omission of (standardized) effect size estimates and their CI is concerning for a few reasons.First, both standardized and non-standardized effect size estimates allow researchers to make a judgement on the practical significance of the magnitude of the studied effect (Asendorpf et al., 2013;Kelley & Preacher, 2012;Schäfer & Schwarz, 2019).Second, standardized effect size estimates can be used to directly conduct a-priori power analyses for follow-up studies (Cohen, 1988;Lakens, 2022;Schäfer & Schwarz, 2019).Third, standardized effect size estimates permit direct comparison across similar studies that collected dependent variables on different raw scales, and can be used in meta-analysis to draw meta-analytic conclusions.Fourth, when researchers report (standardized) effect size estimates, researchers should acknowledge and quantify the uncertainty in these estimates.CIs provide information on how accurately a true effect size was estimated (Asendorpf et al., 2013;Kelley & Preacher, 2012).This is especially of interest if studies have small sample sizes because such studies suffer from large sampling error which leads to substantial uncertainty around the true effect size.For instance, imagine a researcher that conducted a study with a two-cell design where there are 10 participants per condition, and reported a significant Cohen's d S of 0.5 omitting its 95% CI [0.05; 1.05].Although the observed effect size and p-value were reported, the uncertainty around the estimate makes clear that the test was not very informative about the true effect size.Therefore, researchers should follow the journal article reporting standards recommended by APA (Appelbaum et al., 2018) and report both effect sizes estimates and their CI.
Our investigation has a few limitations that should be addressed herein.First, our selection is a pilot sample of original studies published in only one sports science journal.Thereby, our findings are far from a complete picture of the field of sports and exercise science, and should be considered a pilot study for a more comprehensive examination in the future.Furthermore, the small sample of studies included (n = 89) increased the uncertainty around the parameter estimates (Brunner & Schimmack, 2020).Second, the z-curve analysis included only studies that tested a research hypothesis regardless of the type of hypothesis.Ideally, researchers should explicitly distinguish between primary, secondary and exploratory hypotheses or exploratory results (Cooper, 2020).Whilst researchers should control for both Type I and II error rates when testing a primary hypothesis, researchers only control for Type I error rate when testing a secondary hypothesis.When a study is exploratory, researchers test research hypotheses without a-priori predictions and error rates are uncontrolled.Although the distinction between these types of hypotheses is of relevance, such distinction was sometimes ambiguous.This could be resolved if authors stated explicitly the type of research hypothesis tested to allow other researchers to know whether the study was intended to be hypothesis-testing or hypothesis-generating.Thirdly, the protocol followed to select p-values for z-curve required us to make multiple subjective decisions because selected studies usually: a) tested vague and multiple hypotheses, b) measured dependent variables that were often operationalized using additional measures of the same construct and c) used dependent variables that were measured in several alternative ways (see Wicherts et al., 2016 for researchers' degrees of freedom).Fourth, although two secondary authors undertook some random verification of the data selected (D.L. verified some coded data for z-curve analysis and J.W. verified some coded data for the reporting practices and reproducibility of a-priori power analyses), only the primary author extracted and coded data.This and the fact that data extraction was often difficult due to the researchers' degrees of freedom might have been a source of bias.Finally, we might come across as cynical when we criticize the lack of preregistration and (sound) a-priori power analyses in our study sample.
Preregistration itself does not make a study better compared to a non-preregistered study (Lakens, 2019).The goal of preregistration is to allow other researchers to transparently evaluate the severity of a test -that is, evaluate whether any opportunistic post-hoc decisions have been made given the observed data.We acknowledge the lack of preregistration as a limitation; however, we make both data and code available so that any researcher can reproduce our results and evaluate our claims.Regarding the lack of an a-priori power analysis for our study, the data collection was terminated based on time constraints where the leading author attempted to code as many articles as possible over a 3-month period.This exploratory study was conducted to help in the sample size estimation for a future project where we attempt to assess the prevalence and reproducibility of a-priori power analyses in a larger sample of studies (https:// osf.io/mqbr2/).
Overall, our results suggest that there are substantial barriers that would hinder both computational reproducibility and replicability.First, the point estimate of the Observed Discovery Rate (0.73) lies outside the 95% CI of the Expected Discovery Rate [0.13; 0.71] suggesting the presence of publication bias.Second, the two power estimates indicate that the sampled studies had, on average, inadequately powered designs (as a Type II error rate of 40% should be considered too high).Third, the low usage of a-priori power analyses as well as the use of effect size estimates obtained from previous studies or pilot studies is problematic given the small samples observed in the field of sport and exercise science (Abt et al., 2020) and the issues with overestimated effect sizes as a result (Albers & Lakens, 2018;Anderson et al., 2017).Fourth, the reporting practices of a-priori power analyses and inferential tests were often suboptimal, preventing researchers from assessing the validity of the results.Finally, the absence of preregistered studies and studies with data publicly available is concerning because these Open Science practices aim to make research more transparent and reproducible.Therefore, it seems there is substantial opportunity to improve researchers' behaviours through the adoption of Open Science practices such as preregistration, sample size planning based on an a-priori power analysis or sequential testing, and full reporting of statistical results, if the scientific community is to improve these factors in the future.

Figure 1 .
Figure 1.PRISMA flow diagram for inclusion of studies in z-curve analysis.

Figure 2 .
Figure 2. Distribution of z-scores over [0-6] interval.The vertical red line refers to a z-score of 1.96, the critical value for statistical significance when using a twotailed alpha of 0.05.The dark blue line is the density distribution for the inputted p-values (represented in the histogram as z-scores).The dotted lines represent the 95% CI for the density distribution.Range represents the minimum and maximum values of z-scores used to fit the z-curve.

Table 2 .
Justifications of the selected effect size used in a-priori power analyses (n = 44).
SESOI = smallest effect size of interest

Table 3 .
Frequency of reporting practices for both F-tests and t-tests.