The methodological rigour of systematic reviews in environmental health

Abstract While systematic reviews (SRs) are often perceived as a “gold standard” for evidence synthesis in environmental health and toxicology, the methodological rigour with which they are currently being conducted is unclear. The objectives of this study are (1) to provide up-to-date information about the methodological rigour of environmental health SRs and (2) to test hypotheses that reference to a pre-published protocol, use of a reporting checklist, or being published in a journal with a higher impact factor, are associated with increased methodological rigour of a SR. A purposive sample of 75 contemporary SRs were assessed for how many of 11 recommended SR practices they implemented. Information including search strategies, study appraisal tools, and certainty assessment methods was extracted to contextualise the results. The included SRs implemented a median average of 6 out of 11 recommended practices. Use of a framework for assessing certainty in the evidence of a SR, reference to a pre-published protocol, and characterisation of research objectives as a complete Population-Exposure-Comparator-Outcome statement were the least common recommended practices. Reviews that referenced a pre-published protocol scored a mean average of 7.77 out of 10 against 5.39 for those that did not. Neither use of a reporting checklist nor journal impact factor was significantly associated with increased methodological rigour of a SR. Our study shows that environmental health SRs omit a range of methodological components that are important for rigour. Improving this situation will require more complex, comprehensive interventions than simple use of reporting standards.


Introduction
Systematic review (SR) is a study design that aims to minimise bias and maximise transparency when answering research questions using existing evidence Whaley, Edwards, et al. 2020). To achieve its aims, SR methodology is rigorous, consisting of a standardised series of steps including: defining specific research objectives; prespecifying a detailed protocol; conducting comprehensive searches for relevant literature; screening and extracting data for analysis; assessing the potential for systematic error in the included studies; and using quantitative, qualitative, and/or narrative methods to synthesise the included evidence and determine the level of certainty with which the research question has been answered (IOM 2011;Hoffmann et al. 2017;Higgins, Lasserson, et al. 2019;Whaley, Aiassa, et al. 2020). When conducted well, SRs provide trustworthy summaries of evidence, helping identify knowledge gaps, providing insight into the validity of study methods in a given area of research, preventing unnecessary primary research being conducted when answers are already known, and cutting through uncertainty when studies are individually inconclusive (Chalmers et al. 2002). SRs also play an important role in improving study design and reporting, the translation of findings from experimental to target contexts, and informing decision-making (Ritskes-Hoitinga et al. 2014;Ritskes-Hoitinga and van Luijk 2019).
The value of systematic review in healthcare, as championed by organisations such as Cochrane, is well-established (Dickersin 2010). The potential value of SR methods for the toxicological and environmental health sciences (henceforth "environmental health") was arguably first raised in the mid-2000s (Guzelian et al. 2005;Hoffmann and Hartung 2006). While there were precursors to SR in elements of the IARC Monographs (Samet et al. 2020) and disciplines that overlap environmental health, such as public and occupational health, few environmental health publications that explicitly describe themselves as SRs were published before 2005 ( Figure 1). In 2010, the first agency-level guidance on using SR in regulatory risk assessment was issued (EFSA 2010) and publication rates of environmental health SRs were beginning to see a significant increase (Figure 1). In 2014, the first formal SR frameworks for chemical risk assessment and environmental health questions were published (Rooney et al. 2014;Woodruff and Sutton 2014). The increase in rate of publication of SRs since 2014 has been rapid, with regional, national, and international agencies and research groups developing guidance and implementing varying interpretations of systematic review methodology for environmental health questions (for example, see Vandenberg et al. 2016;NASEM 2017;Schaefer and Myers 2017;Radke et al. 2020;Pega et al. 2021;Whaley, Aiassa, et al.  2020; WHO 2021). In 2020, approximately 1750 environmental health SRs were published (see Figure 1). While there has been a large increase in publications of SRs in environmental health, there is a lack of up-to-date empirical information about their methodological rigour. To our knowledge, three previous studies have directly investigated this. Sheehan and Lam (2015) analysed 48 environmental health SRs and meta-analyses published between 2001 and 2013. Sheehan et al. (2016) analysed 43 SRs and metaanalyses of ambient air pollution epidemiology published between 2009 and 2015. Sutton et al. (2021) analysed 13 SRs published between 2011 and 2017 in the course of a broader assessment of the extent to which reviews of three exposureoutcome pairs (air pollution and autism spectrum disorder, polybrominated diphenyl ethers and attention deficit hyperactivity disorder, and formaldehyde and asthma) can be characterised as systematic. All three studies found a high prevalence of a range of important shortcomings in adherence to accepted systematic review methodology. These include ambiguous formulation of review objectives, lack of pre-published protocols, and a lack of critical appraisal of the included evidence utilising valid instruments. The findings of these studies echo empirical evidence of widespread limitations in the conduct and reporting of SRs in healthcare (Ioannidis 2016;Page et al. 2016;Pussegoda et al. 2017;Gao et al. 2020) and socioeconomic research (Wang et al. 2021).
In this study, we aimed to provide up-to-date information about the methodological rigour of environmental health SRs. To do this, we conducted a survey of recently published, selfdescribed systematic reviews that sought to quantify associations between environmental exposures and health outcomes. We assessed the SRs for methodological rigour, analysing them for frequency of implementation of methodological features that are generally considered to contribute to the transparency, utility, and validity of SRs. We tested three hypotheses: that a pre-published protocol, the use of a reporting checklist, or higher journal impact factor are associated with increased methodological rigour of a SR. We chose these hypotheses because reporting standards and pre-published protocols are intended to improve the rigour of SRs, and higher impact-factor journals may be assumed to be publishing more rigorous research. We also conducted some exploratory analyses of relationships in the data to help us interpret our results and develop hypotheses for potential future testing.
For ease of reading, SR-related tools, instruments, and frameworks are identified throughout the manuscript and supplemental materials as their abbreviation. A guide to abbreviations is provided in Appendix A. Supplemental materials for this manuscript are extensive and have been deposited in a structured Open Science Framework (OSF) archive (https://osf.io/j2d5v/). Supplemental material is hyperlinked throughout the manuscript, and a complete hyperlinked index of supplemental materials is presented in Appendix B.

Materials and methods
This study was conducted according to a pre-published protocol registered 19 November 2020 on the Open Science Framework (registration DOI: 10.17605/OSF.IO/MNPVS). The protocol was not externally peer-reviewed; however, registration allowed changes to planned analyses that were made after data had been collected to be transparently identified. The methodology consists of the following components: the selection of a purposive sample of environmental health SRs; the development of a tool, CREST_Basic, for rapid assessment of the methodological rigour of a SR; assessment of the SRs using CREST_Basic and the gathering of additional data to inform the discussion; testing of the primary hypotheses and conduct of the exploratory analyses; and analysis and interpretation of the results. Each is discussed below.

Search strategy
We designed a search strategy to retrieve environmental health SRs. The strategy had three components: a general environmental health concept; a systematic review concept; and a date filter to capture an 18 month period between 1 January 2019 and 30 June 2020. The search was performed in PubMed on 19 November 2020 and yielded 4882 results. The search strategy was validated by checking recall against a set of 20 SRs eligible for inclusion in our study, that were manually selected from a range of environmental health journals. The search retrieved 100% of the validation set. The search strategy and its results are available in the online supplemental materials SM01 (https://osf.io/85c6d/) and SM03 (https://osf.io/j59ku/) respectively.

Eligibility criteria and screening methodology
We employed a purposive SR sampling strategy aimed at covering a range of environmental health topics in a narrow, recent time window. We defined environmental health as "the investigation of associations between exposures to exogenous environmental challenges and health outcomes," including toxicology and environmental epidemiology, as per our protocol (Menon et al. 2020). To be eligible for inclusion in the SR sample, documents had to fulfil the following criteria: 1. identify explicitly as a "systematic review" in their title; 2. assess the effect of a non-acute, non-communicable, environmental exposure on a health outcome; 3. include studies in people or mammalian models; 4. be available in HTML format; 5. be published between 1 January 2019 and 30 June 2020.
We excluded umbrella reviews and review protocols, SRs of the wrong population (e.g. plants), and SRs of the following exposures: pathogens; poisons; smoking and tobaccorelated products, except for exposure to second-hand smoke (to prevent our sample being dominated by studies of direct tobacco exposure); social, psychological, or behavioural risk factors; housing conditions; SRs of exposure only (e.g. biomonitoring); and protocols for SRs. We did include diet and addiction as exposures, as these are considered by some researchers as being environmental exposures (Bazzano et al. 2008). SRs had to be available in HTML format to allow copy-pasting of manuscripts into a uniform format stripped of visual cues, and allow easy removal of identifying information, as per our blinding strategy.
Title and abstract screening was performed independently by two reviewers (PW, JM) in Rayyan QCRI (https://rayyan. qcri.org/). SRs were sorted in reverse chronological order using Rayyan. Eligible studies were screened sequentially for inclusion by JM and PW until a sample of 75 SRs was reached. This was the maximum number of SRs we could evaluate in our budgeted time of 80 investigator hours. Discrepancies in decision were resolved by discussion; 902 SRs were screened to generate the sample. The final set of 75 SRs is available in SM06 (https://osf.io/mgxwj/) and the excluded SRs in SM13 (https://osf.io/y83tu/).

Development of the SR assessment instrument
In order to achieve our research objectives with the resources we had available, we determined that we required a SR assessment instrument that could be applied by two Masterslevel investigators who are not experts in environmental health (JM and FS), would take less than 30 min to apply to allow 50-75 SRs to be assessed and results analysed in our budgeted time of 80 investigator hours, and would generate a score that would represent the methodological rigour of a SR without falling foul of issues with scores and scales as presented by some appraisal instruments (see e.g. Greenland and O'Rourke 2001;Frampton et al. 2022).
Due to the high number of SR appraisal tools that have been published, we determined it would not be feasible to comprehensively review existing tools for applicability to our task. Instead, we decided to design a new tool, "CREST_Basic," for our specific objectives and capacity. We summarise key information about CREST_Basic below. A complete description of the development of CREST_Basic, including pilot testing, are in the "Data Collection" section of our study protocol SM00 (https://osf.io/c8khr/).

Sources for CREST_Basic
We based CREST_Basic on three sources familiar to us and judged to be informative for our task: CREST_Triage, a rapid appraisal tool that supports editorial triage decisions for environmental health SRs and has been in use at the journal Environment International since 2018 (Whaley and Isalski 2019); AMSTAR-2, the updated version of an established SR appraisal tool that has been cited over 2000 times (Shea et al. 2017); and COSTER, the first set of good practice recommendations published specifically for environmental health SRs (Whaley, Aiassa, et al. 2020). We felt the sources and consensus processes behind COSTER and AMSTAR-2, combined with our experience of evaluating the rigour of hundreds of SRs from CREST_Triage (PW) and PROSPERO registrations (JM), would provide sufficient topic coverage and experience in structured appraisal of SRs from which to derive an instrument suited to our research objectives.

Questions and scoring method
CREST_Basic consists of 11 questions, each asking whether a methodological component associated with SR rigour is present. The questions either have yes/no or yes/no/unclear answers designed to be quickly and objectively answered by evaluators. Each question can be flagged by an assessor for "unusual practices" to facilitate follow-up analysis and interpretation. Answers of "yes" are awarded one point, and "no" and "unclear" are awarded zero points. A CREST_Basic score thereby represents how many of 11 methodological components are unambiguously present in a single SR, and for a set of SRs allows the prevalence of each of these 11 components to be calculated. Table 1 shows the questions, the reason for each question, criteria for the answers to each question, and data extracted for additional analysis.
The questions were presented to assessors as a Google Form, of which a PDF copy is available in SM02 (https://osf. io/y7w29/). The reason some questions have "unclear" as an answer is to allow evaluators to indicate uncertainty when an answer to a question depends on the testimony of the authors rather than being directly observable as a feature of a SR manuscript. For example, a missing element of a Population-Exposure-Comparator-Outcome (PECO) statement or presence of a complete search string, can be directly observed. In contrast, whether 100% of search results were screened in duplicate may be stated by SR authors but is not easily observed and may be reported ambiguously.

Interpreting a CREST_Basic score
CREST_Basic provides a measure of the methodological rigour of a SR, defined as the number of a given set of recommended methodological components that are present in a SR. Methodological rigour is a different concept to validity: the presence of a component associated with rigour does not guarantee the validity of a SR, as the component can be implemented to varying degrees of validity; nor does the absence of a component invalidate a SR. However, the absence of components associated with methodological rigour does indicate a potential threat to the validity or utility of a SR. For example, not searching in at least two databases increases risk of selection bias in a SR, and not providing a complete electronic search string means a database search strategy cannot be validated.

SR assessment and data extraction
2.3.1. Blinding the assessment A blinding strategy was applied to minimise the potential for knowledge of e.g. publishing journal or author team to influence the way in which CREST_Basic was applied by the investigators. SRs were retrieved by PW and anonymised by copy-pasting article text from the HTML web version of a manuscript into a Microsoft Word document, clearing all formatting to a default of 11 pt Arial text, and removing identifiers for authors (names, affiliations, and CReDiT statement) and publication (e.g. journal title, article title, document formatting, links to article online). Any supplemental material needed for answering CREST_Basic questions was screen-grabbed and pasted at the end of the word document. Each Word document was allocated a random number generated by Random.org (https://www.random.org/) as a unique identifier. Investigators were instructed not to conduct internet searches on text strings in the Word documents but to request additional anonymised materials from PW. An example of a blinded document is shown in SM08 (https:// osf.io/62tpq/), and the text selection from HTML in SM09 (https://osf.io/e8s7c/). Due to probable prior knowledge of included manuscripts and potential conflicts of interest arising from their position as a specialist SR editor, PW did not conduct data extraction, and did not view extracted data until JM and FS had completed disagreement resolution. None of the investigators had published a SR eligible for inclusion in this study, so measures to prevent the assessment of their own work were unnecessary.

Assessment and data extraction
The anonymised manuscripts were assessed independently by two investigators (FS, JM) using the Google Forms version of CREST_Basic. The investigators were trained on a set of three SRs that met the eligibility criteria for this study, except for being outside the target date range. To reduce the risk of learning effects influencing manuscript assessments, one investigator assessed SRs in ascending order of numerical identifier while the other worked in descending order. Discrepancies in assessments were resolved by discussion. No third-party arbitration was required. After the assessment stage was completed, one investigator (JM) extracted additional information including journal title, impact factor (taken from the journal website), topics of the SRs, included populations of the SRs, and the presence of a CRediT statement or equivalent. Raw data from the data extraction process is available in SM05 (https://osf.io/uv5dg/).

Summary of SR performance against CREST_Basic
Included SRs were allocated 1 point per affirmative answer for each CREST_Basic question. To summarise the performance of each individual SR against CREST_Basic, the points were summed to give a score out of 11. To summarise the performance of all SRs against each CREST_Basic question, we counted the frequency of affirmative answers across the set of included SRs. Because these are counts of document features, the issues with using scores and scales that would apply if e.g. validity of the findings of individual SRs were represented by a score are avoided (Greenland and O'Rourke 2001).

Hypothesis tests
We tested for inequality in sample size and variance using the Shapiro-Wilk test (Shapiro and Wilk 1965;Field 2009), Anderson-Darling test (Anderson and Darling 1952), and probability plots. The results of these tests allowed us to assume normal distribution; we therefore used the Mann-Whitney U test (Mann and Whitney 1947;Field 2009), computed in XLSTAT for Microsoft Excel (Addinsoft 2021), to test for an association between CREST_Basic score and the presence of a pre-existing protocol and the use of a reporting checklist. The Student T-test was used as a secondary test (Field 2009). To test for an association between score and journal impact factor, we used Pearson's correlation coefficient (Field 2009).

Exploratory analyses
To explore our data, we created 2 Â 2 contingency tables to identify correlations between characteristics of the included studies (e.g. the presence of a protocol and the use of a reporting checklist). Correlations were evaluated using the chi-square test in IBM SPSS version 27 (IBM 2020). When the expected numbers for a group was less than 5, we used Fisher's exact test.

Deviations from protocol
The protocol for this study is available from the original OSF registration (https://osf.io/mkg9f/) and duplicated as SM00 on the OSF archive for this project (https://osf.io/c8khr/). We did not conduct a search using the Web of Science platform (see SM00 lines 95-96) due to it returning an excessively high number of irrelevant articles that we did not have capacity to screen. In our analysis we pooled rather than kept separate the "testimony" and "direct" question types that we defined in the protocol (see SM00 lines 151-156 and lines 268-274). We decided this was an artificial and potentially confusing distinction that would be better handled through subgroup analysis; however, in the supplemental materials the distinction between the two types is preserved, via colour coding, to facilitate separate analysis if needed, and testimony questions are marked with an asterisk in Table 1.

General characteristics of the included SRs
We included 75 SRs in our study (see SM06 for the full list, https://osf.io/mgxwj/). The 75 SRs were published in 50 different journals, with impact factors ranging from 0.66-12.38 (median 3.32). The most prevalent journals were Environmental Research (n ¼ 10), the International Journal of Environmental Research and Public Health (n ¼ 6), and Science of the Total Environment (n ¼ 4). Sixty-seven SRs included exclusively human evidence, 3 included exclusively animal evidence, and 5 were of both human and animal evidence. Four SRs included in vitro studies. Sixty-four SRs were published within the target date range, 10 in 2019, and 1 in 2018 (see section 4.7 for a discussion of this discrepancy). One hundred percent of the SRs we screened were available as HTML. Thirty-five environmental exposures and 25 health outcomes were covered. The three most commonly studied The Population-Exposure-Comparator-Outcome ("PECO") framework is fundamental to structuring the objectives of a SR of exposure-outcome relationships. Absence of any element of the PECO framework has a number of consequences, including impeding assessment of comprehensives of a SR and the applicability of its findings. The PECO statement should be explicitly stated and not have to be inferred, for example, from the eligibility criteria of the SR.
To answer "yes," all four PECO elements must be explicit and stated as objectives independently of the eligibility criteria. A complete, stand-alone PECO statement is not necessary for a "yes," so long as all PECO elements are explicit in the text.
The investigators extracted the full PECO statement as presented in the manuscript and noted any missing elements.
2. Protocol: Is there reference to a prepublished protocol? Derived from: COSTER recommendation 1.5.3; AMSTAR-2 question 2 A pre-published SR protocol specifies the methods according to which a SR will be conducted. Prepublication should involve posting a versioncontrolled, read-only document to a third-party preprint repository. Pre-publication of protocol allows potential for bias via changes in methods and/or selective reporting based on knowledge of results to be audited. Self-publication of a protocol without third-party version control does not allow for the pre-publication status of a protocol to be fully audited (NASEM 2022).
To answer "yes," the SR must provide a unique identifier for a protocol on a third-party website that the authors cannot directly edit (e.g. PROSPERO or preprint repository).
The investigators extracted the location, registration number and/or DOI of the protocol.
The investigators also noted if a protocol was referenced but no location was provided.
3. Databases: Were at least two databases or platforms searched? Derived from: COSTER recommendation 2.1; AMSTAR-2 question 4; CREST_Triage question 2 Two databases or research platforms is a minimum for a comprehensive search. This helps ensure that no evidence relevant to the SR is overlooked due to differences in literature indexed by the databases.
To answer "yes," at least two research databases or platforms must have been searched. Duplicate data extraction reduces the risk of systematic and random error in the data extraction process.
To answer "yes," the SR must state that 100% of data extraction was conducted in duplicate.
The investigators extracted relevant text describing whether duplicate data extraction was conducted.
8. Use of critical appraisal tool: Was study quality assessed using a critical appraisal tool?
Derived from: COSTER recommendation 5.1; AMSTAR-2 question 9; CREST_Triage question 4 Critical appraisal helps contextualise the findings of a SR with the limitations of the included studies. Critical appraisal should be conducted using an appraisal instrument or tool, so the same criteria are consistently applied to each category of study included in a SR.
To answer "yes," the SR must appraise 100% of included studies using a critical appraisal tool. The investigators extracted the names of any tools used for critical appraisal. If appropriate, the investigators classified tools as "home-made" (if created from scratch) or "modified" (if changes were made to a tool beyond those specified in the instructions for using the tool).
(continued) exposures were air pollution, pesticides, and occupational risk factors. The three most commonly studied health outcomes were female reproductive outcomes (primarily infant birthweight), neurodevelopment, and cancer. The data tables and analyses for this study are in SM10, SM11, and SM12 of the project OSF archive (https://osf.io/9krhz/).

General performance of included SRs against CREST_Basic
The included SRs scored a mean of 5.97 points out of 11 in the CREST_Basic questionnaire. Out of the 75 included SRs, only one scored the maximum 11 points. Three SRs scored 10 points, and six SRs scored 9 points. The median and mode scores were 6. Eighty-seven percent (65 of 75) of the included SRs omitted three or more components associated with methodological rigour. Figure 2 shows the overall scores and their distribution across the sample. The most commonly omitted SR components were a complete PECO statement (57 of 75 SRs), a link to a pre-published protocol (62 of 75 SRs), and the lack of use of a certainty assessment framework (65 of 75 SRs). The three questions against which the included SRs performed the best were the use of at least two databases for search (67 of 75 SRs), provision of a complete search string (57 of 75 SRs) and the use of a critical appraisal tool to assess individual included studies (57 of 75 SRs). Figure 3 shows the performance of the included SRs against each individual CREST_Basic question.

Objectives
Seventy-six percent (57 of 75) of SRs had incomplete PECO statements. Of those that were incomplete, 100% (57 of 57) omitted a comparator, and 65% (37 of 57) omitted the population of interest. The exposure of interest was omitted once, and the outcome of interest was omitted once. This is shown in Figure 4.

Protocols
Seventeen percent (13 of 75) of SRs provided the location of a pre-published protocol (that we define as posting as a read-only document on a third-party controlled website before collecting data). Of these, 100% (13 of 13) were registered in PROSPERO (https://www.crd.york.ac.uk/prospero/). No other protocol or preprint registries were mentioned. Five of the included SRs referred to the existence of a protocol but provided no location for one.

Databases
Eighty-nine percent (67 of 75) of SRs searched at least two literature databases or platforms. A total of 40 different databases or platforms were used. The most commonly used was PubMed/Medline (n ¼ 68), followed by Web of Science (n ¼ 44) and Embase (n ¼ 33). Chinese databases were used 13 times. Registries and specialist grey literature databases Duplicate critical appraisal reduces the risk of systematic and random error in the critical appraisal process.
To answer "yes," the SR must state that 100% of critical appraisal was conducted in duplicate.
The investigators extracted relevant text describing whether duplicate critical appraisal was conducted.
10. Certainty assessment: Did the authors assess certainty in the evidence using a tool or framework? Derived from: COSTER section 7; AMSTAR-2 questions 12, 13, 14; CREST_Triage question 6 A body of evidence should be assessed for overall properties that affect certainty in the findings of a SR. As for critical appraisal, certainty should be assessed using a tool or framework which specifies a set of formal criteria against which certainty of the evidence is assessed.
To answer "yes," the SR must apply a tool or framework in assessing certainty of the evidence, or similar construct to certainty such as "strength" or "confidence" in the evidence.
The investigators extracted relevant text describing the certainty assessment, and the name of the tool or framework used.
11. Reference to reporting standards * : Do the authors claim to have followed or used a reporting standard or checklist? Implicit in COSTER, AMSTAR-2 and CREST_Triage as comprehensive reporting is needed for full evaluation of SR methods. In theory, the use of a reporting checklist improves the quality of reporting, and potentially conduct, of a SR.
To answer "yes," the authors must explicitly state that they used or applied a reporting checklist either in conducting or reporting the SR.
The investigators extracted relevant text describing the use of a reporting checklist, including the names of any standards or checklists used.
were used 15 times. A full list of the databases and platforms and the number of times they were used by the included SRs is in the "Data Set and Charts" sheet of SM10 in the project OSF archive (https://osf.io/6emp8/).

Search strategies
Seventy-six percent (57 of 75) of SRs provided at least one complete electronic search string. However, 56% of the reported search strings (32 of 57) were flagged by the assessors as potentially problematic, with concerns including lack of use of MeSH terms for PubMed, incomplete coverage of topic concepts, and inappropriate database syntax. Thus, while search strings were generally provided, there are concerns about the comprehensiveness of the searches in at least 67% (50 of 75) of the included SRs (18 without a search string plus 32 with a questionable search strategy).

Eligibility criteria
Seventy-one percent (53 of 75) of SRs reported their eligibility criteria in terms of population, exposure and outcome. Fifteen SRs omitted population, 8 omitted exposure, and 8 omitted outcomes. PECO elements were more widely implemented for describing the eligibility criteria for a SR than for framing its objectives.   prevalent instrument was the Newcastle-Ottawa Scale (n ¼ 24) (Wells et al. 2009) followed by home-made or modified instruments (n ¼ 8), and either ROBINS-I or ROBINS-E (n ¼ 5) (Sterne et al. 2016). Otherwise, choice of instrument was heterogeneous, with 31 instruments used across the 75 included SRs (see Figure 5). There were occasional examples of clearly inappropriate study appraisal methods. These included use of the reporting checklist STROBE, even though it instructs authors not to use STROBE as an appraisal tool (Vandenbroucke et al. 2007), and use of the GRADE certainty assessment framework, which applies to bodies of evidence not individual studies (Guyatt et al. 2008).

Use of reporting checklists
Sixty-nine percent (52 of 75) of SRs claimed to use a reporting checklist. Of these, only 60% (31 of 52) actually provided a completed checklist. Five checklists were referenced, the most common being PRISMA (n ¼ 45) (Moher et al. 2009). The NTP OHAT Framework was referenced once, even though it is not a reporting checklist (NTP-OHAT 2019).

Results of hypothesis tests
The tests for distribution of our data gave mixed results regarding normality (Shapiro-Wilk test W:0.970, p ¼ .05; Anderson-Darling test A 2 :0.834, p ¼ .030). Based on histogram, P-P and Q-Q plots (see SM11), we assumed that our overall scores were normally distributed.

Pre-published protocol and higher number of recommended SR practices
Thirteen SRs provided the location of a pre-published protocol. The mean CREST_Basic score for SRs with a pre-published protocol was 7.769 (SD 1.928) out of 10 against 5.387 (SD1.166) without (shown in Figure 6). The median score was 8 for SRs with protocol against 6 without. Significance was confirmed with a two-tailed t-test (p < .0001) and Mann-Whitney test (p < .0001), although the standard deviations overlap. Therefore, we could reject the null hypothesis that providing a pre-published protocol is not associated with increased methodological rigour of a SR, as represented by a CREST_Basic score.

Reporting checklist and higher number of recommended SR practices
Fifty-two SRs referred to a reporting checklist. Reviews were set in two groups: studies which referred to a reporting checklist (n ¼ 52) and ones that did not (n ¼ 23). The mean CREST_Basic score for SRs that referred to a reporting checklist was 5.5 (SD 1.93) out of 10 against 4.78 (SD 2.43) (p ¼ .323) for those that did not (shown in Figure 7). The median was 5 against 5. Overlapping standard deviations and a two-tailed t-test (p ¼ .18) indicate the association is not statistically significant. Therefore, the null hypothesis that reference to a reporting checklist is not associated with increased methodological rigour of a SR, as represented by a CREST_Basic score, could not be rejected.

Higher impact factor of publishing journal and higher number of recommended SR practices
Reviews were ranked by impact factor and divided into three tertiles: low (IF 2.849, n ¼ 27); intermediate (2.849 < IF 5.248, n ¼ 21); and high (5.248 < IF, n ¼ 26). One review (ID443725738) was removed from analysis as the publishing journal had no impact factor (Current Developments in Nutrition). The mean CREST_Basic score for low tertile against high tertile was 5.44 (SD 1.99) against 6.58 (SD 2.37) (p ¼ .085), shown in Figure 8. The median score was 6 against 6.5. Overlapping standard deviations and a two-tailed t-test (p ¼ .065) indicate the association is not statistically significant. Pearson's correlation coefficient was also non-significant (p ¼ .185). Therefore, the null hypothesis that higher journal impact factor is not associated with increased methodological rigour of a SR, as represented by a CREST_Basic score, could not be rejected.

Exploratory analyses
Exploratory analyses of answers to CREST_Basic questions, conducted using 2 Â 2 contingency tables, found a number of recommended practices to be significantly associated. The three practices carrying the most associations were reference to a pre-published protocol (four other practices were associated), use of a critical appraisal instrument (four other practices were associated), and duplicate conduct of critical appraisal (three other practices were associated). All statistically significant associations are shown in Table 2. The contingency tables are in Supplementary Materials 12.

Discussion
Our work raises a number of thematic issues relating to the rigour of systematic reviews in environmental health. We discuss six of these, speculate as to their implications for Score for SRs that do not reference a protocol vs. those that do p<0.0001 Figure 6. Comparison of mean CREST_Basic score for SRs that provide the location of a pre-published protocol (n ¼ 13) against those that do not (n ¼ 62). Error bars show 1 standard deviation.
eventually improving the rigour of published SRs, and describe some of the strengths and limitations of our study.

Potential for unclear objectives and ambiguity about eligible evidence
The value of PECO statements in characterising objectives and supporting structured analysis of evidence in SRs was recently reinforced by the US National Academies of Sciences, Engineering, and Medicine (NASEM) reviews of the US EPA TSCA SR protocols (NASEM 2021) and US EPA IRIS Handbook (NASEM 2022), and the World Health Organisation (WHO) framework for use of SR in chemical risk assessment (WHO 2021). Of the three studies directly comparable to ours, Sheehan and Lam (2015) found that 54% of environmental health SRs included all four elements of a PECO  In our sample, the populations, exposures, comparators, and outcomes of interest that form the objectives of a SR were only reported in 24% of the included SRs. In the incomplete PECO statements, comparators were always overlooked, and populations were overlooked in 65% of the time. Population, exposure, and outcome characteristics were better reported in eligibility criteria, where populations were reported in 75% of the included SRs and exposures and outcomes were each reported 90% of the time. Given the relative frequency of providing PECO information, we believe that SR authors may often be conflating eligibility criteria with the objectives of a review.
PECO statements are a formalism that facilitates unambiguous characterisation of the knowledge goal of a SR, interpreting a research question into an exact statement of the variables that are to be investigated (Morgan et al. 2018). Eligibility criteria, while also characterised in terms of PECO elements, are importantly different: rather than characterising the question being asked by a SR, they characterise the evidence that is considered informative for answering it. As such, although eligibility criteria need to be consistent with the PECO, they should not necessarily match it. This is a common necessity in environmental health research. Evidence from human studies is often limited, and eligibility criteria in a SR therefore may often need to be extended to include mammalian and other study models as informative surrogates for the specific PECO elements that characterise the SR's objectives (Whaley et al. 2022). Conflating eligibility criteria and research objective reduces the clarity with which SR objectives are communicated to the reader, and neutralise a potentially valuable interpretative framework that can facilitate the conduct of the SR as a whole.

Potential selectivity in retrieval of relevant evidence
That SR methods may reduce selectivity in use of evidence has been an important driver in their uptake in environmental health research and risk assessment (e.g. Radke et al. 2020;Pega et al. 2021). Agency-led assessments of health risks posed by environmental exposures have often been criticised for being insufficiently comprehensive in their coverage of relevant evidence (e.g. Buonsante et al. 2014;Robinson et al. 2020;NASEM 2021). Sheehan and Lam (2015) found that 48% of SRs "fully described the screening, text review, and selection processes"; Sheehan et al. (2016) found 51% of SRs provided transparent search methods and 63% clear study selection criteria and procedures; Sutton et al. (2021) found 85% of their sample provided a "satisfactory" search strategy and 61% a "satisfactory" selection process.
In our sample, most SRs searched at least two databases (only 8 of 75 did not), and only three SRs explicitly stated that they did not conduct screening in duplicate (although 28 of 75 did not report this clearly, and we did not validate claims of duplicate screening). Fifty-seven of 75 SRs provided their search strategy, enabling validation of their approach.
Superficially, this seems encouraging. However, if we count SRs that either searched only one database or platform, or did not present a full search electronic string, or were unclear about screening evidence in duplicate, or were flagged for potential issues in search strategy, then 76% (57 of 75) of the included SRs may be vulnerable to partial inclusion of evidence relevant to their objectives. This is without attempting to validate in detail the search strategies and screening approaches of the remaining 24%. We believe this raises serious questions about whether the majority of published environmental SRs have sufficiently comprehensive coverage of the evidence they should be including.

Potential for mischaracterisation of validity of studies and overall certainty of evidence
Critical appraisal of the included evidence is considered to be an essential component of SR, because it provides the information that allows the credibility of the findings of a SR to be characterised. It should occur at two levels: the individual study, and the body of evidence on which the findings of the SR are based . The SRs in our sample were seeking to quantify associations between exposures and outcomes. Critical appraisal at the level of individual study would therefore be expected to at least address internal validity, i.e. the potential for the way in which a study is conducted and reported to introduce systematic error into its results (Frampton et al. 2022). Many SR frameworks recommend that appraisal of internal validity be conducted using an instrument that assesses risk of bias (Hoffmann et al. 2017). Appraisal at the level of body of evidence concerns features of the evidence base as a whole that may affect certainty in the findings of a SR. This would include the results of the internal validity assessment and other features of the evidence base such as unexplained heterogeneity, the precision of the pooled results, and the generalisability of the evidence to the target question, among others (Morgan et al. 2016;Whaley, Aiassa, et al. 2020). Contemporary approaches to SR in agency-led assessments increasingly include some kind of study appraisal and certainty assessment method. The specific methods vary, with subtle but important differences for both processes between, for example, the US EPA TSCA SR protocol (NASEM 2021), US EPA IRIS Handbook (NASEM 2022), US NTP OHAT Handbook (Rooney et al. 2014), and the Navigation Guide (Woodruff and Sutton 2014). For individual study appraisal there is a much greater diversity of instruments, with 31 used in our sample alone. Agency approaches tend to converge on risk of bias assessment, though implemented in different ways (NASEM 2021(NASEM , 2022WHO 2021), and there is some principled opposition to assessment of risk of bias at all (e.g. Steenland et al. 2020). Sheehan and Lam (2015), Sheehan et al. (2016), and Sutton et al. (2021), found that 38%, 28%, and 38% of their samples of SRs conducted some form of individual study appraisal respectively. None of these studies assessed the use of a certainty appraisal framework in a way that is comparable to our approach.
In our sample, there was high prevalence (76%) of the use of a critical appraisal instrument to assess individual included studies. However, we noted that many of the instruments were of unclear validity for use in quantitative SRs of associations. For example, the Newcastle-Ottawa Scale was the most prevalent instrument (used by 24 SRs) but as an approach to assessing internal validity it presents several issues. These include: incomplete coverage of important factors affecting the internal validity of a study (e.g. selective reporting is not assessed); the treatment of each shortcoming as being of equal weight regardless of true impact on internal validity (all shortcomings are scored equally); and a lack of transparency (users do not report reasons for their judgements) (Frampton et al. 2022). Of the 23 tools cited (8 were homemade), the six that in our opinion show the most validity (the Cochrane Risk of Bias instrument; ROBINS; SYRCLE; OHAT; and the Navigation Guide) were used in only 11 of the included SRs.
Our survey also shows that, while using instruments for assessing individual studies is at least relatively common, the use of a framework for appraisal of the overall body of evidence is currently only employed in a small minority of environmental health SRs (13%, 10 of 75). It was surprising that the Bradford Hill criteria were only used once, and then in modified form, although it should be noted that the GRADE Framework, and the Navigation Guide and NTP OHAT Handbook that are adaptations thereof, are directly derived from the Bradford Hill criteria (Schunemann et al. 2011). The reasons for applying a framework to assessing the body of evidence are the same as those for using an instrument for appraising individual studies: it promotes transparency and consistency of judgements, and discourages ad-hoc interpretation of the evidence that may be vulnerable to inconsistency of judgement and confirmation bias (NHMRC 2019). The lack of prevalence of formal approaches to assessing certainty is therefore an important limitation in current practices in conducting environmental health SRs.
Overall, we believe our survey shows that systematic review teams are challenged in selecting appropriate study appraisal instruments and certainty assessment frameworks. It is not clear to us if this is a cause or an effect of a lack of consensus on best practices. Either way, guidance on how to choose and/or modify study appraisal instruments that are valid and useful for a given SR context would be valuable. Frampton et al. (2022) recently proposed the FEAT mnemonic for this purpose, breaking down the evaluation of study appraisal instruments into four criteria of Focus (target of appraisal), Extent (comprehensiveness of appraisal), Application (generation of valid appraisal results that can be integrated into review findings), and Transparency (clear documentation of judgements made during the appraisal process). Incorporation of this kind of overarching guidance into academic training and the SR planning process may support SR authors in selecting appraisal tools, assist peerreviewers and editors involved in checking the validity of methods used by SR authors, and also help move towards an improved consensus view of which tools are best used for a given research context. Similar work could be done for certainty assessment. A recent protocol for a WHO SR of the effect of electromagnetic frequency radiation on cancer in animal studies was developed by a mix of authors favouring and sceptical of risk of bias assessment and GRADE-style approaches to certainty assessment (Mevissen et al. 2022). This may provide some indication of where future consensus may reside.

The unclear role of protocol pre-publication in improving the rigour of SRs
Pre-publication of a protocol for a SR has been a common recommendation in healthcare and social science for a long time, with protocol repositories such as PROSPERO established to provide a means for this to happen. Environmental health journals first began to accept SR protocols from around 2016 (Whaley and Halsall 2016). Agency-level assessments such as those by the European Food Safety Authority, US NTP OHAT, US EPA IRIS Program, US EPA TSCA, and WHO now pre-publish protocols for SRs. To facilitate this overall trend, PROSPERO has been extended to environmental health topics and SRs of animal studies, so long as they have a direct health outcome (https://www.crd.york.ac.uk/prospero/ #aboutpage). In absolute terms, however, protocol publication is still rare: in our sample, 17% of SRs linked to a prepublished protocol; Sheehan and Lam (2015) found 8% referred to any kind of protocol; Sheehan et al. (2016) did not look at this; and Sutton et al. (2021) rated 22% of their sample as "satisfactory" for use of protocol.
According to our analysis of 2 Â 2 contingency tables, SRs that provided a link to a pre-published protocol performed statistically significantly better than those that did not in four areas: duplicate screening; duplicate appraisal; the use of a critical appraisal tool; and use of a reporting checklist. Our results also showed that only one SR that linked to a protocol failed to provide a complete search string, and only one SR that linked to a protocol failed to provide complete eligibility criteria (see Figure 9); however, this potential association between reference to protocol and complete search strategy or eligibility criteria did not reach statistical significance, potentially due to sample size.
Issues with incomplete PECO statements and a lack of certainty appraisal framework were still prevalent among the protocol-linked SRs. Of the 13 SRs that referenced a protocol and provided a complete search strategy, 6 were flagged for potential issues with the validity of their search strategy. While all the SRs that referenced a protocol used a critical appraisal tool to assess individual studies, 7 of the 13 used a tool of unclear validity for assessing risk of bias in a SR, such as the Newcastle-Ottawa Scale. Therefore, while a link to a protocol is associated with improved methodological rigour, there are still important shortcomings in the absolute rigour of protocol-based SRs, and there are questions about the validity of the methods that are being used.
Because our study is cross-sectional in design, we cannot determine the direction of cause between link to a protocol and CREST_Basic score, i.e. whether registration of protocol causes authors to conduct more components of a SR, or whether a more comprehensive awareness of the multiple components of a SR leads to recognition of the importance of registering review protocols. While we would speculate it is the latter, we also acknowledge the possibility that the act of deliberate planning implied by protocol registration may increase the number of CREST_Basic domains that are covered by a SR team. This would be especially likely if the protocol registration system prompts for text for each major step of the SR (as PROSPERO does, for example). We note, however, that registration alone does not offer an obvious, active mechanism for improvement, as improvement in planned methods would be due to internal actions on the part of the project team. This is in contrast to journal peerreview, where there is an external process for identifying issues in a manuscript and incentive for addressing them. We therefore speculate that external peer-review may provide a mechanism by which protocol development could cause improvement in SR rigour, but our sample did not include enough SRs that used this approach to test for an association.
Given the potential for improvement in the methods of planned SRs, and the potential costs (to both researchers and funders) of implementing a general policy of peer-reviewing SR protocols, it may be justified to conduct a randomised controlled trial of the efficacy of peer-review of protocols in improving SR methods. Alternative observational study designs, that may nonetheless provide evidence of sufficient certainty to support a policy of peer-reviewing protocols, could be before-after comparison of the quality of submitted protocols with accepted protocols, or a cross-sectional comparison of submitted or accepted protocols with protocols that have only been registered.

The unclear role of reporting checklists in improving the rigour of SRs
SRs which claimed to have followed reporting checklists were slightly, but not significantly, associated with better   Figure 9. Question-by-question performance against CREST_Basic of SRs that linked to a pre-published protocol.
CREST_Basic scores. Among the SRs that did refer to a reporting checklist, a majority did not provide a complete PECO statement, a link to a protocol, or use a certainty framework. A large minority did not report duplicate screening, duplicate data extraction, or duplicate critical appraisal (see Figure 10), even though these are key elements of reporting checklists such as PRISMA. In general, while authors were claiming to have used reporting checklists, they were selective in their compliance with them. This is consistent with the findings of Page et al. (2016) in healthcare research. Sheehan and Lam (2015), Sheehan et al. (2016), and Sutton et al. (2021) also indicate that SRs often cite reporting standards while exhibiting poor compliance with them, though none of these studies assessed SRs citing reporting standards as a distinct subgroup. There is evidence from some reviews that SRs published in journals that recommend or require adherence to a reporting checklist are better reported, though the effect size may be small (Agha et al. 2016, Cullis et al. 2017, Nawijn et al. 2019. In other areas, the effectiveness of reporting checklists for improving research has been shown to be limited. For example, in preclinical research, Hair et al. showed that asking for a completed checklist of the ARRIVE guidelines did not alter the reporting quality of studies (Hair et al. 2019). Researchers have highlighted a number of challenges in adhering to reporting checklists, finding that lack of awareness of checklists, checklist complexity, and time or word limitations, can reduce compliance (Burford et al. 2013). Overall, it remains unclear how effective reporting checklists are for improving the quality of conduct or reporting of SRs in environmental health. If they are helpful, we speculate that it would be when they are included in a broader strategy for improving publishing standards for SRs rather than being applied in isolation, through e.g. editorial enforcement of standards and rigorous gatekeeping. This is discussed in more detail in Whaley et al. (2021), in the context of interventions that journal editors might make to improve SR publishing standards.

Impact factor and publication quality
Higher impact factor of a journal is in some contexts taken as a measure of the higher quality of the research it publishes. While the link between the two is tenuous (for discussion, see e.g. Haustein and Larivi ere 2015;McKiernan et al. 2019), it could be speculated that higher impact journals may be more likely to impose quality control processes that increase the rigour of the SRs they are publishing. We followed other studies of comparable type to ours in testing for an association between impact factor and some measure of research quality (see e.g. Fleming et al. 2014); in our case, the number of recommended methodological practices being implemented in published SRs. In general, environmental health journals are publishing SRs that omit a number of important methodological components, and higher impact journals seem to perform about as well as lower impact journals in this regard. The use of journal impact factor as an indicator of the rigour of a systematic review would at this time seem unsupported.

Strengths and limitations of our study
A limitation of our study lies in its sample of SRs. We aimed for a comprehensive sample in a time window of several months, counting back from June 2020. However, because of apparent inconsistencies in recording of publication dates in PubMed and RIS exports, and subsequent handling in RAYYAN, we have a non-random sample of SRs spread across a wider time window. The reason for this happening may be due to ambiguity around online vs. print publication, which is why we did not exclude one SR (ID 551204262) that appears to fall outside our date range (this SR was e-published and indexed in PubMed in 2018 but only listed as published in the journal in 2020). Due to our blinding strategy, the issue only became apparent after we had completed the additional data extraction step of the study, and we did not have resources to resample and reconduct data extraction. If time of publication is associated with methodological rigour or affects prevalence of topics covered within a given time period, then random sampling from a specified time period would improve the generalisability of our results, though would not avoid the issue of some SRs being outside the target time window. Searching PubMed only may also present a generalisability issue, if there are significant differences between SRs that are indexed in PubMed and SRs that are indexed in other databases or literature platforms. Our sample did consist mainly of SRs of human epidemiology studies; this may reflect their predominance in the literature, but we do not know the ratio of epidemiology to non-epidemiology SRs in environmental health. We could arguably have excluded SRs of drug exposure, but only two papers on this topic were included. We did not provide data on SRs that did not declare themselves as such. Since we excluded SRs of direct tobacco exposure, our findings do not generalise to that topic. Nonetheless, we believe we have presented the largest evaluation of environmental health SRs to date, with more generalisability than the topic-focused studies of Sutton et al. (2021) and Sheehan et al. (2016), and presenting a more contemporary picture of SR practices than either of these two studies or Sheehan and Lam (2015). Our assessment directly covers a set of well-established characteristics that a SR ought to have, updating the Literature Review Appraisal Tool (LRAT) applied by Sutton et al. (2021) based on contemporary guidance for conduct of environmental health SRs that was unavailable to Sheehan and Lam (2015) and the original LRAT authors (Whaley and Halsall 2014).
Our methodology has prioritised replicability over potentially subjective or controversial judgements about the validity of the methods of SRs. While this puts limits on detailed insight into the impact of methodological rigour on the validity of the findings of included SRs (e.g. we can only comment on choice of study appraisal instruments, not the validity of their application), CREST_Basic is quick to apply (10 min per evaluation) and requires only Masters-level expertise and approximately 3 h of training. Because the questions of CREST_Basic have been designed to be unambiguous and have objective answers, they should be applicable by other researchers to generate results directly comparable with our study here (we observed 107 discrepancies or 13% raw disagreement between reviewers in our data set, that primarily related to issues such as confusing GRADE with risk of bias assessment, that are easily resolved via consensus between evaluators-see SM07 https://osf.io/dz4v8/). This may help with reproducing our findings in a more representative sample of SRs, a particular subset under-represented in our study (e.g. SRs of animal or in vitro studies), assessing changes in SR quality over time, and extending in other ways our initial dataset of 75 SRs. To facilitate this process, we have made our entire project available as a template on the Open Science Framework (https://osf.io/j2d5v/).

Conclusions
Our study shows that a large number of SRs on environmental health topics are being published in spite of important shortcomings in methodological rigour. Reporting checklists as currently applied seem ineffective for addressing this, and while pre-publication of protocol has some effect on rigour, it is arguably not effective enough. Higher journal impact factor does not guarantee higher methodological rigour of SRs. Our analysis also indicates the following issues are prevalent in environmental health SRs: PECO statements are under-utilised, with eligibility criteria often being conflated with SR objectives. There is significant potential for selective inclusion of evidence in SRs via limitations in search strategy and literature screening methods. There is frequent use of instruments that are inappropriate for critical appraisal of individual studies included in a SR, although there is an encouraging general recognition of the importance of critical appraisal as a step of the SR process. There is a lack of recognition of the importance of systematic assessment of the certainty of the evidence as part of the findings of a SR. Pre-published protocols are under-utilised, but to be fully effective for improving the rigour of SRs they likely need to be accompanied by additional interventions, such as peer-review of protocols and editorial enforcement of SR standards. The same appears to be true for reporting checklists.
Improving the rigour of SRs in environmental health is likely to be a serious challenge, requiring a coordinated, multi-intervention response. We would be cautious about stepping beyond what our immediate study has shown in recommending a response. However, general education of authors, editors, and peer-reviewers would seem sensible, as manuscripts with important limitations in rigour are routinely making it through journal review processes. Instruments that can support peer-review (such as AMSTAR-2) and editorial gatekeeping (such as CREST_Triage) are available but not widely used in SR manuscript handling workflows. Enforcement seems key: PRISMA reports are regularly used, but it seems they are rarely checked. Peer-review of protocols may be especially effective for increasing methodological rigour of SRs and could be tested. Overall, the range of potential interventions is wide; we have considered these in previous work . If it could be established, a Cochrane-like organisation for environmental health that can research, develop, and set SR standards, while building consensus and educating people, might be able to make an important difference. These are exciting times for SR, but steps need to be taken to ensure their rigour if they are to serve as the gold standard in evidence synthesis. FS participated in writing the protocol, piloted CREST_Basic, performed data extraction, and reviewed versions of the manuscript. PW supervised the project, wrote the protocol, designed the search strategy, performed the screening, anonymised the reviews, handled the uploading of documents to the Open Science Framework, and reviewed and revised the manuscript. Dr. Katya Tsaioun (KT) and SH were involved in planning the study. SH reviewed the study protocol and KT reviewed the final manuscript and approved it for submission. Lancaster University provided library access and software for PW. The authors are extremely grateful to five anonymous peer-reviewers for extensive, thoughtful comments that resulted in significant improvements to the framing and contextualisation of the manuscript. The authors would also thank Dr. Roger O. McClellan and Dr. Jesse Denson Hesch for their editing of the manuscript.

Declaration of interest
This study is an undertaking of the Evidence-based Toxicology Collaboration (EBTC) at Johns Hopkins Bloomberg School of Public Health, https://www.ebtox.org/. EBTC is a research organisation that works in the toxicological and environmental health sciences on raising research standards, promoting the use of systematic methods in synthesising evidence, improving access to research, and advocating evidencebased decision-making. EBTC receives the majority of its income from private donations to Johns Hopkins University.
This study was self-funded from EBTC's discretionary funds. The study was initiated by staff in strategic support of the objectives of EBTC. It was conducted to gain insight into current methods being used for conducting systematic reviews (SRs) in environmental health, and will inform EBTC's education and advocacy work around raising SR standards. PW (Research Fellow at EBTC) led the study. Radboud University Medical Centre (JM and FS) was engaged by EBTC on a consultancy contract to conduct the research. Dr Katya Tsaioun (Executive Director of EBTC) and Dr Sebastian Hoffmann (Research Fellow at EBTC) were involved in planning the study. SH reviewed the study protocol and KT reviewed and approved the submission of the final manuscript.
JM declares they are an editor of submissions to the animal studies section of the PROSPERO SR protocol registry. FS declares that they have no conflict of interest. JM was a Research Associate and FS a Masters student at Radboudumc at the time of conduct of this study. PW declares they are Systematic Reviews Editor for the journal Environment International, for which they receive a financial honorarium, and is an Honorary Researcher at Lancaster University, which provides PW with software and library access. As a consultant, PW specialises in the development and promotion of SR methods in environmental health, providing research, scientific training, and editorial support services. https:// www.whaleyresearch.uk/

Supplemental material
Supplemental material for this article is available online here.