Evaluation of Noninvasive Prenatal Testing (NIPT) guidelines using the AGREE II instrument.

Abstract Objective: The rapid increase of cell-free fetal DNA analysis for Down syndrome screening requires evidence-based clinical practice guidelines for noninvasive prenatal testing (NIPT). Several studies show that the quality of many guidelines is low and there are still many health areas where this quality is not systematically evaluated. Given the absence of research, in the NIPT field, we used an internationally validated tool to evaluate a set of three NIPT practice guidelines and to look at dimensions that can be improved. Methods: Four appraisers, experts in prenatal screening, evaluated three main NIPT guidelines published in the last 2 years using the AGREE II (Appraisal of Guidelines for Research and Evaluation II), a tool specifically designed for guideline quality appraisal. Results: Guidelines scored higher in domains related with scope, purpose, and clarity of presentation, and lower in stakeholder involvement and rigor of development. Intradomain items evaluation showed asymmetries between guidelines. The UK-NSC was the guideline with the best scores. Discussion: Several areas of NIPT guidelines, such as stakeholders involvement, selection of supporting evidence, external reviews, updating processes, and competing interests disclosure, can be improved. Appraisers recommend modifications to all NIPT guidelines that can lead to substantial improvements in their methodological quality and subsequently make a contribution to prenatal screening improvement.


Introduction
The recent use of cell-free fetal DNA (cffDNA) in maternal peripheral blood showed a significantly higher detection rate and lower false positive rate for detection of trisomy 21 in general pregnant women, in singleton pregnancies [1,2]. After its identification in 1997 [3], the rapid increase of cffDNA analysis utilization, and the asymmetries in its use, both across and within countries, emphasized the need to develop and implement, at national and international levels, better quality control mechanisms, including through evidence-based guidelines for noninvasive prenatal testing (NIPT) [4]. In the past few years, several expert groups have also issued documents addressing specific concerns, although most documents were not systematically developed as clinical practice guidelines [5][6][7][8].
Practice guidelines (PGs) are documents built to help professionals in making decisions about diagnosis and care, according to the best scientific evidence available [9]. The concept of PGs use is today deeply disseminated throughout all health contexts [10], mainly due to the exponential growth seen in the number of published Clinical Practice Guidelines (CPGs), in the last decades [11,12].
This growth poses a set of challenges [13], and several studies show that the quality of many guidelines is low [12,14,15], the methodology is modest and there are many variations across guidelines dedicated to the same subject [16]. Even in WHO guidelines there is infrequent use of systematic reviews, an absence of systematic guideline development methodology and an overestimation of expert opinion [17,18].
These issues led several groups to develop tools to fulfill an emerging need and to tackle associated challenges [19]. One of these groups developed and validated an instrument named AGREE (Appraisal of Guidelines for Research and Evaluation), to assess the quality of CPGs [20]. The AGREE Instrument evaluates the process of PGs development and the quality of reporting [21]. After being applied and studied in a broad range of health areas [14,[22][23][24][25], is now in its second version [26]. However, and despite the validation, dissemination and use of tools like AGREE, there are still many health areas where the quality of guidelines is not systematically evaluated [12,27].
Also in the Laboratory Medicine field, CPGs became increasingly important to improve the effectiveness of disease diagnosis and monitoring [12,27,28], albeit only a few guidelines were evaluated [29][30][31]. When we look at the NIPT field the scenario is even sparser, with the publication of only one nonstructured study related with NIPT guidelines appraisal [32].
Given the absence of studies in this area and the exponential development of NIPT in the recent years, we believe that evaluating NIPT guidelines can be key to improve existing guidelines and to promote the correct use of prenatal screening.
This study was designed to evaluate three NIPT practice guidelines and to look at dimensions that can be improved, when a validated appraisal tool is used. Additionally, the study can contribute to the design of future guidelines in this field.

Guideline search
The authors searched the main health databases (PubMed, CrossRef, ScienceDirect and Web of Science), using the following terms: guidelines OR practice guidelines OR clinical practice guidelines OR recommendations OR policy statement OR consensus statement OR position paper, AND NIPT or NIPS (Noninvasive prenatal screening), OR cffDNA OR cfDNA. We also looked at relevant agencies that produce CPGs in the NIPT field.

Assessment instrument
The AGREE II is a tool used to assess the methodological quality of PGs and has been tested for its validity and reliability [41,42], providing also a strategy to guideline development [18]. Each appraiser answers 23 questions assessing guidelines components, using a 7-point Likert scale (one for "strongly disagree" to seven for "strongly agree"), based on instructions, detailed criteria and examples for each item, available in the manual. The tool has six domains to assess the quality related to: I -Scope and purpose, II -Stakeholder involvement, III -Rigour of development, IV -Clarity of presentation, V -Applicability and VI -Editorial independence. Scores given by each expert are used to calculate aggregated scores per domain, translated into percentages using the formula described below.
The appraisers are additionally asked to provide comments to justify their ratings and also to rate the guideline globally (Overall Assessment), using the same Likert scale. Finally, they are asked to state if they would: (a) recommend the guideline, (b) recommend it with modifications, or (c) not recommend it at all.

Appraisers
The group of four appraisers included experts in prenatal screening (two Clinical Pathologists, MDs and two Biochemists), all with extensive laboratory experience. In order to assure the correct use of AGREE II across appraisers, all experts attended training sessions supported by the Tutorial and Practice Exercises modules, available online (www.agreetrust.org). The selected NIPT practice guidelines were evaluated by all experts participating in the study (MJS, MA, GC, RR), using the instrument and the User's Manual.

Scores and data analysis
Domain scores were determined according to AGREE II instructions. Percentages for each domain were calculated by summing all four appraiser ratings per item, within each domain (obtained score), and converting the final score into a percentage, using the following formula: Obtained score À Minimum possible score Maximum possible score -Minimum possible score Â 100 Obtained score ¼ sum of all items scores from all four appraisers, by domain maximum possible score ¼ 7 (strongly agree) Â Y (items within domain) Â 4 (appraisers) minimum possible score ¼ 1 (strongly disagree) Â Y (items within domain) Â 4 (appraisers) Moreover, final scores were extracted per item, enabling a detailed analysis of strengths and weaknesses within domains, and a more comprehensive comparison between guidelines.
To define a high-quality guideline, we used a threestep system with thresholds similar to other AGREE studies, (Low-quality, below 50%, Medium-quality, between 50 and 74% and High-quality, between 75 and 100%). A high-quality guideline is here defined as any guideline where the majority of domains scores are above 74% and the overall score is also 75% or higher (4 raters combined).
The reliability between appraisers was calculated through intraclass correlation coefficients (ICC). Tabulation and analyses were performed using Microsoft Excel, version 15 and SPSS Statistics, version 21.

Results
The overall assessment scores show that the UK-NSC guideline has a considerable higher result (88%), when all four appraiser's assessments are combined ( Table 1). The other two guidelines have lower overall scores, ACOG with 50% and ACMG with 58%.
Additionally to domains assessment, the results per item, within each domain, provide a supplementary reading on guidelines attributes. These results (Table 2), divided in three categories labeled as: þ below 50%, þþ between 50 and 74%, and þþþ above 74%, show that Domain 4 -Clarity of Presentation is the only domain where the three guidelines have all items rated 75% or more. These higher scores are followed by domain 1 -Scope and Purpose, where only ACOG target population description is rated below 75%.
In Domain 2 -Stakeholder involvement, all guidelines have one item rated below 50%. Additionally, ACOG and ACMG guidelines don't have items rated 75% or above.
Regarding Domain 6editorial Independence, all items from the UK-NSC guideline obtained scores of 75% or above. The ACMG guideline scored below 75% in the item referring to the influence of the funding body view in the guideline content, and above in recording and addressing the competing interests of the guideline development group. Still in Domain 6, the ACOG guideline scored below 50% in both items described.
Particularly complex and pivotal domains, such as domains 3 and 5, require a more detailed look at items scores. Full items scores for Rigour of Development (Figure 1), show diverse results between guidelines, regarding: (a) the methods used to search supporting evidence (UK-NSC 100%, ACOG 58% and ACMG 17%), (b) the description of the criteria for selecting the evidence (UK-NSC 96%, ACOG 58% and ACMG 38%), and (c) the methods used in the recommendations formulation (UK-NSC 88%, ACOG 54% and ACMG 42%).
In what concerns the description of strengths and limitations of the body of evidence (UK-NSC 92%, ACOG 58% and ACMG 54%), and the linkage between this evidence and the issued recommendations (UK-NSC 92%, ACOG 50% and ACMG 58%), the UK-NSC guideline has, clearly, higher scores, with ACOG and ACMG results ranging from 50 to 58%, for the same two items.
For the group of items with lower scores in this domain, specifically the consideration of health benefits, side effects and risks when formulating recommendations, the highest score is 58% (UK-NSC), followed by ACMG (50%) and ACOG (42%). The item related with guidelines external review, prior to its publication, has also low scores (UK-NSC 58%, ACOG 8% and ACMG 21%), with two guidelines scoring noticeably below 50%. The last item in this domain, related with the existence of a procedure for updating the guideline, has the lowest scores for all guidelines (UK-NSC 46%, ACOG 8%, and ACMG 8%), where all are below 50%.
In domain 5 -Applicability (Figure 2), we can clearly see that the UK-NSC guideline has higher scores for most items, except for the one related to providing the tools or advice on how to put the recommendations into practice (UK-NSC 88%, ACOG 46%, and ACMG 92%). Additionally, the scores are more dissimilar between all guidelines for the item pertaining The overall objective(s) of the guideline is (are) specifically described þþþ þþþ þþþ 2 The health question(s) covered by the guideline is (are) specifically described þþþ þþþ þþþ 3 The population (patients, public, etc.) to whom the guideline is meant to apply is specifically described. þþþ þþ þþþ

Domain 2. Stakeholder involvement 4
The guideline development group includes individuals from all relevant professional groups. þþþ þþ þþ 5 The views and preferences of the target population (patients, public, etc.) have been sought. þþþ þ þ 6 The target users of the guideline are clearly defined. þ þ þ þ þ Domain 3. Rigor of development 7 Systematic methods were used to search for evidence. þþþ þþ þ 8 The criteria for selecting the evidence are clearly described. þþþ þþ þ 9 The strengths and limitations of the body of evidence are clearly described. þþþ þþ þþ 10 The methods for formulating the recommendations are clearly described. þþþ þþ þ 11 The health benefits, side effects, and risks have been considered in formulating the recommendations. þþ þ þþ 12 There is an explicit link between the recommendations and the supporting evidence þþþ þþ þþ 13 The guideline has been externally reviewed by experts prior to its publication. þþ þ þ 14 A procedure for updating the guideline is provided. þ þ þ The guideline provides advice and/or tools on how the recommendations can be put into practice. þþþ þ þþþ 20 The potential resource implications of applying the recommendations have been considered. þþþ þ þ 21 The guideline presents monitoring and/or auditing criteria. þ þ þ

Domain 6. Editorial independence 22
The views of the funding body have not influenced the content of the guideline. þþþ þ þþ 23 Competing interests of guideline development group members have been recorded and addressed. þþþ þ þþþ Overall guideline assessment 1 Rate the overall quality of this guideline. þþþ þþ þþ to the description of facilitators and barriers to implementation (UK-NSC 100%, ACOG 38%, and ACMG 54%). The item associated with resources implications to apply the recommendations, has the biggest range between items in this domain, where the UK-NSC guideline scored 100%, while the ACOG and the ACMG guidelines scored 29 and 25%, respectively.
The lowest scoring can be found in the item linked with the existence of monitoring and audit criteria for the guidelines, where all scored below 50% (UK-NSC 46%, ACOG 8%, and ACMG 8%). Table 3 results show that three out of four experts recommend the UK-NSC guideline without modifications and one makes the same recommendation, but with modifications. Regarding the ACMG guideline, all experts recommend it with modifications. The ACOG guideline is not recommended by one expert, while the remaining three recommend it, with modifications.
Inter-rater reliability, between all raters, was calculated by means of Intra Class Correlation. The results for the three guidelines were: UK-NSC ICC ¼ 0.811, ACOG ICC ¼ 0.757 and ACMG ICC ¼ 0.794.

Discussion
Clinical practice guidelines quality assessment represents one of the most important components of healthcare quality improvement processes. It highlights the need to understand that specific tools must be used to make contextualized assessments, and the need to accept that our expertise is only one of the key prerequisites to promote sustained quality improvement of care.
The domain assessing Scope and Purpose was one of the domains with the highest scores. Whilst we can find high scores for this domain in single guidelines studies [23,43,44] and systematic reviews [12], NIPT guidelines topped most evaluations made with AGREE II, suggesting that the potential health impact on patients and the description of the health questions [45,46] were highly considered.
The overall scores for stakeholder involvement, focused on broad participation and users views [47] were low. The UK-NSC guideline scored very high in the inclusion of individuals from all relevant professional groups [37,38], accordingly with current consensus [48], but very low in the definition of target users, with only a scarce mention to this key component [9,49].
As in similar studies [9,18] the views and preferences of target population still remain a challenge and a weakness in many guidelines, but patients dimensions must be accounted for into decisions, as there is no one better qualified than families and patients to describe difficulties or disabilities. This view [36], accounts both for stakeholder independency and patient participation, as recommended by WHO [18].
In the domain Rigour of Development, dedicated to methodology and use of evidence [47], the biggest differences were found in the methods to search the evidence, where vulnerabilities can compromise guidelines main goal of improving the quality of care [48]. The UK-NSC was the CPG that best performed in this item, confirming NHS strong methodological tradition in research [50], but the assessment shows further liabilities related with the criteria for selecting the evidence and the methodology to formulate recommendations, influencing recommendations variability [16]. The results follow a similar pattern when we look at strengths and limitations of the evidence and the link between evidence and recommendations, suggesting a less structured guideline development methodology [48].
All guidelines inadequately considered health benefits, adverse effects and risks, key factors to increase the odds for better implementation [11] and vital in the prenatal screening area, given the professionals judgement call variability [51].
The absence of a thorough external review process, prior to publication, was also an important finding, as it represents a further vulnerability to implementation. Additionally, only one guideline informed about updating procedures, a shortcoming identified before [17], that deters guidelines revision according to a predefined timeline [11], through a standing panel that reviews new literature and update changes [47].
The domain dedicated to Clarity of Presentation was the one where all three guidelines scored higher. Though somehow expected, based on previous results, both from national societies [52] and international organizations [18], it is important to highlight that these results suggest a special attention given by all NIPT guidelines to this domain.
Regarding Applicability, the mean score was one of the lowest overall, suggesting difficulties to advise on how to apply recommendations. Further developments should consider this to improve adoption [53]. ACOG and ACMG guidelines also scored poorly in the description of barriers to implementation and cost implications, similarly to other studies [9]. Contrariwise, the UK-NSC guideline targets several barriers (eg resources and financial costs), necessary to address contextual factors that may hinder applicability [18].
Pilot testing is key to ensure that guidelines can be put into practice, but only the UK-NSC and the ACMG reported results of a pilot test, scoring equivalently to previous evaluations [9]. Assessing applicability improves guidelines uptake and sustained use, establishing its quality in real world settings [14], however none of the three guidelines reported outcome measures for implementation or monitoring criteria. Our results show that, like in other comparable studies, all evaluated guidelines can considerably improve their quality regarding applicability [9].
In what concerns Editorial Independence, it must be extolled that ACMG guideline top scored by explicitly disclosing competing interests. In 2000, Grilli evaluated 431 practice guidelines and found that 67% did not report the type of professionals involved in the guideline development [54], similar to other studies [55]. Furthermore, the ACMG guideline scored fairly in the item referring to the funding body influence and ACOG guideline scored below 50% in both items. Disclosure is key to assess the potential influence of conflict of interest and for the reputation of institutions that want their guidelines to be considered [49].
Limitations in the study included difficulties defining the concept of guideline and the instrument characteristics. AGREE II is an internationally accepted gold standard for guidelines appraisal [56], and the most comprehensively validated tool [57,58], but we must assume some degree of subjectivity in the results, namely the overall assessment and recommendations [59]. Additionally, the instrument does not include information about how to implement new or updated CPGs, a process critical to quality of care improvement, that would help implementers and clinicians. Other limitation was the fact that all appraisers work in the same laboratory and might be influenced by the organizational culture, despite all measures taken to keep the process blind between appraisers.
This study allowed us to evaluate guidelines of an important and rapidly evolving health area, that haven't been evaluated in a structured way before. The results showed that several areas of NIPT guidelines can be improved significantly, such as stakeholders involvement, selection of supporting evidence, external reviews and updating processes. Additionally, it is pivotal to improve the applicability necessary to implementation and sustainability.
The study also highlighted specific vulnerable areas within domains. This allowed us to conclude that there are key items where scores vary notably between NIPT guidelines and that the UK-NSC is the guideline with the highest quality. Professional associations should adopt systematic procedures for guideline development according to known evidence and with the participation of a broad range of stakeholders.
There is a need for further improvement, not only in traditional core components, already highlighted, but also in dimensions such as editorial independence, including competing interests disclosure.
Practice guidelines aim to provide a valuable aid in making complex clinical decisions and when rigorously developed have the potential to enhance those decisions and healthcare quality. Almost all appraisers recommended all three NIPT guidelines with modifications that can lead to a substantial improvement in their methodological quality and subsequently make a contribution for prenatal screening improvement. Actions should be taken to review and improve these important guidelines for NIPT, involving all key stakeholders and using quality appraisal validated tools.