Estimated fetal weight standards of the INTERGROWTH-21st project for the prediction of adverse outcomes: a systematic review with meta-analysis

Abstract Objective To systematically review and assess the risk of bias in the literature evaluating the performance of INTERGROWTH-21st estimated fetal weight (EFW) standards to predict maternal, fetal and neonatal adverse outcomes. Methods Searches were performed in seven electronic databases (Scopus, Web of Science, Medline, Embase, Lilacs, Scielo and Google Scholar) using citation tools and keywords (intergrowth AND (standard OR reference OR formula OR model OR curve); all from 2014 to the last search on April 16th, 2021). We included full-text articles investigating the ability of INTERGROWTH-21st EFW standards to predict maternal, fetal or neonatal adverse outcomes in women with a singleton pregnancy who gave birth to infants with no congenital abnormalities. The study was registered on PROSPERO under the number CRD42020115462. Risk of bias was assessed with a customized instrument based on the CHARMS checklist and composed of 9 domains. Meta-analysis was performed using relative risk (RR [95%CI]) and summary ROC curves on outcomes reported by two or more methodologically homogeneous studies. Results Sixteen studies evaluating fifteen different outcomes were selected. The risk of bias was high (>50% of studies with high risk) for two domains: blindness of assessment (81.3%) and calibration assessment (93.8%). Considering all the outcomes investigated, for 95% of the results, the specificity was above 73.0%, but the sensitivity was below 64.1%. Pooled results demonstrated a higher RR of neonatal small for gestational age (6.71 [5.51–8.17]), Apgar <7 at 5 min (2.17 [1.48–3.18]), and neonatal intensive care unit admission (2.22 [1.76–2.79]) for fetuses classified <10th percentile when compared to those classified above this limit. The limitation of the study is the absence of heterogeneity exploration or publication bias investigation, whereas no outcomes were evaluated by more than five studies. Conclusions The IG-21 EFW standard has low sensitivity and high specificity for adverse events of pregnancy. Classification <10th percentile identifies a high-risk group for developing maternal, fetal and neonatal adverse outcomes, especially neonatal small for gestational age, Apgar <7 at 5 min, and neonatal intensive care unit admission. Future studies should include blind assessment of outcomes, perform calibration analysis with continuous data, and evaluate alternative cutoff points.


Introduction
Estimated fetal weight (EFW) is routinely used during antenatal care for the screening of fetuses at risk of presenting adverse outcomes at birth, such as small-(SGA) and large-(LGA) for-gestational-age.The prenatal surveillance of fetal growth may lead to interventions to reduce stillbirth, morbidity, and postnatal mortality [1].
length (FL).Hammami et al. [2] identified 70 different formulas for EFW derived from local studies with small samples in a systematic literature review (SLR) of 45 studies.The models provided by formulas incorporating three or more biometrical measurements have been shown to be more accurate than those with fewer parameters, with a particularly good performance of the Hadlock [3] formula from measurements of HC, BPD, AC and FL.However, there is no consensus about which is the most accurate formula for EFW.An international standard to predict EFW may enable a valid comparison between and within populations.In 2017, the INTERGROWTH-21 st Project (IG-21) published the first international, multicenter, population-based EFW formula and proposed international standards for fetuses at 22-40 gestational weeks.Standardized data were obtained from eight geographically diverse countries and populations.The IG-21 EFW formula is a function of AC and HC based on 2,404 newborns who underwent the last ultrasound scan within 14 days before birth [4].
Several studies have been performed aiming to evaluate IG-21 EFW standards' ability to predict the occurrence of adverse outcomes, with conflicting results [5][6][7][8].This SLR aims to synthesize evidence regarding the performance of IG-21 EFW standards to predict maternal, fetal and neonatal adverse outcomes.

Methods
This SLR is part of a larger study registered on PROSPERO under the number CRD42020115462, aiming to answer what is known about all IG-21 standards' predictive ability, including newborn size, fetal growth, gestational weight gain, symphysis-fundal height, and EFW.The search strategy was developed and performed for the main study.This paper presents the findings for studies validating EFW standards.The study design consisted of a SLR followed by a metaanalysis (MA), both conducted taking into account the Preferred Reporting Items for Systematic reviews and Meta-Analyses Group guidelines (PRISMA) [9].
The literature search comprised three steps: (1) forward search using citation analysis tools of the Scopus, Web of Science (WoS), Medline, and Google Scholar databases to identify studies that cited the five articles related to the IG-21 standards [4,[10][11][12][13]; (2) automatic search in Scopus, WoS, Medline, Embase, Lilacs, and Scielo using free-text terms; (3) backward search, manually checking the reference lists of the eligible studies.The complete search strategy and its results are described in Appendix S1.
Inclusion criteria included: investigate the performance of the EFW standards for predicting maternal, fetal and neonatal adverse outcomes in an external dataset (i.e.not IG-21 data) and include singleton pregnant women giving birth to infants with no congenital abnormalities.Inclusion was restricted to original full-text articles published in peer-reviewed journals in English, Spanish, and Portuguese since 2014.
We excluded studies that repeated the original modeling process in the validation data or refitted the models on new data, studies restricted to overweight and/or obese subjects, studies restricted to preterm births and studies that did not evaluate the association with adverse outcomes.

Study selection
The selection process started with title and abstract screening, followed by a full-text reading of potentially eligible publications.Both steps were performed by two independent reviewers.Disagreements were solved by a third reviewer.When multiple reports of the same study/database were identified, the more detailed report was selected.Studies selection was managed using Covidence [14].

Data extraction
Data extraction was performed using a structured questionnaire by one author (Appendix S2).A second author reviewed the data, and any disagreement was solved by consensus.The following information was extracted: year of publication, country and city where the study was conducted, sample size, study design, inclusion and exclusion criteria, sample characteristics (degree of risk, age, BMI, parity and gestational age at delivery), predictors and cutoff points used in the analysis, outcomes evaluated and their respective incidence, the method used to estimate fetal weight, time points of predictor assessment, the proportion of fetuses classified below the 10 th percentile, statistical methods used and presentation of results.Data were synthetized and presented in tables.
Whenever the numerical estimate of interest was presented graphically, without the exact value, it was estimated using the software "GetData Graph Digitizer" (1 study [5]) [15].Where necessary, the corresponding author was contacted and asked for supplementary data, with a reminder thirty days after the first contact.Two authors were contacted, but neither provided the requested information [16,17].

Assessment of risk of bias
Risk of bias was assessed with a customized instrument based on the CHARMS checklist [18].The following domains were considered: study design, recruitment method, missing data, outcome definition and measurement, blindness of assessment, similarity with IG-21 methods (same ultrasound methodology [19] to obtain the parameters and IG-21 formula to obtain EFW), calibration assessment, performance of discrimination and report of strengths and weaknesses.
Two authors independently performed the risk of bias assessment.Differences were solved through consensus.The instrument consisted of nine questions, one for each domain.For each question, answers were classified as low, high, or unclear risk of bias (Appendix S3).

Data synthesis
MA was performed to report the pooled RR and a summary receiver operating characteristic curve (SROC) for the outcomes investigated by two or more studies with similar cutoff points.To avoid methodological heterogeneity, studies using birth weight (BW) as a predictor and/or with the evaluation before 32 weeks of pregnancy were excluded from the MA.To estimate the effect size, we used a random-effects model weighted by the inverse of the variance.The proportion of the observed variance reflecting the true effect's variance, rather than sampling error, was evaluated by I 2 statistics [20].

Study selection
Excluding duplicates, 1,621 studies were identified from the initial search, and 16 were included after reading the full text.The study selection process and the reasons for exclusions are outlined in Figure S1.The backward search did not retrieve any additional studies.The list of excluded papers after full-text reading is presented in Table S1.In addition to the predetermined exclusion criteria, we further excluded one study in which the sample was exclusively composed of gestations with ultrasonography (USG) evidence of AC < 3% or EFW < 10% [23].
The most frequent cutoff of the IG-21 EFW standards used to predict adverse outcomes was the 10 th percentile (n ¼ 11), followed by the 5 th (n ¼ 5) and 90 th percentiles (n ¼ 5).Only Hua et al. [5] and Sovio et al. [34] used repeated measures.To achieve superior prediction power, subsequent analyses focused on the last USG for these studies (Table 1).
Most studies (9/12 with available data) described a low proportion of fetuses classified below the 10 th percentile, ranging between 3.2 and 14.6%.All studies with proportions <6% of fetuses below the 10 th percentile (n ¼ 5) used BW or unclear methodologies to estimate the predictor (Table 1).
In general, IG-21 presented low sensitivity and high specificity for the prediction of outcomes.Considering all studies, outcomes and cutoffs, 95% of the investigations presented a specificity above 73.0%,while for 95% of the investigations, the sensitivity was below 64.1% (Tables S2 and S3).
The þ LR was usually greater than 1 for all outcomes and cutoffs, while the -LR was lower than or equal to 1.The PPV was generally lower than 50%, while the NPV was higher than 90%.Higher PPV values were observed for neonatal SGA, although they were still lower than 86.9% (Tables S2 and S3).
The AUC values varied between 0.53 and 0.62 for the composite outcome.The exception was the study of Zhu et al. (2019), which presented an AUC of 0.90.Mixed results were expected for the composite outcome because of its heterogeneous definition.The most significant AUC estimates were found for the prediction of neonatal SGA (between 0.83 and 0.90) (Table S5).
A detailed description of the eligibility criteria and population characteristics for each study is available in Table S6.
Pooled results for the cutoff < p10 demonstrated a higher RR of neonatal SGA (6.71 [95%CI: 5.51-8.17]),Apgar <7 at 5 min (2.17 [95%CI: 1.48-3.18])and NICU admission (2.22 [95%CI: 1.76-2.79])for fetuses classified below the 10 th percentile by the IG-21 EFW standard when compared to those classified above this limit.The RRs of neonatal hypoglycemia and cord blood pH < 7.1 were not statistically significant.Pooled results for the cutoff > p90 demonstrated a higher RR of neonatal LGA (6.15 [95%CI: 3.72-10.14])for fetuses classified above the 90 th percentile by the IG-21 EFW standard (Figure 2).The outcomes of neonatal hypoglycemia and LGA presented substantial heterogeneity.SROC showed a great variability in sensitivity.Specificity was greater than 0.9 for all outcomes analyzed.The diagnostic accuracy of IG-21 EFW varied according to the outcome studied, following this descending order: neonatal SGA, neonatal LGA, Apgar < 7 at 5 min, neonatal hypoglycemia, NICU admission and cord blood pH < 7.1 (Figure 3).
Further investigation of the heterogeneity was not possible due to the small number of studies.For the same reason, sensitivity analysis and an evaluation of publication bias were not performed.Considering the Cochrane Handbook, the minimum number of studies to apply tests for funnel plot asymmetry is 10, as with small number of studies the power of the tests to distinguish chance from real asymmetry is too low [20].

Discussion
This study synthesizes the evidence regarding the ability of the IG-21 EFW standard to predict adverse outcomes.In summary, we observed that the diagnostic accuracy of IG-21 is limited by its low sensitivity, -LR, and PPV.However, if used as a screening tool, it has good performance, with high specificity.The classification below the 10 th percentile was associated with a higher risk for adverse outcomes, with significantly higher risks of neonatal SGA, Apgar <7 at 5 min and NICU admission.
Our results show high specificity and low sensitivity for the most commonly used cutoff points (10 th and 90 th percentiles).This means that the probability of developing adverse outcomes among those classified at the highest/lowest percentiles is high and the chance of false-positive screening is low.On the other hand, low sensitivity indicates a high rate of falsenegative screening, which means that many individuals who are classified as having adequate EFW may still develop adverse outcomes.
The advantages of a screening tool with high specificity are preserving families from the emotional impact of a false positive result and decreased rates of unnecessary procedures, interventions, and iatrogenic preterm births, additionally saving valuable resources and, potentially, lives.However, the lack of sensitivity is translated in many cases of adverse outcomes missed, not receiving timely interventions [36].
The use of alternative cutoff points can improve discrimination power.More restrictive cutoffs increase the rates of false-negative results and must be avoided.More embracing cutoffs are more sensitive, which could improve IG-21 EFW standard performance.Only three studies reported the best cutoff according to receiver operating characteristic (ROC) curve analysis.Blue et al. [6], Zhu et al. [31] and Kato et al. [30] identified the 22, 11.61 and 40.9 percentiles as those optimizing the balance of true-and falsepositive and false-negative results, respectively.The best cutoff point may vary according to the outcome of interest and its prevalence in each geographic region [37].
PPV and NPV must be interpreted with caution since they are dependent on the outcome incidence and are not fixed characteristics of the test [38].PPV is directly influenced by the prevalence, while NPV is inversely affected by it.In this way, the low prevalence of SLR outcomes can partially explain the high NPV and low PPV observed.Some included papers have unique characteristics that need to be highlighted.Nahirney et al. [24], Choi et al. [33], Hiersch et al. [28] and Melamed et al. [29] did not estimate the fetal weight but used the BW to categorize infants with SGA or LGA according to the IG-21 EFW standard [4].Lorusso et al. [17], Vikraman & Elayedatt [32], and Kato et al. [30] estimated the fetal weight using Hadlock formulas.IG-21 EFW charts were not originally developed to classify BW or EFW derived by other formulas, and these strategies can bias the results in unexpected forms.
We did not expect to find studies using BW in IG-21 EFW centiles in our literature search; therefore, it was not anticipated in our eligibility criteria.However, we decided to keep these studies in the SLR to emphasize the recurrent presence of this approach in the literature and discourage further authors from following this methodology.The use of BW rather than EFW is expected to find overly optimistic assessments of predictive performance.These studies were not included in our MA, and the exclusion of their results from the SLR does not change our main conclusions.
The risk of bias assessment indicated that important strategies to avoid bias and ensure transparency were not implemented or reported in the included studies.The lack of blindness in the outcome assessment may overestimate the method's predictive ability, especially when the outcome requires subjective interpretation (e.g.Apgar score) [18,39].In turn, the methods used to measure the predictor may influence the results in several ways.Studies do not clearly state the ultrasonographic procedures and measures of the parameters that compose the EFW formula, which is essential for readers to contextualize the results [18].Finally, calibration assessment is essential when predictions are used for clinical decisions [40].Especially for those studies investigating the accuracy of the IG-21 formula to predict BW, the calibration plot would provide a better idea of how it performs in each population and whether it produces valid measures for the outcome of interest.
It is expected that any growth standard would find an association with SGA-related outcomes when comparing below and above a particular cutoff.This aspect may be enriched by comparing the results of various standards.All papers included in this SLR compared the performance of the IG-21 standard with other methods.However, our literature review was not designed to accomplish this objective.Thus, it must be explored in future investigations.This is the first study to synthesize data on the ability of IG-21 EFW standards to predict adverse maternal, fetal and neonatal outcomes worldwide, considering different cutoff points.The search process was extremely sensitive, with three steps in multiple databases.Moreover, we extracted and systematized quantitative data to perform a MA for some of the outcomes.As limitations, we could not investigate the origin of the MA heterogeneous results or the possibility of publication bias since no outcomes were evaluated by more than five studies.
EFW formula and curves are considered important tools for predicting adverse pregnancy outcomes worldwide, especially SGA and LGA.The idea proposed by the IG-21 consortium of a population-based international standard derived prospectively, using a prescriptive approach, has a biological basis and makes sense in actual multicultural societies [4].With the findings of our study, we could better understand this standard's discrimination characteristics for its use in clinical practice in the present SLR.
Recommendations for future studies include using prospective data, standardized methods, and blinded assessment of outcomes.Researchers should assess and report its performance in terms of both calibration and discrimination when investigating any tool's predictive value.Studies aiming to confirm the usefulness of the IG-21 standards for predicting adverse outcomes can be improved by the evaluation of predefined and biologically plausible alternative cutoff points by ROC curve analysis and the use of the IG-21 formula to obtain EFW instead of other formulas or BW.On the other hand, dichotomizing continuous predictors always means losing information or, even worse, might entail biased findings based on the datadriven selection of a "best cutoff point" [41].Thus, authors should consider analyzing it as a continuous variable, mainly when included in multivariable regression models, as suggested by Stirnemann et al. [4].

Conclusion
Being classified as adequate EFW by IG-21 is insufficient to exclude a more rigorous follow-up due to the high false-negative rates.However, the evidence indicates that patients classified in extreme percentiles should be monitored more closely since this is a highly specific tool, meaning these individuals have a significantly increased risk of developing maternal, fetal and neonatal adverse outcomes; especially neonatal SGA, Apgar <7 at 5 min, and NICU admission.More standardized and methodologically robust research would benefit this area, as evidenced by the high or unclear risk of bias observed in some of the individual studies included in our review.

Figure 1 .
Figure 1.Quality assessment evaluation.(A) Risk of bias according to selected domains for each study; (B) Proportion of studies with low, high, and unclear risk of bias for each selected domain.

Figure 2 .
Figure 2. Estimate effects and pooled results for outcomes evaluated by two or more studies with similar cutoffs.It included only studies whose fetal weight was estimated by formulas based on USG measurements over 32 weeks of pregnancy.tp, true positives; fp, false positives; fn, false negatives; tn, true negatives; RR, relative risk; SGA, small for gestational age; NICU: neonatal intensive care unit; LGA: large for gestational age; For studies with a zero cell in the contingency table, the Stata command [metan] automatically adds 0.5 in all cells.

Figure 3 .
Figure 3. SROC curve for outcomes evaluated by two or more studies with similar cutoffs.It included only studies whose fetal weight was estimated by formulas based on USG measurements over 32 weeks of pregnancy.SGA: small for gestational age; NICU: neonatal intensive care unit; LGA: large for gestational age.

Table 1 .
Summary of studies characteristics.
AUC: Area under the curve; BW: Birth weight; EFW: Estimated fetal weight; FPR: false positive rate; HR: Hazard ratio; LGA: Large for gestational age; NA: not available; OR: Odds ratio; RR: Risk ratios; p10: 10th percentile; Sens: sensitivity; SGA: Small for gestational age; Spec: specificity; USA: United States of America; UK: United Kingdom; % <p10: proportion of fetuses classified below the 10th percentile using IG-21 standard.a Used the IG-21 EFW chart to define percentiles of BW measures.b e In methods described Relative Risk, but in figures presented Odds Ratio.