Wide range of applications for machine-learning prediction models in orthopedic surgical outcome: a systematic review

Background and purpose — Advancements in software and hardware have enabled the rise of clinical prediction models based on machine learning (ML) in orthopedic surgery. Given their growing popularity and their likely implementation in clinical practice we evaluated which outcomes these new models have focused on and what methodologies are being employed. Material and methods — We performed a systematic search in PubMed, Embase, and Cochrane Library for studies published up to June 18, 2020. Studies reporting on non-ML prediction models or non-orthopedic outcomes were excluded. After screening 7,138 studies, 59 studies reporting on 77 prediction models were included. We extracted data regarding outcome, study design, and reported performance metrics. Results — Of the 77 identified ML prediction models the most commonly reported outcome domain was medical management (17/77). Spinal surgery was the most commonly involved orthopedic subspecialty (28/77). The most frequently employed algorithm was neural networks (42/77). Median size of datasets was 5,507 (IQR 635–26,364). The median area under the curve (AUC) was 0.80 (IQR 0.73–0.86). Calibration was reported for 26 of the models and 14 provided decision-curve analysis. Interpretation — ML prediction models have been developed for a wide variety of topics in orthopedics. Topics regarding medical management were the most commonly studied. Heterogeneity between studies is based on study size, algorithm, and time-point of outcome. Calibration and decision-curve analysis were generally poorly reported.

Background and purpose -Advancements in software and hardware have enabled the rise of clinical prediction models based on machine learning (ML) in orthopedic surgery. Given their growing popularity and their likely implementation in clinical practice we evaluated which outcomes these new models have focused on and what methodologies are being employed.
Material and methods -We performed a systematic search in PubMed, Embase, and Cochrane Library for studies published up to June 18, 2020. Studies reporting on non-ML prediction models or non-orthopedic outcomes were excluded. After screening 7,138 studies, 59 studies reporting on 77 prediction models were included. We extracted data regarding outcome, study design, and reported performance metrics.
Results -Of the 77 identified ML prediction models the most commonly reported outcome domain was medical management (17/77). Spinal surgery was the most commonly involved orthopedic subspecialty (28/77). The most frequently employed algorithm was neural networks (42/77). Median size of datasets was 5,507 364). The median area under the curve (AUC) was 0.80 (IQR 0.73-0.86). Calibration was reported for 26 of the models and 14 provided decision-curve analysis.
Interpretation -ML prediction models have been developed for a wide variety of topics in orthopedics. Topics regarding medical management were the most commonly studied. Heterogeneity between studies is based on study size, algorithm, and time-point of outcome. Calibration and decision-curve analysis were generally poorly reported.
Surgical decision-making in orthopedic surgery involves weighing the benefits of an intervention against its inherent risks. Prognostic scoring tools have been devised to individualize risk prediction and thus improve surgical decisionmaking (Janssen et al. 2015, Pereira et al. 2016, Shah et al. 2018. Although clinical prediction models are not new, recent advancements in artificial intelligence have created a host of prediction models based on machine learning (ML) (Cabitza et al. 2018).
ML is a branch of artificial intelligence that enables computer algorithms to learn from experience from large datasets without explicit programming. Figure 1 shows 3 commonly employed algorithms. Existing reviews of machine learning studies have provided a broad overview of applications ranging from vision to natural language processing and predictive analytics (Cabitza et al. 2018). To our knowledge, there is no study that has critically assessed the body of studies focused on ML prediction models for surgical outcome in orthopedics. These types of prediction models are most likely the first branch of artificial intelligence to be employed in clinical practice (Staartjes et al. 2020). Therefore, familiarizing practicing orthopedic surgeons with ML's concepts and the topics these new methods have focused on can optimize their implementation in clinic.
As such, the purpose of this systematic review is to (1) evaluate which surgical outcomes orthopedic clinical prediction models have focused on, and (2) determine which techniques current prediction models use for development and validation.

Material and methods
Systematic literature search Adhering to the 2009 PRISMA guidelines a systematic search was performed in PubMed, Embase, and the Cochrane Library for articles published up to June 18, 2020. 2 different domains of medical subject headings (MeSH) terms and keywords were combined with "AND" and within the 2 domains the terms were combined with "OR." The 1st domain included words related to ML and the second domain related to possible orthopedic specialties (Appendix 1, see Supplementary data). Terms were restricted to MeSH, title, abstract, and keywords. Two reviewers (PTO, OQG) independently screened all titles and abstracts for eligible articles based on predefined criteria. Eligible full-text articles were evaluated and crossreferenced for potentially relevant articles not identified by the initial search ( Figure 2). Discrepancies between the 2 reviewers were adjudicated by the senior author (JHS).

Eligibility criteria
Studies reporting on ML-based prediction models addressing orthopedic surgical outcomes were included, as were all intraoperative and postoperative outcomes. The surgical orthopedic population was defined as disorders of the bones, joints, ligaments, tendons, or muscles treated by any type of operation. Excluded were studies (1) that did not include at least 1 ML-based prediction models for surgical outcome (e.g., logistic regression-based models), (2) non-English studies, (3) lack of full text, and (4) non-relevant study types such as animal studies, letters to the editors, and case reports.

Assessment of methodological quality
Quality assessment was performed based on a modified nineitem Methodological Index for Non-Randomized Studies (MINORS) checklist (Slim et al. 2003). We made it applicable to our systematic review by including disclosure, study aim, input feature, output feature, validation method, dataset dis-tribution, performance metric, and explanation of the used AI model (Langerhuizen et al. 2019). These 9 items were scored on a binary scale: 0 (not reported or unclear) and 1 (reported and adequate). Table 1 lists the data we extracted from each study. For this review, 6 main orthopedic surgical outcome domains were identified, consisting of (1) intraoperative complications (e.g., blood transfusion, prolonged operative time), (2) postoperative complications (e.g., venous thromboembolism), (3) survival, (4) patient reported outcome measures (PROMs), (5) medical management (e.g., hospitalization), and (6) other. For studies reporting the performance of multiple ML models, the best performing ML model was used. 13 studies provided multiple models for multiple surgical outcomes; these were extracted separately resulting in more ML models than studies. Only the 2 performance measures AUC and accuracy were extracted as they were most the commonly reported results.

Study characteristics
After screening of titles and abstracts, 758 full-text articles were assessed for eligibility and ultimately 59 articles were included reporting on 77 ML prediction models (Table 1). Median sample size was 5,818 869). Using the MINORS criteria, all 59 articles were found to be of similar quality. All included a minimum of 8 out of 9 appraisal items (Appendix 2, see Supplementary data).

Statistics
AUC scores and accuracies in tables are expressed as they were originally reported. For studies that reported multiple results within a single outcome domain (e.g., multiple different postoperative PROMs, each with an independent AUC) averages were taken. The sizes of the training, validation, and node performs a test on the input value with the subsequent branches representing the outcomes. Their graphical representation as seen here makes them easy to understand and interpret. However, they are prone to overfitting. (B) Neural networks are based on interconnected nodes. The input features are represented by the first (blue) layer. The designated outcome is represented by the final (green) layer. The middle, hidden layers (blue and orange) base their output on the input they get from prior layers. Neural networks have been around for a long time and offer good discriminative abilities, but interpretation of the relationships between the different layers remains difficult. (C) Support vector machines (SVMs) perform classification by determining the optimal separating hyperplane between datapoints, which maximizes the distance between the 2 closest points of either group. They can be used for both linear and nonlinear relationships. While they remain effective in data with a great number of features, they do not work well in larger datasets. test sets are reported as percentages of the total dataset. No meta-analysis was performed because of obvious heterogeneity between studies and in orthopedic applications. However, to summarize the findings in some quantitative form, the median AUC and accuracy of the prediction performance were calculated for all studies. We used Microsoft Excel (Version 16.31; Microsoft Inc, Redmond, WA, USA) for standardized forms for data extraction and quality assessment, and Mendeley as reference management software.
Ethics, funding, and potential conflicts of interests Institutional review board approval was not required for this systematic review. No external funding was received. The authors have no conflicts of interest to declare. Table 2 lists the characteristics of all included studies. More than half of the 77 models were developed with data from national databases or registries (42) ( Table 3). The median number of predictor variables used in the ML model was 10 (IQR 8-15). Models using national data did not include more variables: 10 (IQR 8-13). 68 of the models had a binary distribution of the outcome variable. Most frequently employed algorithms were neural networks (42) and random forests (30). 36 of the neural networks were single-layer, 5 deep learning, and 1 convolutional. The median number of patients used was 5,507 364). Median AUC was 0.80 (IQR 0.73-0.86) and median accuracy was 79% (IQR 75-88). Calibration was reported for 26 of the models and 23 provided Brier scores. Decision-curve analysis was employed in 14 studies. 18 provided a digital application for their prediction model.

Outcome
The most commonly reported outcome domains were medical management (17) and survival (16). Medical management mostly focused on discharge destination (7) and hospitalization (4). The studies on survival all addressed patient survival. 6 survival studies were in orthopedic oncology and 5 in orthopedic trauma. Both medical management and survival had a higher median AUC (0.82 and 0.84 than overall median AUC). Spinal surgery was the most commonly involved subspecialty (28). Year of publication 2 First author 3 Disease condition 4 Type of surgery 5 Input feature 6 Number of features in final model 7 Type of outcome 8 Time points of outcome 9 Number of output classes 10 ML algorithm used 11 Number of patients 12 Distribution between training, validation, and test set 13 Validation method 14 AUC and accuracy of model 15 Reporting of calibration and Brier score 16 Decision-curve analysis 17 Digital application of the model

Discussion
Recent years have seen an increasing interest in artificial intelligence and ML in orthopedics (Bini 2018, Jayakumar et al. 2019. With this systematic review we aimed to provide an introduction to the main concepts of developing ML models for orthopedic surgeons and analyze the current application and design of these models in orthopedic surgery. We found a wide range of potential applications ranging from predicting survival in spinal metastases, clinical outcome after shoulder arthroplasty, and hospitalization after hip fracture surgery. This systematic review has a number of limitations. 1st, due to the relative novelty of this field of research in orthopedic surgery, the variety in study designs renders comparisons and comprehensive quantitative analysis difficult. We therefore opted to perform a qualitative analysis of the current publications. Hopefully, the increasing familiarity with these types of studies will lead to better reporting and open up the possibility to perform quantitative analyses. 2nd, this review is likely influenced by publication bias. ML prediction models with good performance are more likely to be published than models with mediocre or poor performance. This positive publication bias has been shown both in medicine and computational sciences (Boulesteix et al. 2015). The performance measures presented here were therefore likely to be more favorable than those of all developed models. 3rd, despite our efforts to perform a search across multiple online libraries, we have missed a number of studies reporting ML prediction models. Whilst unfortunate, we do no not think these omissions will significantly alter our findings on research topics or most utilized methodology as this review included nearly 60 studies.
This systematic review shows that ML models have been developed for a wide variety of topics across all subspecialties within orthopedics. Perhaps surprisingly, medical management was the most studied domain with the majority of models focusing on readmissions and discharge placement. Both readmissions and discharge delays impose a heavy burden on healthcare costs (Wan et al. 2016). Healthcare expenditure has risen steadily throughout the developed world in recent decades (OECD 2019). While there is enormous variation in healthcare systems, government institutions in virtually all countries have looked at improving medical management to help curb costs (Schwierz 2016). Papanicolas et al. (2018) found activities relating to planning, regulating, and managing health services was a major factor in the difference in healthcare expenditure between the United States and 10 other high-income countries. Shrank et al. (2019) concluded failure of care coordination, leading to unnecessary readmissions among other things, amounts to $78 billion of waste in the United States. To address this problem the Centers for Medicare and Medicaid Services started the Hospital Readmissions Reduction Program in 2012, incentivizing hospitals to lower readmission rates. Knowing in advance which patients are at risk of being readmitted within 30 days after discharge is crucial, which is a possible explanation as to why so many prediction models focus on this topic. Similarly, knowing in advance where patients are likely to be discharged to makes preventing delayed discharge a lot easier than the other interventions tried over the years (Bryan 2010, Ou et al. 2011. Furthermore, the databases available in the studies on medical management appear to be larger, enabling researchers to include more variables and create better performing prediction models. These models are more likely to be published as evidenced by the higher AUC for medical management compared to overall AUC. Survival was the other commonly studied outcome domain. Accurately estimating remaining life-expectancy is an important feature in medical decision-making in orthopedic oncology (Pereira et al. 2016). In a patient group with only limited life-span remaining, the aim of treatment is to preserve quality of life. Accurate survival estimations can guide decisionmaking on whether or not to perform surgery and if so, which operative treatment should be opted for (Quinn et al. 2014).
With an ageing population and cancer patients surviving longer, the incidence of bone metastases will continue to rise and prediction models will likely play an increasing role in this field (Quinn et al. 2014).
The AAOS Census 2018 showed only 8.3% of orthopedic surgeons' primary specialty area was the spine, while onethird of the prediction models were linked to spinal surgery (AAOS Department of Clinical Quality and Value 2019). Cost reduction may also be the driving factor in the overrepresentation of spinal surgery prediction models; the economic cost of spinal surgery is large and growing with spinal fusions alone costing $30 billion annually in the United States (Johnson and Seifi 2018). Prediction models could play a role in curbing costs by improving patient selection and surgical decisionmaking, although this could be said for all other subspecialties. Another possible explanation for the disproportionate number is the overlap with neurosurgery. The neurosurgical field was relatively quicker to use ML to develop prediction models and had developed several models in spinal surgery earlier on (Senders et al. 2018). Finally, the field of prediction models is expanding but still small. A significant proportion of the prediction models are developed by a few research groups that happen to focus on spine surgery. With the field expanding as fast as it is with new prediction models being published every month, we expect the overrepresentation of spine surgery to be temporary in a field in its infancy.
While there is wide variation in study design, certain study design elements are fairly similar across most studies. The most common designs comprise binary outcomes; either a 70:30 or 80:20 split between training and test set; and 10-FCV as method of internal validation. Wide variety exists in study size, timepoint of outcome, and choice of ML algorithms. Study size is mostly defined by whether a national database or registry was used for model development. These quality improvement databases offer a large number of datapoints with a variety of variables of a diverse group of hospitals, enabling the creation of prediction models. However, these databases are sometimes flawed by errors and their generalizability is also yet to be assessed (Rolston et al. 2017). External validation remains crucial considering generalizability outside the geographical origin of the database is not ensured (Janssen et al. 2018). Institutional databases offer the advantage of more veracious data, for instance including PROM data, which can extend over longer periods of time, but often lack adequate size.
Which ML algorithm is chosen seems highly random. While studies do list the pros and cons of certain algorithms, no study elaborates on why those algorithms were specifically chosen. A potential reason neural networks and random forests are selected so often is the familiarity of these algorithms. Neural networks have been around for decades, but were limited by lagging computational power (Hopfield 1988). The increase in computational power has led to a significant expansion of what neural networks can process and scientists have been able to build on the work of previous decades (Schmidhuber 2015). Future research should report on multiple ML algorithms and provide the performance measures of all models, thus enabling comparison between different approaches.
Despite the importance of performance metrics, a mere onethird of prediction models included information on calibration, similar to prior studies assessing prediction models in multiple medical domains (Bouwmeester et al. 2012, Heus et al. 2018. Calibration is important to evaluate wehther the model is under-or overestimating the risk regardless of the discriminative abilities. Systematically underestimating risk can lead to undertreatment, while overestimating risk can cause overtreatment (Van Calster andVickers 2015, Van Calster et al. 2019). To improve the quality of reporting of clinical prediction models, Collins et al. (2015) published the Transparent Reporting of a multivariable prediction model for Individual Prognosis or Diagnosis (TRIPOD) statement. While not tailored for ML prediction models this guideline can provide a framework for researchers to use during development. Hopefully, a more widespread adaptation of the TRIPOD statement can lead to less variation in study designs and better reporting of performance metrics.
Only one-fifth of prediction models have a digital application available. The purpose of prediction models is to aid clinicians and patients in decision-making, which can be achieved only if the models are available for use. Otherwise, predictive analytics based on ML will remain a mere theoretical exercise. Furthermore, researchers should be encouraged to not only provide a digital application of their prediction model, but share their code as well. With a field in its infancy, providing code of more experienced researchers can guide beginning research groups in their endeavors. Additionally, this can greatly increase the small number of external validation studies being performed.
In conclusion, ML prediction models have been developed for a wide variety of topics in orthopedic surgery. Topics regarding medical management and survival were the most commonly studied and spine surgery was the most involved subspecialty. Heterogeneity between studies is mostly based on study size, choice of ML algorithm, and time-point of outcome. Most published prediction models showed fair to good discriminative abilities, while calibration was poorly reported. Future studies should preferably include more multi-institutional, prospective databases and develop multiple models enabling comparison between different ML approaches. Also, important performance measures such as calibration should be reported to evaluate the prediction model accurately. Table 2 and appendices 1 and 2 are available as supplementary data in the online version of this article, http://dx.doi.org/ 10. 1080/17453674.2021.1932928 All authors made a substantial contribution to the study. PTO, OQG, CO, JJV, and JHS contributed to the conception of the study. PTO and OQG screened all the titles and abstracts. PTO, OQG, AVK, and MB participated in data collection. PTO and OQG conducted the statistical analyses and prepared the manuscript. All authors contributed to interpretation of the data and participated in revision of the manuscript.

Supplementary data
Acta thanks Max Gordon and Christoph Hubertus Lohmann for help with peer review of this study.  OR (("naive bayes" OR "bayesian learning" OR "neural network*" OR "support vector" OR "support vectors" OR "random forest" OR "deep learning" OR "machine prediction" OR "machine intelligence" OR "computational intelligence" OR "computational learning" OR "computer reasoning" OR "machine learning" OR "convolutional network*" OR "artificial intelligence"):ti,ab,kw)))