Clinical and inflammatory features based machine learning model for fatal risk prediction of hospitalized COVID-19 patients: results from a retrospective cohort study

Abstract Objectives To appraise effective predictors for COVID-19 mortality in a retrospective cohort study. Methods A total of 1270 COVID-19 patients, including 984 admitted in Sino French New City Branch (training and internal validation sets randomly split at 7:3 ratio) and 286 admitted in Optical Valley Branch (external validation set) of Wuhan Tongji hospital, were included in this study. Forty-eight clinical and laboratory features were screened with LASSO method. Further multi-tree extreme gradient boosting (XGBoost) machine learning-based model was used to rank importance of features selected from LASSO and subsequently constructed death risk prediction model with simple-tree XGBoost model. Performances of models were evaluated by AUC, prediction accuracy, precision, and F1 scores. Results Six features, including disease severity, age, levels of high-sensitivity C-reactive protein (hs-CRP), lactate dehydrogenase (LDH), ferritin, and interleukin-10 (IL-10), were selected as predictors for COVID-19 mortality. Simple-tree XGBoost model conducted by these features can predict death risk accurately with >90% precision and >85% sensitivity, as well as F1 scores >0.90 in training and validation sets. Conclusion We proposed the disease severity, age, serum levels of hs-CRP, LDH, ferritin, and IL-10 as significant predictors for death risk of COVID-19, which may help to identify the high-risk COVID-19 cases. KEY MESSAGES A machine learning method is used to build death risk model for COVID-19 patients. Disease severity, age, hs-CRP, LDH, ferritin, and IL-10 are death risk factors. These findings may help to identify the high-risk COVID-19 cases.


Introduction
The continuous pandemic of coronavirus disease 2019 (COVID-19) caused by severe acute respiratory coronavirus 2 (SARS-CoV-2) is showing an unprecedented attack to the global health system. Till 15 November 2020, the outbreak of COVID-19 had caused more than 53 million individuals infected and more than 1.3 million people dead globally; specifically, the U.S. and European countries, the most affected regions with increasing mortality, jointly account for over 47% of cases and 44% of deaths in the world [1]. The largest present report conducted by the Chinese Centre for Disease Control and Prevention with 72,314 cases revealed the average case-fatality rate was 2.3% [2], while a retrospective study among 379 critically ill adult patients have observed a striking day-28 mortality of 27%, ascending sharply with older age and for patients with latent comorbidities [3]. The death risk induced by severity of COVID-19 posed great pressure on medical service, resulting in a shortage of critical care resources and heavy disease burden.
To optimize the treatment and recovery of patients with the limited medical resources, it is of great significance to identify early detection prognostic biomarkers to discriminate COVID-19 patients who would probably develop critical illness and to evaluate their relevant mortality risk during the global pandemic. Prior study implemented in a development cohort of 1590 patients and a validation cohort of 710 patients has established a traditional risk score model by using 10 independent predictors including age, onset symptoms, comorbidities, and several laboratory findings [4]. During the recent global exigency, various machine learning (ML) and artificial intelligence (AI) technologies have been widely applied in patients tracing, vaccine development, and patients screening for its better scale-up, speed-up processing power [5]. However, few studies have focussed on the ML and AI utilization in discerning patient's disease progress and estimating death risk.
Under these circumstances, we conducted a retrospective cohort study among 1270 COVID-19 patients admitted in Wuhan, China, aiming to use a mathematical model method based on interpretable ML algorithms to help discriminate significant death risk factors for COVID-19.

Study design
After excluding pregnant women and subjects with missing information about comorbidities, the 1270 COVID-19 patients admitted in two hospitals in Wuhan between 27 January and 5 April, 2020 were enrolled in our study. These patients were confirmed as COVID-19 according to diagnostic criteria established by WHO interim guidance by positive RT-PCR detection of nasal or throat-swab specimens. Among these patients, 984 were admitted in the Sino French New City Branch of Tongji Hospital in Wuhan with recruitment period ranging from 27 January to 5 April, 2020 and the 286 cases were admitted in the Optical Valley Branch of Tongji Hospital in Wuhan between 3 February 2020 and 26 March 2020. The illness severity of COVID-19 patients was classified into mild, moderate, severe, and critical according to the Diagnosis and Treatment of COVID-19 guidelines published by the Nation Health Commission of China [6]. In this study, we defined the mild and moderate type as non-severe group, while the severe and critical type were categorized as severe group.

Data sources and processing
Electronic medical records of all patients were reviewed to collect the demographic information, clinical characteristics (onset symptoms, disease severity, and comorbidities), laboratory examinations (blood routine examination, cytokines and infection-related factors, blood coagulation factors, and serum biochemical index), and chest CT scan findings on admission. For some laboratory markers were below the limits of detection (LOD) among >15% subjects, we categorized these biomarkers as binary variables by using the normal reference range as cut-off value in the subsequent analysis. The dominating outcomes were discharge or decease. Computerised database was used to neaten the collected original data and further cross-check. Ethics approval for collection and analysis of all data from these patients was approved by the Ethics Committee of Tongji Hospital of Tongji Medical College, Huazhong University of Science and Technology (TJ-IRB20200201).

Statistical analysis
Descriptive data were expressed as frequencies (%) for categorical variables and as medians with interquartile ranges (IQR) for continuous variables. Mann-Whitney U and Chi-square tests were conducted to estimate the differences of continuous and categorical data between surviving and not surviving groups, respectively. To optimize the latent collinearity and avert over-fitting of variables, the least absolute shrinkage and selection operator (LASSO) regression analysis was carried out in the training set of 984 patients to select the most significant clinical characteristics for mortality risk of COVID-19 by using the R software package "glmnet". As described herein, 48 clinical features with missing values <20% were enrolled into the variable shrinkage process. The LASSO regression was established by using a Cox proportional hazard model, whose optimal value of k with the minimum partiallikelihood deviance was selected by using 10-fold cross-validation [7]. Subsequently, variables selected by LASSO regression were entered into a high-performance ML prediction model namely XGBoost. The importance of candidate features in XGBoost is identified by its cumulated use in each decision step in trees [8]. To avoid model over fitting, we split the 984 cases randomly to training and internal validation sets in the ratio 7:3 (were recorded as training and internal validation sets, respectively) and then grid search method based on "caret" package in R software was conducted in the training set to tune XGBoost hyperparameters including number of trees (nrounds), the learning rate (eta), minimum loss to expend on a leaf node (gamma), maximum tree depths (max_depth), minimum sum of instance weight needed in a child node (min_child_weight), and subsampling proportion (subsample) [9]. The optimal hyperparameters was selected according to the minimum root mean square error (RMSE) in grid search process by using 10 repeats 10fold cross-validation. The XGBoost model was finally trained in the training set with the following hyperparameter settings: max_depth ¼ 3, eta ¼ 0.1, gamma ¼ 0.2, min_child_weight ¼ 4, subsample ¼ 1, and nrounds ¼ 145, while all other hyperparameters were used as their default values. We defined this model as "multi-tree XGBoost" and the ranks of feature importance were then obtained.
To further select the most significant features related to mortality risk, 100-round 5-fold cross-validation was conducted in training and internal validation sets. The key features were identified according to the performances of the model with improved area under curve (AUC) score <0.5% when adding the seventh feature to the model. Finally, six features were selected as significant predictors. Following the foregoing findings on the importance of six features, we established a simplified and portable decision model defined as "simple-tree XGBoost". Since 197 subjects had missing detection for at least one of the six critical features, the remaining 787 cases were randomly split into training and internal validation sets in the ratio 7:3 as prior reported [10,11]. Then, simple-tree XGBoost model was re-trained with the same hyperparameters as described above, except for the min_child_weight set to 1 [11]. The performance of the simplified XGBoost model was evaluated in above training and internal validation sets, and also in COVID-19 patients admitted from the Optical Valley Branch of Tongji Hospital (recorded as the external validation set) by assessing the identification accuracy, the precision, recall and F1 scores as described [11]. All participants in the external validation set were included in the XGBoost model since it handled missing values optimally by applying the sparsity-aware algorithm [8]. The AUC scores calculated by simple-tree (six features) or multitree (all features) XGboost models and multivariable logistic regression models with all or six features were also separately evaluated in the three datasets.

Characteristics of patients from two branches in Wuhan Tongji hospital
As shown in Table 1  , and 146 (51.05%) of them were males. Among both two populations, fever was the most frequent onset symptoms, followed by cough (65.14%) ( Table 1). The top 3 comorbidities were hypertension, diabetes, cardiocerebral-vascular disease (CCVD) for patients admitted in the both Branches. Compared to the survivors, the non-survivors were elder, more males, and had a higher proportion of severe symptoms (all p < .05) ( Table 1). Laboratory testing results for patients from the two branches are showed in Table 2, while the detectable rate and variable types of laboratory findings used in our analysis are presented in Supplementary Table 1.

Clinical features selection in LASSO regression analysis
A total of 48 clinical features detected at hospital admission were enter into the LASSO regression analysis, and 19 were significantly associated with COVID-19 death risk, including age, gender, disease severity, number of symptoms, comorbidities of CCVD, hypertension, diabetes, chronic kidney disease, number of comorbidities, blood count of neutrophils, level of activated partial thromboplastin time (APTT), high-sensitivity C-reactive protein (hs-CRP), lactate dehydrogenase (LDH), serum ferritin, and abnormal level of hypersensitive troponin I (hs-cTnI), interleukin-6 (IL-6), IL-8, IL-10, and IL-1b (Supplementary Figure 1).

Features importance for an operable decision model
The aforementioned 19 features were entered into multi-tree XGBoost and top 10 clinical features were ranked by this model based on the values of their importance (Supplementary Figure 2). Subsequently, we added the ranked features one by one to the XGBoost model until an AUC score improving inferior to 0.5%. Six features, including disease severity, hs-CRP, age, LDH, serum ferritin, IL-10, were selected as the significant factors ( Table 3). Application of the multi-tree XGBoost algorithm with aforesaid six features resulted in a mean AUC (SD) of 0.921 (0.038) and 0.891 (0.053) among training and internal validation The calculation of comorbidity numbers was conducted in seven common diseases including hypertension, diabetes, cardio-cerebral-vascular disease, malignancy, pulmonary disease, chronic kidney disease, and digestive system disease. sets respectively, suggesting that this model was accurate enough to discriminate the deceased outcome of patients (Table 3).

Construction and evaluation of simple-tree XGBoost model
Then a simple-tree XGBoost model was constructed based on the above six key features. The performance of the simple-tree XGBoost among COVID-19 patients were presented in Table 4. In the training set, we observed a 99.2% survival and a 100% death prediction precision, and the recalls of survival and death prediction were 100% and 90.2%, respectively; in the internal validation set, the precisions of survival and death prediction showed 99.1% and 100%, respectively, while the survival and death prediction recalls were 100% and 87.5% separately. Similar results were observed in the external validation set, manifesting 99.6% and 100% prediction precision of survival and decease separately, as well as 100% survival prediction recall and 85.7% death prediction recall. In general, the F1 scores including survival and decease prediction, accuracy, weighted and macro averages are all >0.90 among COVID-19 patients in the three sets (Table 4). Moreover, one of the decision trees structure illustrated with the aforementioned six features was presented in Figure 1.
For the benchmark purpose, we also compared the performances of XGBoot model with the conventional multivariable logistic regression model. In the training set, the simple-tree XGBoost model with 6 selected features revealed superior performance compared to the logistic regression with all 19 features (AUC: 0.999 vs. 0.970, p ¼ .008) or 6 features (AUC: 0.999 vs. 0.931, p ¼ .003) (Figure 2(A)), while no significant difference in AUC score was observed between simple-tree and multi-tree models (AUC: 0.999 vs 0.995, p ¼ .056) (Figure 2(A)). Similarly, in internal validation set, the simple-tree XGBoost model exhibited better performance than the logistic regression used by all 19 features (AUC: 1.000 vs. 0.941, p ¼ .026) or the six selected features (AUC: 1.000 vs. 0.883, p < .001), as well as showing marginal higher AUC compared to multi-tree XGBoost model (AUC: 1.000 vs 0.977, p ¼ .049) (Figure 2(B)). In the external validation set, the simple-tree XGBoost model by using six selected features and logistic regression model by using 19 Table 3. Performance of the multi-tree XGBoost classification in discriminating death outcomes by using 100-round fivefold cross-validation among COVID-19 patients admitted in the Sino French New City Branch of Wuhan Tongji Hospital. features showed a superior performance (both AUC ¼ 1.000, Figure 2(C)). Briefly, the above results suggested that simple-tree XGBoost model owned more precise and stable prediction performance than multivariable logistic regression in identifying fatal outcome of patients.

Discussion
The retrospective cohort study conducted in the COVID-19 patients hospitalized in two branches of Wuhan Tongji hospital demonstrated that six features including disease severity, age, and serum levels of hs-CRP, LDH, ferritin, and IL-10 were significant predictors for death risk of COVID-19 patients. The defined simple-tree XGBoost model constructed with these six features revealed satisfactory performance with the AUC scores higher than 99% in training and internal validation set, and prediction precisions of survival and death were both >95% in the external validation set. Judging from the feature importance ranked by XGBoost model, the disease severity is the most crucial factor for death risk prediction. Recently, several studies have reported 41.1-61.5% hospital fatality rate of critical patients, which is significantly higher than fatality rate of 1.1-1.7% among mild and moderate patients [2,[12][13][14]. Patients with severe manifestations  should be paid much attention to get appropriate treatment approaches and further reducing their death risk. Our study also emphasised that the elder patients had a higher death risk, which was consistent with many previous studies [15,16].
Using laboratory biomarkers to construct prediction models is a comprehensive and efficient method for identifying progression towards severity and fatal outcomes of COVID-19. In our study, four biomarkers, namely hs-CRP, LDH, serum ferritin, and IL-10, were selected as risk factors for death prediction. The hs-CRP, a crucial biomarker described in prior COVID-19 studies for undesirable prognosis in ARDS, revealed an enduring status of inflammation [17,18], which might deeply interact with inflammatory storm, causing lung damage and pulmonary oedema of patients with COVID-19 [19,20]. LDH is regarded as an important and ubiquitous cellular enzyme, and the serum level of LDH has been identified as a significant biomarker for lung fibrosis and infection [21,22]. Previous studies have also reported the relationship between increasing LDH and higher death risk of COVID-19 [4,11,23]. Ji et al. also observed that COVID-19 patients with serum level of LDH higher than 500 U/L showed a significant illness progression hazard ratio of 9.8 (p < .001) in a multivariate Cox analysis, when compared to the group with LDH level <250 U/L [24].
dOne prior meta-analysis has recommended serum ferritin and IL-10 as candidate biomarkers for predicting COVID-19 progression to critical illness [25]. But few epidemiological studies provided direct support for the associations of serum ferritin and IL-10 with COVID-19 fatal risk. An early case-control study reported that appraising serum levels of ferritin in subjects at risk for and with ARDS may contribute to predict progression of ARDS and thereby improve relevant treatment approach [26]. Wu et al. conducted a retrospective cohort study among 201 COVID-19 patients and found that elevated serum ferritin was an independent risk factor related to ARDS development, but similar association was not observed when examined for death outcome, possibly due to a limited sample size [27]. Another case-control study conducted among 144 COVID-19 patients in Italy reported that patients who died during the hospital stay presented significant higher level of serum ferritin, but the multiple regression analysis by incorporating clinical and laboratory variables abolished the significant association of serum ferritin with in-hospital death [28]. Interestingly, one study (n ¼ 174) focussed on ferritin level with regard to existing comorbidities, such as diabetes, found that diabetes patients with confirmed COVID-19 had a higher median level of ferritin than non-diabetics patients of COVID-19 (764.8 vs. 128.9 mg/L, p < .001), revealing that diabetics suffering from COVID-19 may face a higher probability to generate inflammation and might experience serious complications from COVID-19 [29]. An in vitro study conducted in human hepatoblastoma cell line HepG2 reported that the inflammation-related cytokines (e.g. IL-1b and IL-6) might elevate ferritin synthesis [30]. Therefore, cytokines induced by COVID-19, which are generally increased in infected patients, might unite serum ferritin production in early inflammation and further result in patient's worse prognosis, but the underlying mechanisms need to be further elucidated. IL-10 is now described as a complex anti-inflammation cytokine generated by different cell types, showing vital effect in regulating immune and inflammation responses [31]. Han et al. performed a case-control study by using a series of inflammation markers (e.g. IL-6, IL-10, and TNF-a) among 102 COVID-19 patients and finally highlighted the significance of IL-6 and IL-10 as illness severity predictors [32]. Another longitudinal analysis conducted in 71 COVID-19 patients revealed that a combination of IL-10, RANTES, and IL-1 receptor antagonist at first week of follow-up might be useful prediction biomarkers for patients' outcome [33]. Increasing of IL-10 in severe patients might be related to a compensatory anti-inflammatory response, which may lead to higher proportion of subsequent infections, sepsis, and further raising the death risk [34]. Although possible biological relationships exist between the above biomarkers (hs-CRP, LDH, serum ferritin, and IL-10) and severity of COVID-19, the effects of these biomarkers in COVID-19 pathogenesis still need validations and further in-depth investigations.
Though the common use of ML method in business analysis, the mainstream medical domain has still fell behind in terms of studying and applying ML method for real-time risk prediction. In this study, the XGBoost method exhibited superior and stable performance in COVID-19 mortality risk prediction. Prior comparison studies have revealed that ML methods can be more accurate and efficient than traditional logistic regression analysis, especially when the sample size was limited [35]. A prospective study conducted in 38 women breast cancer patients used several ML methods with multiparametric magnetic resonance imaging, to make early prediction of pathological complete response (pCR) to neoadjuvant chemotherapy and of survival outcomes, observed that, of all ML classifier model, the XGBoost model outperformed all other models such as linear support vector machine, logistic regression, and random forests in the prediction of pCR (with mean AUC of 0.8577 and best AUC of 0.9430) [36].
Our study provided a portable and intuitive clinical proof to accurately identify the death risk of patients with COVID-19 by using an efficient ML method. A prior study conducted in Wuhan COVID-19 patients also used the XGBoost model to explore the death risk factors, but the 375 patients they included in the training set had a higher proportion of critical clinical symptoms (40.3%) [11], probably because they mainly included the COVID-19 patients admitted in the Department of Critical Care Medicine and this rate was much higher than the Wuhan report (critical rate 3.0%) [37]. They reported a mortality rate of 46.4% and only identified three features (LDH, lymphocyte and hs-CRP) to be significant death risk factors, while our present study in a larger number of COVID-19 patients with a case-fatality rate of 5.67% (close to the mortality rate to date in Wuhan: 7.68%) [38], could give a better representative of the general patients. Additionally, by using both LASSO model and XGBoost ML algorithm, we were able to validate previous results like age, disease severity, hs-CRP, and LDH were significant death risk markers for COVID-19 and further provide direct evidence for the fatal effects of serum ferritin and IL-10. However, several limitations should also be noted. Firstly, we constructed the XGBoost model with a modest sample size, while sample size for external validation was comparatively small. Nevertheless, the performance of death risk prediction performed in XGBoost model was superior in the two populations. Secondly, the proposed ML method is absolutely data-driven, which might be influenced by class imbalance resulting from low fatal rate and further disturbed prediction accuracy and sensitivity. Hence, larger scale and multicentre validation studies with improved data balance should be completed to obtain stable prediction effect and extent to varied dataset rationally. Thirdly, the dataset for model construction and validation are entirely from China, which might restrict the generalizability of the ML model to the other areas of the world. Finally, the classification capacity of ML method still needs to be improved by balancing the association between model interpretability and prediction accuracy. Clinicians obviously show preference to comprehensible method like logistic model, but a black-box model might present preferable performance.

Conclusions
In this study, we identified six candidate features, including disease severity, age, and serum levels of hs-CRP, LDH, ferritin, and IL-10 measured at hospital admission, as critical death risk biomarkers for COVID-19 patients. The simple-tree XGBoost ML model conducted by the six significant features can help to predict death risk of hospitalized COVID-19 patients accurately with >90% precision and >85% sensitivity. Since the six key features were generally detectable at hospital admission, early monitoring of these features might help to prioritise high risk COVID-19 patients and optimise the limited medical resources during the pandemic period.

Disclosure statement
The authors declare that they do not have any conflicts of interest.

Data availability statement
Data sharing is not applicable to this paper as the datasets generated needed to be confidential.