Performance of prediction models for short-term outcome in COVID-19 patients in the emergency department: a retrospective study

Abstract Introduction Coronavirus disease 2019 (COVID-19) has a high burden on the healthcare system. Prediction models may assist in triaging patients. We aimed to assess the value of several prediction models in COVID-19 patients in the emergency department (ED). Methods In this retrospective study, ED patients with COVID-19 were included. Prediction models were selected based on their feasibility. Primary outcome was 30-day mortality, secondary outcomes were 14-day mortality and a composite outcome of 30-day mortality and admission to medium care unit (MCU) or intensive care unit (ICU). The discriminatory performance of the prediction models was assessed using an area under the receiver operating characteristic curve (AUC). Results We included 403 patients. Thirty-day mortality was 23.6%, 14-day mortality was 19.1%, 66 patients (16.4%) were admitted to ICU, 48 patients (11.9%) to MCU, and 152 patients (37.7%) met the composite endpoint. Eleven prediction models were included. The RISE UP score and 4 C mortality scores showed very good discriminatory performance for 30-day mortality (AUC 0.83 and 0.84, 95% CI 0.79-0.88 for both), significantly higher than that of the other models. Conclusion The RISE UP score and 4 C mortality score can be used to recognise patients at high risk for poor outcome and may assist in guiding decision-making and allocating resources.


Background
To mitigate the burden on the healthcare system caused by the Coronavirus disease 2019 (COVID- 19) pandemic, it is necessary to identify patients who are at high risk of poor outcomes early in the course of the disease [1][2][3]. Although most patients with COVID-19 develop only mild symptoms, some develop severe and potentially fatal complications [1,2,4,5]. Prediction models could help forecast outcomes when patients present to the emergency department (ED) and may assist in triaging patients when allocating healthcare resources.
Several triage and prediction models have been developed to identify ED patients with a high risk of adverse outcome [6][7][8][9][10]. Some of these models were specifically designed for patients with pneumonia (CURB-65) and sepsis (abbreviated Mortality Emergency Department Sepsis (abbMEDS) and sepsisrelated organ failure assessment (SOFA)) or for older patients (Risk Stratification in the Emergency Department in Acutely Ill Older Patients (RISE UP)) [6][7][8][9]. These models may be useful in patients with COVID-19 as well, as they often present with pneumonia and sepsis, and patients older than 65 years have a higher risk of poor outcome [11][12][13][14]. A recent systematic review reported on several new prediction models specifically designed for patients with COVID-19 [15]. Some models were found to have a good discriminatory performance with an area under the receiver operating characteristic (ROC) curve (AUC) of 0.84.

Study design and setting
This retrospective cohort study was performed at the ED of the Maastricht University Medical Centre þ (MUMCþ). This is a combined secondary/tertiary care centre in the Netherlands, with 22,000 ED visits every year. The medical ethics committee of the MUMC þ approved this study (METC 2020(METC -1572. Informed consent was obtained from all individual participants. This study was conducted and reported in line with the Strengthening the Reporting of Observational studies in Epidemiology (STROBE) guidelines [19].

Study sample
The study sample consisted of consecutive adult (18 years or older) medical ED patients diagnosed with COVID-19 during the first wave of the COVID-19 pandemic in the period from March 11th until May 8th 2020. Patients were included if they met the following criteria: (1) symptoms compatible with COVID-19 (i.e. coughing, common cold, sore throat, dyspnoea, acute diarrhoea, vomiting, fever or an unexpectedly discovered oxygen saturation below 92%); and (2) positive result of the polymerase chain reaction (PCR) for SARS-CoV-2 in respiratory specimens or (3) (very) high suspicion of COVID-19 according to the chest computed tomography (CT) scan (CO-RADS 4 or CO-RADS 5) [20]. We excluded patients who revisited the ED after an earlier ED presentation during the study period. In order to perform external validation of prediction models in our sample, we aimed to comply with the rule of thumb to include approximately 100 patients who met the primary outcome, similar to other studies [21].

Data collection
Data collection was performed by medical students and resident doctors, who were blinded to the study hypotheses. We collected data on age, sex and information regarding comorbidity according to the Charlson Comorbidity Index (CCI) from electronic medical records [22]. We also retrieved the following vital signs: heart rate (HR), systolic blood pressure (SBP), mean arterial blood pressure (MAP), respiratory rate (RR), oxygen saturation, temperature and Glasgow Coma Scale (GCS). For each vital sign, we used the initial (i.e. first recorded) value during the ED visit. The Alert Verbal Pain Unresponsive (AVPU) scale was derived from the GCS [23]. If RR or GCS were missing, we used paCO 2 and descriptions in the medical records to deduce these values, similar to other studies [6,18,24]. In addition, we collected routinely assessed laboratory tests: haemoglobin, haematocrit, leukocytes, thrombocytes, lymphocytes, D-dimer, blood gas analysis, bicarbonate, sodium, potassium, blood urea nitrogen (BUN), creatinine, lactate dehydrogenase (LDH), bilirubin, albumin and C-reactive protein (CRP). If haematocrit and pO 2 values were missing, we used haemoglobin and oxygen saturation to calculate these values, similar to other studies [25,26].
Furthermore, we collected the results of the PCR for SARS-CoV-2 in respiratory specimens and the results of the chest CT scan [20]. The results of the chest CT scan were determined by a radiologist. Finally, we retrieved data on length of hospital stay, admission to the medium care unit (MCU) or intensive care unit (ICU), and 30-day and 14-day mortality. Data on mortality were verified using the medical records. In the Netherlands, all deaths are registered by the municipal administration office, and these data are linked to the medical records.

Prediction models
We searched PubMed for studies on prediction models focussing on patients with COVID-19 using a combination of methodological search terms (prognostic, prediction model, score, regression) and COVID-19 search terms (COVID-19, SARS-CoV-2, coronavirus). In addition, we checked reference lists of manuscripts we identified this way. The search was performed on June 17th and repeated on September 11th to check for more recent publications.
We selected prediction models based on the inclusion of readily available variables in the ED and the aim to predict the risk of mortality or progression to severe illness (i.e. tachypnoea, hypoxia and ICU admission with shock, mechanical ventilation, or organ failure). We excluded models that were not clearly described or were not feasible in our ED setting. Prediction models were also excluded if the included variables or the risk calculation were unclear. Models developed using machine learning techniques other than regression and radiologic models were excluded because these could not be reproduced in our setting.

Outcomes
The primary outcome was all-cause mortality within 30 days of ED presentation. The secondary outcomes were all-cause mortality within 14 days and a composite outcome of 30-daymortality and admission to the MCU/ICU. In our hospital, all patients admitted to the ICU were mechanically ventilated.

Statistical analysis
Baseline characteristics were analysed using descriptive statistics on the observed data. For each patient, we completed variables of the included prediction models. When the score could be completed in less than 95% of patients due to missing values, data were imputed using stochastic regression imputation. We calculated the AUC under the ROC curve to quantify the discriminatory performance of the included prediction models. An AUC of 0.5 corresponds with very poor discriminatory performance, whereas an AUC of 1.0 means perfect accuracy. We compared the AUCs of the included models using the method of DeLong. All data were analysed using IBM SPPS Statistics for Windows, IBM Corporation, Armonk NY, version 25.0.

Study sample
During the study period, 415 ED patients met the inclusion criteria. After the exclusion of 12 patients because of refusal of informed consent, we included 403 patients for analysis ( Table 1). The median age of patients was 71 years (IQR 60-78), and 255 patients (63.2%) were older than 65 years. Most patients (66.0%) were male. The PCR for SARS-CoV-2 was positive in 323 patients (80.1%) and the chest CT scan was positive in 325 patients (80.6%). A total of 307 patients (76.2%) were admitted to the hospital, whereas the other patients were discharged home for further recovery. The median length of hospital stay was 6 days (IQR 3-12).
In our sample,66 patients (16.4%) were admitted to the ICU, 48 patients (11.9%) to the MCU, and 95 patients died during follow up, yielding a 30-day mortality of 23.6% and a 14-day mortality of 19.1%. The survival curve is shown in Figure 1. A total of 152 patients (37.7%) met the composite endpoint of 30day mortality and admission to MCU/ICU.

Prediction models
We included 11 prediction models (Table 2), of which seven prediction models were not explicitly developed for patients with COVID-19: RISE UP, CURB-65, Modified Early Warning Score (MEWS), Rapid Emergency Medicine Score (REMS), abbMEDS, SOFA and Acute Physiology And Chronic Health Evaluation II (APACHE II) [6][7][8][9][10][16][17][18]. Furthermore, in a recent systematic review, 16 prediction models specifically designed for patients with COVID-19 were identified [15]. Of these models, eight estimated mortality risk in patients with suspected or confirmed COVID-19, five aimed to predict progression to severe disease, and three estimated length of hospital stay. We excluded 14 of these models for the following reasons: no clear description of the variables or risk calculation (n ¼ 5), not compatible with our setting because of the use of machine learning (n ¼ 5), or inclusion of radiologic characteristics (n ¼ 4). We included two prognostic models from the systematic review (ACP score and Host risk factor score) [29,30]. Additionally, we included two more recently published prediction models (CALL score and the Coronavirus Clinical Characterisation Consortium (4 C) mortality score) not included in the systematic review [27,28].
A total of six prediction models (RISE UP, 4 C mortality, CURB-65, SOFA, APACHE II and CALL) could be calculated in less than 95%of the patients because of missing values (vital signs and laboratory tests, Supplementary Table 2). Therefore, missing data were imputed using stochastic regression imputation.

Validation of the prediction models
The prediction models were used to calculate the risk of an adverse outcome (Table 3; Figure 2). The RISE  In comparison, the CURB-65, MEWS, REMS, abbMEDS, SOFA, APACHE II, CALL, ACP and Host risk factor score yielded AUCs ranging from 0.64 to 0.76 for 30-day mortality, AUCs ranging from 0.62 to 0.76 for 14-day mortality, and AUCs ranging from 0.68 to 0.76 for the composite endpoint. The discriminatory performance of the RISE UP score and 4 C mortality  score was significantly higher than that of the other models using the DeLong method.

Discussion
In this retrospective study, we externally validated 11 prediction models for their ability to predict mortality or admission to MCU/ICU in ED patients with COVID-19. We found that both the RISE UP score and 4 C mortality score had very good discriminatory performance, which was the highest of the models we analysed. The models yielded high AUCs for both 14-day mortality (both AUC of 0.83) and 30-day mortality (AUC of 0.83 and 0.84). The nine other models showed significantly lower discriminatory performance. The CURB-65, REMS, abbMEDS, SOFA, APACHE II and CALL score had a good discriminatory performance (AUC ranging from 0.71 to 0.76). In contrast, the ACP index and Host risk factor score had a moderate to poor performance (AUC of 0.67 and 0.64, respectively). Most prediction models had a higher discriminatory performance for predicting mortality than for predicting the composite outcome of mortality and MCU/ ICU admission.

Prediction models
The RISE UP score was recently developed to predict 30-day all-cause mortality in older medical ED patients and consists of easily and readily available items during the ED visit [6]. It is not unexpected that the model works well for admitted patients with COVID-19, since many of these patients in our cohort (63.2%) were 65 years or older. High mortality in older patients with COVID-19 was shown previously [11][12][13][14]. The 4 C mortality score was recently developed to predict inhospital mortality in a very large cohort of COVID-19 patients in the UK [27]. The good discriminatory performance of both the RISE UP and 4 C mortality scores can be explained because these models include items that reflect the severity of illness in ED patients and are indicative of sepsis, organ failure and/or shock (i.e. abnormal vital signs, LDH, BUN, Bilirubin). The items of the RISE UP and 4 C mortality score are quite similar. Elevated levels of LDH were found to predict adverse outcomes in patients with COVID-19 [31]. The prognosis of ED patients is reflected by the presentation of the patients at the ED, which results from both the severity of the current disease and pre-existing factors (i.e. age and comorbidities) [1,4]. Regarding feasibility, the probability of a poor outcome can be predicted in the first two hours of the ED visit by both models. One disadvantage of the 4 C mortality score may be that it contains the number of comorbidities of the ED patients, which is not always available in the ED. This is a disadvantage compared to the RISE UP score, which consists of six items readily available in the ED. Moreover, the RISE UP score can easily be implemented with an online calculator (https://jscalc.io/calc/ o1vzp36bIDGQUCYl). To guide clinical decision-making, prediction models that can be computed easily and quickly are of great value. The CURB-65 is commonly used to assess the severity and mortality in patients with community-acquired pneumonia [7]. In our cohort, we found that the score had a moderate to good ability to discriminate between mortality and survival (AUC of 0.75). In other studies in patients with COVID-19, the CURB-65 score was found to have a very good discriminatory performance for mortality and progression to severe disease with AUCs ranging from 0.81 to 0.88 [32][33][34][35]. The highest AUC (0.88) was found in a Turkish study [33]. Their high AUC may be explained by the inclusion of patients with less severe COVID-19 (more often lower CURB scores and lower mortality) compared to our patients. The MEWS and REMS were designed for early detection of high-risk patients by assigning points to vital signs and can both be easily applied in the ED. In our cohort, the MEWS score showed only reasonable discriminatory performance for 30-day mortality (AUC of 0.64), while the REMS score yielded moderate to good performance (AUC of 0.73). In one Chinese study, the MEWS score and REMS score were analysed in 138 patients with COVID-19 [36]. The MEWS showed an AUC of 0.68, similar to the AUC in our sample. The REMS score was found to have an AUC of 0.84. Our patients were older than the patients in the Chinese study (median 71 versus 58 years), which probably explains the higher AUC, as the AUC was 0.77 in the 50 Chinese patients older than 65 years.
APACHE II and SOFA scores are used to predict mortality in ICU patients. The discriminatory performance for 30-day mortality of these scores in our cohort was moderate to good (AUC of 0.71 and 0.72, respectively). These findings were comparable to those reported in other studies with patients with COVID-19 [31,32,37]. In one Chinese study in ICU patients with COVID-19, the AUC of the APACHE II score was 0.97, and the AUC of the SOFA score was 0.87, which is much higher than the AUCs we found [32]. However, our patients were less frequently admitted to the ICU (only 16.4%). Consequently, our population is more heterogeneous and mortality is probably more difficult to predict. The APACHE II score was less feasible in an ED setting, because in our ED, an arterial blood gas is measured on indication only (in 37.5% of our patients, no arterial blood gas was measured).
The three other prediction models specifically designed for patients with COVID-19 had varying predictive performances in our cohort. The CALL score had a good predictive value (AUC of 0.76). This CALL score was developed to predict progression to severe disease in the first 5 to 10 days in a cohort of 208 Chinese patients with COVID-19 [28]. The AUC in the Chinese study was 0.86, which was higher than the AUC we found. Application of a new model in an independent cohort usually results in a lower AUC. In addition, the patients in the Chinese cohort were much younger than our patients (mean 44 versus 71 years), and their follow-up period was shorter. The ACP index was developed to predict 12-day mortality in patients with COVID-19 in Wuhan [29]. The Host risk factor score was developed to predict mortality or progression to severe disease [30]. The discriminatory performance of these two scores was not reported by the authors. In our external validation, both scores had poor discriminatory performance (AUC of 0.67 (ACP index) and 0.64 (host risk factor score)). In a recent Spanish study in nursing home residents, the ACP and host risk factor score yielded comparable low AUCs (AUC of 0.60 and 0.55, respectively) [35]. The difference between our study and the original Chinese studies may also be explained by the different phase of the COVID-19 pandemic in which the studies took place, as in Europe, physicians were already slightly more prepared, and outcomes may therefore differ.

Study limitations
Our study had several limitations. First, our study was performed in a single medical centre, limiting the generalizability of the results. However, our cohort of patients with COVID-19 was relatively large and has been recruited in one of the most heavily affected areas of the Netherlands. Furthermore, by validating all prediction models in the same cohort, there were no differences in the patient sample, and we could truly compare the scores [38]. Second, the process of selecting prediction models for our analysis might have been incomplete. We chose prediction models that were feasible in our ED setting, which may be different for other EDs. Last, in a subgroup of patients with pre-existing frailty or severe comorbidity, it was decided to initiate conservative care only (35.2% had treatment restrictions). As these decisions affect mortality and likelihood of going to the ICU on the one hand, and may differ in other countries on the other hand, we decided to study MCU/ICU admissions as a composite outcome only. In addition, we decided to perform a subgroup analysis in the 261 patients without treatment restrictions (Supplementary Table 1). We found comparable AUCs for 30-day mortality, 14-day mortality and the composite outcome (AUC of 0.84, 0.82 and 0.81 for the RISE UP, respectively). We found some differences in the performance of the models between patients with and without treatment restrictions, which may be due to the smaller number of patients and the smaller number of events.

Conclusion
In conclusion, the RISE UP and 4 C mortality score had the highest discriminatory performance for short term mortality in ED patients with COVID-19. Prediction models like the RISE UP and 4 C mortality score are useful for identifying patients at high risk for adverse outcomes and may be a first step in guiding clinical decision-making and allocating healthcare resources in this pandemic in which we have to deal with scarcity of clinical facilities and materials. However, this needs to be the subject of further investigation.