Development of Multivariable Prediction Models for the Identification of Patients Admitted to Hospital with an Exacerbation of COPD and the Prediction of Risk of Readmission: A Retrospective Cohort Study using Electronic Medical Record Data

Abstract Background Approximately 20% of patients who are discharged from hospital for an acute exacerbation of COPD (AECOPD) are readmitted within 30 days. To reduce this, it is important both to identify all individuals admitted with AECOPD and to predict those who are at higher risk for readmission. Objectives To develop two clinical prediction models using data available in electronic medical records: 1) identifying patients admitted with AECOPD and 2) predicting 30-day readmission in patients discharged after AECOPD. Methods Two datasets were created using all admissions to General Internal Medicine from 2012 to 2018 at two hospitals: one cohort to identify AECOPD and a second cohort to predict 30-day readmissions. We fit and internally validated models with four algorithms. Results Of the 64,609 admissions, 3,620 (5.6%) were diagnosed with an AECOPD. Of those discharged, 518 (15.4%) had a readmission to hospital within 30 days. For identification of patients with a diagnosis of an AECOPD, the top-performing models were LASSO and a four-variable regression model that consisted of specific medications ordered within the first 72 hours of admission. For 30-day readmission prediction, a two-variable regression model was the top performing model consisting of number of COPD admissions in the previous year and the number of non-COPD admissions in the previous year. Conclusion We generated clinical prediction models to identify AECOPDs during hospitalization and to predict 30-day readmissions after an acute exacerbation from a dataset derived from available EMR data. Further work is needed to improve and externally validate these models.


Introduction
About one in five patients who are discharged from hospital for an acute exacerbation of COPD (AECOPD) are readmitted within 30 days.COPD is the condition with the highest 30-day readmission rate in Canada and carries a significant financial burden and is costly [1].Early identification and treatment of individuals with an AECOPD may lead to improved outcomes such as reduced readmissions through interventions such as comprehensive multimodal COPD interventions including case management [2,3].Resource limitations often preclude offering intensive case management to all patients.Thus, there is a need for tools to timely identify individuals admitted with an AECOPD as well as those COPD patients who are most at risk of readmission.
To initiate case management of acute exacerbations, prompt identification near the time of admission is needed by not just the clinicians but also the larger interprofessional team including pharmacists, respiratory therapists, educators, and discharge coordinators.From a systems perspective, it is often a challenge for the team to be aware of all patients hospitalized with an AECOPD.Presenting symptoms, laboratory investigations, and radiographic findings are often nonspecific and can occur in many other conditions.Additionally, other chronic conditions can similarly present with an acute decompensation.While clinicians may initiate treatment for an acute exacerbation of COPD, this may be a provisional diagnosis, and it may be one of several active medical issues being treated.Some Electronic Medical Records (EMRs) may not have diagnoses as coded elements in a 'problem list' , or these 'problem lists' may not be reliably updated.A systematic review of methods to identify patients with an AECOPD using routinely available data found ICD coding was most frequently used [4].However, ICD coding is typically not available during hospitalization and occurs after discharge.This systematic review found some evidence to identify COPD patients using pharmacotherapy data such as inhaler prescriptions when ICD codes are unavailable, but the evidence was limited to two studies that focused on outpatient identification [4].
A second challenge is knowing which patients admitted with an acute exacerbation are likely to be readmitted after discharge.Patients who are at high risk can be provided increased resources to meet their needs.General prediction rules for 30-day readmission for all hospitalized patients as well as COPD-specific prediction rules exist.The general prediction rules such as the HOSPITAL score and LACE index and perform only modestly for patients with COPD exacerbations [5][6][7][8].There are COPD specific prediction scores such as the CODEX, CORE, PEARL, and RACE scores are predictive for all-cause readmission [8][9][10][11][12].These COPD-specific scores are also limited by modest performance, items that are not generalizable such as U.S. medical insurance status or items not readily available in EMRs such as the MRC Dyspnea scale.
For models to be clinically useful and effective, the model should be highly predictive, should be parsimonious, and utilize variables that are readily available in the electronic medical record.

Aim
Our primary aim was to develop two clinical prediction models using data available in electronic medical records: 1) to identify patients early in their admission with an acute exacerbation of COPD and 2) to predict 30-day all cause readmission in patients discharged after an AECOPD.

Dataset
To create both datasets, we used retrospective data from two University Health Network (UHN) affiliated hospitals -Toronto Western Hospital and Toronto General Hospital.We created a data set of all admissions to General Internal Medicine from the years of 2012 to 2018 inclusive.The data set contained structured coded data including age, sex, medications ordered, co-morbidities, and laboratory values.The data set included ICD-10 diagnoses codes that were documented typically several months after each admission.We received Research Ethics Board at UHN.

Study population
Given the two aims, there were two population cohorts used to create the two corresponding datasets.

Cohort to identify COPD exacerbations Inclusion criteria:
All admissions to General Internal Medicine from the years of 2012 to 2018 inclusive to the Toronto Western Hospital and the Toronto General Hospital which are tertiary teaching hospitals part of the UHN.

Exclusion criteria
None.

Cohort to predict 30-day readmissions Inclusion criteria:
All admissions from the first cohort who had had an admission due to AECOPD.To determine if an admission was due to AECOPD, we used the Health Quality Ontario Quality Based Procedures ICD-10 definition of an AECOPD admission [13] (Table S1):

Exclusion criteria:
We excluded any patients who died during the index admission.

Predictor variables
We collected data that was exclusively obtainable from the electronic medical record from their current and prior admissions.This data included demographic information, clinical data such as diagnoses, index admission characteristics, laboratory values, and medication orders.These variables were selected from the literature (Tables S2 and S3) For the first dataset, only data within 72 h of admission was included.For the second dataset, data throughout the hospitalization was included.

Outcomes
For the outcome of whether a patient had a COPD exacerbation, the inclusion criteria for the AECOPD cohort described above was used.ICD-10 coding of diagnoses is typically performed one to three months after discharge and was the "gold standard" outcome for model prediction.For the outcome of readmissions, we included all-cause medical readmissions within 30-days after discharge from the index AECOPD admission.

Analysis
In the development for both models, variables with 15% or more missing data were eliminated.In instances where there are missing values variables, we imputed the value with the average of other values in the same category.We tuned and fit models with the following four algorithms: logistic regression, Least Absolute Shrinkage and Selection Operator (LASSO) [14], recursive partitioning [15], and linear discriminant analysis (LDA) [16].For the non-logistic regression models, all variables available were included.For logistic regression, full models were fitted, and an ordered list of variable importance was generated.We built parsimonious models using forward variable selection based on order of importance [17].For each added variable, we used cross-validation to produce a sample of C-statistics for each model.We then used the Wilcoxon Signed Rank test [18] to evaluate whether there was a change in C-statistic compared to the model with 1 fewer variable.If the model with the added variable had a significantly improved C-statistic, we included this variable in the model.We then compared the performance of the four algorithms using C-statistics and Brier Scores [19].
Recognizing that the performance of the models also depends on the probability threshold that defines positive and negative cases for a specific predicted probability, we assessed the top performing models at two thresholds: 50% and 20% predicted probabilities of the outcome [20].While the arbitrary default to define a positive or true case with logistic models is typically 50% or 0.5, we also assessed the models at 20% to determine the effect on sensitivity, specificity, and positive predictive value.
For AECOPD identification, data was split into 11-folds, and 10-fold cross-validation was performed with a hold-out data set.This hold-out data set was used for 'pseudo-external validation' [21,22].The hold-out data set sample size was selected to achieve a power of 80% to detect a 3% difference in sensitivity using McNemar's test [23].
For 30-day readmission risk prediction, the sample size was much smaller, and there was not sufficient data for a hold-out data set.An 8-fold cross-validation was used for model building and producing only internally validated performance metrics on all four models.We compared the performance of readmission prediction to the HOSPITAL score using C-statistics [7].The HOSPITAL score is a validated tool predicting 30 day potentially avoidable readmissions consisting of the following predictors: Hemoglobin, discharge from an Oncology service, Sodium level, Procedure during the index admission, Index Type of admission (urgent), number of Admissions during the last 12 months, and Length of stay.

Results
There were 64,609 patient admissions to the General Internal Medicine ward between 2012 and 2018 at Toronto General Hospital and Toronto Western Hospital (Figure 1).Of those, 3,620 (5.6%) were admitted for an AECOPD.From this subset, 3,372 (93.1%) were alive at discharge.Of these discharges, 518 (15.4%) had readmissions to the hospital within 30 days of discharge (Figure 1).
With respect to the 'Cohort to identify patients with COPD exacerbations' , the mean age was 66 years with an approximately equal number of males to females (Table 1).Compared to those who did not have an admission coded as an AECOPD, patients who had an admission coded as an AECOPD tended to be older, more likely to be male, had a past diagnosis of COPD and more likely to have a history of heart failure.For admissions eventually coded as an AECOPD, the clinician provided a diagnosis of a COPD exacerbation at the time of admission in 49%.The second most common admitting diagnosis for these patients eventually coded as AECOPD was pneumonia (23%).
With respect to the 'Cohort to predict 30-day readmissions' , patients who were readmitted within 30-days of admission for an AECOPD were more likely to be given a designation of 'discharged against medical advice' , to have a shorter length of stay, and to have a higher heart rate and respiratory rate on admission (Table 2).

Model variables and performance
For identification of AECOPD, after elimination of variables for missing data, there were a total of 71 variables (Tables S4-S6).For the logistic regression model, there were no further significant changes to the algorithm after inclusion of the fourth variable when ranked in descending order with respect to relative weight in the algorithm (Figure 2).The four variables that remained in the logistic regression model were all related to medications ordered for the patient received within 72 h of admission: 1) 'Prednisone' , 2) a long-acting anticholinergic, 3) a short-acting anticholinergic, and 4) 'pneumonia antibiotics.' For all four variables, the  presence of the medication order was associated with increased likelihood of an AECOPD.When comparing the C-statistic between the models, the top performing model was LASSO (Table 3) with a statistically significant improvement in LASSO compared with logistic regression (p < 0.001), recursive partitioning (p < 0.001), and linear discriminant analysis (p = 0.02).Comparing the top models of LASSO and the 4 variable logistic regression at different thresholds (50% vs 20%), LASSO had significantly better specificity at 50% (99.1% vs 98.8%) and sensitivity at 20% (76.6% vs 70.4%), but worse specificity at 20% (97.1% vs 97.8%) (Table 4, Table S7).

Model variables and performance
For the 30-day risk of readmission after an admission coded as AECOPD, there was a total of 82 variables utilized after elimination of variables for missing data.There were no further significant changes to the logistic regression model after the second variable (Figure 3).The top two variables were the number of COPD admissions in the last year and the number of non-COPD admissions in the last year.For assessing the performance of logistic regression models, two models were selected: the first using the above 2 variables and a second 1-variable model using the number of total admissions in the last year.The highest performing models were logistic regression (1 and 2 variable), LDA and LASSO.There was a non-significant difference between using 1 or 2 variables models and LDA or LASSO (Tables S11-S13).

Comparing to HOSPITAL score
Comparing the performance of the five models to the HOSPITAL score risk category, both logistic regression models, LASSO and LDA were all superior to HOSPITAL score (p < 0.001) (Table 5).The 1-variable readmission prediction model had superior sensitivity compared to the HOSPITAL score risk category models at a 20% threshold (Table 6).

Discussion
In a retrospective study of General Internal Medicine admissions over a 7-year period at two hospital sites, we generated clinical prediction models to identify AECOPDs during hospitalization and to predict all cause 30-day readmissions after an acute exacerbation from a dataset derived from available EMR data.A LASSO model to identify admissions that were coded as AECOPDs had a sensitivity of 76.6% and specificity of 97.1% and a C-statistic of 0.975 +/-0.004.With respect to 30-day readmission prediction, we found that a logistic regression model with one variable -the number of hospitalizations in the last year -provided the best combination of parsimony and accuracy with a sensitivity of 20.5%, specificity of 97.2%, and accuracy of 85.4%.With the low sensitivity of our readmission prediction model, further work is required before it is clinically useful.
In the identification of patients with COPD, we found that medications ordered early in admission were the most important predictors of whether an admission would be later coded as an AECOPD.While this has not been described in the inpatient setting previously, this is similar to previous findings in the outpatient setting that medication information improve the identification of patients with COPD [4].
While the clinicians ordering these medications have AECOPD as likely a diagnosis, the larger interprofessional team may be unaware.Instead of just relying on appropriate notes: 1 median income in Canadian dollars (CaD), 2 vitals on admission to the hospital; 3 minimum value during admission in hospital; iCs: inhaled corticosteroid; laaC: long-acting anticholinergic; laba: long-acting beta agonist; saaC: short-acting anticholinergic; saba: short-acting beta-agonist; hb: hemoglobin; hs troponin: high sensitivity troponin; bnP: b-type natriuretic peptide.documentation or communication of the diagnosis of AECOPD by clinicians, the identification through EMR medication data may be a scalable and reliable systematic method to identify patients early in admissions.This model could be utilized to potentially flag patients with AECOPD earlier to the interdisciplinary team.Identification to the larger interprofessional team may enable order sets, care bundles to be implemented for patients with COPD exacerbations.External validation of our model is required, but the current performance suggests it is likely accurate enough to help identify most patients with COPD early during hospitalization.
With respect to the readmission prediction, the Global Initiative for Chronic Obstructive Lung Disease COPD Report lists previous exacerbations as the best predictor of future exacerbations and that worsened airflow obstruction also predicted future exacerbations [24].A systematic review of prognostic models in COPD found that the most common variables used in the development of prognostic models were age, FEV1, sex, body mass index, smoking, previous exacerbations, previous hospital admissions, the MRC dyspnea scale, BODE index, and Charlson comorbidity index [25].They also found that most prognostic models predicted 30 day mortality, whereas there were fewer models for predicting risk of readmission, and these often had worse accuracy than the models developed for predicting mortality.More recently, the PEARL score has been internally and externally validated to predict readmissions and uses five variables (previous admissions, extended MRC dyspnea score, age, right-sided heart failure, and left-sided heart failure) (Table 7) [9].Unfortunately, three of these variables are not typically available from standard EMRs as they require asking patients about their dyspnea severity (extended MRC dyspnea score) or require an echocardiogram and clinical judgment (right-sided heart failure and left-sided heart failure).Comparing our models to PEARL, we found that neither age nor history of heart failure were significant predictors of readmission in our parsimonious logistic regression model and did not improve the performance of our model.Another prediction model, the RACE scale, had a dominant predictor of Medicare/Medicaid insurance [8].Due to this fact, it has limited generalizability outside the U.S. Finally, the CORE score to predict readmission has five predictors including eosinophil count, lung function, triple inhaler therapy, previous hospitalization and neuromuscular disease.The CORE score had an area under the curve of 0.703 in predicting 1-year hospitalizations [11].Our study further emphasizes the importance of previous hospitalizations over other predictors and uses readily available EMR data.With a mean specificity of 97%, our readmission prediction model -once validated -could be helpful in identifying some patients at high risk of readmission.However, the model sensitivity of 20.5% will miss large number of patients as the model is unable to rule out those at high risk of readmission.Further work is needed to develop and validate a model with high sensitivity.
There are limitations to our study.The first is that while we had a large number of admissions in our first cohort, unfortunately only 5% of our patient population was diagnosed with an AECOPD.Of those, only about 20% had readmissions within 30 days.Although this is in keeping with typical admission patterns in North America [1,26], this imbalance led to an overall bias in our dataset favoring a negative diagnosis of AECOPD and negative prediction of readmission.The low numbers permitted only pseudo-external validation in the first model development and only cross-fold validation in the second model development.Readmissions were also only considered if readmitted to one of our two hospital sites, thus we are missing readmissions to other facilities.For our readmission prediction, we were unable to compare to other COPD readmission prediction algorithms due to some of their variables not being within our EMR.We did compare to the HOSPITAL score which was designed to look at potentially avoidable 30-day readmissions as compared to our model looking at all 30-day readmissions.This may explain the worse performance for the HOSPITAL score as it is targeting avoidable readmissions.The coding of COPD exacerbations can also be challenging and other diagnoses such as pneumonia, influenza, and acute decompensation of heart failure may be co-occurring.We did use a definition of COPD based on ICD10 codes from the  literature [13].While this definition excluded some ICD10 codes that are sometimes included with COPD cohorts, these are typically rare and unlikely to affect generalizability.Lastly, the final diagnosis is often a clinical diagnosis, thus there can variability in the accuracy of a final coded diagnoses.Future work should include using a large dataset to validate COPD identification.Once externally validated, this four-variable model could be incorporated within EMRs to ensure very high identification of most patients with acute exacerbations of COPD.This systems approach would ensure most patients who are eventually coded as having an AECOPD are identified prior to discharge.Similarly, the one or two variable model to predict readmissions could be incorporated in EMRs.Yet, further work is needed to improve the model of predicting readmissions, especially to improve the sensitivity.There is a trend for more data such as income, comorbidities, lung function and symptoms scores to routinely becomes coded information in EMRs.These added features may enable further improvements in the accuracy in predicting readmission.

Conclusion
We generated clinical prediction models to identify AECOPDs during hospitalization and to predict all cause 30-day readmissions after an acute exacerbation from a dataset derived from available EMR data.Further work is needed to improve and externally validate these models.

Figure 1 .
Figure 1.Patient data flow-chart.the 'Cohort to identify patients with COPD exacerbations' had 64,609 visits.the 'Cohort to predict 30-day readmissions' had 3,372 visits.

Figure 2 .
Figure 2. Change in performance of logistic regression models to identify aeCOPD with stepwise addition of variables of decreasing importance.no significant change in C-statistic after addition of fourth variable.

Figure 3 .
Figure 3. Change in performance in the 30-day readmission risk logistic regression models with addition of more variables.no significant change in C-statistic after addition of the second variable.

Table 1 .
Demographics of cohort to identify admissions coded as COPD exacerbations.

Table 2 .
Demographics of cohort to predict 30-day readmissions.

Table 3 .
Performance of 10-fold cross-validation identification of aeCOPD models.

Table 4 .
lassO and logistic regression models for identification of aeCOPD at 20% and 50% decision thresholds.

Table 6 .
Comparison of 1-variable logistic regression model and hOsPital score risk category for 30-day readmission risk at 20% decision threshold.

Table 7 .
Characteristics of prediction scores to predict readmissions.