An explainable multi-class decision support framework to predict COVID-19 prognosis utilizing biomarkers

Abstract Millions of lives have been impacted by COVID-19, which has spread rapidly. Several vaccines have been developed to curb the severe prognosis induced by the virus. However, a part of the population (elderly and patients with coexisting conditions) is still at risk. It is crucial to identify these patients early since appropriate treatments can be provided to them to prevent the onset of severe symptoms such as breathlessness and hypoxia. Hence, this study utilizes machine learning and explainable artificial intelligence (XAI) to predict COVID-19 severity using biochemical, haematological and inflammatory markers. The patients are grouped into three classes: mild, moderate and severe. Four nature-inspired techniques have been utilized to select the best markers. The final stacked model obtained a maximum accuracy of 84. Demystifying the models has been done using four XAI techniques, including Shapley additive values (SHAP) and Local Interpretable Model-Agnostic Explanations (LIME). Lactate dehydrogenase (LDH), albumin, D-Dimer, c-reactive protein (CRP) and lymphocytes were considered important, according to them. The classifiers can be utilized as a prognostic decision support framework to aid the medical personnel in classifying COVID-19 patients.


Introduction
On a global level, the current COVID-19 outbreak is a significant health problem.Severe acute respiratory syndrome coronavirus 2 (SARS-COV2) causes this influenza, and it has already caused many fatalities (Ioannidis, 2021).Some of the common symptoms of COVID-19 are myalgia, cough, fever, loss of taste, nausea and shortness of breath (Alimohamadi et al., 2020).
Most patients experience mild-moderate symptoms.However, severe health issues such as acute respiratory disorder syndrome (ARDS), hypoxia, and multi-organ failures were also reported (Berlin et al., 2020).Cytokine storm is one of the significant reasons which causes severe prognosis in a COVID-19 patient.The body's immune system starts attacking itself by releasing a massive secretion of cytokines (Fajgenbaum et al., 2020).Vaccines were eventually developed and administered to people worldwide.They have been highly helpful in preventing the severe prognosis caused by the disease.However, patients with existing health conditions such as hypertension, diabetes, and cancer are still at risk (Ejaz et al., 2020).The elderly population are also vulnerable to this contagious disease (Dadras et al., 2022).
In recent years, the healthcare sector has benefited from various artificial intelligence (AI) applications.Functional screening, decision support, diagnostic tools and early prognostic frameworks have been developed to help healthcare professionals make crucial clinical decisions (Apell & Eriksson, 2023).Explainable artificial intelligence (XAI) applications are also being heavily used in the medical sector to decipher predictions made by ML classifiers (Arrieta et al., 2020).XAI has successfully interpreted various predictions using visual, mathematical and modular techniques.
COVID-19 prognosis can be monitored using several haematological, biochemical and inflammatory markers (Khartabil et al., 2020).A marker used to identify coagulation disorders is D-Dimer.Higher D-Dimer levels have been reported in severe COVID-19 patients (Rostami & Mansouritorghabeh, 2020).C-reactive protein (CRP) levels tend to increase before the onset of severe symptoms.Studies have used CRP to monitor the progress of COVID-19 patients (Mosquera-Sulbaran et al., 2021).Ferritin stores iron content.According to many studies, Ferritin levels have also increased significantly in COVID-19 patients (Cheng et al., 2020).Lactate dehydrogenase (LDH) can be used to check tissue damage.Elevated LDH levels have been reported in severe coronavirus patients (Wang et al., 2021).Neutrophil-to-lymphocyte (NLR) levels have been monitored in many COVID-19 studies (Yang et al., 2020a).NLR can also be considered a crucial prognostic marker.Many studies have combined clinical markers and ML to predict COVID-19 severity.A few of them have been reviewed below: Huyut et al. (Huyut, 2023) developed an automated AI system to classify mild and severe COVID-19 patients.The minimum redundancy maximum relevance method was used for feature selection.Twenty-nine attributes were considered, and the local-weighted learning (LWL) obtained a maximum accuracy of 97.86%.A data mining approach was used to predict the severity in COVID-19 patients (Khounraz et al., 2023).The data was collected from Shahid Beheshti University of Medical Sciences.The random forest obtained a maximum accuracy of 86.45%.A scoring system for COVID-19 severity prediction was developed in another research (Zhang et al., 2023).Two thousand six hundred forty-nine patients were considered for model training.The most important attributes were age, gender, clinical typing and pulmonary insufficiency.Solayman et al. (Solayman et al., 2023) used XAI to predict COVID-19 severity.The hybrid CNN-LSTM obtained a maximum accuracy of 96.34%.Local Interpretable Model-Agnostic Explanations (LIME) was the XAI technique used, and the essential features according to it were headache, sore throat, shortness of breath, fever and cough.Another research used multiple feature selection methods to predict COVID-19 severity (Hayet-Otero et al., 2023).Multiple filter, wrapper and embedded methods were used.According to the study, the most important markers were CRP, respiratory rate, pneumonia, severity index and oxygen levels.Autoencoders were used to predict the COVID-19 prognosis in another study (Mahdavi et al., 2023).A hybrid framework was utilized on the clinical data of 1474 patients.The elastic net obtained a maximum AUC of 0.913.A COVID-19 prognosis model for Vietnamese patients was developed in another study (Nguyen et al., 2023).Two hundred sixty-one patients were considered, and the random forest model delivered an accuracy of 97%.CRP, IL-6, dyspnea, ferritin and D-Dimer were the most critical markers.It is evident from the above studies that COVID-19 severity can be predicted using biomarkers and machine learning.This study uses multiple baseline classifiers to categorize COVID-19 patients into mild, moderate and severe.The other contributions of this study are as follows: • Four feature selection techniques such as genetic algorithm, differential evolution, sine cosine algorithm and whale optimization algorithm have been compared in this study.
• A custom stack model has been created using eight baseline classifiers to predict COVID-19 severity.
• SHAP (Shapley additive values), LIME, Eli5 and QLattice XAI techniques have been utilized to interpret the predictions.
• Analysis of results have been conducted from a medical perspective to understand the significance of the best prognostic markers.
The remaining sections are arranged as follows: Section 2 explains the materials and methods.The result outcomes are explained in Section 3. In Section 4, the findings are discussed.Conclusion is made in Section 5.

Dataset description
The research data were obtained from two Manipal hospitals: "Dr TMA Pai Hospital" and "Kasturba Medical College".Ethical clearance was taken to perform this machine learning analysis.The data considered was from January 2022 to May 2022.Individuals older than 18 years old were taken into consideration for this research.899 patients in all were involved in this study.The number of attributes was 33, including the target variable.Haematological, biochemical and inflammatory markers were included in this study.The target variable categorized the patient into "mild", "moderate", and "severe".The doctors themselves labelled all the patients.A brief information of the markers is made in Table 1.

Dataset preprocessing
Data preprocessing is a crucial phase in machine learning.Several techniques, such as imputation of null values, data standardization, variable encoding and data balancing, were performed.The dataset was first subjected to a descriptive statistical analysis.The "Jamovi" application was used for this purpose.The application is open-source and is utilized by data scientists all over the globe.
A few descriptive statistical measures are tabulated in Table 2.
A Pareto chart combines bar and line graphs where bars reflect each value and the line denotes the cumulative total (Al-Subehat, 2022).Pareto charts for various markers are described in Figure 1.The figure shows that lymphocyte and monocyte were higher in mild patients.Urea and AST were higher in the severe cohort.Further CRP, D-Dimer and LDH levels were higher in severe COVID-19 patients.
The dataset only contained one categorical feature, which was gender.One-hot encoding was utilized to encode the above parameter (Rodríguez et al., 2018).The entire dataset was also scaled using the standardization method (Ferreira et al., 2019).This step is essential since the model prefer attributes with higher values.The dataset was already balanced.Hence, data balancing was not performed in this study.

Feature selection and a few machine learning terminologies
In order to obtain good results, choosing only the necessary features is essential.When relevant data is trained, the predictions are more trustworthy too.In this study, we have used four feature selection methods.They are differential evolution, genetic algorithm, sine cosine algorithm and whale optimization.
Non-differentiable functions are optimized using the differential evolution method (Pant et al., 2020).A group of candidate solutions are iteratively improved using mutations.The function with the lowest objective value is the best solution.The genetic algorithm utilizes Charles Darwin's theory of evolution (Mirjalili & Mirjalili, 2019).The fittest individuals are chosen for reproducing better offspring in this method.The best candidate solution is discovered after a predetermined number of iterations.Mirjalili et al. (Mirjalili, 2016) proposed the sine-cosine method in 2010.It utilizes trigonometric functions such as sine and cosine for optimization.The hunting technique of humpback whales is used in whale optimization algorithms (Mirjalili & Lewis, 2016).The whales use a technique called "bubble nest" to catch their prey.The same method is used to find the best solution.All the above techniques can be used for feature selection using the GitHub package (Pavlyshenko, 2018).The features chosen by the above four techniques are described in Table 3. Machine learning algorithms can be used for prediction after model training.This study makes use of supervised learning, where the data is labelled.Eight classifiers which include bagging and boosting, were utilized in this study.Further, the models were ensembled using the stacking classifier (Pavlyshenko, 2018).Figure 2 provides a description of the stacked model's construction.Stacking uses merging multiple weak learners and combing the results with a meta-learner to make the predictions more trustworthy.Stacking is also called "stack generalization".In this method, all submodels contribute equally based on the weights to create a novel classifier with more accurate results.The average of the results from the 10 runs of each classifier was taken into consideration.The predictions made by the models were demystified using four XAI techniques.The entire process flow is described in Figure 3.

Results
In this study, multiple baseline classifiers were used on the attributes chosen by the four natureinspired algorithms.Table 4 provides a summary of the models' findings.The differential evolution algorithm delivered the most effective results out of the four.The catboost model obtained an accuracy of 87%.The lightgbm classifier performed exceptionally well, with an accuracy of 86%.The accuracies obtained by the random forest, adaboost, xgboost and stack were 84%.The catboost model also obtained excellent precision and recall values of 87%.The catboost classifier also obtained the least hamming loss of 0.13.The Jaccard score and Mathew's correlation coefficient obtained by the catboost were 0.76 and 0.80, respectively.
The genetic algorithm performed relatively good.Among all the ML algorithms, the xgboost obtained the best results with an accuracy of 86%.The precision, recall, f1-score, hamming loss, Jaccard score and MCC obtained were 86%, 86%, 86%, 0.14, 0.75 and 0.78, respectively.The catboost and the random forest obtained accuracies of 83% each.The lightgbm and the stack model obtained accuracies of 81%.
The whale optimization algorithm performed relatively poorly compared to the other three algorithms.The accuracy, precision, recall, f1-score, hamming loss, Jaccard score and MCC obtained by the random forest algorithm were 80%, 81%, 80%, 80%, 0.2, 0.66 and 0.70.The logistic regression delivered an accuracy of 77%.The confusion matrices for the four-feature nature-inspired algorithms are depicted in Figure 4.It can be seen that false positive and false negative results obtained were minimum.Hence, the models were able to obtain good results.Further, severe COVID-19 cases were more accurately identified than mild and moderate cases.Among the baseline algorithms, the random forest, catboost, lightgbm, xgboost and the stacked model obtained better results than other baseline algorithms.
Machine learning algorithms' predictions can now be trusted due to explainable artificial intelligence (XAI) (Antoniadi et al., 2021).The impact of the models and biases can be easily identified using XAI.It helps define model accuracy, fairness, and transparency.In this study, four XAI techniques have been used for interpretation.They are SHAP, LIME, Eli5 and Anchor.All four XAI techniques were tested on the stack model since they are an ensemble of multiple baseline classifiers.SHAP utilizes a game theoretic approach to interpret the models (Younisse et al., 2022).The average marginal impact of a characteristic value across all potential combinations is known as the Shapley value.The SHAP summary plot for the four nature-inspired methods are described in Figure 5.The most important markers for the differential evolution technique are LDH, D-Dimer, lymphocytes, CRP and NLR.The most critical features for the genetic algorithm technique are albumin, LDH, D-Dimer, lymphocytes and respiratory rate.The necessary features for the sine cosine algorithm are LDH, albumin, CRP, lymphocytes and basophils.For the whale optimization technique, LDH and lymphocytes were the most valuable attributes LIME was developed in 2016 by Marco Ribeiro (Gramegna & Giudici, 2021).LIME can be used to demystify each prediction.It utilizes the linear ridge regression model for interpretation.The LIME  explainer for all four nature-inspired techniques are described in Figure 6.A Python package called Eli5 makes it possible to use a uniform API to visualize and debug different classifiers (Kuzlu et al., 2020).It offers a mechanism for explicating black-box models and supports numerous algorithms.The Eli5 predictions for all four feature selection techniques are made in Figure 7. Eli5 predicts essential attributes for all mild, moderate and severe COVID-19 cases.When using differential evolution, the most critical markers which predicted mild COVID-19 were LDH, D-Dimer and NLR.LDH, CRP and protein were crucial in predicting moderate COVID-19 cases.D-Dimer, CRP and LDH were the essential markers in predicting severe COVID-19 cases.When the genetic algorithm was used, albumin, D-Dimer and LDH helped predict mild cases.LDH, albumin and D-Dimer were crucial in predicting moderate COVID-19 cases.LDH, D-Dimer and albumin were critical in determining severe cases.When using the sine cosine algorithm, neutrophils, urea, and albumin were important in predicting mild COVID-19 cases.Urea, albumin and lymphocytes helped predict moderate cases.LDH, lymphocytes and albumin were significant in predicting severe cases.The whale optimization chose LDH, D. Bilirubin and lymphocytes to predict all the COVID-19 classes.Eli5 also considers the bias parameter during interpretation.
Anchor use rules and conditions to explain model predictions (Jouis et al., 2021).They also use graph search algorithms and reinforcement ML techniques to make predictions.Anchors use two metrics to measure the quality of interpretation.They are precision and coverage.The anchor predictions for all the four feature selection techniques are made in Table 5.The most essential features for the differential evolution technique were CRP, lymphocytes, LDH and NLR.For the genetic algorithm, the most crucial features are albumin, D-Dimer, lymphocyte and LDH.For the sine cosine algorithm, the critical features were basophils, LDH and albumin.For the whale optimization technique, lymphocytes and LDH were considered significant.
The most crucial characteristics, according to the four XAI methods, are LDH, D-Dimer, lymphocytes, albumin, and CRP.The above markers can be monitored carefully to predict the COVID-19 prognosis in advance.

Discussion
This study utilized machine learning techniques and XAI to predict and understand the COVID-19 prognosis.A custom stacked model was developed using eight baseline classifiers.Differential evolution, sine cosine algorithm whale optimization and genetic algorithm were used for feature selection.Among the four, the differential evolution algorithm obtained the best results.Among the nine classifiers, the catboost performed extremely well with an accuracy of 87%.According to XAI, the critical markers were LDH, D-Dimer, lymphocytes, CRP, albumin, NLR and respiratory rate.
LDH levels tend to increase in severe COVID-19 patients (Wang et al., 2021).Similar observations were found in this research.LDH has already been used as a COVID-19 marker, according to many studies (Khartabil et al., 2020;Wang et al., 2021).D-Dimer levels were higher in the severe cohort in this research.The result agrees with other studies and follows a similar trend (Khartabil et al., 2020;Rostami & Mansouritorghabeh, 2020).Many existing studies have observed an increase in CRP levels among severe COVID-19 patients (Khartabil et al., 2020;Mosquera-Sulbaran et al., 2021).A similar observation was found in this research too.Albumin levels decreased in the severe COVID-19 cohort.According to many published studies, albumin levels decrease in the severe COVID-19 cohort (Aziz et al., 2020;Huang et al., 2020).NLR is an important COVID-19 prognostic marker.NLR levels have been monitored to predict the prognosis in a COVID-19 patient (Yang et al., 2020a(Yang et al., , 2020b)).The COVID-19 severe group has been found to have elevated NLR levels.The respiratory rate increased in severe COVID-19 patient.Many patients struggle to breathe during the onset of severe COVID-19 symptoms (Zheng et al., 2022).These were some of the inferences obtained from this study.Several studies have used clinical markers to predict COVID-19 severity.
Weizman et al. (Weizman et al., 2022) developed a scoring system to predict COVID-19 prognosis.It was a multicenter study, and 11 clinical markers were included.The authors claim that their risk scores are better than the standard existing risk scores.Another research used Machine learning tools to predict prognostic markers (Johm Jaime et al., 2023).Patient data was obtained from three Colombian hospitals.The logistic regression model delivered an accuracy of 72.9%.Ustebay et al (Ustebay et al., 2023).compared ML algorithms to predict COVID-19.Two datasets were considered in this study.All prognostic models obtained an AUC above 0.92.According to the study, the most important features were CRP, lymphocytes, serum calcium and albumin.COVID-19 prognosis was predicted using machine learning and laboratory tests in another study (Mlambo et al., 2022).Six algorithms were used, and the random forest classifier obtained maximum accuracy.Alshanbari et al. (Alshanbari et al., 2022) used biochemical markers to predict COVID-19 severity.The patient details were obtained from King Fahad Hospital, Saudi Arabia.The logistic regression model obtained an 85% accuracy.Our study has a few limitations too.Though biomarkers are efficient in predicting the prognosis, they have their limitations too (Wang et al., 2021;Weizman et al., 2022).A few biomarkers could increase/decrease drastically due to other factors (Alshanbari et al., 2022;Mlambo et al., 2022).The data collected in this study was from India alone.Data from other Countries were not explored The number of patients were also only 899.The use of ML and XAI techniques can increase space and time complexity.This can be a burden to the users.In our study, real-time testing was not considered.Model validation was also not performed by doctors.Model validation is crucial to find the most important biomarkers Deep learning models were not considered in this study.The neural networks perform better than traditional ML models when the data is huge.Cloud-based frameworks are being widely used to save data and models.However, they have not been used in this research.This research also did not make use of unsupervised and reinforcement learning.Other clinical modalities such as imaging, genomic and voice-based analysis were not combined in this study.The study also does not discuss the generalizability of the models to broader populations or regions, which is essential for the practical application of the classifiers in different healthcare settings.
In the future, we plan to collect data from multiple geographical regions to validate the generalizability of the models.Sample size of the patients could be increased to enhance the robustness of the prediction models.Evaluation of deep neural networks to explore their potential in handling large datasets and complex patterns could be done.Comparison of deep learning and machine learning frameworks could also be performed.Designing and implementing cloud-based networks to store   patient data and models in a scalable and secure manner should be done.The models could also be shared with different healthcare institutions.Combing and integrating information from other modalities such as genomics, medical imaging (chest x rays and computed tomography scans) and more clinical data could be incorporated.Investigation can also be done on other modalities which could improve accuracy.Unsupervised algorithms such as clustering and dimensionality reduction could be added to extract additional insights from the data.Investigation of the use of reinforcement learning algorithms to develop personalized treatment policies could also be done.We could develop a userfriendly interface for healthcare professionals that allow for patient data input and provide clear predictions and interpretations of the model decisions.The practical application could be made more accessible and inexpensive.Other health indicators such as blood pressure, oxygen saturation and heart rate could be included as input features too.Other XAI models could be used to make the predictions even more interpretable and useful.The model's ability to predict the long-term course of the disease, including recovery after the acute phase of infection could be investigated too.Evaluation of clinical and economic impact of implementing these models, including cost reductions and improvements in patient outcome could be investigated.Strengthening of data security and privacy measures to protect patient information in compliance with regulations and security standards could be investigate.

Conclusion
A multi-class machine learning framework is developed in this study to classify patient prognosis into mild, moderate and severe COVID-19.The COVID-19 data was collected from two hospitals in Manipal.Thirty-two demographic and clinical features were chosen initially.The most important features were chosen using four feature selection techniques.They are differential evolution, genetic algorithm, sine cosine algorithm and whale optimization.The differential evolution proved to be superior to the other methods.The catboost performed the best with a maximum accuracy of 87%.Further, the predictions were demystified using four XAI techniques.According to the XAI techniques, the most critical attributes are LDH, D-Dimer, CRP and lymphocytes.This decision support framework can be utilized in healthcare facilities as a preliminary screening tool to predict the prognosis of a patient in advance.Vulnerable patients can then be provided with appropriate care and treatments.

Funding
The article will be funded by Manipal Academy of Higher Education.
chose LDH and lymphocytes.The whale optimization algorithm chose only four features.The differential evolution and sine cosine algorithms chose 12 features each.
Figure 6(a) indicates a mild COVID-19 prediction.Lymphocyte is the marker which pulls the prediction towards mild COVID-19.Figure 6(b,c) also indicate a mild COVID-19 prediction.Markers such as lymphocytes, albumin, CRP and basophils were crucial in making these predictions.In Figure 6(d), LIME interprets a moderate COVID-19 prediction.Markers such as LDH, lymphocytes and respiratory rate were crucial in this prediction.

Figure
Figure 3. Machine learning pipeline used in this study.