The Deep Learning ResNet101 and Ensemble XGBoost Algorithm with Hyperparameters Optimization Accurately Predict the Lung Cancer

ABSTRACT Lung cancer is the most common and second leading cause of cancer with lowest survival rate due to lack of efficient diagnostic tools. Currently, researchers are devising artificial intelligence based tools to improve the diagnostic capabilities. The machine learning (ML) requires hand-crafted features to train the algorithms. To extract most relevant features is still a challenging task in the field image processing. We first extracted the texture gray level co-occurrence matrix features. We fed these features to traditional ML algorithms such as k-nearest neighbor (KNN) and support vector machine (SVM). The SVM yielded an accuracy of 83.0%, whereas KNN produced an accuracy of 97.0%. We then optimized and employed the ensemble extreme boosting (XGBoost) algorithm, which improved the detection performance with precision, recall, and accuracy of 100%. We also optimized and employed the deep learning ResNet101 to distinguish the small cell cancer from non-small cell lung cancer and obtained the 100% performance with these evaluation performance measures. The results revealed that proposed approach is more robust than traditional ML algorithms. Based on these results, the proposed methodology can be very helpful in the early detection and treatment of lung cancer for better diagnosis system.


Introduction
According to the recent statistics of lung cancer in 2022 (Siegel et al. 2022), there were about 2.36 million new cases of lung cancer expected for diagnosis and out of them 85% belongs to on-small cell lung cancer.The non-small cell lung cancer (NSCLC) is diagnosed using stereotactic body radiotherapy (SBRT) and radiofrequency (RF).Lung cancer has two subtypes such as small cell lung carcinoma (SCLC) and NSCLC.Both types have different methods for treatment and spreading.NSCLC is different from SCLC and slowly grows.While SCLC is growing rapidly related to smoking and spread in whole body quickly and forms tumor.The lung cancer deaths are due to the number of cigarette smoked (Moldovanu, de Koning, and van der Aalst 2021).
The SCLC is directly linked with cigarette smoked and aggressive type of lung cancer.The increasing evidence of SCLC are linked to autocrine growth loops, proto-oncogenes, and tumor-suppressor genes in its development.Therefore, SCLC have different methods for treatment and diagnosis than NSCLC.The NSCLC early detection can be very helpful with survival rate of 35%-85% depend on the stage and tumor type.Usually, most of the tumor are late detected so overall 5-year survival rate for NSCLC remains 16% only.Chemotherapy is utilized for SCLC which provokes 60% of response for NSCLC patients.Thus, in respones the cancer returns within few months resulting abysmal overall 5-year survival rate of 6%.The excessive tobacco uses, and smoking causes the lung cancer around 90% cases.Other factors that may lead to lung cancer include air pollution exposures, radon gas, asbestos, and chronic infections.In addition, many hereditary and there have been suggested both inherited and acquired mechanisms of lung cancer susceptibility.Radiation therapy, surgery, targeted therapy, and chemotherapy are also choices for lung cancer treatment (Zang et al. 2021).
As radiation and X-rays were discovered at the end of the 19th century, physicians used these results to examine the human body and approaches to non-surgical cancer treatment came along.Hospital radiologists and surgeons started working together and with the use of computers, significant cancer data began to accumulate in 1968.For the past 50 years, considerable effort has been spent in this field.Tests or imagining modalities typically conducted to evaluate the stage of lung cancer some of them are computed tomography (CT), this is the process that includes the detailed pictures of the anatomy and lung tumor and are precarious for treatment planning.For cancer staging, CT scans of the chest are essential and the abdominal CT scan is used for locating secondaries and metastases (Kemps et al. 2021).Positron emission tomography (PET) scan uses radioactive sugar as cancer cells rapidly uses sugar and is essential for the identification of spread to lymph nodes or other organs (Zhang et al. 2021).One of the best currently available scans is magnetic resonance imaging (MRI) scan that is used for the scanning of brain.Scanning of brain may be necessary to decide the propagation of tumor in brain (Hamdeni et al. 2018).
X-rays are used to gain functional and structural details about the human body.The radiation dose reduces the quality of the CT image Experts can describe and analyses the findings of various machine learning (ML) techniques that are useful for lung cancer prognosis and prediction (Kourou et al. 2015).ML techniques generally helps to improve the performance or predictive precision of maximum predictions, specifically when they compared with expert-based systems or traditional statistical.Computer-aided diagnosis systems have been developed for the characterization and identification of a variety of lesions in the field of lung cancer diagnosis.The system overcomes the challenge of developing a computer-based system for extracting full features from segmented suspicious regions in X-ray images of the lungs, and these assets can be used directly from the images to classify lung tumors as benign or malignant (Mridha et al. 2022).The imaging modalities or tests are widely used to assess the stage of lung cancer: CT scans of the chest and abdomen, which provide accurate images of the lung tumor and anatomy and are useful in care planning.CT scans of the chest are critical for cancer staging, and CT scans of the abdomen are used to identify secondary tumors and metastases.Since cancer cells use sugar quickly, a PET scan that uses radioactive sugar is useful for detecting spread to lymph nodes or other organs.
Recently, there are many applications of ML algorithms for medical diagnostics systems and improving prediction of lung disease.For computing the features importance, there exists the standard toolkits recently developed utilized by Liang et al. (2022).More recently, researchers (Shahbandegan et al. 2022) developed the ML algorithms to predict the patient for CT exam in emergency department (ED).The proposed approach can be helpful for ED to allocate resources to prompt actions and to maintain the patient flow and to reduce the overcrowding.Recently, Binson, Subramoniam, andMathew (2021a, 2021b) developed electronic nose (e-nose) to distinguish the chronic obstructive pulmonary disease (COPD) from healthy subjects by recognizing the presence of volatile organic compounds amount.The authors (Freitas et al. 2021) used liquid biopsy to diagnose and detection of lung cancer by focusing the circulating cell-free DNA, tumor cells, tumor-derived exomes, micro-RNAs, tumor-educated platelets, for its applicability in future clinical practices.The different combination of biomarkers along with several other computational tools can provide very good diagnosis and prognosis of lung cancer.The researchers (Lener et al. 2021) used blood cadmium level as a marker to detect the lung cancer especially in former smokers.The authors (Hsu et al. 2021) utilized ML methods with feature extraction and selection to detect the lung cancer for improving the electronic healthcare record to improve the diagnosis and treatment of the individuals.The researchers (Pradhan and Chawla 2020) summarizes the lung cancer datasets and ML techniques for improving the lung cancer prediction in the clinical internet of things (IoT) environment.The proposed methods can be helpful for early diagnosis to timely detect the lung cancer patients precisely.The authors (Binson, Subramoniam, and Mathew 2021c;Pradhan and Chawla 2020) developed an e-nose system to analyze the exhaled breath to classify exhaled breath from healthy patients and patients suffered from COPD, lung cancer, and asthma using SVM, XGBoost, and ensemble methods.The ML methods have successfully been utilized since very long decades ago in analysis of different disease pathologies such as brain levels of polyamines and histamine in various extreme exposures as utilized by Goroshinskaia et al. (1987).
Figure 1 shows the schematic diagram to detail the flow of our model.In the first step, the lung dataset was fed as input and applied the preprocessing on the input images, such as data cleaning, augmentation, reduction, interpolation, feature engineering, etc.In the second phase, training/testing data were utilized using 10-fold cross validation.In the third phase, the machine leaning algorithms along with deep learning methods were utilized by optimizing the hyperparameters using grid search method.For ML, we first extracted the level co-occurrence matrix (GLCM) and Haralick texture features as the standard and widely used texture features for medical imaging diagnosis and then fed to traditional ML SVM and k-nearest neighbor (KNN) algorithms.We then fed the GLCM features to XGBoost with and without the hyperparameters optimization.Finally, we applied the deep learning ResNet101 method with transfer learning approach and optimizing the hyperparameters with grid search method.

Dataset
In this study, we utilized first dataset lung cancer dataset publicly provided by Lung cancer Alliance (LCA) utilized previously by Hussain et al. (2019) of CT images.LCA is nonprofit organization which provides patients advocacy and support exclusively suffering with lung cancer or at risk.The database was in DICOM format and there were 76 patients including total 945 images of which 568 belongs to SCLC and 377 to NSCLC subjects.

Pre-Processing
Following image pre-processing methods were utilized on lung cancer images.

Image Resize
We used "inter area" is a type of interpolation that is used to resize images in a way that produces smooth, accurate results.In computer vision, interpolation is a method of estimating the value of a pixel in an image based on the values of surrounding pixels (Hashemzadeh, Asheghi, and Farajzadeh 2019).The "inter area" option specifies that the interpolation should be performed using the area-based method.In the area-based method, the value of the pixel is calculated based on the average value of the pixels in the area surrounding it.This method is typically used for resizing images, where the goal is to reduce the size of the image by reducing the number of pixels.Because the area-based method considers the values of multiple pixels, it can produce smoother, more accurate results than other interpolation methods.

Data Augmentation
Data augmentation is a method of creating additional data samples from existing ones in order to artificially increase the size of a dataset (Shorten and Khoshgoftaar 2019).This can be useful when training ML models, especially when the available dataset is small or not representative of the problem being addressed.There are various techniques for data augmentation, including adding noise to the data, applying transformations to the data, and generating synthetic data by combining or modifying existing samples.Data augmentation can improve the generalization of a model by introducing variations in the training data that the model may encounter in the real world and can also help prevent overfitting.

Hyperparameters Optimization
The learning process is a crucial aspect of any model.Before this process begins, certain parameters that have a direct impact on the model and are external to it, known as hyperparameters, need to be set (Bengio 2000).XGBoost and ResNet101 also have their own set of hyperparameters that can be adjusted or fine-tuned to improve performance.Some of particularly significant hyperparameter is discussed below: (i) As the depth of the model increases, its performance also improves, but there is a risk of overfitting and complexity.The max_depth value is typically set to 6, but it is important to ensure that it is a positive integer.(ii) The learning rate, a key hyperparameter, helps to reduce error and better approximate the model's objectives.A higher learning rate may not necessarily lead to optimal results, while a lower rate may take longer to process but offer a higher probability of optimal results.The learning rate is typically set between 0 and 1, with a common value being 0.3.(iii) XGBoost will learn a total of n_estimators trees during the boosting rounds of the learning process.(iv) The colsample_bytree hyperparameter is set to a value between 0 and 1, with a typical setting being 1.This determines the percentage of columns that are randomly selected for each tree during the training process.(v) The hyperparameter "colsample_bytree" is a value that ranges between 0 and 1, with a default value of 1.It determines the fraction of observations that will be used for each tree during the learning process.
Setting a low value (close to 0) may help prevent overfitting, but there is a risk of underfitting.

Tools, Languages and Libraries
In this study, we used the ML XGBoost and deep learning ResNet101 algorithms using google co-lab and optimized the hyperparameters of these algorithms using grid search method.The libraires for each model are reflected below:

Grid Search
Grid search is a method for hyperparameter optimization in ML (Bao and Liu 2006).It involves specifying a grid of hyperparameter values, and then training and evaluating a model for each combination of these values.The goal is to find the combination of values that results in the best performance of the model.
Here is the procedure for performing a grid search: (1) Define a grid of hyperparameter values to search over.This can be done by specifying a list of values for each hyperparameter.
(2) Train and evaluate a model for each combination of hyperparameter values.This can be done using a loop over the values in the grid.(3) Select the combination of hyperparameter values that resulted in the best performance of the model, as measured by a performance metric such as accuracy or F1-score.

Feature Extraction
In ML, the feature engineering is highly desired and require knowledge specific to the problem.Researchers computed different imaging related features to capture the most relevant information.The author (Rathore, Hussain, and Khan 2015) computed the hybrid features to detect the colon  (Hussain et al. 2018(Hussain et al. , 2019;;N. Rathore et al. 2014N. Rathore et al. , 2014;;Rathore, Hussain, and Khan 2015).The texture gray level co-occurrence matrix (GLCM) features extended version of texture features which further improved the detection performance, so in this research, the GLCM features were computed from lung cancer imaging datasets.

Gray Level Co-Occurrence Matrix (GLCM) Features
The GLCM features are computed and most wide used second order statistical tool to extract relevant information from the image.These features extract the texture properties, spatial relationship from an image pixel.The GLCM features made four directions 0°, 45°, 90°, and 135° as detailed in Kairuddin and Mahmud (2017).We computed contrast, autocorrelation, cluster prominence, correlation, cluster shades, energy, dissimilarity, homogeneity, entropy, maximum probability, sum average, sum of squares, sum variance, difference variance, sum entropy, difference entropy, Information measures of correlation-1 & II, inverse difference normalized, and inverse difference moment normalized (Kairuddin and Mahmud 2017).
eXtreme Boosting (XGBoost) Algorithm This algorithm was proposed by Chen and Guestrin (2016) is a supervised MLalgorithm which implement a boosting process for yielding accurate models.The predictive model on labeled training examples is applied on new unseen examples.The boosting is an ensemble learning method utilized to build many models sequentially, where each model is going to attempt for correcting shortages in the preceding model.XGBoost is a core boosting tree algorithm which build many models sequentially, where each new model is trying to correct the deficiencies in the previous model (Friedman 2001).The XGBoost extends the generalized gradient boosting by including the regularization term to combat overfitting and to support the arbitrary differentiable loss function.These properties made the XGBoost more robust in improving the lung cancer detection performance.
The gradient boosting is divided into two parts by optimization for the sake of optimization step and step direction.
But the XGBoost solve, @Sðy; f ðmÀ 1Þ ðxÞ þ f m ðxÞÞ @f m ðxÞ ¼ 0: (1) For every x in data to directly fix the step.We have, Sðy; f ðmÀ 1Þ ðxÞ þ f m ðxÞÞ; (2) (3) Using the second order Taylor expansion by expending loss function, where Then, loss function can be rewritten as: In region j, lets G jm denotes sum of gradient and the sum of Hessian is represented by H jm , then equation will be, The optimal value can be computed using below function: We get loss function when we plug it back: The tree structure is marked using this function.The lesser the score indicates better structure (Chen and Guestrin 2016).The maximum gain for every split is: To improve the performance, the loss function can be rewritten below by keeping in mind the regularization criteria: where γ penalizes the number of leave, α denotes L1 regularization while λ denotes L2 regularization.The optimal weight can calculate for each region j as: And the gain is, where, The XGBoost classifier is very important because it has more randomization and regularization options learning process, it is faster and easy to use.We used the following hyperparameters: To summarize, the challenge of optimizing the main function is reduced to identifying the minimum of a quadratic function.Due to the addition of regularization phenomena, XGBoost has a stronger capability to avoid overfitting.The structure of XGBoost can be seen in Figure 2.

Deep Learning
In the second approach, the deep learning ResNet101 model was utilized with transfer learning approach.The deep learning methods yielded good performance, but require high computational resources as detailed below:

Transfer Learning Approach
We applied the transfer learning approach; this means the networks such as ResNet101 of Convolution Neural Network was pre-trained on a large dataset.The network ResNet101 consisted of inception layers, convolution layers and fully connected layers.In this case, the ImageNet dataset consisting of 14 million image was used to pre-train the network.This initial training helps the first layer to find extremely generatable features from bigger dataset; later layers of the network take on specifics of smaller dataset for the adaptive model.We used ResNet101 in our study, described in the below section.The convolutional neural networks require high computational resources as very high operations are performed for performing convolution and pooling operations for computing low level, mid-level and high-level features, weight filters, weight channels.The Figure 3 reflects the heatmap of few selected images to distinguish Lung cancer NSCLC from SCLC i.e. original image samples along with heatmaps.For example, if we have pooling with filter 512, then memory and parameters computed as depicted below:  Resnet 101 The ResNet model was proposed by He et al. (2016) in 2016 an abbreviation of residual network.This method is used in diverse applications medical imaging, pattern recognitions, computer vision etc.The CNN comprised of multiple layers interconnected to each other in specific manner and trained for performing various tasks (Sun et al. 2017).There are 104 convolutional layers with 33 filters (blocks), one filter for each layer respectively.The residual connection, 9 out of 33 layers use directly the previous layer output.The residual connections are used as operand for summation operations.The four remaining layers receive output of previous block as an input and apply to convolutional layer with filter size of 1 × 1 and a stride of 1, followed by a group of normalization layers.

Performance Evaluations Measures
The performance was evaluated using standard performance evaluation measures and training and testing data formulation were employed using split method and 10-fold cross validation (CV) (Divya Rathore and Agarwal 2014; Hussain et al. 2019;Rathore et al. 2013Rathore et al. , 2014;;Rathore, Hussain, and Khan 2015).ML and deep learning techniques are evaluated using standard performance metrics such as accuracy, precision, recall, and F1-score to measure their effectiveness and efficiency in solving a given task.F1-measure is the harmonic mean of precision and recall metrics.Precision has been widely used as measure to evaluate the performance of information retrieval techniques and it refers to the fraction of retrieved documents that are relevant.Following standard performance evaluation metrics are utilized (Jalil et al. 2022

Results and Discussions
This study is specifically conducted to improve the lungs cancer detection by first extracting hand-crafted features such as GLCM texture features.We fed these features as input to traditional ML algorithms; we then optimized the hyperparameters of ensemble XGBoost and ResNet101.Tables 1 and 2 reflect the lung cancer prediction based on GLCM features and employing traditional supervised ML algorithms.In Table 1 using SVM, the overall test accuracy was yielded as 83.0%.The NSCLC yielded precision (83.0%), recall (69.0%), and F1-score (75.0%) and SCLC with precision (82.0%), recall (91.0%), and F1-score (81.0%).
Table 4 reflects the lung cancer detection performance utilizing the XGBoost algorithm on GLCM features and optimizing the hyperparameters.The highest performance with 100%, precision, recall, F1-score and training and testing accuracy was yielded.
Figure 4 left side depicts the AUC-ROC of 1.0 for both training and testing data using XGBoost with optimized hyperparameters and right-side   We also applied the XGBoost using second dataset to distinguish the lung infection (pneumonia) from normal chest X-rays by first extracting the GLCM features and applying the robust XGBoost algorithm.The chest X-ray images of pneumonia (N = 3863) and X-ray images of normal (healthy) (N = 1525) were taken.Where bacterial pneumonia (N = 2521) and viral pneumonia (N = 1342) were taken.Table 9 presents the classification accuracy of bacterial lung infection with normal chest X-rays.The training accuracy of 100%, test accuracy of 97.04, AUC (0.99) were yielded.The AUC-ROC accuracy and error graph is reflected in Figure 9, whereas ROC and PR curves are reflected in Figure 10.
Figure 11 reflects the (a) accuracy, (b) loss curve of training and testing data for 50 epochs using ResNet101 to distinguish the NSCLC from SCLC subjects.After 20 epochs the accuracies and loss curves remain almost the constant.Figure 7 shows the confusion matrix from fold 1 to 10. Except fold 5 and fold 8, the predictions were 100%.
Table 9 reflect the bacterial vs normal subjects, the bacterial performance was yielded with precision (98.0%), recall (97.0%), and F1-score (98.0),   whereas for normal, the performance was yielded with precision (97.0%), recall (97.0%), and F1-score (97.0). Figure 8 shows the (a) ROC curve, and corresponding (b) precision-recall curve to distinguish the bacterial from normal lungs using XGBoost algorithm.An AUC of 0.99 was obtained (%).The corresponding accuracy and loss curve is represented in Figure 8.
Table 10 presents the classification accuracy of bacterial lung infection with normal chest X-rays.The training accuracy of 100%, test accuracy of 96.16, and AUC (1.00) were yielded.The AUC-ROC accuracy and error graph is reflected in Figure 9, whereas ROC and PR curves are reflected in Figure 9.  Table 11 shows the binary class (viral, normal) classification using ResNete101 with 10-fold cross validation.A 100% prediction performance was yielded as reflected in Table 12.The comparison of results with other studies is reflected in Table 12.
Figure 11 reflects the (a) accuracy, (b) loss curve of training and testing data for 200 epochs using ResNet101 to distinguish the viral from normal subjects.
We proposed XGBoost and optimized the hyperparameters in order to improve the lung cancer detection performance by extracting hand-crafted GLCM features.We also compared the results with traditional ML algorithms.Previously, there are few studies which yielded performance up to 95% on different extracted features using traditional ML techniques.However, the performance can be improved by applying and optimizing the hyperparameters of more robust algorithms.The XGBoost algorithm improved the detection performance than other traditional methods.We also utilized the deep learning ResNet101 algorithm with transfer learning approach and optimized the hyperparameters.The ResNet101 also improved the detection performance.The first LCA dataset was small, so to check the validity of our proposed algorithms, we applied the XGBoost and ResNet101 on another larger dataset of lung infections to distinguish the normal lungs from community infected bacterial and viral pneumonia lungs and consistent results were yielded.

Conclusions
Lung cancer is the deadliest cancer with lowest survival rate.Majority of the countries have incidence of deaths multiplied unexpectedly.The researchers are trying to develop artificial intelligence tools to improve the prediction performance.Mostly, the traditional ML have limitations which are not much appropriate for more nonlinear and complex problems.In this study, we proposed ensemble XGBoost and ResNet101 algorithms to distinguish the NSCLC from SCLC by optimizing the hyperparameters.We also compared the results with traditional ML methods.The results reveals that proposed model due to its robust performance and functionality improved the prediction performance.Based on these results, the proposed methodology can be very helpful in the early detection and treatment of lung cancer, with the potential to decrease mortality rate and increase survival rate.Currently, we have not clinical information of the patients, in future we will apply the proposed models to larger dataset for detecting the survival, recurrence and disease severity.

Figure 1 .
Figure 1.Schematic diagram to reflect the flow of work to detect lung cancer using XGBoost algorithm and deep learning ResNet101 with hyperparameter tuning.

Figure 3 .
Figure 3. Lung cancer NSCLC and SCLC original image samples along with heatmaps.

Figure 4 .
Figure 4. AUC-ROC and error graph to detect the lung cancer using XGBoost.

Figure 5 .
Figure 5. ROC and PR curves to detect lung cancer.

Figure 7 .
Figure 7. Confusion matrix for binary (NSCLC, SCLC) classification using ResNet101 model at different folds from 1 to 10 k-fold cross validation.

Figure 8 .
Figure 8. AUC-ROC and error graph to detect the bacterial lung cancer using XGBoost.

Figure 9 .
Figure 9. ROC and PR curves to detect bacterial lung cancer.
Hussain et al. andcoworkers extract the texture, morphological, elliptic Fourier descriptors (EFDs), scale invariant feature transform, and entropy-based features to detect the prostate cancer, breast cancer, brain tumor, and lung cancer ): For the computation of F 1 -measure, each record is considered as if it is the result of a query and each class as if it is the desired set of documents for e2166222-1684 a query, then recall and precision of that record for each given class are calculated.The F 1 -measure of recordj and class iis defined as follows:

Table 3 .
Lung cancer detection performance utilizing XGBoost with default parameters.

Table 4 .
Lung cancer detection performance utilizing XGBoost with hyperparameters optimization.

Table 9 .
Bacterial lung cancer detection performance utilizing XGBoost using optimized hyperparameters.

Table 10 .
Viral lung cancer detection performance utilizing XGBoost using optimized hyperparameters.

Table 11 .
Viral lung cancer detection performance utilizing ResNet101 with 10-fold cross validation.

Table 12 .
Comparison of findings from previous studies.