Gastrointestinal tract disease recognition based on denoising capsule network

Abstract Today, cancer is one of the leading causes of death in humans in the world. Cancers affect different parts of the human anatomy in different ways. There are significantly more deaths associated with gastrointestinal cancers such as colorectal cancer than with all other cancers. With computer-aided diagnosis systems, medical practitioners can recognize a variety of illnesses more quickly than with manual procedures. It is possible to reduce mortality significantly by detecting and removing precancerous lesions early. Due to the time-consuming and challenging nature of manual diagnosis, researchers have developed computational algorithms that are used to identify and classify diseases such as gastrointestinal (GI) diseases. Computer-aided diagnosis relies heavily on the classification of medical images, which is a challenging task. Therefore, this study presents a less sophisticated but still effective pre-processing technique for identifying endoscopic images known as denoising capsule networks (Dn-CapsNets). Moreover, we constructed activation maps (AM) using the feature representations to visualize the results. As a result of these evaluations, the trained model achieved 94.16%, 83.1%, 86.7%, 96.1%, 86.6%, and +0.69 respectively in terms of accuracy, precision, sensitivity, specificity, F1-score, and Matthew’s correlation. Comparisons between the proposed method and the current state-of-the-art have shown improved accuracy.


Introduction
The automatic identification and classification of diseases such as skin cancer (Saba 2019)- , lung cancer (Muhmmad Irfan Sharif et al., 2019), brain tumor (M. I. Muhmmad Irfan Sharif et al., 2019), stomach cancer (M Attique Khan, Rubab et al., 2020), and a few others have been major study subjects in medical imaging for decades (Muhmmad Irfan Sharif et al., 2019)-(M Attique . Cancers of the colon and tumors of the gastrointestinal tract are two of the most common gastrointestinal health problems in the United States (Siegel et al., 2017), killing between 60 and 70 million people each year, affecting between 60 and 70 million people each year. Stomach cancer claimed the lives of over one million individuals worldwide in 2017, making it the third-leading cause of death (Hyung et al., 2019). Most Colorectal cancers (CRCs) are caused by previous adenomas, and the adenoma-carcinoma sequence can be used to test for CRC and prevent it (Bretthauer, 2011). Colonoscopy combined with adenoma resection can reduce the risk of CRC by up to 80% and the related death by 50% (Vokes et al., 1993). The development of computer-aided diagnostic (CAD) technologies has proven immensely useful for doctors in their clinics (Akram et al., 2018), (Nasir et al., 2018), (M. Muhammad Sharif et al., 2021). CNNs, for instance, rely on pooling layers to handle the most significant information, which makes the network translationally invariant. Consequently, CNNs need to be augmented with a variety of data augmentation methods in order to generalize to other perspectives. The time and difficulty involved in these augmenting methods are, on the other hand, considerable. With dynamic routing, capsule networks were developed to overcome CNNs' drawbacks (Sara et al., 2017).
A capsule (Sara et al., 2017) is made up of a number of neurons whose activity vector indicates the parameters associated with the instantiation and whose length indicates the likelihood that a given feature is encoded by the capsule. In layer l, each lower-level capsule i (where) i (where uses its own activity vector in encoding spatial information (e.g., scale, position, orientation, and skewness) of the input features as instantiation parameters. Despite their ability to predict attributes from each object in an image, capsule networks struggle with complex images. This has led to the use of CIFAR 10 to train most capsule network models proposed in the literature for identifying images. To address this issue, researchers have started developing computational approaches. In these approaches, segmentation, classification and feature visualization are all crucial processes. In their work, Yuan & Meng (Yuan & Meng, 2014) combined a Bag of Features technique with a saliency map to create an integrated polyp detection system. To characterize local features in the first phase of the BoF technique, SIFT feature vectors are clustered using k-means. Saliency characteristics were then determined using the saliency map histogram. Lastly, employing both BoF and saliency variables, the SVM was employed to undertake classification. The authors in (Yuan et al., 2016) improved this technique by adding LBP and uniform LBP (ULBP), complete LBP (CLBP), and histograms of oriented gradients (HoG) features, along with SIFT features. Finally, SVM and Fisher's linear discriminant analysis (FLDA) classifiers were employed to categorize these features using different combinations of local characteristics. The SVM classifier produced good classification accuracy when SIFT and CLBP characteristics were combined. The medical imaging technique of wireless capsule endoscopy is used to examine the gastrointestinal (GI) tract. As shown in Figure 1, this technique (Campus et al., 2018) is widely used in hospitals to detect gastrointestinal abnormalities such as ulcers, bleeding, and many others.
CNN has lately been found to be quite useful in endoscopic procedures such as esophagogastroduodenoscopy (EGD), colonoscopy, and capsule endoscopy. The anatomical position in EGD images (Takiyama et al., 2018), Helicobacter pylori (HP) infections (Shichijo et al., 2017), (Itoh et al., 2018) and gastric cancer were all challenges for a CNN-based diagnosis tool in EGD (Hirasawa et al., 2018). During a colonoscopy, a CNN-based diagnostic tool was employed to detect and characterize colorectal polyps (Komeda et al., 2017), (Byrne et al., 2019b), (R. Ruikai Zhang et al., 2017). In 2017, Komeda et al. (Komeda et al., 2017) reported a study that employed CNN to diagnose colorectal polyps utilizing 1,200 colonoscopy photographs and 10 additional video images of unlearned processes. A 10-fold cross-validation produces an accuracy of 0.751, where accuracy refers to the percent of answers that are accurate. In a study by Byrne et al. (Byrne et al., 2019a), the researchers developed a deep CNN model for real-time analysis of colorectal polyps within colonoscopic video images. Training and testing of the CNN model was done using only NBI videoframes (split evenly across relevant multi-classes) and unedited routine exam films that were not explicitly intended for CNN classification 106 sequentially encountered small polyps were used to validate the model using the second set of 125 movies. Using CNN to detect adenomas, the CNN was accurate, sensitive, specific, negative predictive, and positive predictive, respectively, with 94%, 98%, 83%, 97%, and 90%. Similar to Chang et al., Chang et al. (Y. Yuan Chang & Chen, 2019) achieved better categorization abilities across a wide range of categories using a deep attention neural network. A variety of techniques are employed to transform the data, including automatic data fusion, multi-epoch fusion, and adaptive threshold selection achieving an F1 score of 90.70%. A polyp identification convolutional neural network based on a single shot multibox detector was described by (X. Xu Zhang et al., 2019). An upgraded SSD, according to the data, can increase mean average precision (mAP) from 88.5 to 90.4%. The findings also show that when max-pooling layers are used, convolutional neural networks lose about 3/4 of their critical information. Ayidzoe et al. (Afriyie et al., 2021) suggested a capsule network variation that is less sophisticated but still robust and capable of extracting features. Their model took advantage of the Gabor filter and a proprietary preprocessing block to understand the structure and semantic information in an image, resulting in higher accuracy on the datasets studied. Several deep learning models based on CNNs have been proposed to diagnose gastrointestinal disorders. However, only a few studies have used gastrointestinal images to train capsule networks to our knowledge. We propose using CapsNets to construct a modified squash function for detecting gastrointestinal diseases as a result of our research. This paper also introduces improved methodologies that can be used as CapsNet performance indicators, enhancing the models' dependability, explainability, and understandability.
The paper also underlines the need for CapsNet GI models that make judgments based on accuracies and activation map-related performance measures to be re-evaluated.
The identification of similar patterns of diseased regions is one of the most challenging components of gastrointestinal tract disease classification. Images of low-contrast ulcers, for instance, have similar hues to polyps, making classification difficult. Low contrast ulcer zones, on the other hand, are difficult to distinguish from healthy regions, making classification problematic. For successful classification, the selection of useable features is crucial, and this article focuses on the best features selection algorithm. By delivering more accurate classification results, robust features improve the overall efficiency of the CAD system.
We propose a modified framework to detect anomalies in GI images automatically by combining hybrid features and deep learning information. A denoising technique reduces noise in the images to help CapsNets learn better features from the endoscopic images and build a faster network. Summarily, the contributions of this paper are; (1) Based on the CapsNets, we present a novel and augmented framework for detecting malignant GI anomalies. To extract the required features, we create optimized CapsNets.
(2) A proposed model is trained extensively tested on three datasets with GI anomalies. All datasets were successfully analyzed with our technique, and we demonstrated a generalized high performance with the newly built architecture.

Related work
Capsule network research is advancing rapidly in terms of its ability to recognize complex images. Capsules performed better on complicated images (S. Siwei Chang & Liu, 2020) when more layers were stacked, the number of capsule layers increased, and ensemble averaging was used. Many methods such as changing the reconstruction scaling factor, parallelizing capsule layer (Xiong et al., 2019), and others have been proposed. Detecting and recognizing gastrointestinal disorders is primarily done with supervised learning algorithms using CNNs and handmade features.  (Wimmer et al., 2018) employed endoscopic image datasets to train three pre-trained CNN architectures, and SVM was then used to classify colonic polyps and celiac disease. They experimented with classification by concatenating and combining features from several levels. Other CNN-based techniques were outperformed by their method. Mossotto et al. (Mossotto et al., 2017) offered three unsupervised machine learning models that used only endoscopic data, only histology data, and mixed endoscopic data, with an accuracy of 71%, 76.9%, and 82.7%, respectively. Ozawa et al.  demonstrated the robustness of a GoogleLeNet CNN architecture based on a computeraided diagnostic system (CAD) for predicting ulcerative colitis severity. According to Maeda et al. (Maeda et al., 2019) in a similar study persistent inflammation associated with ulcerative colitis can be predicted using CAD. Our study however presents a novel and integrated framework that is capable of detecting abnormalities from endoscopic images based on optimized CapsNets.

Proposed method
Dynamic routing (Sara et al., 2017) is used in the proposed method, which includes an optimized squash function and amplifier algorithm. In Figure 2, a three-layer convolutional capsule is shown along with four amplified layers that map onto the primary capsule to provide enhanced features. The images are resized to 28 × 28 to identify and categorize the Kvasir-v2 dataset, as shown in our proposed architecture. The input images are fed into the modified layers using 3 × 3 kernels to create 28 × 28 feature maps. As a result, the modified layer 1, which contains 96, 26 × 26 kernels and a Relu of 1, sends data to the Conv1. After the first modified layer produces 96 pieces, the second modified layer receives 26 dimensions. As input to the third modified layer, the second modified layer's output is 96, 24 × 24 pixels, which are then fed into the Conv2, which has 128, 24 × 24 dimensions. Convolutional and modification layers are sandwiched between preprocessing layers and batch normalization layers. Concatenated layers were added to Conv4 to extract more semantic information and to improve the network's learning ability.

Feature extraction and classification
Recently, extracting valuable learning features from a set of extracted features is an active area of research. The amount of irrelevant and redundant features in the originally extracted features may need to be ignored before final learning. It is critical to remove extraneous data to keep the suggested model consistent (Afriyie et al., 2021). Dynamic routing is used in this study because it has been widely adopted in the literature. This paper describes the encoder network using input, custom preprocessing (CP), convolutional layers, a primary capsule layer, and a class capsule layer. Activation maps are preprocessed (see Algorithm 1) to reduce noise, and features are extracted by convolutional and modified layers.
The trained images may contain noise which is expressed as μ x ð Þ ¼ f x ð Þ þ n where n represents the white Gaussian additive noise. where where N d x ð Þ represents the square patch of size 2d þ 1 ð Þx 2d þ 1 ð Þ centered at x and W is the normalizing term, W x ð Þ ¼ ∑ y w x; y ð Þ Eqns (1) and (2) demonstrate how the relevant features of each image are applied to the fastNlMMeansDoinoisingColoredMulti function (Orchard et al., 2008) from open-cv2, which suppresses the noise. In the weighting scheme, the filter parameter (α) controls how fast the exponential expression decays. The α-parameter is typically controlled by algorithm 1. It is necessary to choose a very large α-value to achieve a smoothing effect in the images, as a very small α-value result in noise in the training images. Using the range [30,50] as an α-value, we were able to achieve an overly-smooth image thereby discarding the irrelevant features. For all layer l þ 1 capsules, the prediction vector û jji is generated, and it serves as the transformation of the entity represented by capsule j at a level l þ 1 by capsule i at level l. The vector û jji represents the proportion of the PC i's contribution to the class capsule (CC) j. In other words, the product û jji represents a vote for the class capsule j at layer l þ 1 from the capsule i in the layer l. If there should be top-down feedback to raise the coupling coefficient c ij for that route, a scalar product between the prediction vector û jji and the output v j of each class capsule is used. A big scalar product (from a ij ¼ v j :û jji ) for a lower-level capsule and a specific higher-level capsule indicates a strong agreement, with the c ij for that connection increasing and the c ij for all other links decreasing.
The function v j works with the s j input to a capsule j, which is the weighted sum overall û jji from lower-level capsules. Coupling coefficients c ij are trained during the iterative dynamic routing process and computed by SoftMax (Gao & Pavel, n.d.), with initial logits b ij being the log prior probabilities that capsules i and j should be coupled. In other words, c ij The logits are started with zero and updated using the old b ij values and the agreement a ij as b i¼1 c ij ¼ 1 shows that at any point during the training, all the c ij between a given capsule i and all other capsules in layer l þ 1 must add up to one.

Datasets description and experimental setup
The experiments in this paper were carried out on a single 64-bit Windows PC with an NVIDIA GeForce GTX 1050 Graphic Processing Unit (GPU) running CUDA 10.1 and 8GB of dedicated memory, as well as Keras (TensorFlow backend). Using a 0.001 learning rate and a learning rate decay of 0.9, each model is trained for 200 epochs. The number of routing iterations for each dataset is changed between 1 and 6, and the margin loss function is used to train the models. Equation 2 depicts the margin loss. For all of the models, the datasets were divided into 9:1 ratio for training and testing. The margin and reconstruction losses make up the loss function needed to train the model.
In this implementation, the default values for m+, m-, and the loss function (Sara et al., 2017) were kept. During the training of the Capsule models, three routing iterations were used.

Kvasir-v2 Dataset:
We used the Kvasir dataset for our tests (Pogorelov et al., n.d.). The Kvasir dataset was produced to improve applications that include automatic detection, classification, and localization of endoscopic pathological abnormalities in gastrointestinal tract images. A total of 4,000 colored images have been categorized and confirmed by medical endoscopists in this new collection. It is divided into eight classes that depict various diseases as well as typical anatomical markers. Each class has 500 examples, ensuring that the dataset is well balanced. We train all the datasets on the proposed model in this study. This experiment used five classes in total. Accordingly, 0 signifies esophagitis, 1 signifies normal-cecum, 2 signifies normal-pylorus, 3 signifies polyps, and 4 signifies ulcerative colitis. A polyp is a pathological condition of the cecum, whereas ulcerative colitis is a pathological condition of the stomach. Esophagitis is a condition characterized by abnormalities of the esophagus. Due to their duplication of the aforementioned classes, the other three datasets were not included in the experiment. A 90:10 leave-out strategy was applied to the dataset, which was scaled and resized to 28x28x3.   (Vollgraf, n.d.) introduced the Fashion MNIST, which consists of 70, 000 grayscale images divided among 10 classes, each with 7, 000 images. Tests were conducted using 10,000 images, while training was conducted using 60,000 images.

CIFAR-10:
In addition, we used Krizhevsky and Hinton's CIFAR-10 (Krizhevsky, 2009), which consisted of 60,000 colored images that were divided into ten classes of 32 × 32 RBG images of the real world. There are 50,000 training images and 10,000 testing images in the dataset, with 90% serving as training images and 10% as testing images.

CIFAR-100:
A total of 60,000 32x32x3 images make up this dataset. There are 100 classes in the dataset, made up of 50,000 training images and 10,000 test images.
As part of the training process, patience, the early stopping hyperparameter, is set to 10, and only the best models are saved.

Results and discussion
The experimental results of the model are evaluated using the methods proposed in this section. Samples of the different classes of the Kvasir dataset is shown in Figure 3.

Model's flexibility and robustness
Models should be tested using the methods described here to ensure their stability, understandability, and credibility. Equation (4) gives the average accuracy of a model:

average accuracy ¼ ∑ of correct predictions total number of samples in testset ð4Þ
We adjusted the hyperparameters and/or intermediate layers of all the models randomly to see how responsive they were to the changes. It is difficult to detect how effectively a classifier differentiates one class from another when there is a class imbalance (Provost et al., 1998), the most common assessment metric for classification algorithms (Mensah et al., 2020). According to Patrick et al. (Ayidzoe, Yongbin, Kwabena et al., 2021), there was no significant difference in the performance of CapsNets models when momentum, batch size, learning rate, dropout, and learning rate decay were changed. One of the most significant hyperparameters that affected CapsNet models' performance was the number of routing iterations. Three iterations generated the best results. Our study examines the proposed model's accuracy and its validation findings (see, Figure 2). As shown in Table 1, our proposed model achieved comparable better results on all of the datasets used for testing.
In Figure 4, the confusion matrix shows how many images were correctly identified and how many were incorrectly identified. The precision, sensitivity, specificity, and accuracy values for each class were calculated to get useful information from this data. It can be shown that the proposed model performed well across all classes when the true positive and accuracy per class values are considered. Compared to the proposed Capsule models, these adjustments negatively affected CNN models' performance.

Confusion matrix-based metrics
Based on multi-class confusion matrices, Figure 4 summarizes the performance of the models. The confusion matrix can also be used to deduce other performance metrics, such as Precision, Sensitivity, Specificity, and Accuracy-per class, to evaluate the model's performance on specific classes. As part of the confusion matrix, true positives (TPs), true negatives (TNs), false positives (FPs), and false-negatives (FNs) are also generated, which are critical to decision making based on model performance, even if interpretations can be complex and situation-specific in some cases. It is not fatal to classify a healthy GI as unwell, which is the reason why a classifier should collect as many false positives (FP) as possible. In these conditions, however, many FN forecasts indicate a terrible outcome by the model, as it is hazardous to label an infected GI as healthy. We present other metrics from equations (5)-(9). Sensitivity=Recall MCC ¼ TP � TN ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi Using the Matthew correlation coefficient (MCC; Matthews, 1975), which compares the truepositive rates with the false-positive rates, the fitness function is determined. MCC is within the range of [−1, +1]. In general, a larger MCC value leads to better performance when it comes to image classification. We achieved an average MCC value of +0.691 indicating a strong positive relationship between the true-positive rate and false-positive rate of the classes of GI disease. Also, the performance comparison shown in Table 2 depicts an overall accuracy, sensitivity, precision, F1-score, and specificity of 94.16%, 86.7%, 83.1%, 86.6%, and 96.1% respectively.

Model convergence
As can be seen in Figure 5, the proposed Dn-CapsNets model learns and converges faster. For example, the Dn-CapsNets start achieving higher accuracies models between epochs 75 and 200. As a result, the proposed Capsule network's final accuracy can be approximated during the first few epochs. Before the final precision of the training can be determined, one must wait for the complete time of the training. The optimized architecture's ability to encode the texture of the infectious region of the stomach is credited with this convergence. The collaboration of these layers results in rapid learning and convergence. These are particularly useful for prototyping and preliminary GI disease investigations.
When feature extraction is performed, examining whether sections of the network are missing is imperative, as this condition causes excessive oscillations and prolongs training time (Japkowicz & Shah, 2011). Figure 6 shows that the Conv2 layer of the proposed model (see, Figure 6b) is an efficient extractor, possibly due to the fact that it is a higher-level layer that can sample features from lower-level layers (Conv1 and Conv2) to represent advanced image characteristics. Thus, the proposed PC layer possesses all of the important classification properties that the baseline model lacks (see, Figure 6a). As well as helping to make CapsNets more explainable and understandable, this method enables them to be used in real-world settings.
Due to its independence from the decision threshold and invariance to a priori probability distributions of the classes, area under the curve (AUC) is more preferred than accuracy. AUC is a superior metric to accuracy based on levels of consistency and discrimination (Ling et al., 2003). It is preferable to use a classifier with a large AUC over one with a lesser AUC. For both the Receiver Operating Characteristic Curve (ROC) and the Precision-Recall Curve, AUC may be calculated (PR). The ROC curve should not be utilized to analyze balanced datasets since it tends to be unduly optimistic in circumstances where the dataset class distribution has a big skew. For unbalanced datasets, the PR curve is appropriate (Singla & Domingos, 2005). When it is not possible to gather enough data for the model, the ROC curve should not be employed (Sokolova, n.d.). In the literature, this distinction is rarely addressed and adopted for CapsNet implementations. The proposed models' ROC and PR curves have huge regions beneath the curves, as shown in Figure 8. In ROC and PR spaces, a class's objective is to be in the top left and upper right corners, respectively. As shown in Figure 7, the proposed model classes achieve this purpose better.

Comparing of results
To demonstrate the performance of our proposed model, we compare it with the state-of-the-art in literature. In light of the limited number of works on GI diseases using deep learning, we compare our proposed methods to those used on similar tasks. The proposed method for the classification of the datasets with labeled data is quite effective concerning the medical image dataset. Also, data sets with a numerical imbalance between classes are expected to benefit from the proposed method.
We further compare the average training time on the Kvasir-v2 dataset for the proposed and the Sabour's model as seen in Table 4. It can be seen from Table 4 that the proposed converges faster than the baseline model. We further show our proposed algorithm with states-of-the-art schemes as shown in Table 3.
As shown in Figure 9, the secondary capsule layer clusters for the Dn-CapsNets model are produced. As can be seen from Figure 9, the first column shows clusters that are derived from the proposed model. Despite some overlap in classes, the proposed models for these datasets form separable clusters. A few outliers appear in the proposed model, but they are not too far from their    respective clusters. This indicates that the proposed model can discriminate well between GI images when it comes to classifying them.

Conclusion
In this study, we present a robust framework (Dn-CapsNets) for identifying gastrointestinal tract diseases in the Kvasir-v2 dataset. A deep learning technique can help reduce the likelihood of developing malignant diseases and even decrease the number of benign tumors that need to be (a) P-R curve of the proposed model (b) ROC of the proposed model   removed. To improve classification accuracy, the images were optimized to remove noise, and the convolutional and modified layers were applied to extract relevant features. The Dn-CapsNets we proposed showed superior performance on the KVASIR dataset when applied to endoscopic image classification. Resulting in improved classification, we discovered the proposed model was able to capture more detailed differences between similar-looking images. As well, we believe that our framework can be applied to other diseases that can be detected through computer-aided diagnosis. In the future, we hope to implement an improved model on HyperKvasir which is the largest, unbalanced, and possibly most complex digestive tract visual data set.

Funding
This research did not receive any specific grant from funding agencies in the public, commercial, or not-for-profit sectors.