Detection and classification of gastrointestinal disease using convolutional neural network and SVM

Abstract Gastrointestinal tract is a series of hollow organs connected in a long tube twisting from the mouth to the anus. Recovery of gastrointestinal diseased patients depends on the early diagnosis of the disease and proper treatment. In recent years, the diagnosis of gastrointestinal tract diseases using endoscopic image classification has become an active area of research in the biomedical domain. However, previous studies show that there is a need for improvement as some classes are more difficult to identify than others. In this study, we propose a concatenated neural network model by concatenating the extracted features of VGGNet and InceptionNet networks to develop a gastrointestinal disease diagnosis model. The deep convolutional neural networks VGGNet and InceptionNet are trained and used to extract features from the given endoscopic images. These extracted features are then concatenated and classified using machine learning classification techniques (Softmax, k-Nearest Neighbor, Random Forest, and Support Vector Machine). Among these techniques, support vector machine (SVM) achieved the best performance compared to others, using the available standard dataset. The proposed model achieves a classification accuracy of 98% and Matthews’s Correlation Coefficient of 97.8%, which is a significant improvement over previous techniques and other neural network architectures.


PUBLIC INTEREST STATEMENT
The digestive system consists of the gastrointestinal tract and other organs which help the body to break down and absorb food. The gastrointestinal tract may be affected by a variety of diseases which affect its functionality. The domain of gastrointestinal endoscopy includes the endoscopic diagnosis of various digestive diseases using image analysis and various devices. Endoscopy is currently the preferred method for examining the gastrointestinal tract; however, previous studies have shown that there is a need for improvement as some classes are more difficult to identify than others. In this study, we proposed a concatenated neural network model by concatenating the extracted features of VGGNet and InceptionNet networks to develop a gastrointestinal disease diagnosis model. The deep convolutional neural networks VGGNet and InceptionNet are trained and used to extract features from the given endoscopy images. The proposed model achieves a classification accuracy of 98% and Matthews's Correlation Coefficient of 97.8%, which is a significant improvement over previous techniques and other neural network architectures.

Introduction
The human gastrointestinal tract is prone to several abnormal outcomes from minor discomfort to life-threatening diseases (Naz et al., 2021). Gastrointestinal tract diseases such as ulcers, bleeding, polyps, Crohn's disease, colorectal cancer, tumor cancer, and other related diseases are well known today worldwide (Jha et al., 2021). According to the International Agency for Research on Cancer, the number of new cases of gastrointestinal cancer and the number of related deaths worldwide in 2018 is estimated at 4.8 million and is responsible for 26% of global cancer cases and 35% of all cancer deaths (Naz et al., 2021). Early diagnosis of various gastrointestinal diseases can result in effective treatment and reduce the risk of mortality. Unfortunately, various gastrointestinal diseases are undetectable or create confusion during screening by medical experts due to noise in the images which suppress important details (Borgli et al., 2020;Khan, Majid et al., 2021;Ramzan et al., 2021). Visual assessment of endoscopy images is subjective, often time-consuming, and minimally repetitive, which could lead to an inaccurate diagnosis (Borgli et al., 2020). The use of artificial intelligence (AI) in a variety of gastrointestinal endoscopic applications has the potential to improve clinical practice and improve the efficiency and precision of current diagnostic methods (Chanie et al., 2020;Woreta et al., 2015). Several studies have been conducted to classify gastrointestinal diseases with the goal of making a significant contribution to the effective diagnosis and treatment of gastrointestinal diseases. However, several studies used a small amount of data , and the majority of research studies aimed to classify a small number of gastrointestinal disorders in a specific part of the human gastrointestinal system. However, in Borgli et al. (2020), a series of experiments were conducted based on deep convolutional neural networks to classify 23 different classes of images. The authors achieved a Matthews Correlation Coefficient of 0.902. More importantly, the study showed that there is a need for improvement as some gastrointestinal diseases are more difficult to identify than others.
Deep learning techniques are another form of machine learning techniques which have been used in several areas of gastrointestinal endoscopy including colorectal polyp detection and classification, analysis of endoscopic images for diagnosis of helicobacter pylori infection detection and depth assessment of early gastric cancer, and detection of various abnormalities in wireless capsule endoscopy images (Khan, Nasir et al., 2021;Khan, Sarfraz et al., 2020;Majid et al., 2020). Öztürk & Özkaya (2020) stated that due to the deficiencies of radiologists and other human factors which could lead to a false diagnosis, a computer-aided automated system would be useful for accurately diagnosing gastrointestinal polyps in the early stages of cancer. In this paper, we proposed a gastrointestinal disease diagnosis model that uses SVM as a classifier to determine which of the 23 gastrointestinal diseases the endoscopic scan images correspond to. The main contributions of this paper are as follows: • We propose a novel concatenated feature extraction model of the VGGNet and InceptionNet deep learning models for endoscopic image classification. This method builds on convolving images using concatenated VGGNET and InceptionNet architectures for feature modeling and then classifying these features using RBF kernel-based multi-support vector machine. The proposed approach avoids the use of handcrafted feature extraction and selection. Furthermore, the proposed model overcomes the problem of over-fitting and requires few parameters to learn discriminant features.
• We consider two popular pre-trained CNN models with transfer learning principles for robustness testing of the proposed model.
The remainder of this paper is organized as follows. Section 2 presents an overview of related works. Section 3 explains the materials and methods, while Section 4 articulates the experimental results and comparative analysis. Finally, Section 5 deals with the concluding remarks.

Related works
Over the past few years, the strength of deep learning algorithms have been explored in the field of endoscopy including capsule endoscopy, esophagus-astro-duodenoscopy, and colonoscopy to help doctors successfully diagnose various gastrointestinal lesions (Borgli et al., 2020;Chanie et al., 2020;Khan, Nasir et al., 2021;Gamage et al., 2019). Nowadays, deep learning approaches especially CNN has become a powerful machine learning technique in image processing tasks such as the classification of gastrointestinal diseases.
Borgli et al. (2020) carried out a series of experiments based on deep convolutional neural networks using applied common architectures with a slight modification to classify 23 different classes of images with the help of Pre-Trained ResNet-50, Pre-Trained ResNet-152, Pre-Trained DensNet-161, Averaged ResNet-152 + DenseNet-161, and ResNet-152 + DenseNet-161 + MLP which achieve a Matthews correlation coefficient (MCC) of 0.826, 0.898, 0.899, 0.899, and 0.902, respectively. However, the results also show that there is room for improvement as some classes are more difficult to identify than others (difficulty in classifying ulcerative colitis and esophagitis, difficulty in classifying dyed lifted polyps and dyed resection margins, and difficulty in classifyingbarrettes from esophagitis or z-line).
The descriptive study of Woreta et al. (2015) conducted at Gondar University Hospital based on patient data indicates that esophago-gastroduodenal pathology was detected in 1093 (83.4%) patients of which duodenal ulcer was the most common endoscopic findings obtained from 333 (25.4%) cases.
The study performed by Chanie et al. (2020) at Saint Paul's Hospital Millennium Medical College on 128 patients varices indicates that the most common cause of upper gastrointestinal bleeding was observed in 46.1% (59), followed by peptic ulcer disease, 24.2% (31), esophagitis, 3.9% (5), gastritis, 6.3% (8), duodenitis, 3.1% (4), and malignancy, 4.7% (6), while 10 patients (7.8%) had both varices and ulcer. The proportion of deaths in the study was 17.2%. Khan, Nasir et al., (2021) presented a ulcer segmentation approach based on deep learning of the modified Recurrent CNN masks for the detection of ulcers and classification of gastrointestinal diseases (ulcer, polyp, bleeding). During the classification phase, the ResNet101 pre-trained CNN model is further developed through transfer learning to derive deep characteristics. Grasshopper optimization and the minimal distance fitness function were used to improve the acquired deep features. For the final classification, the best-selected features are fed into a multi-support vector machine classifier using the cubic kernel function. Validation is done with a private data set, which achieves a MOC of 88.08% for ulcer segmentation and 99.92% for the classification accuracy. The ulcer regions were not correctly segmented because of the small amount of training data which lead to the failure of the approach to segment polyp and bleeding regions. Gamage et al. (2019) used pre-trained DenseNet-201, ResNet-18, and VGG-16 CNN models as feature extractors with a global average pooling (GAP) layer to produce an ensemble of deep features as a single feature vector with a promising accuracy of 97.38%. Only eight classes of digestive tract anomalies were predicted by the suggested method. Takiyama et al. (2018) proposed an alternative CNN-based classification model for categorizing the anatomical location of the human digestive tract. This technique could classify EGD images into four large anatomical locations (larynx, esophagus, stomach, and duodenum) and three subcategories for stomach images (upper, middle, and lower regions). A predetermined CNN architecture called GoogleNet was used for this classification problem, which achieved a high performance. Owais et al. (2019) proposed a CNN and TML classification framework for the classification of several gastrointestinal diseases using endoscopic videos with a dataset containing a total of 77 video files with52,471 images. They considered a total of 37 different classes (both diseased and normal cases) related to the human GI tract and achieved an average accuracy of 92.57%. However, more data was required to improve the systems performance and due to the limited size of the dataset, it was not possible to use different videos for training and testing, because most of the classes (about 21 out of 37 classes) consisted of single-patient data (one video per class). Sharif et al. (2019) proposed a new technique based on merging deep CNN and geometrical functions. Initially, the disease regions are extracted from the given WCE images using a new approach called improved contrast color features. The geometric features were derived from the segmented disease area. Thereafter, a unique fusion of the deep CNN functions, VGG16 and VGG19 was carried out based on the Euclidean Fisher Vector. The unique features are combined with geometric features which are then used to choose the best features using the conditional entropy approach. K-Nearest Neighbor (kNN) was used to classify the selected features. For the evaluation of the suggested method, a privately gathered database of 5500 WCE images was used which achieved a classification accuracy of 99.42% and a precision rate of 99.51%. However, the authors obtained just three classification classes, namely: ulcers, bleeding, and health. Shichijo et al. (2017) proposed a deep learning CAD tool for the diagnosis of Helicobacter pylori (H. pylori) infection. The proposed framework employed two-step CNN models. In the first step, a 22-layer deep CNN was refined for classification (i.e., positive or negative) of H. pylori infection. Then, in step 2, another CNN was used to further classify the data set (EGD images) according to eight different anatomical locations to achieve an accuracy of 83.1%. Segu et al. (2016) introduced a deep CNN system for characterizing small intestine motility. This CNN-based method exploited the general representation of six different intestinal motility events by extracting deep features, which resulted in superior classification performance when compared to the other handcrafted feature-based methods. Although they achieved a high classification performance of 96% accuracy, they only considered a limited number of classes.  used machine learning methods to analyze lymph node metastasis in gastric cancer using gemstone spectral imaging. They used feature selection and metric learning methods to reduce data dimension and feature space, and then used the kNN classifier to distinguish lymph node metastasis from non-lymph node metastasis using 38 lymph node samples in gastric cancer to achieve an overall accuracy of 96.33%. Wang et al. (2015) proposed a system for detecting polyps in colonoscopy. It can generate an alert with real-time feedback during colonoscopy. The authors used visual elements and a rule-based classifier to detect polyp borders. The system achieved an accuracy of 97.7% for polyp detection.
A multiscale approach for detecting ulcers was introduced by Souaidi et al. (2019). The authors retrieved texture patterns such as the entire pyramids of LBP and Laplacian, and then classified them using SVM. They tested the system on two WCE datasets and found that it achieved accuracies of 95.11% and 93.88%.  proposed a framework for wireless CE images through integrated LBP and discrete wave transformation to extract texture characteristics from scale and rotation invariants. Finally, the selected characteristics were categorized using the SVM classifier.
By combining the bag-of-features (BoF) approach with the salience map, Yuan & Meng (2014) proposed an integrated polyp identification system. The bag-of-features approach uses a Scale-Invariant Feature Transformation feature vector with k-means clustering to characterize local features in the first stage. The saliency map was then used to generate a histogram, which was then used to extract saliency features. Finally, the BoF and primary attributes were sent into the SVM classifier, which was used to classify the data. For the detection of polyps and ulcers in the small intestine in CE, Karargyris & Bourbakis (2011) proposed a geometric and texture-based technique. The images were pre-processed with Log Gabor filters and the SUSAN edge detector, and geometrical features were retrieved to determine the polyp and ulcer area. To identify normal and pathological tissues,  employed discrete wavelet processing and a uniform local binary model (LBP) with SVM. Wavelet transform combines the ability of multi-resolution and uniform LBP analysis in this feature extraction approach to give robustness to lighting fluctuations, resulting in superior performance.
Most of the presented approaches have been designed to classify a limited number of gastrointestinal diseases in a specific portion of the human gastrointestinal tract. Moreover, feature extraction is a challenging task in medical imaging for recognition of infections because of additional and irrelevant features, or noises that degrades the overall recognition accuracy. Also, some gastrointestinal diseases are difficult to identify from the other disease types. Therefore, to deal with these challenges, in this paper, we employ different image processing stages. Finally, we used deep feature concatenation as a single feature vector by combining VGGNet and InceptionNet models as feature extractors, followed by a SVM classification.

Dataset acquisition
In this paper, we used the publicly available HyperKvasir Dataset (Borgli et al., 2020). The dataset is composed of endoscopic imagery from inside the GIT which has 23 different classes. It contains anatomical landmarks such as cecum, ileum, retroflex-rectum, pylorus, retroflex-stomach, z-line, and pathological findings such as hemorrhoids, polyps, ulcerative-colitis-grade-0-1, ulcerative colitis grade-1, ulcerative colitis grade-1-2, ulcerative colitis grade-2, ulcerative colitis grade-2-3, ulcerative colitis grade-3, Barrett's, short segment Barrett, esophagitis-a, esophagitis b-d, and quality of mucosal views which include bbps 0-1, bbps 2-3, and impacted stool and therapeutic interventions; dyed-lifted-polyps, dyed-resection-margins. In total, the dataset contains 10,662 labeled images stored using the JPEG format. Finally, the data is split into 70/30 which means 70% for training, 20% for testing, and 10% of the data for validation. Image samples of the various labeled classes for gastrointestinal disease are shown in Figure 1.

Proposed methodology
The proposed gastrointestinal disease classification model development comprises of different processes such as preprocessing, segmentation, feature extraction, and classification.

Preprocessing
Each endoscopy image is stored, and its quality might be degraded by different factors such as no uniform intensity, variations, motions, shift, and noise. Thus, it requires preprocessing techniques to reduce the disease identification complexity. In this study, we used image preprocessing techniques such as image resizing, and noise filtering techniques.
3.2.1.1. Image resizing. The dataset has images that have a uniform size of 100 × 100 resolution, but due to our computer performance, the image resolution of the original image needed to be resized before feeding it into the model. Therefore, the size of the image was resized to a resolution of 32 × 32.

Denoising.
Medical images play a vital role in disseminating information about polyps, cancers, hemorrhoids, and other diseases. A major challenge in the process of medical imaging is to obtain an image without the loss of any significant information. The images obtained were affected by noise, and this noise affects the classification accuracy of the proposed model. Basic digital image filters like Gaussian filter (GF), Median filter (MF), and Adaptive Median filter (AMF) (Guan et al., 2014) are common filters used to remove noise in medical images. In this study, we compared GF, MF, and AMF todetermine the most suitable filtering technique that improves the classification performance of the proposed model.
i. Adaptive median filter Adaptive median filter (AMF) is being applied widely as an advanced denoising technique compared with the traditional median filter because the AMF executes spatial processing to determine which pixels in the image has been affected by noise. It classifies pixels by comparison of each pixel in the image to its surrounding neighbor pixel. It preserves image details such as edges and smoothens non-impulsive noise, while the standard median filter does not (Ali, 2018). An illustration of the resultant image after using the noise removal technique is shown in Figure 2.

Data augmentation.
Deep learning has been used to classify several types of medical images over the past few years and has achieved a desirable performance. This was found to be true when there is an extensive amount of data that helps to attain the desired performance model. The limitations of the lack of training data are addressed through augmentation and transfer learning, which is very essential because of the inadequate medical datasets. Data augmentation enhances classification performance and prevents the overfitting issue since it offers a large amount of data (Shorten & Khoshgoftaar, 2019). In this paper, we applied rotation of 45°, width_shift_range = 0.2, height_shif-t_range = 0.2, and horizontal flip augmentation technique. After augmentation, the dataset is increased to 47,398 images.

Feature extraction
Feature extraction is a type of dimensionality reduction where a large number of pixels of an image are efficiently represented in such a way that essential parts of the image are captured effectively. For feature extraction, a concatenated model of VGGNet and the InceptionNet CNN architectures was used. The concatenation process of the two model features integrates the information from the different CNNs to create a more discriminative feature representation than when only the feature extracted from a single CNN model is used since different CNN architectures can capture diverse information of input images which increase their performance compared to 3.2.2.1. Feature extraction using the proposed VGGNet model. In this study, high-level features are extracted using the proposed VGGNet model. Despite this, feature extraction with CNN models consists of several similar steps because each step is made up of cascading layers: convolution, pooling, dropout, batch normalization, and dense layer, each of which represents features at different levels of abstraction. However, CNN models are different from each other in how these important layers are adjusted and also in how they form the network. The architectural design of the proposed VGGNet model used in this study is presented in Figure 3, while the parameters used for the proposed VGGNet architecture of the CNN model are presented in Table 1. The proposed VGGNet model comprises of six convolutions, two batch normalizations, three max-pooling, three dropouts, and one flattened layer. A dropout layer is inserted below each max pooling and dense layer to overcome overfitting. Finally, the extracted features are classified into 23 gastrointestinal disease types by Softmax, KNN, RF, and SVM classifiers. When we trained the model, we observed overfitting of the training data. During this time, the model performs well with the training set but didnt perform well with the validation or test set. To solve this challenge, we used data augmentation to increase the dataset and dropout operation for regularization.

3.2.2.2.
Feature extraction using the proposed InceptionNet model. The second model (the architectural design of the proposed InceptionNet model) used in this study is described in Figure 4, while the parameters used for the proposed InceptionNet architecture of the CNN model are presented in Table 2. The model comprises of six convolutions, one max-pooling, one global average polling, one batch normalization, one dropout, and one flattened layer. A dropout layer is inserted below the global average pooling and dense layer to overcome overfitting. Finally, the extracted features are classified into 23 gastrointestinal disease types using Softmax, KNN, RF, and
In this study, we compared four classifiers, Softmax, KNN, RF, and SVM to determine the best classifier for correctly identifying gastrointestinal diseases using the proposed model. A Radial Basis Function kernel-based SVM classifier with a concatenated feature extraction model was also proposed to efficiently classify the type of gastrointestinal disease. Support Vector Machine is a supervised classification algorithm presented in 1992 by Boser, Guyon, and Vapnik and it has been broadly utilized for classification, regression, detection tasks, and feature reduction (Paoletti et al., 2020). Support vector machine (SVM) tries to find the optimal hyperplane which can be used to classify new data points into two classes with an ideal isolating hyperplane by increasing the distance between the nearest vectors to the hyperplane (Aychew, 2020). After breaking down the multiclass classification problem into multiple binary classification problems, the same principle is used for multiclass classification as well as binary classification. Then, the algorithm can classify an instance as only one of three classes or more. Different SVM algorithms use different types of kernel functions like-Polynomial Kernel, Gaussian Kernel, Radial Basis Function (RBF), Sigmoid Kernel, and others. The most used type of kernel function is RBF because it has localized and finite responses along the entire x-axis (Salau & Jain, 2021). Because real noisy data cannot be linearly separated in this case, SVM is preferable because it can still be used for nonlinear classification problems by applying appropriate kernel functions that map the input data into a higher-dimensional space where the classes are linearly separable , and when the amount of available training data is small, the SVM classifier outperforms the softmax classifier for the CNN features (Ayalew et al., 2022). RBF is used in this study to solve our 23 class classification problem because it is the most commonly used SVM kernel function used by several researchers. We used C values of 1, 10, 100, and 1000, as well as gamma values of 0.001 and 0.0001. The best model performance was obtained when the c value was 1 and the gamma value was 0.001.

Convolutional neural network
Computer vision can be very useful in practical areas such as surgery, where it can improve surgeon performance by providing real-time contextual awareness, and endoscopy, where it can provide proper therapy after surgery (Maier-hein et al., 2017). Machine perception covers a range of levels, from low-level tasks like identifying edges, to high-level tasks like understanding full scenes because of the maturation of deep learning and the supply of large datasets labeled to train these algorithms Deng et al., 2009). When there is a lack of domain understanding for feature extraction, CNN is preferable because useful learned features are extracted automatically rather than handcrafted based algorithms that manually define and extract a set of features such as edge detection, corner detection, histograms, and so on. However, this approach does not guarantee that the number of corners is a good descriptor for classifying images (Noor Mohamed & Srinivasan, 2020; Lin et al., 2020). Deep Learning is now widely used to solve complex problems like image classification, natural language processing, and voice recognition. RBM, Autoencoder, DBN, CNN, and RNN are the most important deep learning algorithms (Shetty & Siddiqa, 2019). According to Guo et al. (2016), CNN-based architectures are most widely used for image classification problems. CNN is currently the most widely used machine learning algorithm in medical image analysis because it preserves the spatial relationships when filtering input images, which is crucial in endoscopy, for example, where a normal colon tissue meets a cancerous tissue . CNN employs three basic types of neural layers: convolutional layers, pooling layers, and fully connected layers. Convolutional Layers: In the convolutional layers, CNN uses different portions of the image as computational intelligence to convolve the full image and produce different feature maps. Pooling layers: Pooling layers are in charge of reducing the spatial measurements (width, height) of the information volume for the next convolutional layer. This layer's activity is also known as subsampling or downsampling, because the decrease in size results in a loss of data. Nonetheless, such loss is beneficial to the network because the reduction in size results in less computational overhead for the network's subsequent layers, as well as the elimination of overfitting. The most commonly used techniques are average pooling and maximum pooling. Furthermore, there are various varieties of the pooling layers in literature, each propelled by different inspirations and serving specific needs, for example, stochastic pooling, spatial pyramid pooling, and def-pooling. Fully Connected Layers: After a few convolutional and pooling layers, the neural network's important level of thinking is carried out through completely associated layers. As the name implies, neurons in a completely associated layer have complete associations with all representations in the previous layer. Their actuation can then be achieved by duplicating the framework followed by an inclination counterbalance. Finally, fully associated layers convert the two-dimensional component maps to a one-dimensional element vector. The obtained vector could either be carried forward into a specific number of classes for classification or considered as a component vector for additional preparation (Escobar et al., 2021). Pre-trained models are a type of deep learning models that have been trained on a large dataset to solve classification problems (Al-Adhaileh et al., 2021). VGGNet, ResNet, Inception, and Xception are some of the most recent basic deep learning structures for CNN. Simonyan and Zisserman introduced VGGNet in 2014. VGGNet was the first runner-up design in the characterization task of the ImageNet Large Scale Visual Recognition Competition (Simonyan & Zisserman, 2015;Sakib et al., 2018). ResNet CNN's architecture demonstrates exceptional fundamental use of skip connections and batch normalization. Also, at the network's end, there are no fully connected layers. The main flaw of this system is that due to the vastness of the boundaries, it is expensive to use. In any case, ResNet is currently regarded as the best convolutional neural network model, having won the ILSVRC 2015 challenge. It is worth noting that, despite having many more layers than the VGG, it requires much less memory, nearly 5 times less. This is because, instead of dense layers in the classification phase, this network employs GlobalAveragepooling, a type of layer that converts the final layer in the feature extraction stage's two-dimensional activity maps to an n-class vector that is used to calculate the probability of belonging to each class. Szegedy et al. (2015) proposed Inception, a deep convolutional neural network design that is responsible for setting a new critical edge for classification and detection in the ImageNet Large-Scale Visual Recognition Challenge in 2014 (ILSVRC14). The main advantage of this strategy is that it adds a critical component to a minor increase in computer requirements when compared to smaller and smaller networks. Chollet (2017) designed Xception, which has 36 convolutional layers that serve as the foundation for extracting network features. The Xception architecture is made up of a linear stack of depth-separable convolution layers with residual connections. It is an adaptation of Inception, with deep-separable convolutions replacing the Inception modules. Furthermore, it has approximately the same number of parameters as Inception-v1. We used InceptionNet and VGGNet because both performed well in the 2014 ImageNet challenge, and InceptionNet gained attention due to its architecture based on modules called Inception.

Proposed CNN model
The proposed model aims to accurately diagnose gastrointestinal disease using concatenated deep features extracted from endoscopy images of two different models (CNN architecture of VGGNet and InceptionNet). Initially, the input images are appropriately prepared by resizing and noise filtering processes. The features extracted from endoscopy images are then extracted using the proposed VGGNet model. The proposed InceptionNet model extracts features from the same images in the other model. Finally, the features extracted from these models are combined into a single classification descriptor. Figure 5 depicts the proposed concatenated model architecture, which concatenates the features extracted by the two flattened models (CNN architecture of VGGNet and InceptionNet).

Architecture of the proposed model
The overall steps followed to identify the corresponding class of a particular input image are presented in the proposed gastrointestinal disease classification model architecture as shown in Figure 5. There are three main phases in the systems architecture. The training phase, the testing phase, and the validation phase. The training phase and validation phase are performed before the testing phase for training the model. The testing phase is performed after the training phase and

Transfer learning
Transfer learning is the process of improving learning in a new task by transferring knowledge from a previously learned related task. It is implemented by slicing off the final output layers of a pretrained model and replacing them with output layers specific to the new target output. Pre-trained models, on the other hand, have been trained on an ImageNet dataset that contains 1000 classes of images, such as animals and cars. Because models trained on the ImageNet dataset learn features that are completely different from medical image features, fine-tuning the models for medical tasks may not have a significant impact on their performance.

(i)Learning method
The proposed concatenated model is evaluated in terms of accuracy and loss function. To measure the difference between the output predictions and the target output, we used a categorical crossentropy loss function, which is commonly used for multi-class problems. Adaptive momentum estimation (Adam) optimization function with an initial learning rate of 0.001 was used to adjust the learning rate and other parameters to increase the accuracy of the model and to decrease the loss. The classified classes are compared in accordance with the classification's performance parameters. The confusion matrix is used to calculate the parameters performance. According to Ayalew et al. (2022;Salau (2021), each performance measure parameter is computed as follows: • False-Positive Rate (FPR): indicates that the diseases are not detected correctly.
• False-Negative Rate (FNR): indicates that the diseases are detected as non-disease regions.
• True Positive Rate (TPR): indicates that the diseases are detected as disease regions.
• True Negative Rate (TNR): indicates that a non-disease region is correctly recognized as a nondisease region.
• Confusion matrix (CM): CM is used to find the correctness and accuracy of the model. It is obtained using parameters such as TP, TN, FP, and FN of the matrix.
• Accuracy: Accuracy is the number of correct predictions divided by the total number of samples. The formula for accuracy is given as follows: (1) • Precision: It is a measure that indicates the proportion of patients that were diagnosed as having a disease. The predicted positives (People predicted with a disease are TP and FP) and the people having a disease are TP as follows: (2) • Recall (Sensitivity): It is a measure that indicates the proportion of patients that have a disease and were diagnosed by the model as having a disease. The actual positives (People having a disease are TP and FN) and the people diagnosed by the model having a disease are TP as follows: • Dice similarity coefficient (F1-score): It represents a harmonic average of the precision and sensitivity and measures the similarity rate between predicted and ground-truth regions. It is calulated using Eq. (4): • Matthews Correlation Coefficient (MCC): accuracy is sensitive to class imbalance. Precision, recall, and F1-score are asymmetric, but MCC is perfectly symmetric; therefore, no class is more important than the other. It measures the classification quality even if the classes are of different sizes, true positives, and false positives and negatives as follows:

Experimental results and discussion
We compared different filtering and segmentation techniques, as well as different state-of-the-art classifiers and CNN pre-trained models with the proposed concatenated CNN model.

Experimental setup
Experiments were performed with the Keras prototype on the tensor flow backend. The performance of the proposed concatenated model is measured using accuracy, confusion matrix, precision, recall, f1-score, and Matthews correlation coefficient metrics of the classification. The whole image dataset before augmentation (10,662 images) or after augmentation (47,398 images) is partitioned into training, validation, and testing. Image augmentation is required to enable the trained model to extract a large number of important features. After augmentation, we used a dataset of 70% (33,944 images) for training, 20% (9610 images) for testing, and 10% (3844 images) for validation, with no overlap from the original images. The same splitting method was also applied to the dataset for the experiments which were performed before augmentation. For this study, we used an anaconda environment, python programming language, a free-source Keras and a free-source CNN library using Tensor Flow which served as the backend. The python code was written using Jupiter Notebook. The designed prototype was tested on the Windows 10 Hp Pavilion laptop with 64-bit operating system. The laptop is an Intel(R) Core (TM) i3-8145UMQ with 2.30 GHz CPU processor, with 8 GB RAM and 700 GB hard disk capacity.

Experiments before augmentation
In this stage, we trained the proposed concatenated model for 75 epochs, a batch size of 32, and an initial learning rate of 0.001. With the end-to-end proposed concatenated CNN model using a softmax classifier, we performed seven experiments; the first experiment was performed to obtain the result before noise filtering and segmentation techniques; the second and third experiment was performed to compare the three filtering techniques (gaussian filtering, median filtering, and adaptive median filtering). We have achieved 70.8%, 68.6%, and 72.7% accuracy, when we used GF, MF, and AMF, respectively. Since AMF achieves superior results, we employed AMF for the rest of the experiments and the last three experiments were performed for viewing the effect of the three segmentation techniques (we compared the otsu, adaptive thresholding, and kmeans segmentation techniques). We achieved 54.7%, 52.3%, and 56.3% accuracy when we used otsu, adaptive thresholding, and k-means, respectively. Since both segmentation techniques achieved low accuracies, we did not use segmentation techniques for the rest of the experiment. The results are shown in Table 3.

Experiments after augmentation
In this stage, we conducted 14 experiments. We trained the proposed concatenated model using 75 epochs, a batch size of 32, and an initial learning rate of 0.001. The first four experiments were performed to compare the classifiers using the VGGNet feature extraction method. The results show that 93%, 93.2%, 93.3%, and 93.4% accuracy was achieved when we used Softmax, KNN, RF, and SVM, respectively. The second four experiments were performed to compare classifiers using the InceptionNet feature extraction method. The results show that 84%, 84.5%, 84.7%, and 85% accuracy was achieved when we used Softmax, KNN, RF, and SVM, respectively. The third and fourth experiments were performed to compare classifiers using concatenated VGGNet-InceptionNet feature extraction method. We achieved 94%, 95.4%, 95.7%, and 98% accuracy when we used Softmax, KNN, RF, and SVM, respectively. The last two experiments were used to demonstrate the performance of the pre-trained VGGNet16 and pre-trained Inceptionv3. The models achieve an accuracy of 89% and 91%, respectively. The results are shown in Table 4.

Performance of the proposed concatenated model using the SVM classifier
The results of the experiments performed to measure the training accuracy, validation accuracy, testing accuracy, and Matthews's correlation coefficient of the proposed concatenated CNN model with SVM classifier are presented in Table 5. The experiments were performed using the concatenated model with a resized and a filtered (AMF technique) image dataset and SVM classifier. The experiments were performed using 75 epochs with a batch size of 32. We train the model with 33,944 training images and 3844 validation images, and we test with 9610 images. The model validation is

Result discussion
The dataset was split into training, validation, and testing dataset for the proposed concatenated model training and the performance of the proposed concatenated model was tested using the dataset samples. To classify the gastrointestinal images into 23 classes, we have concatenated VGGNet and InceptionNet features for feature extraction and SVM for classification. During our  experiment, we confirmed that the prediction ability of the model is increased when we used the concatenated features of the proposed VGGNet and InceptionNet models. Before using noise filtering and augmentation techniques, the accuracy of the model is 71.8%, and after noise removal, the accuracy of the model increased by 0.9% (72.7%). We obtained the best classification accuracy with the adaptive median noise filtering technique. We also applied the augmentation technique to control overfitting and to learn different features of images for the proposed concatenated model that helps to increase its classification ability. After augmentation, the accuracy of the proposed endto-end model is 94%. We also applied different segmentation techniques to identify regions of interest in the image; however, these segmentation techniques result in the low classification performance of the model. Also, when the accuracy of each model was examined with the SVM classifier after the features were extracted with each of the feature extractors individually, the results show that VGGNet and InceptionNet have accuracies of 85% and 93.4%, respectively. Afterwards, the features of VGGNet and InceptionNet were concatenated which results in an improved classification performance of 98% accuracy. In this study, satisfactory results were obtained after we conducted 21 experiments. Finally, a comparison of state-of-the-art classifiers was performed using the proposed concatenated model. Our proposed concatenated model with SVM classifier achieves a training accuracy of 98.7%, validation accuracy of 98.2%, testing accuracy of 98%, and Matthews's correlation coefficient of 97.8%. The results of the proposed concatenated model outperformed the proposed VGGNet, the proposed InceptionNet, and the state-of-the-art pre-trained VGGNet16 and Inceptionv3 models. Generally, based on the experiments, our proposed model improves Matthews's correlation coefficient of the previous work by 7.6%. This indicates that the proposed concatenated model is effective for gastrointestinal disease diagnosis from endoscopy images. A comparative analysis of our proposed method with existing works is presented in Table 6. The results show that the proposed method achieves a high accuracy as compared to other methods.

Conclusion
The digestive system consists of the gastrointestinal tract and other organs which help the body to break down and absorb food. The gastrointestinal tract is affected by a variety of diseases which makes it not to function properly. The domain of gastrointestinal endoscopy includes the endoscopic diagnosis of various digestive diseases using image analysis and various devices. Endoscopy is currently the preferred method for examining the gastrointestinal tract; however, its effectiveness is severely limited by the variation of the models performance. This study presents a concatenated model by concatenating features of VGGNet and InceptionNet for the classification of multiple gastrointestinal diseases from endoscopy images. The proposed model was trained and tested using a publicly available dataset from the HyperKvasir database. Extracted features were used to train the Softmax, KNN, RF, and SVM classifiers. The model achieves a test accuracy of 98% using the SVM classifier, which is an incredible performance when compared to state-of-the-art approaches. It outperforms the previous similar works by 7.6%, with Matthews' correlation coefficient of 90.2%. As a result, the proposed model is one of the best gastrointestinal disease classification models. Based on the experimental outcomes of this study, we suggest the following recommendation as future work: • Segmentation plays a critical role in the early and accurate diagnosis of various diseases using computerized systems. For the future, we recommend that a new simultaneous segmentation method can be used by taking advantage of MASK-RCNN for detection of the disease location and GRABCUT for segmentation to improve the performance of the model.
• This study was performed for only 23 gastrointestinal disease types; therefore, other types of gastrointestinal disease can also be considered.
• In future works, it is important to evaluate different feature selection algorithms that can help to determine the smallest subset of features that can aid in the accurate classification of gastrointestinal disease types.