Pneumonia detection in chest X-ray images using compound scaled deep learning model

Pneumonia is the leading cause of death worldwide for children under 5 years of age. For pneumonia diagnosis, chest X-rays are examined by trained radiologists. However, this process is tedious and time-consuming. Biomedical image diagnosis techniques show great potential in medical image examination. A model for the identification of pneumonia, trained on chest X-ray images, has been proposed in this paper. The compound scaled ResNet50, which is the upscaled version of ResNet50, has been used in this paper. ResNet50 is a multilayer layer convolution neural network having residual blocks. As it was very difficult to obtain a sufficiently large dataset for detection tasks, data augmentation techniques were used to increase the training dataset. Transfer learning is also used while training the models. The proposed model could help in detecting the disease and can assist the radiologists in their clinical decision-making process. The model was evaluated and statistically validated to overfitting and generalization errors. Different scores, such as testing accuracy, F1, recall, precision and AUC score, were computed to check the efficacy of the proposed model. The proposed model attained a test accuracy of 98.14% and an AUC score of 99.71 on the test data from the Guangzhou Women and Children’s Medical Center pneumonia dataset.


Introduction
Pneumonia is an inflammatory disease of the lungs and is mainly caused due to pathogens like bacteria and viruses. It is one of the primary causes of death in countries such as the United States [1] and India [2]. In older patients, it goes unnoticed and untreated until it has reached a life-threatening point. Pneumonia is the primary cause of death in children below the age of 5 [3]. According to WHO Every year, it kills an estimated 1.4 million children under the age of five years, accounting for 18% of all deaths of children under five years old worldwide. Pneumonia affects children and families everywhere but is most prevalent in South Asia and sub-Saharan Africa. Children can be protected from pneumonia, it can be prevented with simple interventions, and treated with low-cost, low-tech medication and care [2].
It is a very common disease all across the globe. Due to poverty, people refrain from having access to trained radiologists. As far as diseases like pneumonia are concerned, the level of accuracy in the diagnosis should be good enough to assure proper treatment of this fatal disease [4]. Hence, to reduce the mortality of pneumonia, there is a need for research in the field of computer-aided diagnosis. There are many tests for pneumonia diagnosis, such as the chest ultrasound, chest MRI, chest X-ray, computed tomography of the lungs and needle biopsy of the lung [5]. X-rays are the most widely available diagnostics imaging technique [6]. The examination of chest X-rays is a difficult task for radiotherapists. Sometimes the appearance of pneumonia in the X-ray of the patient is very unclear. This leads to difficulty in the prediction of the disease as it is difficult to identify the features that describe the presence of the disease. This is the primary reason behind the misclassification of the X-ray images in the dataset. Several CAD systems have already been demonstrated to aid the medical area primarily in the detection of lung nodules, breast cancer, etc. Deep learning remains the most promising and widely used Machine Lerning (ML) technique for radiology in particular and disease detection in general. Deep learning techniques have already proved their effectiveness and their disease prediction accuracy matches that of an average radiologist [7]. Presently, deep learning-based methods cannot be used as a replacement for trained clinicians in medical examination. Hence, computer-aided diagnosis techniques based on deep learning can be used as a supplement in the clinical decision-making process.
The rest of the paper is structured as follows: Section 2 contains the discussion of the related works. Methods used in this paper are discussed in Section 3, while the dataset used has been introduced in Section 3. The proposed methodology and the results are discussed in Sections 4 and 5, respectively. Finally, Section 6 has the conclusion of the paper.

Related work
Several biomedical image detection techniques have already been proposed by different authors. Authors in [8] discussed the identification of pneumonia. Razaak et al. addressed the challenges of methods for medical imaging [9]. Many authors proposed various methods for the detection of numerous diseases [10]. For example, Andre in his paper presented a deep CNN-based on Inception v3 architecture for skin cancer classification [11], Milletari also worked on a method for detecting the prostrate, using CNN, in MRI volumes [12], Grewal used deep learning for detecting the brain haemorrhage in computed tomography scans [13], Varun worked on a method for the detection of diabetic retinopathy [14]. Lakhani proposed an algorithm based on conventional neural networks (CNNs) for the automated classification of pulmonary tuberculosis [15]. Bar also published a paper in which he used deep neural networks (DNNs) for chest pathology detection [16]. CNN is a much better advancement over DNN as it can very easily operate on 2-D and 3-D images and also extract the features responsible to classify the disease. This is possible in CNN because the max-pooling layer is very efficient and they are also attached to some weights. CNNs also deal with the grave situation of diminishing gradient as they involve gradient-based learning, while it is being trained. CNNs work on the principle of the features extracted by the various layers to identify the various chest X-rays and classify them accurately. Some algorithms were devised earlier in the papers of Shin for data mining [17,18] and the paper of Boussaid also proposed that labels are predicted by the extraction of features and application of segmentation techniques from radiology images of chest X-ray [19]. Methods regarding the diagnosis of disease using chest X-ray were also worked upon earlier by the authors such as Avni, Melendez and Jaeger who performed various examination techniques [20][21][22]. After the examination is performed classification algorithm on a chest X-ray view is performed which was discussed by Xue [23]. Roth et al. [24] proposed an algorithm for lymph node detection in computed tomography scans using 2.5D resampling approach and a deep CNN. The lung area is very well segregated from the whole chest X-ray where the network has to perform various operations looking for the features for the identification of the disease. Rajaraman et al. [25] presented a visualization strategy for localizing the region of interests in the chest X-rays. They used a customized VGG 16 model in their work. Togacar et al. [26] combined deep features from several deep models to form an efficient deep feature set and then used algorithms, such as k-nearest neighbours, decision tree, etc. for classification of the deep feature set. Then they used algorithms such as k-nearest neighbours, linear discriminant analysis, for pneumonia detection. Lakhani et al. [15] used GoogLeNet and AlexNet CNNs and reported an area under the curve (AUC) of 0.94-0.95. Ayan et al. [27] and Rahman et al. [28] used CNNs such as Vgg16, Xception, DenseNet201 and SqueezeNet network for pneumonia classification. Vikash et al. [29] proposed a model which was based on an ensemble architecture that combined outputs from several pre-trained models for pneumonia diagnosis. Jaiswal et al. [30] used a deep learning model based on mask-RCNN, which fuses local and global features for pixel-wise segmentation for pneumonia classification. Pezeshk et al. [31] proposed the use of 3-D CNNs for the fast screening of pulmonary nodule detection in chest CT. Our previous work [32] used an ensemblelike model that integrated five deep learning models (ResNet18, Xception, InceptionV3, DenseNet121 and MobileNetV3).
The main contribution of this paper is that a compound scaled version of ResNet50 for pneumonia detection, which required fewer computational resources compared to our previous work [32], is proposed. First, the ResNet50 architecture was used for binary pneumonia classification. Then the ResNet50 architecture was scaled up using the compound scaling method introduced by Tan et al. [33] and was used for pneumonia detection. The comparative analysis of both the architectures for pneumonia detection was done. Towards the end, the class activation maps were plotted to explain the results of the deep learning models.

Convolutional neural network
CNNs are arguably the most established deep learning architecture. LeCun et al. [34] first proposed the use of CNN for the recognition of handwritten Zip Code. They come in the category of feed-forward networks. The primary advantage of CNN is that it could consider the locality of features. The central part of the CNN is the convolutional layer that gives the network its name. The convolutional layer extracts the features from an input image. The pooling layer reduces the dimensionality of each input map but retains important information. The network architecture consists of several convolutional and pooling layers in series, as shown in Figure 1. At the end of the network, the softmax layer is used for image classification. The main drawback of deeper CNNs is of vanishing gradients.

Residual networks
Kaiming He et al. of Microsoft Research Institute [35] proposed the Residual Networks to solve the problem of vanishing gradients. The central idea of the residual block was to take into account both the incoming eigenvalues computed by the network layers and the original values while computing the result. This residual block can be seen in Figure 2.
F(x)+x can be executed by feed-forward neural networks with "bypass connections". One or more intermediate layers can be skipped using the bypass connections. The outputs of both the bypass connections and of the stacked layers are summed together. The training of the network can be done using SGD with backpropagation.

Transfer learning
CNNs generally require larger datasets to train on. When CNNs are trained on smaller datasets they generalize poorly. Transfer learning can be used in such cases. Figure 3 shows the process of transfer learning where the information gained by the model, while solving a particular problem is used for solving another problem.
In medical image processing-related problems, sufficient dataset is not available. Therefore, training a model without proper initialization of weights is expensive in terms of the computational resources involved and the achieved results are also poor. Transfer learning was used with both ResNet50 and the compound scaled version of ResNet50. The initialization of weights using transfer learning is effective as the model is already trained to detect generic features like edges in an image.

Dataset
There were a total of 5836 images in the original dataset from the Guangzhou Women and Children's Medical Center [36]. It included the images of both healthy and pneumonia cases. There were 1583 healthy chest Xray images and 4273 pneumonia-infected chest X-ray  images. The entire dataset was rearranged into training and test set with 700 images in the test set and 5136 images in the training set. Two sample images of the dataset have been shown in Figure 4.

Proposed methodology
An optimal algorithm for pneumonia detection from Chest X-rays is proposed in this paper. Data augmentation techniques were deployed to increase the size of the limited dataset. Pre-trained ResNet50 architecture is fine-tuned for the pneumonia classification task. Then, the scaling up of ResNet50 architecture is done using compound scaling, as introduced in [33]. Figure 5 shows the block diagram of the proposed methodology. Various techniques used in the proposed methodology are explained in the subsequent sections.

Pre-processing and data augmentation
First, the input images were resized to 224 * 224 and were used to train the model. Effective training of the neural net requires a large dataset. When the deep networks are trained on a smaller dataset, then the networks cannot generalize and hence poor testing accuracy. Data augmentation is one of the solutions to this problem and utilizes the existing dataset in an efficient manner and expands it. In the training dataset used in this paper, there were a total of 1283 healthy chest X-ray images and 3873 pneumonia-infected chest X-ray case images. The dataset had enough images   of the pneumonia case and hence only the images of the normal case were augmented twice. The settings used to create augmented images ( Figure 6) are shown in Table 1. After augmentation, there were 3849 normal images and 3873 pneumonia images. The test set images were not augmented.

Compound scaling
Scaling of CNNs can be done across three dimensions: depth, width and resolution. The depth of a CNN is equivalent to the number of layers it has, width refers to the number of channels in the convolution layer and resolution refers to the resolution of the image passed to a CNN. Scaling can be used: to achieve greater performance for a particular task and to create more efficient models. Scaling can be done on width, height and resolution. Figure 7 explains what scaling means across different dimensions. Scaling up any dimension of the network (width, depth or resolution) should improve accuracy, but the accuracy gain diminishes for bigger models. While scaling CNNs, it is critical to balance all dimensions of a network (width, depth and resolution) for getting improved accuracy and efficiency. Tan et al. [33] proposed compound scaling technique in their network: EfficientNet. A CNN can be represented mathematically as Equation (1): (1)  Such that, α · β 2 · γ 2 ≈ 2 and α ≥ 1, β ≥ 1, γ ≥ 1 is a user-specified coefficient which controls the number of resources available and α, β and γ specify how to designate these resources to network depth, width and resolution, respectively. In a CNN, Conv layers are the most computationally expensive part of the network. Also, FLOPS of a convolution operation is almost proportional to d, w⊃2, r⊃2. Scaling the network using Equation (2) will increase the total FLOPS by (α * * β⊃2 * * γ ⊃2) . The constraint, α · β 2 · γ 2 ≈ 2, was applied so that the total FLOPS do not exceed 2 φ . Here, α, β, and γ can be determined using grid search and the compound coefficient can be increased to get larger but more accurate models. In this work, ResNet 50 model was scaled up with d = 1.4, w = 1.2 and r = 1.3.

Fine-tuning the models
ResNet50, which was pre-trained on ImageNet dataset [37] is used as the base architecture.  show the framework of the network and its details. The network has five stages. The first stage only has one convolutional layer. The second, third, fourth and fifth stages have 9, 12, 18 and 9 convolutional layers, respectively. The kernel size of each convolutional layer is 3 * 3.
After adding a dense layer with two output classes to this imported architecture, every layer of the network was unfrozen. Stochastic Gradient Descent (SGD) is used as the optimizer and the model is trained for 50 epochs. Different learning rates were used, while training the model and different results were obtained. The most optimum performance was achieved while using 0.001 as the learning rate. Then the scaled version of ResNet 50 was created, as discussed in Section 3.2.

Class activation maps
One of the problems with DNNs is that they are blackbox models and it is difficult to explain their final predictions. Class activation maps (CAMs) are one of the solutions for this problem. The interpretation of the results of the model is necessary for clinical decisionmaking. This can also help in understanding the points of failure of the model. There are several ways to generate CAMs. In this paper, first, the global average pooling (GAP) layer was attached to the model. After GAP, the softmax layer was attached, which gives the final prediction. The network can be seen in Figure 5. CAMs were generated for the ResNet50 model and the compound scaled ResNet 50 model and the results are presented in Section 5.

Experiment results and discussion
The details of the experiments, conducted to test the proposed architecture, are presented in this section. Keras open-source framework with TensorFlow as the backend has been used to implement the deep learning networks. Computation was done on a system having 16 GB RAM and NVIDIA Quadro 4000 graphic card with 2 GB GDDR5 GPU Memory and Intel i7, 7th generation processor.

Testing accuracy and testing loss
Each experiment was conducted several times to examine the proposed model's performance. Most optimum results were obtained with 0.001 as the learning rate with SGD as the optimizer. The testing accuracy achieved by ResNet50 architecture was 97.85 and the testing loss was 0.070. With 0.1 as the learning rate, the model was unable to learn anything and testing and training accuracy was stuck at 50%. Table 3 shows the  testing accuracy achieved by ResNet50 with different learning rates. Finally, with the compound scaled ResNet50 architecture, and with the same hyperparameters, the testing accuracy increased to 98.14 and the testing loss decreased to 0.061. Training accuracy and loss graphs are shown in Figure 9. The models were trained for 50 epochs in every experiment. The average inference time was 0.042 s for both the models.

Performance analysis
Accuracy, recall and precision score for the two models, ResNet50 and the compound scaled ResNet50 were calculated. Accuracy can be defined as the ratio of the right predictions to the total number of predictions (Equation (3)). Precision is how precise the model is in predicting the true labels (Equation (4)). Recall is the ratio of the true positives to the summation of true positives and false negatives (Equation (5) To calculate these scores, the confusion matrix for both ResNet50 and compound scaled ResNet50 was obtained (Figures 10 and 11, respectively). The confusion matrix can help in obtaining false positives, true negatives, true positives and false negatives and further these can help in analysing the performance of the model and calculating the above-discussed scores.
As recall is increased, the precision decreases and vice versa. F1 score can be calculated to find the optimal blend of recall and precision.
The F1 score of the ResNet 50 model was 99.50 and the compound scaled ResNet50 model was 99.71. Furthermore, the AUC curves were plotted and can be seen in Figure 12. Compound scaled ResNet50 architecture achieved a higher AUC score. All the results   have been summarized in Table 4. The achieved results also show that the generic image features, such as the edges and the curves, learned by the deep neural networks while getting trained on the ImageNet dataset, worked as a good initialization of the weights. Despite the domain difference, transfer learning was effective. Hence, initialization of weights by models trained on any classification dataset is better than random initialization of weights. It can also be seen that the compound  scaling used in the paper was effective in scaling up the model and increasing the overall performance of the model. Furthermore, the CAM was plotted to explain the predictions of the deep model. In the Figure 13, it can be seen that the proposed model was able to differentiate between a normal and pneumonia-infected chest X-ray. The deep learning model identified abnormal lungs to detect pneumonia in the chest X-ray. The image parts towards which the model focussed while giving the output were highlighted in the CAM.

Comparative analysis with respect to various existing methods
Furthermore, a comparative study is presented in Table 5. The accuracy of various existing methods and the proposed methodology is compared. Rajaraman   [27] and Rahman et al. [28] used CNNs such as Vgg16, Xception, DenseNet201 and SqueezeNet and achieved a testing accuracy of 84.5% and 98.0%, respectively. Vikash et al. [29] proposed an ensemble model to combine predictions from multiple deep learning models and reported a testing accuracy of 96.36%. The proposed methodology achieved a testing accuracy of 98.14% with an AUC of 99.71. Our previous work [32] achieved a higher testing accuracy (98.43%) and a higher AUC score (99.71%), but it required more computational resources as five deep learning models were integrated. Five models (ResNet18, Xception, InceptionV3, DenseNet121 and MobileNetV3) had to be trained and the predictions from all the five models had to be obtained to give the final output. The average inference time for this approach was 0.042 s, while for the previous approach it was 0.203 s. Hence, even though the present approach achieved lesser accuracy compared to the previous approach, but its application on a real-world device would be easier.

Conclusions and future scope
Pneumonia is a life-threatening infectious disease. For patients over 75 years, the mortality rate of pneumonia is 24.8% [38]. In this paper, an algorithm which can further support computer-aided diagnosis of pneumonia has been proposed. The deep residual network, proposed in the paper, has a more complex structure but fewer parameters and higher accuracy. Furthermore, this model was scaled up efficiently using the method of compound scaling. Data augmentation and transfer learning have also been used to tackle the obstacle of the insufficient training dataset. Different scores, such as recall, precision and accuracy, were computed to prove the robustness of the model. The proposed model attained an accuracy of 98.14%, a high AUC score of 99.71 and an F1 score of 98.3. The future works involve developing an algorithm which can localize the parts of the lung affected by pneumonia.