Improvement of chest X-ray image segmentation accuracy based on FCA-Net

Abstract Medical image segmentation is a crucial stage in computer vision and image processing to help the later-stage diagnosis process become more accurate. Because medical image segmentation, such as X-ray, can extract tissue, organs, and pathological structures. However, medical image processing, primarily in the segmentation process, has significant challenges regarding feature representation. Because medical images have different characteristics than other images related to contrast, blur, and noise. This study proposes the use of lung segmentation on chest X-ray images based on deep learning with the FCA-Net (Fully Convolutional Attention Network) architecture. In addition, attention modules, namely spatial attention and channel attention, are added to the Res2Net encoder so that it is expected to be able to represent features better. This research was conducted on chest X-ray images from Qatar University contained in the Kaggle repository. A chest x-ray image measuring 256 × 256 pixels and as many as 1500 images were then divided into 10% testing data and 90% training data. The training data will then be processed in K-Fold Cross validation from K = 2 until K = 10. The experiment was conducted with scenarios that used spatial attention, channel attention, and a combination of spatial and channel attention. The best test results in this study were using a variety of spatial attention and channel attention in the division of K-Fold with a value of K = 5 with a DSC (Dice Similarity Coefficient) value in the testing data of 97.24% and IoU (Intersection over Union) in the testing data of 94.66%. This accuracy result is better than the UNet++, DeepLabV3+, and SegNet architectures.


Introduction
Medical images are an essential part of the world of health for performing diagnostic procedures involving a visual and functional representation of the human body and its organs (Rizwan I Haque & Neubert, 2020).One type of medical image commonly used to diagnose the disease is a chest X-ray.Chest X-rays are still among the most widely used tests globally because of their affordable cost, even in developing countries.They are also still essential in identifying various types of lung diseases globally (Chest X-ray, 2020).Although they provide comprehensive information about the patient's condition, interpreting the data more accurately is a significant challenge for radiologists and requires skill, experience, and concentration.
To assist medical personnel in detecting diseases through X-rays more accurately using computer vision, segmentation in medical images is one of the essential tasks in medical image processing.Medical image segmentation can extract different tissue, organ, and pathological structures so that, in the next step, it can help with diagnosis, surgery planning, and treatment (R. Wang et al., 2022).
Manual segmentation is generally less effective, so it takes longer and can only allow for small dataset sizes.In addition, as the diversity of medical images increases, manual segmentation is impossible due to cost and productivity concerns.Another problem with manual medical image segmentation is that the segmentation approach will be subjective, depending on the knowledge and experience of medical experts (Rizwan I Haque & Neubert, 2020).
Therefore, X-ray image segmentation can be done automatically using a computer.This process plays a crucial role in computer-aided diagnosis and innovative medicine because it can help the diagnosis process in the next step be more accurate and efficient (R. Wang et al., 2022).However, in terms of X-ray image segmentation, extracting important features is more difficult when compared to the standard image due to problems such as blur, noise, low contrast, and so on.So X-ray image segmentation is still a challenging topic in computer vision.
Applying medical image segmentation, such as chest X-rays or other medical images, with a deep learning approach has contributed to developing new image segmentation models with better performance.Deep learning neural networks can achieve high average accuracy on various datasets.
The following are the main contributions of the paper: • We use 1500 datasets from Qatar University, divided into 90% data training and 10% data testing.
Data training will be processed in K-Fold Cross Validation from K = 2 to K = 10.
• This research provides a deep learning model using FCA-Net (Fully Convolutional Attention Network) architectures for lung segmentation in chest X-ray images that is considered a better model than manual image segmentation.
• The experimental evaluation used DSC (Dice Similarity Coefficient) and IoU (Intersection over Union).
• This research resulted in high accuracy of DSC and IoU so as to enable an accurate diagnosis of lung disease as well.

Literature review
In 2016, research on lung segmentation on chest X-ray data images totaling 354 images was done using the CNN (Convolutional Neural Network) method with a four-layer encoder and four-layer decoder network.These images came from portal tuberculosis and the Japanese Society of Radiological Technology.The encoder consists of a collection of convolution layers, while the decoder consists of a stack of deconvolution layers.This architecture uses ReLu as an activation function and can produce a DSC value of 96.2% (Kalinovsky & Kovalev, 2017).
Then, in 2018, the LF-Segnet model, a fully convolutional encoder-decoder network method with 13 3 × 3 convolution layers, was changed to 224 × 224 pixels and added to the JSRT dataset and the Montgomery dataset for lung segmentation on chest x-ray images.The training data used was 1674 images, and the testing data used was 199.The LF-Segnet architecture can produce an IoU value of 95.10% (Mittal et al., 2018).In the same year, a study also segmented the lungs on chest X-ray images using an FCN (Fully Convolutional Neural Network) architecture with 13 convolution layers.Each layer has 3 × pixels of the kernel; this study experiments on the JSRT and Montgomery datasets.In architecture, the 13 convolution layers are divided into five sets; each set consists of a convolution layer, an additional layer (the sum of previous convolution layers), dropouts, and max pooling.The ReLu activation function follows the output of each convolution layer.This architecture can produce an IoU value of 95.88% (Hooda et al., 2018).
Another study related to lung segmentation on chest X-ray images was also conducted in 2018 using a convolutional network encoder-decoder called SegNet with five layers of encoders and decoders.Each convolution layer is followed by batch normalization and ReLu activation functions.This architecture was applied to the JSRT dataset of 154 images, consisting of 119 training images and 35 testing images, with an average DSC score of 95.7% for the left lung and 96.2% for the right lung (Saidy & Lee, 2018).Research related to lung segmentation was also conducted on 138 chest X-ray images from Montgomery's dataset.The architecture used is Deep Convolutional Neural Network by combining initial segmentation using AlexNet and ResNet reconstruction with an average DSC result of 94% and IoU of 88.07%(Souza et al., 2019).
In 2020, a study about lung segmentation on chest X-ray images used the GAN (Generative Adversarial Network) architecture with four different discriminators.Experiments were done on three different chest x-ray datasets: the JSRT dataset, the Montgomery dataset, and the Shenzhen x-ray dataset, with 200 training data, 20 validation data, and 20 testing data for the JSRT dataset.This architecture had the best DSC value of 97.4% and the best IoU value of 94.3%.As for the Montgomery dataset, the composition used was 110 training data, ten validation data, and 18 testing data.For the Shenzhen dataset, the design of the training data is 200, as well as 40 data for validation and testing.The disadvantage of the GAN architecture is that it requires high computing power (Munawar et al., 2020).In the same year, research segmented the lungs using the fully convolutional multi-scale ScSE-DenseNet method on chest x-ray image data.FC dense net is used as a backbone network by inserting multi-scale and scSE modules (spatial channel squeeze and excitation).The FCdense net down sampling line contains multiple blocks; each block contains a dense block.The multiscale convolutional used has a kernel size of 3 × 3, 5 × 5, and 7 × 7 totaling 16 each.Each dense block is inserted ScSE (spatial channel squeeze and excitation) module.This architecture was applied to 5566 indicated chest X-ray datasets for pneumothorax and 5485 non-pneumothorax chest X-ray datasets.The composition of the dataset is 64% training data, 12% validation data, and 20% testing data.This architecture reduced the number of trainable parameters to large and achieved a DSC (Dice Similarity Coefficient) value of 92% (Q.Wang et al., 2020).
Research on biomedical image segmentation in 2020 was also conducted with the FCA-Net (Fully Convolutional Attention Network) architecture.The attention modules added to the architecture are the spatial and channel attention modules, with the Res2Net block as the encoder.Spatial attention can clarify the differences in image boundaries, while channel attention can increase the division of image details.The datasets used in this study included a chest x-ray dataset of 138 images, nucleus image data of 660 images, and cervical cancer cell images of 917 images.The datasets used K-fold cross-validation with five folds.DSC (Dice Similarity Coefficient) values were produced for chest X-ray images at 98.32%, nucleic images at 89.9%, and cervical cancer images at 93.21% (Cheng et al., 2020).
In the same year, research was conducted on image segmentation using a Multi-Scale U-Net experiment on five different datasets.One was a chest X-ray image dataset from the National Library of Medicine, Maryland, USA, and Medical College, Shenzen, China.The total number of datasets is 800 datasets.The architecture used is Multi-Scale U-Net, a development of U-Net by eliminating convolution blocks in the U-Net decoder structure and replacing them with multi-scale blocks.The evaluation results showed that the chest X-ray dataset was able to obtain a DSC value of 92.9% (Su et al., 2021).

Materials and methods
Figure 1 is the architecture system of this research.Detailed stages include the processes described as follows: (1) The research dataset will be divided into 10 % testing and 90% training data.
(2) Furthermore, the training data will be processed using K-Fold Cross Validation, with the number of folds starting from K = 2 to produce training and validation sets.
(3) Then the training process is carried out on the training set using the FCA-Net architecture.
(4) Furthermore, a model that has been trained will be obtained; the model is then tested on a validation set to produce segmentation images and evaluation results in the form of DSC and IoU values.
(5) The best model is then tested on testing data to produce segmentation images and evaluation results in DSC and IoU values.

Dataset of the study
The chest x-ray image dataset used in this study is a public dataset sourced from Qatar University (QU) and available in a Kaggle repository (https://www.kaggle.com/datasets/anasmohammedtahir/covidqu) (Tahir et al., 2021).The dataset is 256 × 256 pixels and consists of 3 categories: lungs indicated by Covid-19, pneumonia, and normal lungs.The following is a sample of the x-ray image and its ground truth in Figure 2.
The ground truth used results from previous research using a collaborative human-machine approach consisting of two stages.The first stage is the radiologist manually drawing segmentation on some of the images to be used (500 images).Then the results of the manual segmentation images are used as ground truth to train images using several models inspired by the U-Net architecture.Then, the model will be used to predict 500 of these images in each training fold.The radiologist will then re-identify the results of the predictions and manual images, and each segmentation image is valid.This early observation is exciting because it shows that radiologists prefer segmentation images made by models.After all, they have a higher significance level than radiologists' manual creations (Degerli et al., 2021).
Then, in the second stage, the ground-truth collaborative human-machine selected in the first stage was used to train images with five modified models inspired by U-Net, U-Net ++, and DLA.Additionally, a radiologist will identify and choose the five most reliable images from the five prediction models to use as inputs for the trained model's predictions of the remaining images.The results of the most valid ground truths are then collected to compile benchmark data (Degerli et al., 2021).
The total number of x-ray image data and ground-truth lung segmentation available is 33,920, with 11,956 image data infected with Covid-19, 11263 infected with pneumonia, and 10,701 images with normal lungs.However, for this study, 1500 data will be used with each class consisting of 500 image data, 1500 of these data will be taken randomly using a computer program.The following Table 1 describes the number of datasets to be used.
The use of datasets will be divided into 90% training data (1350) and 10% testing data (150), and then K-Fold Cross Validation will be carried out with an experiment on the number of folds starting from the value K = 2 to K = 10.In addition, this study also conducted experiments for an unbalanced amount of data.The CXR image dataset was obtained from Airlangga University Hospital (RSUA) Indonesia.The dataset consists of 207 CXR images infected with Covid-19, 53 CXR images infected with Pneumonia, and 32 CXR images that are Normal.The dataset, along with the ground truth of lung segmentation, is available at Mendeley Data (RSUA, Radiologi, 2023).

Fully convolutional attention network
The FCA-Net architecture uses a Fully Convolutional Network as the base architecture and adds two attention modules, namely spatial attention and channel attention.This architecture uses Res2Net as the encoder network and combines both attention modules on the encoder network to better represent features (Cheng et al., 2020).Figure 3 is an illustration of the FCA-Net architecture.
In the encoder section, there is a Res2Net block that is used to extract features and produce a better representation of features in predicting pixels.Input from a 256 × 256 image is then extracted using the Res2Net encoder.Dilated Res2Net is also used in network encoders, and Dilated Convolutional is used in the last two blocks of the encoder side after max-pooling of the image.The final feature map is 1/8 the size of the input image.When the image size is reduced to 1/2 of the input image, an attention module is inserted to improve the representation of features (Gao et al., 2021).
The Res2Net block is an extension of the bottleneck block by replacing the 3 × 3 filter architecture of the bottleneck block with a smaller group that connects different filter groups like a hierarchical model, as shown in Figure 4.This architecture can better extract multiscale features with the exact computation as bottleneck blocks (Gao et al., 2021).
After convolution of filter 1 × 1, each feature map will be separated by s feature map subset notated by X i where i is 1, 2, . .., s.Each subset is the same size but has a channel count equal to 1/ s of the input channel value.Each X i has a 3 × 3 convolution layer denoted with K i , except for X 1 .

Spatial attention
The spatial attention module plays a role in calculating the crucial features of each pixel in the domain region and extracting essential information from the image so that the lack of contextual information that causes the misclassification of local features by FCN can be overcome.The architecture of spatial attention is shown in Figure 5.
As shown in Figure 5, local features are sent to the convolution layer to produce two feature maps, FA and FB.Then average pooling and max pooling are carried out on FA, and later the results will be concatenated and convoluted with a 7 × 7 filter totaling one piece.Then the Hadamard product will be generated from the matrix of convolution results and FB to produce the outcome of spatial attention.Spatial attention is essential to finding important feature points in the spatial area.This can be seen in the average pooling and max-pooling processes, which calculate the average and maximum pixel values in the entire channel.So that from the whole channel, the coordinate points or regions of essential features will have a higher weight.
Then, here's the equation to calculate spatial attention: Where σ is a sigmoid function, Conv 7 × 7 indicates that the convolution layer size is 7 × 7, and Con indicates the concatenate of both matrices, and ⨀ denotes the Hadamard product.The Hadamard product, which can also be called element-wise multiplication, is the process of multiplying each element of a matrix with another matrix.

Channel attention
Channel attention modules can receive feature sharing that leverages global information and searches for features that matter.Channel attention is built through interchannel relationships with features, where each channel of the feature map is considered a feature detector.This is done because utilizing dependencies between map channels can improve the representation of features from semantic segmentation.
In Figure 6, global max pooling and global average pooling are carried out on the feature map, and then both pooling results are generated through shared dense one and shared dense two.Shared dense one and shared dense two are MLP (multilayer perceptron) with a hidden layer.Next, element-wise summation was performed on shared dense max-pooling and average pooling results.Furthermore, the Hadamard product (element-wise multiplication) is generated from the MD and input matrix as the final result of the attention map channel.
The basic principle of channel attention is to focus on the most critical channels by assigning a higher value to channels that are considered to have more essential features than others.The trick is to do global average pooling and global max pooling on each channel so that a 1 × 1 x C feature map will be generated, which will be multiplied (Hadamard product) with the initial input.From these results, it will be seen that channels that have more essential features than other channels have more excellent value.

DSC (Dice Similarity Coefficient)
DSC (Dice Similarity Coefficient) is a commonly used performance measurement metric for segmentation cases.This metric measures the overlap between segmentation results and ground truth.The DSC range is between 0 to 1, where the value 1 means that the segmentation results overlap perfectly with the ground truth.

IoU (Intersection over Union)
IoU (Intersection over Union), or the Jaccard index, is a measurement metric commonly used to measure image segmentation performance.IoU is the number of pixels that intersect between the predicted segmentation result and the ground truth divided by the union or a combination of the segmentation prediction results with the ground truth.

Implementation details
This experiment implements using a learning rate of 10 −4 , optimizer ADAM, and epoch 150.This experiment does with Google Colaboratory Pro, which has GPU NVIDIA Tesla A-100 with these scenarios describe in Table 2.The experiment was conducted with scenarios A and B. Scenario A is a K-Fold experiment ranging from K = 2 to K = 10 using both the spatial and channel attention modules.After obtaining the optimal K value from scenario A, scenario B uses the optimal K value from the previous experiment.Scenario B consists of scenario B1 using both attention modules; scenario B2 only uses the spatial attention module; and scenario B3 uses the channel attention module.Scenario B aims to determine the effect of spatial and channel attention modules on DSC and IoU.

B1
The most optimal K in Scenario A

B2
The most optimal K in Scenario A

B3
The most optimal K in Scenario A Channel

Results & discussion
To illustrate the image segmentation process, a simple segmentation model consisting of layers was built as follows: (1) Convolution layer with 3 × 3 filter size 16 (2) Max pooling (3) 64 convolution layers with 3 × 3 filters (4) Layer convolution and upsampling (5) Convolution layer with 3 × 3 filter size of 2 (6) The last convolution layer with 1 × 1 filter size is one piece and sigmoid activation The input image to train the model is a lung X-ray image that has been resized to 32 × 32 for 1,500 images.This training model is supervised because it has a target/label in the form of ground truth from the image.Then the model that has been trained and stored will be used to predict one lung  image sample and will be visualized as a kernel and feature map from each layer used.Figure 7 is the sample of the kernel and feature map of the first convolution.
There are 16 kernels used, with a size of 3 × 3 in each kernel.Furthermore, the following feature map of prediction results using the kernel is shown in Figure 8. Figure 9 is an example of the kernel visualization used in the last process.
The kernel used is one piece with a size of 1 × 1.This last convolution process uses the sigmoid activation function.The following feature map of prediction results using the kernel is shown in Figure 10.

Training result
The model was trained with the initial scenario, which was the K-Fold experiment with K = 2.The FCA-Net architecture was used, along with both the channel attention and spatial attention modules as shown in Table 3 and Figure 11.Based on the results of the K-Fold experiment starting from K = 2 to K = 10, it was found that the best DSC mean value in the validation data was with K = 5 of 94.18% and IoU mean of 89.60%.This value is obtained with a standard deviation of DSC 0.06 and a standard deviation of IoU 0.10.
While the best or slightest standard deviation is at K = 2 with a DSC standard deviation value of 0.00014 and an IoU standard deviation of 0.0021, but for the IoU and DSC values of K = 2 is still not good, namely DSC of 72.25% and IoU of 56.61%.So, in this case, K = 5 is still superior.Figure 12 shows that the greater the K value on the K-Fold, the longer the computation time required.Furthermore, data training was carried out with experiments using the attention module (Scenarios B1, B2, and B3).Scenario B1 uses a combination of spatial attention and channel attention modules.The results of training using a variety of attention modules are shown in Table 4.
Based on Table 4, the best DSC and IoU values in the validation data were DSC of 97.43% and IoU of 95.01% in the first fold.The standard deviation of training using the combined attention module is 0.061 in DSC and 0.10 in IoU.
The following data training, referred to as the B2 scenario, involved sharing data using K-Fold Cross Validation up to 5 folds and only using spatial attention modules without using channel  5.In DSC and IoU, the standard deviation of training with the spatial attention module is 0.061 and 0.10, respectively.
Then the last scenario is Scenario B3, with data division using five folds and only using the channel attention module.According to Table 6, the best DSC and IoU values in the validation data are 97.31%DSC and 94.78% IoU in the first fold.The outcomes match those of scenario B2.However, it is different for the other folds so that the standard deviation results of DSC and IoU training with the spatial attention module are 0.068 and 0.11, respectively.
The best results were obtained from each trial scenario based on the training results.The best model of each scenario is then tested using testing data.

Testing result
Based on the experimental results of the K value in K-Fold Cross Validation, the optimal K value result is K = 5 because it has DSC and IoU, and overall, each fold is relatively higher when compared to other K values in the data training process.Furthermore, the results of trials on the use of the attention module in data sharing using 5-fold can be seen in Table 7.
The test results showed that the combined use of spatial attention and channel attention had the highest DSC and IoU values, namely DSC of 97.24% and IoU of 94.66% for QU dataset.Meanwhile, the RSUA dataset obtained testing results: DSC of 97.65% and IoU of 95.42%.The two datasets state that using balanced and unbalanced data has resulted in the highest lung segmentation accuracy in combining spatial and attentional channels.Although the amount of unbalanced data is small, it still gives high test results.
Moreover, from experiments using module attention, the difference between each DSC and IoU value at the testing time was not far apart.This shows that the FCA-Net architecture, by combining the Spatial Attention Module and Channel Attention Module can produce good lung segmentation.However, the FCA-Net architecture also needs to improve computational time and requires significant resources for the data training process.Furthermore, Table 8 compares the segmentation results from the four deep learning architectures: SegNet, UNet++, DeepLabV3, and FCA-Net.Based on the table shows that the FCA-Net architecture has higher DSC and IoU values than the other three architectures.Meanwhile, Figure 13 is an example of visualizing the overlap between ground truth and the four deep learning architectures.According to the accuracy results shown in Table 8, an example of visualization of overlapping imagery between ground truth and the segmentation results of the four deep learning architectures, using FCA-Net produces fewer errors (shown by boxes in each deep learning architecture).

Conclusion & future work
Based on the results of testing on test scenarios in this study, including K-Fold value experiments and trials on the attention module used, K-Fold Cross Validation with a value of K = 5 is the best value in terms of DSC, IoU, and computing time.Then, in terms of trials using the attention module, it was determined that the best result was to use a combination of the spatial attention module and the channel attention module with a DSC value in the validation data of 94.43% and an IoU of 95.01%.Then, after being tested using testing data, DSC was obtained at 97.24% and IoU at 94.66%.
Although in this study, the FCA-Net model produced good evaluation values, the FCA-Net model training process requires a long time and large resource requirements.Therefore, in future studies, improving the model from the encoder and decoder lines is recommended to reduce computational time during training.

Figure 8 .
Figure 8. Feature map of first convolution.

Figure 13 .
Figure 13.Example image overlap between segmentation results and Ground-truth of the four deep learning architectures being compared.