Comparative analysis of deep learning methods of detection of diabetic retinopathy

Abstract Diabetic retinopathy is a common complication of diabetes, that affects blood vessels in the light-sensitive tissue called the retina. It is the most common cause of vision loss among people with diabetes and the leading cause of vision impairment and blindness among working-age adults. Recent progress in the use of automated systems for diabetic retinopathy diagnostics has offered new challenges for the industry, namely the search for a less resource-intensive architecture, e.g., for the development of low-cost embedded software. This paper proposes a comparison between two widely used conventional architectures (DenseNet, ResNet) with the new optimized one (EfficientNet). The proposed methods classify the retinal image as one of 5 class cases based on the dataset obtained from the 4th Asia Pacific Tele-Ophthalmology Society (APTOS) Symposium.

ABOUT THE AUTHOR Our team comprises of 5 people. We specialize in artificial neural networks, natural language processing, digital signal and image processing. The goals and objectives of the research are the development and testing of methods, algorithms, and architectures of applied neural networks to medical data, in particular to diabetic retinopathy. Pak Aleхandr Aleхandrovich -Candidate of Technical Sciences, Assistant Professor at FIT KBTU, Scientific Leader of the Doctoral Candidate, Project Manager: No. AP05132760 "The development of methods for deep learning of semantic probability inference" of the Institute of Information and Computational Technologies. Atabay Ziyaden-software engineer in (IICT), Big data mining laboratory. Tukeshev Kuanysh -Medical Doctor at Institute of Eye Diseases. Jaxylykova Assel-PhD student from KazNU, IICT in specialty: «Information Systems». Abdullina Dana -Medical Doctor at Institute of Eye Diseases.

PUBLIC INTEREST STATEMENT
Imagine that we can stop blindness before it becomes irreversible, with the help of simple hardware and common cell phone. The problem of blindness is closely related to the disease of diabetic retinopathy (DR) -a vascular complication of diabetes mellitus (DM). DR takes one of the first places as the cause of blindness and visual impairment in the age group from 20 to 70 years. In this study, we offer testing results of various state-of-the-art neural network architectures applied to DR diagnosis, in terms of accuracy and performance. The purpose of such work is the hardware-software complex in the form of a service that can be used by a non-specialized medical worker (paramedic and nurse), this is especially important for the rural and regional population, where there are not enough doctors (ophthalmologists and endocrinologists). And moreover, not everywhere there is access to the Internet.

Introduction
Diabetic retinopathy (DR) is one of the significant causes of blindness. Since DR is a progressive process, medical experts suggest that patients with diabetes must be screened at least twice a year to diagnose the signs of the disease promptly. With the current clinical diagnosis, detection is mainly based on the fact that the ophthalmologist examines the color image of the fundus. This detection is laborious and time-consuming, which leads to a more significant error. Also, due to a large number of patients with diabetes and the lack of medical resources in some areas, many patients with DR cannot diagnose and treat promptly, thus losing the best treatment options and ultimately leading to irreversible loss of vision. Especially for those patients at an early stage, if DR can be detected and treated immediately, the process can be well controlled and delayed. At the same time, the effect of manual interpretation is exceptionally dependent on the experience of the clinician. Misdiagnosis often occurs due to a lack of experience with doctors. (Aljawadi & Shaya, 2007;Kaleeva & Libman, 2010;Lisochkina et al., 2004).
In connection with the problem of an erroneous diagnosis, the development of intelligent systems to support decision-making by ophthalmologists has aroused the interest of the scientific community in several works (Gadekallu et al. 2020;Gulshan et al., 2016;Mateen et al., 2018;Reddy et al., 2020).
We should note the pioneering work (Gulshan et al., 2016), the main idea is dedicated to telemedicine issues, namely the ability to diagnose retinopathy remotely based on images obtained using a cell phone. The study shows a methodological approach for obtaining fundus images with subsequent diagnosis by neurocomputing algorithms and comparing the results with the opinion of ophthalmologists. The mentioned above neurocomputing algorithms are Convolutional Neural Networks (CNN). The base hardware is the smartphone iPhone 4. The sample size of patients was small, only 55 people, 110 eye shots. In general, remote diagnostics showed high values of sensitivity and specificity.
CNN have made remarkable achievements in a large number of computer vision and image classification, significantly exceeding all previous image analysis methodologies.
(Russakovsky et al., 2012) described new CNN architecture-AlexNet. It showed significant performance improvements at the 2012 LSVRC competition, achieving top-5 error 15,31 %. For comparison, a method that does not use convolutional neural networks received an error of 26.1%. The network contains 62.3 million parameters and spends 1.1 billion calculations in a direct pass. Convolutional layers, which account for 6% of all parameters, perform 95% of the calculations.
VGG Net is a convolutional neural network model proposed in (Simonyan and Zisserman (2014). At the 2014 ILSVRC competition, an ensemble of two VGG Net received a top-5 error of 7.3%. Due to the depth and number of fully connected nodes, the VGG16 weighs over 533 MB and contains 138 million parameters. The enormous size of VGG makes the deployment process of the model a tedious task. The next step in the development of CNN model was the winner of ILSVRC 2015 with a top-5 error of 3.57%-an ensemble of six ResNet type (Residual Network) networks developed by Microsoft Research (K. He et al., 2016). The ResNet-50 has over 23 million trainable parameters.
The next architecture, in which it was possible to significantly reduce the number of parameters without significant loss of quality, was published in (Huang et al., 2017) and called DenseNet (Dense Convolutional Network). The main idea of the architecture is to shorten the connection at CNN, which allowed to train deeper and more accurate models. With dense connection, fewer parameters and high accuracy are achieved compared with ResNet and Pre-Activation ResNet. DenseNet121 number of model parameters was 7 million.
Recent work (Real et al., 2019) presented a new artificially discovered architecture-AmoebaNet-A. The architecture sets a new state-of-the-art 83.9% top-1/96.6% top-5 ImageNet accuracy. The results are comparable to the current state-of-the-art ImageNet models.
There is a modern EfficientNet architecture that achieve much better accuracy and efficiency than AmoebaNet-C. EfficientNet-B7 achieves state-of-the-art 84.4% top-1/97.1% top-5 accuracy on ImageNet while being 8.4x smaller and 6.1x faster on inference than the best existing ConvNet (Tan & Le, 2019). Thus, one of the important areas of research was the issue of reducing the computational complexity of neural network architectures while maintaining an acceptable level of accuracy. Indeed, the modern requirements for diagnostic software imply working offline, since not all settlements have high-quality high-speed Internet. This paper explores the use of low resource CNN architectures for the diagnosis of diabetic retinopathy.

Related work
The task of the early detection of DR is a hard problem in the field of computer vision area. The goal of detection is to find clinical features of retinopathy due to diagnostic transparency requirements. In retinal color fundus images, there are various clinical features of DR, such as microaneurysms, hemorrhages, exudates, and others. The extraction of these signs are an essential matter for the precise diagnostics; they help to determine the actual status of DR. (Hann et. al., 2009) states a conventional approach based on the morphology of digital images that extracts the number such features from fundus images. The advantages of the approach are transparency and simplicity. The presence of exudates next to macular on fundus image are an important diagnostical sign of diabetic macular edema.
The work of (Walter et al., 2002) presented efficient algorithms for the extraction of the retinal exudates and optic disc. The main idea of the paper that exudates can be extracted with the help of high grey level variations, and their contour can be detected with morphological reconstruction techniques.
Another work of (Agurto et al., 2010) proposed an approach based on multiscale amplitudemodulation frequency-modulation (AM-FM) methods to discriminate between normal and pathological retinal images. The modulations were applied to the set of different small regions of fundus images with different types of lesions. After that, the feature vector of a small region was derived from the amplitude-frequency response. The authors claimed that there is a statistical differentiation of normal retinal structures and pathological lesions based on AM-FM features. (Kazakh-British et al., 2018) conducted numerical experiments with the following processing pipeline; first of all, Frangi & Sato filters were applied to fundus images for blood vessels extraction; after that, the CNN classifier was trained to detect lesions.
(Abràmoff et al., 2010) tested wavelet detectors and k Nearest Neighbors for clinical feature extraction from fundus images. The results of the extraction are AUC 0.86 and Standard Error 0.0084. It worth noting that the dataset was produced with the help of "non-mydriatic" digital retinal cameras. The size of a fundus image varied from 0.15 to 0.5MB. (Gulshan et al., 2016) tested the deep learning algorithm of diabetic retinopathy on the dataset that was produced with the help of a smartphone. This paper shows the opportunity of access to diagnostical methods of diabetic retinopathy for a broad audience. (Pratt et al., 2016) developed and tested CNN with 13 layers to detect the stage of DR. Their CNN was trained and tested on NVIDIA K40c. Also, the authors stated that image with size 512 × 512 could be processed in 0.04 seconds, which makes possible real-time feedback. (Tan et al., 2017) proposed another CNN with 10 layers to simultaneously segment and discriminate exudates, hemorrhages, and micro-aneurisms. The results of exudates detection are sensitivity 0,87 and 0,71. The sensitivities of hemorrhages and micro-aneurisms are 0,62 and 0,46.
(Choi et al., 2017) tested VGG-19 architecture on STructured Analysis of the REtina (STARE) database. The results achieved an accuracy of 30.5%, relative classifier information (RCI) of 0.052, and Cohen's kappa of 0.224 on 10 categories as the target variable. In the case of 3 categories, results are showed an accuracy of 72.8%, 0.283 RCI, and 0.577 kappa. (Mateen et al., 2018) proposed a symmetrically optimized solution through the combination of a Gaussian mixture model (GMM), visual geometry group network (VGGNet), singular value decomposition (SVD) and principal component analysis (PCA), and softmax, for region segmentation, high dimensional feature extraction, feature selection, and fundus image classification, respectively. The authors claimed that the VGG-19 model outperformed the AlexNet and spatial invariant feature transform (SIFT) in terms of classification accuracy and computational time. (Khalifa et al., 2019) presented deep transfer learning models for medical DR detection were investigated. The numerical experiments were conducted on the Asia Pacific Tele-Ophthalmology Society (APTOS) 2019 dataset. The models in this research were AlexNet, Res-Net18, SqueezeNet, GoogleNet, VGG16, and VGG19. These models were selected, as they consist of a small number of layers when compared to larger models, such as DenseNet and InceptionResNet. Data augmentation techniques were used to render the models more robust and to overcome the overfitting problem.
We are working on the classification of fundus images according to the severity of DR so that we can perform end-to-end classification in real-time from the fundus image to the state of patients. For this task, we use pixel normalization techniques to highlight various clinical features (blood vessels, exudates, microaneurysms, and others) and then classify the retinal image into the appropriate stage of the disease. We adopt the CNN architecture for DR discovery in the "APTOS 2019 Blindness Detection" Dataset. The contribution of this article is summarized as follows: (1) We test the latest CNN model (DenseNet121, ResNet50, ResNet101, EfficientNet-b4) to recognize small differences between classes of images for DR detection (F. He et al., 2019;Hu et al., 2017;Tan & Le, 2019) (2) Hand-held training and tuning of hyper-parameters are adopted, and the experimental results have demonstrated better accuracy than the non-transmitting training method for classifying DR images.

Dataset
Aravind Eye Hospital collected the dataset of fundus images of high quality in India, information about the dataset can be found at https://www.kaggle.com/c/aptos2019-blindness-detection/. It consists of 10 GB of data across 5,590 RGB-images of the fundus(approximately). The data owners divided the dataset into two parts, in particular, the training dataset consists of 3662 images with target labels and 1928 of the testing dataset without labels. Like any real-world data set, there is a noise in both the images and labels. Images may contain artifacts, be out of focus, underexposed, or overexposed. The images were gathered from multiple clinics using a variety of cameras over an extended period, which affects further variation.
There are five classes in data; from Table 1, one can see that classes are not uniformly represented.

Preprocessing
There is a first step of preprocessing of all images before augmentation and training. All images were normalized to keep the efficiency of models pre-trained on ImageNet. Preprocessing involved several steps: 1. Balancing image sizes: images were rescaled to have the same radius and cropped to avoid uninformative black pixels around the edges of the fundus.
2. Reducing lighting-condition effects: images come with many different lighting conditions: some images are very dark. That was fixed by subtracting the local average color from each pixel.
We applied augmentation on images in real-time to reduce overfitting. During each epoch, a random augmentation of images that preserve collinearity and distance ratios was performed.
The training was started from no augmentation applied, retrained with light, mid, substantial augmentation: 1. Light: randomly rotate an image 90 degrees, randomly flip an image horizontally.
2. Mid: randomly rotate an image 90 degrees, randomly flip an image horizontally and transpose an image by swapping rows and columns. Apply median blur with randomly picked parameters. Randomly apply CLAHE, Sharpening or Randomize Contrast and Brightness.
3. Strong: randomly rotate image 90 degrees, randomly flip the image horizontally, transpose the image by swapping rows and columns. Apply median blur with randomly picked parameters. Randomly apply CLAHE, sharpening filter or randomized contrast and brightness adjustments, distortion and hue shift.
All steps of preprocessing presented in Table 2. There is a visual result of preprocessing in Figure 1. Step 1 Step 2 Step 3 Step 4 Step 5

Image size normalization
Correcting lighting conditions

Applying CLAHE Applying image augmentation
Training the classificators

EfficientNet model
It is common practice to develop convolutional neural networks (CNNs) at a lower resource cost, and then if more resources are available, scaled it up to achieve better performance. There are several options to scale a model, namely, there is arbitrarily increasing of the CNN depth or width, or to use high-resolution input images during the training phase to grab data dependencies in detail.
While such an approach is good at improving model accuracy, it usually requires manual tuning, and still often yield suboptimal performance.
For comparative analysis, we chose the EfficientNet-B4 model that is ahead of the results, remaining significantly less in the number of model parameters. Its architecture is presented in Figure 2.
In general, one can define a convolutional layer as tensor function, namely Y i ¼ F i X i ð Þ, where F i is tensor operator, Y i is output tensor, X i is input tensor. The input tensor shape H i ; W i ; C i ð Þ, H i and W i are spatial dimensions and C i is a color dimension. One should define the model of convolutional networks as a sequence of embedded functions (Tan & Le, 2019), where F L i is a layer repeated L times at stage i, H i ; W i ; C i ð Þ is the shape of the input tensor of the particular layer; � . . . ð Þ is the operator of a Hadamard product. The complexity and performance of a CNN model depend on width, height, depth parameters. There is a method to scale up a CNN to obtain better accuracy and efficiency based on compound scaling method with the following coefficient Φ to uniformly scales network width, height, and resolution (Tan & Le, 2019), where α, β, γ are constants that can be defined by grid search. Briefly, Φ is a user-defined parameter that controls the number of free resources for model scaling, while α, β, γ specify the scaled aspect of a model (Tan & Le, 2019). On the mentioned above principles EfficientNet-b4 was generated and used at the presented work. (2)

Numerical experiments
To compare the quality and performance of various models, we conducted experiments with preprocessed and not preprocessed images. The dataset is divided into training and validation data and also compared the variants with 5 classes. The assessment value of the classification is the Κ-statistic, which is used to control only those instances that may have been correctly classified by chance. The Κ can be calculated using both the observed (total) accuracy and random accuracy, where A T À total accuracy; A R -random accuracy. Non-preprocessed images obtained a Κ below 0.656 with various models. The error was calculated using cross-entropy for discrete values. Accuracy was calculated using a conventional formula. At the Table 2 there are the results of the numerical experiments of various models. The preprocessed photos have a Κ below 0.690, as you can see at Table 3, the accuracy had significantly increased when strong augmentation was added to the pipeline of the models. The best model we obtained through the encoding of network output with the help of ordinal regression, the K coefficient below 0.790. On another hand, the best results were obtained on contemporary model EfficientNet-b4 with respect to familiar models such as DenseNet and ResNet, which explains the good convergence of EfficientNet-b4 and complete applicability in the task of DR diagnosis. The estimation of computational complexities presented at Table 4.

Conclusion
The task of early detection of diabetic retinopathy is an actual problem of predictive medicine. Diabetic retinopathy is the most common cause of blindness among the old aged group of people. Due to the  The open problem of the area is computational complexity because there is a demand of smartphone embedded software with low computational requirements. In the paper, we show the possibility of the various pixel normalization techniques to highlight various clinical features (blood vessels, exudates, microaneurysms, and others) and then classify the retinal image into the appropriate stage of the disease. Also, we show the possibility of the development of such a diagnosis method with the help of EfficentNet in comparison with other neural network architectures. EfficientNet is an optimal model in terms of accuracy and number of parameters. The numerical experiments were conducted on the Asia Pacific Tele-Ophthalmology Society (APTOS) 2019 dataset. Further work assumes the augmentation of the algorithm with preprocessing in order to reveal clinical-pathological features and performance upgrades.