Diabetes classification using MapReduce-based capsule network

Big data analytics is a complex exploratory process to uncover hidden data information from vast collections of data. It often provides enormous information from diverse sources and the use of analytics provides confined knowledge from the collected noisy data. In the case of diabetes data, there exist a massive collection of patient data that relates to significant information on patient health and its critical nature. In order of validating and analysing the data to get desired information about a patient and their health risk from the vast collection of data, the study uses bigdata based deep learning analytics. This study uses a Deep Learning Model namely capsule network (CapsNet) is executed on a MapReduce framework. The CapsNet present in the MapReduce framework enables the classification of instances via proper regulations. This model after suitable training with the training dataset enables optimal classification of instances to detect the nature of the risk of a patient. The validation conducted on the test dataset shows that the proposed CapsNets-based MapReduce model obtains increased accuracy, recall, and F-score than the conventional MapReduce and deep learning models.


Introduction
Data analytics, clustering, healthcare systems, data mining, machine and deep learning, storage, and cloud computing are all seeing significant shifts in their respective directions recently.Health must be well cared for using all of these technologies.The development of statistics improves this since diabetes is a dangerous disease that, in some cases, may cause death.In other words, diabetes has a significant impact on the body's ability to produce insulin, which impairs carbohydrate metabolism and raises glucose levels in the blood.Diabetic patients are most commonly affected by high blood sugar.Those who have experienced high blood sugar have reported symptoms such as increased thirst, increased hunger, and frequent urination [1].
With increasing numbers of patients, social medical data, and sensing data are making medical data records larger today [2].Additionally, the analysis process will need to be distributed across a distributed system to broaden the scope of the diabetes datasets.Additionally, ML techniques are needed to create models capable of accurately identifying patients according to certain characteristics like glucose, insulin, BMI, or diabetes.This requires building a model with machine learning techniques.
When this data is used correctly, it could help us solve many of our problems at little or no cost.Studies now investigate these topics from a scientific point of view, based on analysis of historical records in various fields.
Data is now routinely created by businesses and research institutions, as well as governments, at an unprecedented scale and complexity.Organizations all over the world have realized that gathering useful information and advantages from a large quantity of data is crucial.Trying to quickly and easily obtain meaningful insights from this data is difficult.It has thus come to the point where analytics is now impossible to ignore to realize the full value of data and, therefore, improve the overall business performance and increase market share.Data volume, velocity, and variety have increased over the last few years.Technologies in general are not prohibitively expensive, and open-source software makes up the majority of the software [3].
Data analytics is one of the greatest challenges when it comes to missing data, which can be due to sensors, human errors, or the transmission of data from different systems locations, such as cloud servers.In very large datasets, if the number of missing items increases, the dataset needs to be fixed to preserve the power of the data [4,5].
Conventional MapReduce and deep learning models cannot effectively handle the complexity and diverse nature of diabetes patient data, resulting in suboptimal accuracy, recall, and F-score.Previous approaches failed to leverage the advantages of capsule networks (CapsNet) hindering the ability to uncover critical patient health information from massive collections of data.
The work's primary goal is to advance the process of classifying large datasets with smaller training datasets.In this paper, [6] CapsNet is developed on a MapReduce framework.The CapsNet present in the MapReduce framework enables the classification of instances via proper regulations.This model after suitable training with the training dataset enables optimal classification of instances to detect the nature of the risk of a patient.
The paper's key contribution is listed below: • The authors used a map-reduce framework to adopt large data into the model for processing.The data volume is effectively reduced using the mapper and reducer function that simultaneously preserves the essential data.• The study uses CapsNet for making the optimal decision with its simple formulation that focuses entirely on the creation of document illustrations.
The CapsNet classifies the needed data to predict the risk of patients from big data with proper preprocessing and feature extraction procedures.
The following is the paper's outline: Review of the literature is presented in Section 2. The suggested big data categorization model is discussed in Section 3. The complete work is evaluated in part 4, and it is wrapped up in section 5.

Literature review
Big data analytics has been shown to have numerous issues within the healthcare industry, and there is an ongoing research project that aims to rectify these issues.
According to a study [7], massive data sets are no longer accessible to conventional applications.This research centres on understanding patients with diabetes by using data mining algorithms.Hadoop and MapReduce are used to conduct an in-depth examination.
Healthcare, which must grapple with vast amounts of data, was described by [8].Important examples were mined using information mining techniques, eliminating incoherent information.Associative Rule Mining is a standard that shows how other things are connected, as well as how things work individually.The Apriori algorithm here debase due to the huge amount of data in the Hadoop MapReduce structure, an Apriori algorithm execution was provided.
[9] revealed that Effective diabetes detection and management strategies were developed.According to the evaluation in this study, the classifications for diabetes types 1.0 and 2.0 have been finished.To improve diabetes screening and diagnosis, the 5G-Smart diabetes framework calls for the incorporation of modern technology.
[10] focused on big data analytics, which highlights how fast the healthcare division applies Big Data.Classification and clustering algorithms are employed to classify and group diabetes-related information.Moreover, the capability of the equivalent is examined.
This study noted that the recent development of smart DSSs that aim to recreate human characteristics has been mentioned previously [11].The use of artificial neural networks (ANNs) enables one to handle a wide range of decision-making activities.This model utilizes an ANN, for gestational diabetes detection.
[12] created C4.5, an improved classification framework, and installed reverse planning, enabling the construction of a classification representation.DataSpeak had an accession to mining, classification, and conglomeration on the basis of the results of the study conducted by [13].This method of overcoming the hindrance of k-nearest neighbour and getting quick access to information overcomes the issue.
[14] developed a classification plan that adheres to the MapReduce programming model and incorporates association rules.Because of the data inspection, which is present in MapReduce, [15] deduced the relationship connection the number of labels for the fuzzy components and the lack of information.It also looked at how the instance set was partitioned to see how fine-grained the results were.
The InfusedHeart Framework [16], employing knowledge-infused learning, outperforms various machine learning and deep learning approaches [17,18], achieving high accuracy in heartbeat acoustic event classification under different signal-to-noise ratio conditions.The FL-PMI leverages DRL and BiLSTM within the FL framework achieving high accuracy, reduced memory usage, and transmission data in smart healthcare applications [19].
[20] proposes a heart and diabetes disease prediction model using a combination of rough set theory for attribute reduction and a fuzzy logic system for classification.The experimental results demonstrate the superiority of the algorithm.
developed an effective Diabetes dataset classification and extraction process.Adaboost.M1 and Logit-Boost.M1 are two of the algorithms that [9] designed for building an apparatus representations for diabetes diagnosis utilizing medical test data.A model to forecast for the prevalence of DM was created using 150 machine learning classifiers and to make an assessment of the increase in clustering.
Several classification techniques were discussed for forecasting patient data, in addition to a general diabetic presentation, as well as for determining some relative diseases.To measure the sorting of diabetes Hadoop and MapReduce Bhattacharya [8] Apriori algorithm Mamatha et al. [10] Big Data Classification Moreira et al. [11] A N N Rawal et al. [12] C 4 .5 Shankar et al. [13] DataSpeak Bechini et al. [14] MapReduce programming model chronic diseases, the researchers developed ML techniques like decision trees and random forests are used to create classification system-rooted forecast representation, and also decision tree and random forest to develop diagnostic rule-rooted forecast representation.In addition, an algorithm was created, which was given an RF signature to find difficult regions of type 2 diabetes.
LR and SVM are used to train classifiers.The formation of a relatively low microbial density for the development of infection.The biologic plausibility offered by relevant to the decision that antibiotic strategies aimed at defenseless anaerobic bacteria such as Bacteroidetes and Clostridiales, whose tryptophan supplies are not particularly plentiful, are causally linked to IFI risk in hematologic patients, and gives insight into IFI antimicrobial metabolic re-equilibrium Table 1.

Proposed model
Diabetes is a prevalent disease and this necessitates being aware of it as soon as possible [16][17][18][19][20], which ensures the patient's well-being.In the most recent healthcare plan, dealing with shapeless physical condition files will create even more complications.As a result, healthcare is faced with many difficult challenges when attempting to extract information from this kind of data, and this has led to a demand for more advanced data analytics.The goal of this research is to predict the occurrences of datasets to provide predictions that are as accurate as possible to whom the results apply.
The insulin dosage is determined by the patient's prior medical records, and the blood sugar stage is estimated using customary intervals.The platforms Hadoop and MapReduce are used in tandem with Cap-sNets in this case.Though methods that currently work can process MapReduce using this platform, they are only able to process a subset of attributes.This method is more effective because it incorporates many factors into consideration.Through the use of due to the later installation of the MapReduce platform and the multiclass outlier technique, a suggested hierarchical technique, the diabetes patient database will be successfully examined.Therefore, insulin dosage is determined in accordance with the findings, taking into consideration the patient records and the results from the experiments.An illustration of this is in Figure 1.

Hadoop and map-reduce
The Hadoop ecosystem uses the platform to store, access, and analyse massive quantities of data.The MapReduce Framework is utilized for the parallel processing of data.It is a parallel processing programming technique that is commonly applied to parallel processing.A Hadoop MapReduce instance has two phases: Mapper and Reducer.The results from the map phase are fed into the reduce phase as input.
To reduce the mined data into key-value pairs, the output of the mapper is applied to the input of the MapReduce Framework..The initial step involves extracting a certain number of field values from every record.Also, the value that is no longer there in the dataset is noted.Pairs of key-value pairs are created by processing all the documentation sequentially and in parallel, and these are equated to pairs of the form as in Equation ( 1) Alternatively, you can obtain the key-value pairs of the mapper phase in the text as the input, but instead of reducing the phase.After all the intermediate values have been calculated, the last stage's output is the unified structure of all those values, as in Eq.( 2) MapReduce Framework processing is visualized in Figure 2.After reduction by Map-Reduce, the big dataset is smaller and acts as a prerequisite for further processing.

Capsnet
The CapsNet architecture is presented in this section, where we describe how both malignant and benign cancer cells are distinguished in MR imaging.In the CapsNet, capsules are viewed as lot of neurons that represent the activity of parameter neurons.The specific entity is likely present because the vector length of these neurons indicates that.
CapsNet overcame the CNN problems with the connected layers that are subject to refinement of connected layers.It may, however, vary with increased iterations, and the coupling coefficients could be different in the child capsules.Therefore, the parent capsule counts how many more capsules there are after the prediction compared to before the prediction.As illustrated in Figure 3, a CapsNet model was developed.
Considering the u i of the output as the capsule i, the predicted value for the parent capsule j is as evaluated.
where u j|i is considered as the forecast vector obtained from capsule output j, i is considered as the capsule that estimates the prediction vector, and W ij is considered a weighting matrix that is applied on a backward pass.Another layer is computed using the softmax function to calculate the c ij coefficients, using how well the capsules in previous layers matched the ground truth.
Where b ij is considered as the log-likelihood function, where it functions if capsule i is interconnected with capsule j and which has its setting equal to zero at the start of each iteration.To estimate the input parent capsule vector j, we first use the following formula: The output vector of a function goes beyond each output vector of a capsule, and it then uses an initial vector to yield the final result.
where, s j is considered the input vector of j th capsule and v j is considered the output vector of j th capsule.The internal product of the two vectors is used to update the log probability during the classification in terms of agreement between the output capsules vector produced by the method and the output capsules vector produced without the use of a non-linear function, u j .
The study uses an agreement function to estimate the estimated value of the log likelihood coupling coefficient is as follows: A loss function is present in the final capsule layer k. l k that calculates how inaccurate its prediction increases the prediction errors to meet the specified level of accuracy.l k is estimated as follows: Here Due to the process of training, the hyperparameters (m + , m −, and λ) are estimated.Thus, true labels are exceptionally tough to obtain, CapsNet layers are required one for convolutional operations and two for capsule operations.

Results and discussions
This part compares the proposed MapReduce-CapsNet model with traditional MapReduce and traditional deep learning models such recurrent neural network (RNN) models to classify the nature of the risk of diabetic disease in a patient.
The entire simulation is conducted in a high-end computing engine with a primary memory of 8 gigabytes of random access memory running on an i5 core generation processor.The study uses the Pima Indian dataset to train and test the entire dataset.Table 2 shows the parameters considered for training and testing the classifier, where The UCI repository is where the information is gathered.The data pieces of 2500 with 15 attributes are used for validation of the classifier after training.The necessity of reducing records and attributes in a small dataset like PIMA that acts as a properly complied training dataset to improve the process of training in this framework.The study does not accommodate the same dataset for training and testing, 786 valid items are employed for training, and 70% of the remaining 30% are utilized to test the classifier.A 10-fold cross-validation is conducted on the high-end computing system using the MapReduce -CapsNet model.
The MapReduce-CapsNet for the diabetic classification is compared with conventional MapReduce and RNN.The validation helps to detect the risk of the patient diabetic level using the given classifier.Figure 4 shows comparison of the MapReduce-CapsNet classifier's accuracy results with those of the traditional MapReduce-RNN and RNN classifier.The MapReduce-CapsNet provides a multi-variable classification of input diabetic datasets that enables optimal classification with supervised learning in a 10fold cross-validation.The results of testing in Figure 4 show that the MapReduce-CapsNet classifier achieves a higher accuracy level than the MapReduce-RNN and RNN classifier.
The MapReduce-CapsNet classifier's F-measure findings are displayed in Figure 5 with the conventional MapReduce-RNN and RNN classifier.The MapReduce-CapsNet provides a multi-variable classification of input diabetic datasets that enables optimal A 10-fold cross-validation that classifies a greater rate of F-measure as sensitivity and higher specificity.The results of testing in Figure 6 show that the MapReduce-CapsNet classifier achieves a higher F-measure level than the MapReduce-RNN and RNN classifier.
Figure 6 displays the MapReduce-CapsNet classifier's G-mean results using the traditional MapReduce-RNN and RNN classifier.The MapReduce-CapsNet provides a multi-variable classification of input diabetic datasets that enables optimal classification with reduced errors in a 10-fold cross-validation.The results of testing in Figure 6 show that the MapReduce-CapsNet classifier achieves a higher G-mean level than the MapReduce-RNN and RNN classifier.
Figure 7 shows the MAPE results of the MapReduce-CapsNet classifier with the conventional MapReduce-RNN and RNN classifier.The MapReduce-CapsNet provides a multi-variable classification of input diabetic datasets that enables optimal classification with a sigmoid loss function in a 10-fold cross-validation enabling reduced MAPE.The results of testing in Figure 7 show that the MapReduce-CapsNet classifier achieves reduced MAPE than the MapReduce-RNN and RNN classifier.
Figure 8 shows comparison of the MapReduce-CapsNet classifier's sensitivity findings with those of the traditional MapReduce-RNN and RNN classifier.The MapReduce-CapsNet provides the multi-variable classification of input diabetic datasets that enables positive samples in a 10-fold cross-validation.The test findings shown in Figure 8 demonstrate that the MapReduce-CapsNet classifier achieves a higher sensitivity level than the MapReduce-RNN and RNN classifier.Figure 9 shows comparison of the MapReduce-CapsNet classifier's particular results with those of the traditional MapReduce-RNN and RNNclassifier.The MapReduce-CapsNet provides a multi-variable classification of input diabetic datasets that enables optimal tenfold cross-validation classification for true samples that were negative.The results of testing in Figure 9 show that the MapReduce-CapsNet classifier achieves a higher specificity level than the MapReduce-RNN and RNN classifier Figure 10.
From the results, it is found that with increasing variables, the accuracy tends to get increased and the increase of training data improves the rate in which the instances are classified.

Conclusion
In this paper, MapReduce-based CapsNetenables optimal classification of diabetes conditions from big data.The MapReduce-based CapsNet is undergone with suiTab training that enables improved diabetes data classification and detects the nature of the risk of a patient.The simulation on the test datasets shows that the MapReduce-based CapsNets framework obtains increased classification accuracy, recall rate, and Fscore than the conventional MapReduce and deep learning algorithms like RNN and DenseNet.In the future, the application of large datasets can inclusively be added to the dataset to improve the training rate.

Figure 1
Figure 1 depicts the architecture of the proposed model, which involves pre-processing steps using map reduction, dataset splitting, and training, followed by CapsNet-based classification and validation.to reduce the rough set theory and a hybrid firefly and BAT optimization technique are used in the model's feature extraction block to reduce the dimensionality of the input data and identify relevant characteristics for further processing and classification using CapsNet.
The measures used to evaluate the effectiveness of the following are a list of suggested classifiers: Accuracy = TP + TN/(TP + TN + FP + FN) F-measure = TP/(TP + 0.5(FP + FN)) Sensitivity = TP/(TP + FN) Specificity = TN/(TN + TP) where, represents the Total number of iterations; A t represents the Actual value; It represents the Forecast value; TP stands for True Positive and FP for False Positive; TN stands for True Negative; FN stands for False Negative.

Table 2 .
Attribute considered for the study.