Automated disease classification in (Selected) agricultural crops using transfer learning

The biotic stress of agricultural crops is a major concern across the globe. Especially, its major effects are felt in economically poor countries where advanced facilities for diagnosis of a dis-easeislimitedaswellaslackofawarenessamongthefarmers.Arecentrevolutioninsmartphonetechnologyanddeeplearningtechniqueshavecreatedanopportunityforautomatedclassifica-tionofdisease.InthisstudyimagesacquiredthroughsmartphonearetransmittedtoapersonalcomputerviaawirelessLocalAreaNetwork(LAN)forclassificationoftendifferentdiseases usingtransferlearninginfourmajoragriculturalcropswhichareleastexplored.Sixpre-trainedConvolutionalNeuralNetwork(CNN)havebeenusednamelyAlexNet,VisualGeometryGroup 16(VGG16),VGG19,GoogLeNet,ResNet101andDenseNet201withitscorrespondingresultsexplored.GoogLeNetresultedinthebestvalidationaccuracyof97.3%.Themisclassification wasmainlyduetoTobaccoMosaicVirus(TMV)andtwo-spottedspidermite.Intestconditions,imageswereclassifiedinreal-timeandpredictionscoreshavebeenevaluatedforeachdisease class.ItdepictedareductioninaccuracyinallmodelswithVGG16resultinginthebestaccuracyof 90%.Variousfactorscontributingtothereductioninaccuracyandfuturescopeforimprovement havebeenelucidated.


Introduction
Globally, different diseases are key factors affecting agricultural crop production [1]. In developing nations, the lack of knowledge in the identification of specific diseases is further complicating the control of its inducing factors. Abang et al. [2] survey on farmer's knowledge for identifying diseases in Cameroon brought to the limelight that only 21% and 16% of farmers were able to identify diseases and pests respectively. Another survey on farmer's knowledge of various diseases showed an inadequate awareness of the biotic stress in the crops [3]. Similar inadequacy has been reported by Islam, [4] for viral disease and its management. Therefore, appropriate training to the farmers can improve the identification and management of diseases. However, the feasibility of such a massive training suffers from the lack of significant experts and facilities. Moreover, each farmer cultivates different crops in their field which pose a significant challenge for the farmers to get trained as well as for the experts to train them for various diseases.
Recently, the abundance of cost-effective smartphones among the farmers has created an opportunity for classifying diseases using the images of the infected foliage [5]. The smartphone camera acts as a vision sensor which acquires color information in Red Green Blue (RGB) channels. This information along with the machine learning technologies learn to recognize the patterns of the disease based on the visible symptoms on the leaves. Traditionally features of the symptoms are represented using mathematical functions and evaluated in mapping for distinct patterns of different diseases [6][7][8]. These handcrafted features are fed to the machine learning algorithms such as Neural Networks (NN), Support Vector Machine (SVM), etc., for classifying the different diseases and best features which result in better accuracy were utilized [6,8,[9][10][11][12]. Recently, due to the advancement in computing, a large number of artificial neurons are stacked in a specific architecture which forms deep neural networks and these are capable of learning the features automatically in contrary to the previous approach. These features are used for the image classification (in different domains) and this is popularly known as deep learning. One of the deep learning approaches namely Convolutional Neural Network (CNN) is widely used for image classification [13].
Different CNN based architectures namely AlexNet, VGG16, VGG19, GoogLeNet, ResNet, DenseNet, etc., have been developed and adopted for solving the problem of disease classification in various crops [5,9,[14][15][16][17][18][19]. Training from scratch approach requires a large dataset for training as it is a data-driven technique which poses a significant challenge to the researchers. Hence, an approach known as transfer learning [16] is widely used where the architecture is pre-trained with the ImageNet dataset and each processing element has optimized weights as a result of training. These pretrained networks have been retrained again with standard datasets. For example, PlantVillage [20] includes images of common diseases from various crops using which many studies have been performed [5,9,[14][15][16][17][18][19]. The performance of these pre-trained networks is (generally) excellent and in some cases, it resulted in an accuracy greater than 99%. In few other studies, authors have used their own developed datasets for the classification of various crop diseases [21][22][23][24]. Despite these efforts, the dataset for many crops is limited which is the critical input for training the deep learning-based algorithms.
This gap has motivated the authors to create a dataset for a few potential diseases in some vital crops (for which it is unavailable) based on the literature. Also, a rapid disease classification system with an easily available smartphone camera will be beneficial to the farmers. As an initial phase of this objective, a preliminary system consisting of a smartphone camera connected to a wireless Local Area Network (LAN) with a CPU will be used for classifying the disease with pre-trained deep learning architectures such as AlexNet, VGG16, VGG19, GoogLeNet, ResNet and DenseNet [25][26][27][28][29]. These architectures are briefly discussed in the following section. The observed results and analysis are presented in Section 3. The influential factors, drawbacks, possible solutions and future directions are included in Section 4.

System overview
The system consists of a smartphone, wireless LAN and a Personal Computer (PC) (as shown in Figure 1). The smartphone used in the study was Lenovo A7700 equipped with 8 Megapixel camera which costs approximately U.S. $100. Wireless LAN is used for sending the acquired images from a smartphone to the PC. A mobile application (IP webcam) has been used for acquiring and transmitting the images to the PC. The computer system used for the processing of images was ACER NITRO 5 SPIN laptop with 4 GB NVIDIA 1050GTX graphics card and 8 GB of Random Access Memory (RAM). The data communication and classification utilizing deep learning was performed using MATLAB 2018a.

Dataset and deep learning models
In this study, ten different diseases (as shown in Figure 2) of four variety of crops namely eggplant (Solanum melongena), hyacinth beans (Dolichos lablab), ladies finger (Abelmoschus esculentus L) and lime (Citrus aurantifolia) have been considered for developing a disease classification system (shown in Table 1).
The dataset was created by collecting leaf samples from the field located at Tirumalaisamudram-Thanjavur district (10˚43 33.8 N, 79˚00 57.5 E) in the state of Tamil Nadu, India. The leaves were examined by the experts and categorized into the respective disease class. A smartphone was used to acquire the image of isolated leaves placed on a white sheet of paper with glass sheet to minimize the loss of visual information from the surface. The images were acquired in a room environment with sufficient natural illumination and no special lighting was used. The images were segmented manually using the software tool to minimize the effect of the background in learning process.
The number of input images in the dataset has to be increased as deep learning models require a reasonably large dataset. Data augmentation is a standard procedure where the number of images was increased artificially by applying the image transformations such as rotation, translation and changes in intensity value which prevents a common deep learning problem of overfitting [16]. A random rotation angle within the limit of ± 30°was generated and applied to each image. In addition, a random number within the range of ± 30 pixels along the horizontal and vertical direction with a random intensity value in the range of ± 30 were used to transform each images. These limits were selected based on numerous trials and the generated dataset was used for the training and validation process. Six pre-trained deep learning models namely AlexNet, VGG16, VGG19, GoogLeNet, ResNet101 and DenseNet201 have been used for training and validation of the created dataset. AlexNet, VGG16 and VGG19 is series CNN based network (as shown in Figure 3) with repetitive convolution layer followed by nonlinear activation function namely Rectified Linear Unit (ReLU). The corresponding mathematical representation of convolution operation and ReLU are given   by Equations (1) and (2) respectively [21,40].
X l j is the activation map from layer l as a result of convolution operation using kernel K l ij with the image or activation map of the previous layer X l−1 i and bias term B l j . In some layers, ReLU is followed by the maxpooling (as depicted in Equation (3)) which reduces the dimension of activation map. The resulting map P i is obtained with pooling region R i applied on feature map (with elements α i ) from the previous layer. In case of AlexNet, addition layer known as batch normalization layer is included in few layers. The architecture finally ends with 2 fully connected layers of 4096 neurons followed by another fully connected layer with number of neurons proportional to number of classes, a softmax and a classification layer. The softmax layer (given in Equation (4)) outputs the probability y m for the possible class with T vector of M dimension and the classification layer decides the class, based on this probability [21,40].
GoogLeNet architecture contains inception module (as shown in Figure 4) which consists of convolution layers in parallel. It extracts features in parallel and concatenates at the end of each inception module [27]. These inception modules are arranged in a stack and  contain 27 layers in total. The fully connected layers which exploit the computing resource have been removed in this architecture.
As the depth of the architecture is increased by stacking more convolution layers, the accuracy and loss get saturated quickly due to vanishing gradients during backpropagation. To counteract this problem, He et al. [28] developed a technique of using a short cut or residual connections between convolution layers (as shown in Figure 4) that solves the vanishing gradient problem. There are several variants of this developed Residual Net or ResNet among which ResNet 101 (one of the deepest architectures in ResNet) were used in the study due to the limitation of the available GPU resources. As an alternative to the residual connection with additional layer for solving the vanishing problem, DenseNet attempts in using the concatenation layers which combine the feature maps from the previous layers instead of an additional layer (as shown in Figure 5) [29].
The architecture ensures maximum connectivity and all the layers have additional input from the previous layers. In our study, DenseNet 201 consists of 709 layers which demands comparatively higher computation resources for execution has been used.

Methodology
The augmented image dataset was used for training and validation (as shown in Table 2) of the six deep learning models. These trained and validated models were deployed to classify the given image (transmitted through a wireless network from a smartphone as shown in Figure 6). This sample test set was also augmented in order to increase the size and variations using which the test performance was evaluated.
During training, in order to improve the accuracy of the models, fine-tuning of the hyperparameters were performed. The following were the fine-tuned parameters for training the network: maximum epoch, 20; initial learn rate, 0.0001; L2 regularization, 0.0001 and momentum, 0.9. The following parameters were tuned for last fully connected layer: weight learning rate, 3; weight L2 factor, 1; bias learning rate, 2 and bias L2 factor, 0. Stochastic gradient descent algorithm was used for updating the weight and bias parameters. As it is a pre-trained network, the learning rate for all the layers was kept at minimum and it is required for the last fully connected layer to learn, quickly. Hence the  learning rate for this layer is higher than the initial learning rate. The obtained results are analysed and reasons for misclassification are provided in the Results section.

Results
The training was carried out with 80% dataset and 20% for validation as depicted in the previous section. The learning rate remained constant for the entire training of all the models. The training plot for all the six models is shown in Figure 7. The time taken for training of the dataset using AlexNet, VGG16, VGG19, GoogLeNet, ResNet101 and DenseNet 201 were approximately 11, 108, 224, 32, 175 and 447 min respectively. As the number of convolution layers in AlexNet is lower compared to other architectures, the time taken to complete 20 epochs is the lowest of all the models used in the study. The number of layers in GoogLeNet is broader and contains many convolution layers than VGG16 but still, it takes reasonably lower time with the utilization of minimal computational resources without sacrifice in the performance. This is mainly due to the development of architecture based on Hebbian principle [27]. The convergence of training and loss graph is faster using ResNet101 and DenseNet201.
The validation accuracy using these six models i.e. AlexNet, VGG16, VGG19, GoogLeNet, ResNet101 and DenseNet 201 were 91.4%, 96.5%, 94.3%, 97.3%, 96.9% and 93.7% respectively. The GoogLeNet with inception module found to produce the best result due to parallel processing and possibly optimal depth of the architecture which improves the learning of the features from the image. The obtained result agrees with previous studies carried out using the PlantVillage dataset [5,9] while contradicting the other study [17] where VGG16 produced best results. In general, deeper architectures such as VGG16 and VGG19 produced better results at the expense of computational resources which is evident from the time taken and memory space allotted for training. In the case of DenseNet201 and ResNet101 the accuracy did not improve significantly with deeper layers but the computing resources were used efficiently compared to the earlier models. In addition, although the time taken for training is more, test time performance is in milliseconds compared to the performance of other shallower traditional learning algorithms.
When the validation accuracy was analysed using the confusion matrix (as shown in Figure 8), it showed that the disease class which was mainly affecting the performance was two-spotted spider mite. It was mainly misclassified to TMV as the symptoms of twospotted spider mite are not learned efficiently. Moreover, the symptoms of the above disease do not have distinct clear patterns, but an infinitesimally tiny powdery appearance due to the presence of pest on the surface of leaves. In severe cases, discoloration begins to appear which may be the possible cause to misclassify it to TMV.
This was also evident from the estimated property values namely False Negative Rate (FNR), False Positive Rate (FPR), True Positive Rate (TPR) and True Negative Rate (TNR) based on the confusion matrix as shown in Table 3.
The FPR of the class TMV is higher with all the algorithms which shows that other classes have been misclassified to TMV. The other classes affecting the accuracy were Cercospora leaf spot and Epilachna beetle which were misclassified to TMV. The above misclassification is substantiated by the TPR value of Cercospora leaf spot and Epilachna beetle with most of the models. The attributed reason may be due to the presence of discoloration and pest on the leaves in few images of Epilachna beetle. Few of the disease class namely  Table 3) for VGG16, VGG19, ResNet101 and DenseNet201 which depicts a higher prediction capability. In the case of AlexNet, the other class which majorly affected its accuracy was leaf hopper (TPR -0.871) that was misclassified to yellow vein mosaic due to similarity in symptom color. Similar results were depicted in the case of VGG19 and DenseNet201. From the confusion matrix, it is also evident that the misclassification for disease of one species to other species was very low. It signifies the ability of architectures to learn other features such as different leaf shapes, veins, etc.

Visualization of features
In a further sophisticated way, the learning of features by the trained model can be visualized and analysed qualitatively using deep dream image function (originally developed by Alexnder Mordvintsevl) available in the MATLAB software tool. Few studies have used visualization for understanding the learning ability of different architecture and convolution layers [41,42]. Due to the limitation of the software tool used in this study, the function is applicable only for series networks (AlexNet, VGG16, VGG19) and it cannot be utilized for GoogLeNet as it is a Directed Acyclic Graph (DAG). The visualization gives an intuition on the learning ability of the network. The function is applied to the last fully connected layer of AlexNet, VGG16, VGG19 as abstract features are combined by using the features from different layers. The above function generates an image which enhances the activation of the neurons in the network layers. The images were generated using the random image with pixels from a normal distribution. When the features were visualized with the AlexNet, it did not closely match the real symptoms. But, in the case of VGG16 and VGG19, features from symptoms of almost all diseases were able to be related to the visualization map. A sample visualization of Cercospora leaf spot is shown in Figure 9, where random spots are present in the generated image.
The poorly learned classes were not able to show features matching those disease class (e.g. two-spotted spider mite) and were generating images which may be mostly based on the weights from the pre-trained  objects. This is in agreement with the analysis of the confusion matrix where two-spotted spider mite was more often misclassified to other classes.

Testing
Finally, testing of the sample images obtained from the smartphone camera transmitted through the wireless LAN was performed. The images were read from the specific virtual COM port of the PC with the MATLAB tool. The testing was carried out in two ways, one with pre-processing and other without preprocessing. In pre-processing a binary mask was constructed using thresholding operation and an input test image was segmented with a background similar to the training and validation images. In the other case, the input image was directly classified with the trained model. The sample output obtained as a result of using pre-processed images for classification is shown in Figure 10.   in the classification accuracy with original and preprocessed images using different models. The applied segmentation procedure also resulted in the loss of few pixel information. As expected, the performance of AlexNet was poor with the resulting accuracy of 50% (as shown in Figure 11).
Overall, the accuracy has dropped significantly when the models were provided with the test set. It was interesting to observe that VGG16 resulted in the best accuracy (i.e. 90%) compared to other deeper architectures such as GoogLeNet (i.e. 60%), VGG19 (80%), ResNet101 (76%) and DenseNet201 (64%) using the segmented images. The class which was misclassified in almost all cases was Cercospora leaf spot with VGG16. It was falsely classified as citrus canker as the symptoms of this disease were similar, hence the prediction score is 100% for citrus canker as shown in Figure 12.
All the models were able to classify the six diseases with higher accuracy namely little leaf, citrus canker, citrus hindu mite, Epilachna beetle, yellow vein mosaic and leaf hopper. In all the models Cercospora leaf spot was misclassified as citrus canker as the symptoms are similar with distinct spots randomly present on the leaf surface. The prediction score for each case using the best performing model (VGG16) for the test set with pre-processed sample images for each class is shown in Figure 13.
In order to observe the effect of transformation such as rotation and translation on the test images, it was translated randomly and rotated to 90°, 180°and 270°.
There was a significant fall in the accuracy with all the models and AlexNet was poorest performer with 40% accuracy. A sample example of this above-introduced uncertainty is shown in Figure 14 where the TMV was classified as Cercospora leaf spot when the image was rotated to 270°with VGG16.
When an additional change in illumination was introduced, the accuracy fell further greatly as the dataset used for testing include images of untrained illumination conditions. This is in agreement with the previous studies which have shown that the deep learning architectures trained with the image acquired in controlled conditions were unable to classify the disease under different conditions effectively [5,43]. Hence, in accordance with that, the developed method is limited within the images of leaf samples acquired under constant white background with similar illumination.

Discussion
VGG16 alone was able to produce a classification accuracy of 90% on the test set and all the models reported a lower accuracy compared to validation. The results suggest that the trained deep learning models work well when the provided input data is similar to the training data and the accuracy drops when it is different. The effect of unseen dataset on the classification accuracy is in agreement with earlier study by Mohanty et al. [5] and Barbedo [16]. Also, the application of different augmentation techniques separately with training and test sets resulted in a drop in performance. Hence the orientation of the leaf in the input test images should be similar to the training images. The results also suggest that when the trained models were put to test without removal of the background, the accuracy fell significantly compared to segmented images in all the cases except DenseNet201. Although the deep learning model has the ability to learn features irrespective of the background, the limitation on the quantity of the dataset may be one of the factors influencing the prediction capability. Barbedo [16] however on his study showed a significant impact of the background on the classification accuracy but also suggested a solution of increasing the dataset size that may improve the performance.
In this study, Cercospora leaf spot in brinjal was misclassified as a citrus canker in the testing scenario. When the symptoms were analysed visually, both the disease contains a dark spot surrounded by yellow halo region. But the properties of these symptoms are different which in few cases matches each other. An earlier study shows that the similarity of the symptoms between two disease classes can lead to significant errors in classification. This problem can be partially solved by increasing the dataset size used for training according to Barbedo [16].
The studies by Ferentinos [17] and Shijie et al. [14] reported a higher accuracy of utilizing the PlantVillage dataset and VGG16. But, one of the studies by Too et al. [19], reported a lower accuracy of 76.12% using VGG16 and PlantVillage dataset which is contradicting the results of these studies. Despite the higher performance shown by VGG16, the major factor acting as a bottleneck for its implementation is the heavy exploitation of computer resources even with shallow depth compared to other deeper architectures.
Most of the studies have utlitised the PlantVillage dataset but unfortunately, it lacks several unexplored diseases and crops [5,9,[14][15][16]. We have attempted to close this gap by the creation of a dataset for these ten diseases in four crops which is not available in the above dataset. Also, the study reveals a good accuracy, even with lower number of training images and utilization of limited computer resources for training compared to the PlantVillage dataset which has thousands of images and requires a GPU with larger memory.
Our study also has focused on application part of the system for the farmers and a Graphical User Interface (GUI) has been developed (as shown in Figure 14) using the GUI options available in MATLAB 2018a. The developed GUI is a precursor for the development of a software application that encompasses various options available to the user. When the input image option is selected, a snapshot of the image is acquired from video footage of the smartphone camera transmitted via wireless LAN and stored in the computer memory. Pre-processing is an optional step using which the image can be segmented. Finally, the classify disease option loads one of the trained models and it predicts the associated disease class. The output of the application is the predicted class with prediction score. The prediction score reveals the confidence of the network in predicting the particular target class. Although the developed system is not state of the art, it can be utilized as disease diagnostic support system in remote areas which lacks facilities and manpower for diagnosis.
In future, it is proposed to prepare a dataset of the above diseases with complex backgrounds under varying conditions. The behavior of the architectures will be analysed for developing a sustainable disease diagnostic system for the real-time classification of disease by implementing a model in smartphone.

Conclusions
The above study has proposed an automated disease diagnostic system with cost-effective resources such as smartphones, wireless network and a PC using intelligent deep learning-based algorithms, for assisting the farmers. The study has evaluated six pre-trained deep learning models namely AlexNet, VGG16, VGG19, GoogLeNet, ResNet101 and DenseNet201 for classification of disease, in the created dataset of ten different diseases for four different crops. The data augmentation technique has been utilized for enhancing the dataset with the introduced distortion which will improve generalization and prevents overfitting problems. When the results from the validation data were compared, GoogLeNet was found to produce the best accuracy of 97.3%. But contrastingly, in the case of test scenario, VGG16 resulted in the best accuracy. The reasons for variation in accuracy resulting from validation and test set were discussed and future direction for the above work has been presented.

Disclosure statement
No potential conflict of interest was reported by the author(s).