Crop pest recognition based on a modified capsule network

Crop pest insects seriously affect yield and quality of crops, and pesticide control methods cause severe environmental pollution, which has inextricably influenced people’s daily lives. Crop pest identification in the field is crucial components of pest control. It is much more complex than generic object recognition due to the apparent differences in the same pest species in the field with various shapes, colours, sizes and complex background. A crop pest recognition method is proposed based on a modified capsule network (MCapsNet). In MCapsNet, a capsule network is used to improve the traditional convolutional neural network (CNN), and an attention module is introduced to capture the most important classification features and speed up the network training. The experimental results on a pest image dataset validate that the proposed method is effective and feasible in classifying various types of insects in field crops and can be implemented in the agriculture sector for crop protection.


Introduction
Crops, such as maize, wheat, rice, soybean, sugarcane and cotton, are often affected by different kinds of pests, which severely reduce the production and quality of crops. Accurate detection and identification of crop pests is taken care of pest control in the earlier stage of crop growth. Its detection and recognition of crop insects remain difficult and challenging on a massive scale due to their similar appearance and complex background. Moreover, the pest images collected in the natural field environment are often affected by illumination, insect morphology, image size and shooting angle, etc., which greatly complicate pest detection and recognition (Wu et al., 2019b). Using the continuous development of image processing and pattern recognition technology, many crop pest detection and identification methods have been presented (Deng et al., 2020;Zhang et al., 2013), which can be divided feature extraction-based methods (Miranda et al., 2014) and deep learning-based methods (Thenmozhi & Reddy, 2019).
Feature extraction and the classifier are crucial of pest recognition in the field. Wang et al. (2012a) designed an automatic identification system to identify insect images, while several relative features are extracted by digital image progressing, pattern recognition and the theory of taxonomy, and artificial neural networks (ANNs) and support vector machine (SVM) are used as classifiers for pest identification tests. To extract the invariant features for CONTACT Xiaoli Shi 844394899@qq.com representing the pest appearance, Deng et al. (2018) integrated the bio-inspired hierarchical model, scale invariant feature transform (SIFT), and non-negative sparse coding (NNSC) to increase the feature invariance and then extracted the invariant texture features based on a local configuration pattern algorithm. Zhang et al. (2014) presented a pest image recognition method by combining the features of colour, shape and texture and validated its effectiveness using 34 kinds of pest images, including rice, rape, corn and soybean. Wang et al. (2012b) conducted a lot of tests on eight and nine orders with different features, compared the advantages and disadvantages of their system, and designed an automatic insect identification system at the order level according to the methods of image processing, pattern recognition and the theory of taxonomy and provided some advice for future research on insect image recognition. Although the various image segmentation algorithms (Wang et al., 2010a;Wang & Huang, 2008), feature extraction algorithms (Chen et al., 2005;Du et al., 2006b;Huang & Jiang, 2012;Wang & Huang, 2009), neural networks (NNs) (Han et al., 2008(Han et al., , 2010Han & Huang, 2006;Zhao et al., 2004) can be applied the pest image recognition task (Deng et al., 2018;Miranda et al., 2014;Zhang et al., 2013), they carry on the tedious steps, such as image preprocessing, image segmentation, feature extraction and feature classification, to recognize the crop pests, and the corresponding results rely heavily on the effect of the image preprocessing and segmentation (Du et al., 2007;Shang et al., 2006;Xiao-Feng et al., 2008a), extracted handcraft-features and classifiers, such as NNs (Huang & Du, 2008), radial basis probabilistic NNs (Du et al., 2006a;Huang, 1999) and background (Kamilaris & Prenafeta-Boldú, 2018). Moreover, because the same pest in the field is different and irregular with a various shapes, poses and colours and complex backgrounds, as shown in Figure 1, using the features designed manually is difficult to obtain the feature expression closest to the natural attribute of the crop pests in the field, then it is difficult to extract the robust and invariant classification features from the pest images.
Image recognition using deep learning is considered the state of the art in computer vision research (Zhang et al., 2018). Deep learning-based crop pest detection methods not only can save time and effort, but also can achieve a real-time judgment. It can perform automatic feature extraction and learn complex high-level features in image classification applications (Shah et al., 2020). Due to the ability to learn the data-dependent features automatically from the data, many convolutional neural networks (CNN) and their variant models have been applied to pest identification tasks (Liu & Wang, 2020). Cheng et al. (2017) built an image dataset of tomato diseases and pests under the real natural environment, and proposed a tomato diseases and pests detection method based on improved Yolo V3. It references intelligent recognition and engineering application of plant diseases and pests detection. To achieve pest identification with the complex farmland background, Alfarisy et al. (2018) proposed a pest identification method using deep residual learning. It can be integrated with current agricultural networking systems into actual agricultural pest control tasks. Wu et al. (2019a) collected 4511 images of paddy pests and diseases from four languages using search engines and augmented them to develop diverse datasets. These datasets were fed into the CaffeNet model and processed with the Caffe framework for paddy pests and diseases recognition and achieved an accuracy of 87% higher than the random selection of 7.6%. To control crop diseases and pests, Nanni et al. (2020) introduced AlexNet and GoogleNet to detect crop pests and diseases on several crop pest and disease images. They achieved the highest detection accuracy rate of 98.48% for 38 pests and diseases with a higher efficiency, practicability, and accuracy. Li et al. (2020b) proposed an automatic classifier for pest recognition by integrating saliency methods and CNNs, where three different saliency methods are used as image preprocessing to create three images for every saliency method, and some new 3 × 3 images are created for every original image to train different CNNs. Thenmozhi & Reddy (2019) introduced a crop pest recognition method based on several deep CNNs that accurately recognize ten common crop pests. Chen et al. (2021) proposed a convolutional neural network (CNN) model to classify insect species on three publicly available insect datasets. The proposed model was evaluated and compared with pre-trained deep learning architectures such as AlexNet, ResNet, GoogLeNet and VGGNet for insect classification. The experiment results validated that CNN can comprehensively extract multifaceted insect features. To enhance the learning capability for pest images with cluttered backgrounds, Li et al. (2020a) proposed a crop pest recognition by the attention-embedded lightweight network under field conditions, where the optimized loss function and twostage transfer learning are adopted in model training to improve the identification accuracy of crop pest images. Wang et al. (2020) established a large-scale standardized agricultural pest dataset, Pest24, containing 25,378 labelled images of 24 kinds of field pests. They applied several state-of-the-art deep learning detection methods, Faster RCNN, SSD, YOLOv3, and Cascade R-CNN, to detect the pests in the dataset, and obtain encouraging results for real-time monitoring field crop pests, analyzed the datasets in a variety of aspects, finding that relative scale, number of instances and object adhesion, mainly influence the pest detection performance, and conducted crop pest detection experiments using deep learning networks.
The modelling ability and classification performance of CNN and its improved models for geometric invariant features mainly come from the expansion of datasets, the deepening of network layers and the artificial design of the model, but it does not fundamentally solve the deformation problem of field pests (Chen et al., 2021;Li et al., 2020aLi et al., , 2020bWang et al., 2020).
CNN and its improved models perform very well in image classification and recognition, but they do not fundamentally solve the deformation problem of field pests, because the image colour of the field pest is rich, its colour gradient change is significant, and its size and shape characteristics are various and complex, using CNN directly cannot extract the depth rich characteristics of the pest image. Unlike the maximum pooling in CNN, Capsule Network (CapsNet) does not discard the location information between entities within the region and it retains semantic information and spatial relationships between the various features in the text classification (Kumar et al., 2020). A capsule is composed of neurons, where the texture, colour and other characteristic attributes in the input image are contained in the neurons. It can predict the global characteristics of the whole entity through some attributes of the entity in the neurons. CapsNet is ideal for the segmentation and recognition of handwritten and medical images. It can be applied to crop recognition tasks.
Capsule of CapsNet is composed of neurons, where the texture, colour and other characteristic attributes in the input image are contained in the neurons. It can predict the global characteristics of the whole entity through some attributes of the entity in the neurons. It is a carrier of multiple neurons to identify a limited observation condition and deformation within the scope of the visual entities, and the output of a set of the instantiation of the parameters and the degree of its significant value (that is, the probability of the entity), the parameters include the entity's precise location, colour information, and the shape of the information. Among them, the significant degree of the existence of visual entities is locally invariant (the recognition probability of the entity type does not change), while the instantiation parameters are isotropic (the instantiation parameters also change correspondingly). In CapsNet, the neural nodes in CNN are replaced with the neuron vector, and the new neural network is trained using the dynamic routing algorithm (DRA) instead of the maximum pooling operation in CNN. It only considers the characteristics of image pixels when extracting image features and fully considers the spatial relations of image elements. The convolutional layer and Primary Capsule layer is the main feature extraction part of achieving the matching and projection from the image's low-dimensional features to high-dimensional features. During the whole capsule operation, the lower capsule transmits part of the extracted features to the upper capsule for overall recognition. DRA is the key to feature information projection between capsules in Cap-sNet . CapsNet is ideal for the segmentation and recognition of handwritten and medical images (Sabour et al., 2017). But because the image colour of the field pest is rich, its colour gradient change is significant, and its size and shape characteristics are various and complex, using CapsNet directly cannot extract the depth rich characteristics of the pest image (Kromm & Rohr, 2020). Inspired by CNN, CapsNet and attention mechanism, a modified CapsNet with attention mechanism, namely MCapsNet, is constructed for pest recognition. The main contributions of this paper are given as follows.
(1) MCapsNet is constructed to extract the invariant features from the various pest images.
(2) The attention mechanism is used to capture rich contextual relationships for better feature extraction and improving network training. (3) A LeakyReLU activation function is used to speed up the model convergence to some extent and to prevent gradient dispersion.
The remainder of this paper is organized as follows. Section 2 introduces the attention mechanism and Cap-sNet. Section 3 presents an improved CapsNet with an attention mechanism in detail. The experiments and results are presented in Section 4. Section 5 summarizes this paper and gives future works.

Attention mechanism
Attention mechanisms have been successfully applied to the deep learning-based recognition tasks, such as machine translations, question answering, speech recognition and image captioning (Li et al., 2020a). Its central idea is to assign a high weight to the important features and ignores irrelevant features, and then amplify the desired features. Let an input matrix H = [h 1 , h 2 , ... , h n ], where h i is the i-th output feature vector, and n is the length of a sentence. The output weight component α i ∈ R n of H is obtained by tan h function, where W α ∈ R 1×n is a weight matrix, and b is a bias. The attention weight β i ∈ R n is calculated by the Softmax function, The output feature is obtained by

Capsule network (CapsNet)
CapsNet is mainly composed of a convolutional layer, primary capsule layer (PrimaryCaps), digital capsule layer (DigitCaps) and concatenation layer, as shown in Figure 2. Suppose the input data is an image of 28 × 28 pixels. The convolutional layer aims to capture the features from the input image and output the feature graphs, which are transformed into vector capsules by the PrimaryCaps layer, then are output after the calculation of Digitalcaps. Digitalcaps is similar to the full-connection layer of CNN, but each neuron is transformed into a capsule structure for classification and output. The probability of a certain category is measured by the size of the output vector mode, and the vector with the largest mode is the output category. In the classical CapsNet, the convolution layer selects the convolution kernel with a step size of 1, depth of 32 and 9 × 9. ReLU is selected as the activation function. In the PrimaryCaps layer, eight groups of convolution kernels with a step size of 2, depth of 32 and size of 9 × 9 are selected, and eight times of convolution operations are carried out on the feature images output by the convolution layer to obtain 6 × 6 × 8 × 32 feature vectors, and 1152 capsules are obtained by the feature vectors. Each capsule consists of an eight-dimensional vector. DigitCaps layer outputs tensors of 16 × 10. The algorithm of CapsNet is shown in Figure 3.
In Figure 3, W is the back-propagation operation in parameter update, and C is the weighted sum operation of scalar. The right of Figure 3 is the capsule network part; DRA iterations can get better classification results when set to 3, so in this model, the dynamic routing number of iterations of three PrimaryCaps convolution kernel size is set to 3 × 3, the channel number is set to 3, the number of Digitalcaps tag set to data set on the number of the categories. The feature vector is input into the capsules. The feature mapping relationship between low-level capsules and high-level capsules is encoded in the weight matrix W, and the feature vector is multiplied by the corresponding weight matrix W. To better concatenate the feature vectors in the previous low-level capsules, the prediction vectors are weighted sum of the scalar before input into the high-level capsules. The capsule network adopts the Squashing nonlinear function to ensure that the length of the output vector is between 0 and 1. The j-th output capsule of the parent capsule layer is calculated as v j = ||s j || 2 1 + ||s j || 2 · s j ||s j || (4) where s j is the total input vector of the j-th capsule obtained by the weighted sum of the j-th parent capsule layer connecting with the i-th child capsule layer, ||s j || 2 /(1 + ||s j || 2 ) is the reduction coefficient of s j , s j /||s j || is the normalized unit vector of s j , s j = i c ijûj|i , the prediction vectorû j|i is obtained by multiplying the output features of the BN layer with the weight matrix of the primary capsule layer, and the prediction vector is obtained as follows:û where u i is the i-th output capsule of the child capsule layer, and w ij is the elements of the weight matrix W. Similar to the fully connected neural network in CNN, s j is the linear weighted sum of u i of the upper CapsNet. On this basis, v j is introduced into CapsNet. The coefficient c ij is updated by DRA between capsules, as follows.
where k is the category number. CapsNet uses the edge loss function as the model loss function, which is expressed as follows.
where the former part is used to calculate the settings of the correctly classified digital capsule, and the latter part is used to calculate the losses of wrongly classified digital capsules, m + = 0.9 and m − = 0.1 are the default category prediction values, λ = 0.5 is the default balance coefficient, T k is the label of data category, T k = 1 is the correct label, CNum is the number of categories, ||V k || is the length of the vector representing the probability of discriminating as the class kth pest, the total loss is the sum of all digital capsule loss functions. In Equation (6), the smaller L c , the smaller the difference between the predicted value of the output vector and the true value of the input vector, that is, the better the CapsNet classification effect. To estimate the error between the predicted values and the real results to update the parameters of the model by the dynamic routing algorithm, the number of v j is consistent with the model output of the number of categories in the last layer of the digital capsule. The length of a vector v j is calculated for the classification probability expressing the target belonging to the jth class. The correct category and the wrong category are 1 and 0, respectively. If the judgment is correct, the first half is used to calculate the loss. If the judgment is incorrect, the latter part is used to calculate the loss.
CapsNet classifies the input features by adopting DRA instead of the pooling layer of CNN. The more similar features, the stronger the features are, which is equivalent to a feature selection process (Kumar et al., 2020;Zhang et al., 2020). The main idea of the DRA is that all the sub-capsule outputs can predict the instantiation parameters of the parent capsule through the alternating matrix. When the bottom capsule prediction is the same, the parent capsule activates and outputs the eigenvector.

Modified CapsNet with an attention mechanism
CapsNet is not effective in large-scale image datasets, which limits its application range. A modified CapsNet (MCapsNet) is constructed by introducing an attention mechanism and local DRA (LDRA) instead of the DRA of CapsNet. Firstly, the two-dimensional images are extracted through three convolution layers and then transmitted to the feature capsule layer to form the highdimensional feature capsule. Then, the LDRA of the category capsule layer is mapped to the final classification result.
The convolution part of CapsNet uses only a convolution operation to extract the characteristics of the image, which cannot extract the high-level characteristics of the pest image. In MCapsNet, three continuous convolutional layers are used to replace the single convolutional layer of CapsNet to achieve high-level feature extraction of pest images. Suppose pest images with a size of 224 × 244 × 3 are selected as the input of MCapsNet. After the 64-layer feature map is first generated, the size of the feature map is 112 × 112, and a LeakyReLU nonlinear function is introduced between the convolution layers to activate the convolution operation and then transfer to the next layer. After the same convolution operation, the feature graph generated at the previous layer is transformed into a feature graph of 128 layers with a size of 5656. The last convolutional layer continues to carry out the convolution operation on the feature graph after twice activation. Two hundred fifty-six convolution kernels with the size of 3 × 3 are used to carry out mobile convolution against the feature graph generated by the middle convolutional layer and are generated before activation. MCapsNet performs pre-convolution and homogeneous layer activation operations using three successive convolutions with step 2 and size 3 × 3. Through three continuous convolutions, the 224 × 244 × 3 pest images are transformed into two-dimensional image features, which are conducive to feature analysis and processing of capsule layer to better abstract the analysis of extracted features and enhance the expression of two-dimensional image features by feature capsule layer.
To avoid the influence of gradient disappearance in the back propagation of pre-convolution on the classification results of pest images, an appropriate activation function is used to sort out the features propagated at the next layer. To make the model have better fitting performance and convergence speed, the activation function in CapsNet is replaced, and each convolution layer in the network with pre-convolution is activated. The nonlinear activation function is added to activate the image features and remove some redundant image features, and MCapsNet adopts the LeakyReLU activation function to speed up model convergence to some extent and prevent gradient dispersion. The difference between LeakyReLU and ReLU is that ReLU will go to 0 when the input X is less than 0, while LeakyReLU retains some information and the gradient is not zero. LeakyRELU is defined as follows where the leak is a decimal and the value is 0.1, x is the input data.
To adapt to different objective functions, MCapsNet adopts the Adam algorithm of adaptive learning rate. Its gradient is smoother in the training process, and all parameters are optimized. Its adaptive learning rate is based on adaptive low-order moments to accelerate the convergence of MCapsNet. For pest images with large noise, the Adam algorithm can reduce the impact of noise on feature extraction. In training, the adaptive learning rate and momentum algorithm are used, the learning rate is always kept within a fixed range, the parameter changes are relatively stable, and gradient descent can be avoided. Updating the model's weight and parameters makes MCapsNet have better performance. Making full use of the mean of the first and second moments of the matrix is an important method of updating the model. To further solve the over-fitting problem caused by too long training time in MCapsNet, the two-stage model training method is used. In the second stage, the model parameters are initialized by less training, and then the model training of all the data is carried out.
From the above analysis, an MCapsNet with an attention mechanism model is constructed. Its architecture and corresponding parameters are shown in Figure 4 and Table 1, respectively. Three continuous convolutional operations in Layer_1 of MCapsNet are shown in Figure 5.
Based on CapsNet, MCapsNet adjusts the input feature graph and the size of convolution kernel, and utilizes three successive convolutional layers, an attention mechanism layer and a convolution layer to generate a 28 × 28 feature map as input of PrimaryCaps layer and outputs 50,176 units. The feature maps are encapsulated in groups according to the size of 8 × 1 capsules by Digitalcaps operation, and finally encapsulated into 6272 capsules. After the adaptive modification of the parameters, the feature capsule layer can express the rich details of the mural image. In the attention mechanism, a self-attention module is used to focus on the significant extracted features of the three successive convolutional layers while ignoring the needless features, which is useful for extracting the crucial features of plant pest images. After the similarity output matrix is obtained, mask calculation of   global attention is carried out with the passing output. The output features from the self-attention module are transferred to the second convolution layer, and the 3 × 3 convolution kernel is used further to improve the feature description ability of the model. In the PrimaryCaps layer, the input characteristic graph of Conv2 is vectorized, in which the PrimaryCaps is divided into main and minor capsule layers. Each layer comprises 32 capsules, and each capsule is composed of 16 5 × 5 convolution nuclei. The Digitalcaps layer is composed of a Cnum × 16D dimension vector, which predicts the type of input image. Routing between capsule layers is realized by LDRA. In model training, the loss function L loss (Vector out , R r ) is calculated, and the model parameters are updated according to the back-propagation algorithm.
In the convolutional layer, the attention mechanism is used to learn more effective feature maps to make the network pay more attention to the foreground regions. The attention block consists of a convolutional layer and a gating mechanism. In this convolutional layer, a convolution kernel V ∈ R k×1 is applied to all the context features with window size k (padded when necessary) of every feature c j (j = 1, 2, . . . , n − h + 1) in feature map C to produce the attention weight matrix A = [a 1 , a 2 , . . . , a n−h+1 ] T .
Through the gating of attention weight matrix A, the output feature maps are calculated as follows: where m l , b ∈ R (n−h+1)×1 , b is a bias matrix, t is the number of convolution kernels we use in the attention-gated layer. We use kernels with different window size k to extract different gained attention weight matrices A l (l = 1, 2, . . . , t). Finally, the output feature map is given as follows.
where b ∈ R (n−h+1)×1 . The basic operations of MCapsNet-based pest identification method are as follows: (1) Matrix multiplication of input vector and weight matrix is used to express spatial relations between low-level features and high-level features in the image.
(2) Weighted input vector, where the weight determines which higher capsule the current capsule sends its output to, is accomplished by the dynamic routing process.
(3) Sum the weighted vector, which is the same as the general CNN block. (4) Use the squash function to compress the vector so that its maximum length is 1 and its minimum length is 0, while keeping its direction unchanged.

Experiments and results
In this section, eight common crop pests, such as mucilworms, corn bores, moths, caterpillars, ladybugs, aphids, cotton bollworms and flying cicadas, were studied. The images were collected from the experimental base in Baoji City, Shaanxi Province. In different periods in a natural field environment, nearly 2000 images of pests were collected using image acquisition devices such as smartphones, cameras and the Internet of Things. About 250 images of each pest were collected. At the same time, some network images are used to supplement the data set to ensure its integrity. To improve the training efficiency of the subsequent network model, Photoshop was used to cut the images into JPG colour images with a size of 256 × 256 pixels. The original pest image examples are shown in Figure 6. The learning rate plays a decisive role in the convergence of the optimization algorithm of the network model. If the learning rate is too small, the convergence speed is slow. On the contrary, if the learning rate is too large, the model may not converge to the optimal solution. Generally, at the beginning of iterative training, the learning rate is large. As the model gradually converges, the learning rate becomes smaller so that the model can converge to a minimum better to avoid the situation of loss increase. In the process of MCapsNet iteration, the Adam algorithm was used for optimization. A relatively large learning rate was initially selected and set to 0.9, attenuated according to an exponential function with the completion times of data set training.
Crop pest recognition based on MCapsNet mainly includes three parts: dataset preprocessing, MCapsNet training and pest image classification using trained MCapsNet. The main process is described as follows.
(2) Normalize each image after augmentation, and then divide all images into a training set and a test set. (3) Train MCapsNet with the training set, calculate the weight update by Adam during the iteration and determine whether the weight update is less than the threshold. If it is, the iteration is terminated.
Otherwise, keep training. The default threshold is 0.001. (4) Test the average recognition rate of MCapsNet on crop pest leaf images using the test dataset.
The precision rate (P) and recall rate (R) are usually used to evaluate the performance of the models where TP is the sentences with correct classification of ADRs, FN is the sentences with ADRs predicted as no ADRs, FP is the sentences without ADRs predicted as ADRs, and TN is the sentences without correct prediction of ADRs. A lot of experiments are conducted to evaluate the classification performance of the proposed method, where TensorFlow is the deep learning framework, Python3.7 programming development language, and the operating environment of the system is Windows10 64Bit, the hardware development environment is Intel Xeon E5-2643v3 @3.40 GHz CPU, 64GB memory, graphics card NVIDIA Quadro M4000 GPU.
To verify the effectiveness of the proposed algorithm for crop pest detection in the field, the experimental  results are compared with the other four CNN models (ICNN, VGG16, ResNet and CapsNet) based on the augmented pest image database. LeakyReLU is used as the activation function to ensure the nonlinear ability of the model. The initial learning rate is 0.01, the momentum factor is 0.9, and the Batch size is 128.
After 1200 iterations of the model, the learning rate is 0.001. Figure 7 shows the feature graphs by MCapsNet. Table 2 gives the recognition results of crop pests by five methods. From Table 2, it is found that MCapsNet achieves the highest recognition accuracy and recall rate in the same condition. The main reason is that LeakyReLU and attention mechanism are introduced to improve the performance of MCapsNet, reduce the influence of noise on the model and increase the accuracy of model recognition.

Conclusions
Insect pest recognition is vital for food security, a stable agricultural economy and quality of life. To realize rapid crop insect pest recognition, a MCapsNet-based pest identification method is proposed in this paper. The experiments of crop pest image recognition under a complex background are carried on and compared with ICNN, VGG16, ResNet and CapsNet. The results show that the proposed method can better accomplish the crop pest recognition task in nature fields and complex backgrounds, and has higher recognition precision and recall rates. The results show that the proposed method can meet the requirements of crop pest recognition in fields. As a result, the proposed model can be applied in realworld applications and further motivate research on crop disease identification.