The effect of three novel feature extraction methods on the prediction of the subcellular localization of multi-site virus proteins

ABSTRACT Experimental methods play a crucial role in identifying the subcellular localization of proteins and building high-quality databases. However, more efficient, automated computational methods are required to predict the subcellular localization of proteins on a large scale. Various efficient feature extraction methods have been proposed to predict subcellular localization, but challenges remain. In this paper, three novel feature extraction methods are established to improve multi-site prediction. The first novel feature extraction method utilizes repetitive information via moving windows based on a dipeptide pseudo amino acid composition method (R-Dipeptide). The second novel feature extraction method utilizes the impact of each amino acid residue on its following residues based on pseudo amino acids (I-PseAAC). The third novel feature extraction method provides local information about protein sequences that reflects the strength of the physicochemical properties of residues (PseAAC2). The multi-label k-nearest neighbor algorithm (MLKNN) is used to predict the subcellular localization of multi-site virus proteins. The best overall accuracy values of R-Dipeptide, I-PseAAC, and PseAAC2 when applied to dataset S from Virus-mPloc are 59.92%, 59.13%, and 57.94% respectively.


Introduction
Knowledge about the subcellular localization of proteins is critical for understanding their functions and biological processes in cells. 1 High-quality databases of information on the subcellular localization of proteins are informed by wet laboratory experiments. However, such experiments are time-consuming, costly and laborious. 2 Experimental methods for handling proteins on a large scale have become increasingly difficult. It is necessary to develop effective computational methods to analyze subcellular localization. 3 The web servers proposed to identify the subcellular localization of proteins based on their sequence information can be classified into two series. 4 One is the "PLoc" series, and the other is the "iLoc" series. The "PLoc" series includes six web servers to handle eukaryotic, plant, human, gram-negative bacterial, gram-positive bacterial, and viral proteins, while the "iLoc" series includes seven web servers to handle eukaryotic, plant, human, animal, gram-negative bacterial, gram-positive bacterial, and viral proteins. 5 Many studies have indicated that greater progress in prediction systems is obtained by developing feature extraction methods than by improving the classifiers. 6,7 In recent years, a wide range of feature extraction methods have been proposed to improve the performance of prediction: (1) amino acid composition (AAC) methods; [8][9][10][11] (2) homology-based methods; 7,12 (3) sorting signal-based methods; [13][14] and (4) pseudo amino acidbased feature methods (PseAAC). [15][16][17] All these methods have shown good performance but could be improved. AAC methods lack location information; homologybased methods are not suitable for low-homology protein sequences; and PseAAC can reflect some of the effects of sequence order but lacks the impact of each residue on the subsequent residues. Therefore, three feature extraction methods are proposed to improve the performance of multi-site prediction.
The three novel feature extraction methods proposed in this study are called R-Dipeptide, I-PseAAC and PseAAC2. Inspired by the long short-term memory with attention mechanism (A-LSTM), 18 R-Dipeptide focuses on using repetitive information. First, the spacing between two windows is set by the user, often to a small number. In this study, the spacing is one. Then, two better protein sub-sequences are selected according to the prediction results and combined. This method makes up for the lack of extraction of key information by PseAAC. I-PseAAC computes the impact of each amino acid residue on the subsequent residues. This method offers global order information, rather than the local order information provided by PseAAC. PseAAC2 focuses on location information. This method not only offers global order information but also adds the relative strengths of the residues, whereas PseAAC lacks information on the relative strengths of residues.

Dataset
Dataset S, constructed by Shen in establishing Virus-mPloc, is the benchmark dataset for the study. 19 Dataset S offers three advantages. (1) The dataset is specialized for virus proteins. (2) None of the proteins included in S has 25% pairwise sequence identity to any other protein in the same location. (3) The dataset includes proteins with more than one location and thus can be utilized to address the subcellular localization of multi-site virus proteins. 20 Dataset S includes 207 virus protein sequences, of which 165 belong to one subcellular location, 39 to two locations, and 3 to three locations. 20 The dataset is classified into 6 subcellular locations, 21 as expressed in Eq. 1: where S1 represents the subset for the subcellular location "viral capsid", S2 the subset for "host cell membrane", and so forth (Table 1), while [ denotes "union" in set theory. 21 Here, the locative protein sequences and different protein sequences are briefly described as follows. Locative proteins are described by Eq. 2: .m ¡ 1/N.m/ (2) where N(locative) represents the number of locative proteins and N(different) represents the number of different proteins. Here, m is the number of locations where the specific protein is identified, and N(m) is the number of proteins that are identified in m locations.

R-Dipeptide
R-Dipeptide utilizes repetitive information via moving windows based on a dipeptide pseudo amino acid composition method. First, the number of each amino acid residue in every protein sequence is calculated in Eq. 3. Then, the number of residues is normalized in Eq. 4.
where v i is the number of the i-th type of residue in every protein sequence.
where v Ã i is the normalized value of v i , m denotes the mean of v i , and s represents the standard deviation of v i . Second, the spacing between two windows is set to one, and the window size is set to thirty. The subsequence of the first group is {R 1 ,R 2 ,…,R 30 }, the subsequence of the second group is {R 2 ,R 3 ,…,R 31 }, and so forth. For the last residue (R L ), L is smaller than the minimum length of all protein sequences. Then, two improved protein sub-sequences are combined to create a new database based on the prediction results. The new database contains important repetitive information. that contributes to the prediction of subcellular localization.
Lastly, a dipeptide pseudo amino acid composition method (Dipeptide) is used for the new database. Dipeptide will generate 400 components, i.e., AA, AC, AD, …, YV, YW, and YY. These 400 components are calculated for every protein sequence and then subjected to a standard conversion.

I-PseAAC
PseAAC is proposed by Chou and avoids losing the ordering information of protein sequences. 23 A protein (P) including L amino acid residues can be described by Eq. 5: P D R 1 ; R 2 ; R 3 ; ::::::; R L where R 1 is the first residue of the protein sequence P, R 2 is the second residue of the protein sequence P, and so forth. The sequence order information can be represented by Eq. 6.
. . . ; n and n < L/ (6) where d u is the u-th correlation factor, which provides the sequence order information between the u most contiguous residues. V(R i , R iC1 ) can be described by Eq. 7: where H 1 (R i ), H 2 (R i ), Pk 1 (R i ), Pk 2 (R i ), PI(R i ), and M(R i ) denote the hydrophobicity value, the hydrophilicity value, Pk1(-COOH), Pk2(-NH3), PI, and the mass value of the amino acid residue R i , respectively. All physicochemical properties should be normalized before being used in the calculation of Eq. 7.

PseAAC2
In contrast to PseAAC and I-PseAAC, PseAAC2 provides a different kind of local information to reflect the strength of the physicochemical properties of residues, as described in Eq. 8 and Eq. 9: MLKNN is a multi-label classifier that utilizes the knearest neighbor algorithm to collect the category tag information of neighbor samples and exploits the principle of maximum posterior probability to infer the "no example of label" set. 21,24 MLKNN can be described by Eq. 10 and Eq. 11: where C j represents the number of neighbors of x belonging to class N(x). 21 h.x/ D y j P.
where H j denotes the event of x including the category y j . P(H j jy j ) denotes the posterior probability set H j that N(x) contains the number C j in the category y j .  Fig. 1(b). The impact of each residue on the subsequent residues. Fig. 1(c). The impact of each residue on the subsequent residues.

Evaluation
To provide a more intuitive and easier-to-understand measurement, a new scale, the so-called "absolute true" overall accuracy, 20 reflecting the accuracy of a predictor, is given in Eq. 12: where L represents the absolute true rate, N represents the number of total proteins investigated, and D(i) D 1 or D(i) D 0. All subcellular locations of the i-th protein will be tested. If every subcellular location of the i-th protein is correctly predicted, D(i) D 1; otherwise, D(i) D 0.
Therefore, the absolute true scale is much stricter than the scale used previously to measure the overall accuracy.
In addition, a series of other evaluation functions are applied to evaluate the prediction performance. 22 HammingLoss: HammingLoss is utilized to calculate how many times a label is misclassified. A lower value of HammingLoss represents better algorithm performance. RankingLoss: C i is the collection of labels with a value of one, denoted by labels-one. C i is the collection of labels with a value of zero, denoted by labels-zero. If the predictive labels of an instance are completely correct, the output value of labels-one should be higher than the output value in labels-zero. RankingLoss is utilized to calculate how many times the output lacks an appropriate comparison. A lower value of RankingLoss indicates better algorithm performance. One_error: One_error calculates how many times the top label is not in the appropriate label sets. A lower value of One_error represents better algorithm performance. Coverage: Coverage is utilized to calculate how far down the label set of an instance it is necessary to go. A lower value of Coverage indicates better algorithm performance. Average_Precision: Average_Precision is utilized to calculate the average fraction of labels ranked. A higher value of Avera-ge_Precision represents better algorithm performance.

Results
In this study, the spacing between two windows is set to 1, and the window size is set to 30. The database is divided into 24 groups: (0,30) is the first group, (1,30) is the second group, and so forth. The number of each amino acid residue in every group is calculated in Eq. 3. and Eq. 4. The overall accuracy of each group is shown in Table 2. The overall accuracy of the original database is 55.16%, while the best overall accuracy of the groups is 57.54%. Table 2  As shown in Table 3, the two methods AAC and Dipeptide give better results when applied to the new database than when applied to the original database. The original database contains redundant information. Therefore, the methods cannot obtain better performance when applied to the original database. The new database utilizes the repetitive information from sub-sequences. This approach is equivalent to increasing the weight of key residues.
Six physicochemical properties are used in the PseAAC2 and I-PseAAC methods: the hydrophobicity, hydrophilicity, Pk1(-COOH), Pk2(-NH3), PI and mass values of each amino acid residue, as described in Table 4.
The three novel feature extraction methods are compared with PseAAC. 23 Group 5 and group 7 are combined to create a new database, and four feature extraction methods are used in the new database to identify the subcellular localization of multi-site virus proteins by MLKNN. The results of the PseAAC method are obtained via a web server called PseAAC at http://www.csbio.sjtu.edu.cn/bio inf/PseAAC/#. The weight factor is 0.05, and the Lambda parameter is 40.
As shown in Table 5, the three novel feature extraction methods show superior performance, achieving 59.92%, 59.13%, and 57.94% accuracy for the MLKNN algorithm. The PseAAC method shows 57.14% accuracy for MLKNN algorithm. Thus, the three novel feature extraction methods improve the performance of multi-site prediction.
As shown in Table 6, the number of correct predictions of every subcellular location is calculated by Eq. 12. The overall accuracy is the sum of the correct predictions.  To simplify the representation of the evaluation functions, Average_Precision is denoted by A, Coverage is denoted by C, HammingLoss is denoted by H, One_error is denoted by O, and RankingLoss is denoted by R. The calculation details of the five evaluation functions are described in Eq. 13-17. The feature extraction method is denoted by FEM.
As shown in Table 7, R-Dipeptide, I-PseAAC, PseAAC2 all show better performance than PseAAC in general.

Conclusion and discussion
In this study, three novel feature extraction methods are proposed to improve the performance of multisite prediction. In experimental comparisons, the R-Dipeptide, I-PseAAC, and PseAAC2 methods achieve higher accuracy rates for the MLKNN algorithm than does the PseAAC method. Thus, repetitive information, the impact of each residue on subsequent residues, and local information are critical for the performance of multi-site prediction. The advantage of R-Dipeptide is the extraction of key information using the repetitive information method. We are accustomed to extracting key information by weight adjustment of the algorithm. For a large-scale dataset, weight adjustment is an effective method for the extraction of key information. However, if the dataset is limited in scale, the repetitive information method is better than the weight adjustment method. The advantage of I-PseAAC is that it can reflect the difference in physicochemical properties between each amino acid residue and the subsequent residues. In addition, I-PseAAC provides global information on the residues. The disadvantage is that the difference between the i-th residue and the j-th residue may be the same as the difference between the i-th residue and the k-th residue. For example, two kinds of physicochemical properties are denoted by A and B, respectively. The difference in A between the i-th residue and the subsequent j-th residue is 0.2, and the difference in B is ¡0.2. The difference between the i-th residue and the subsequent k-th residue in A is 0.3, and the difference in B is ¡0.3. Thus, there is no difference between the j-th residue and the k-th residue. The advantage is that PseAAC2 amplifies the differences in the physicochemical properties of different residues by providing another source of local information about protein sequences. The disadvantage is how to choose a set of representative physicochemical properties. If the values of the physicochemical properties of different residues are different, this kind of physicochemical property is representative. If some of the residues have the same physicochemical property values, the performance of PseAAC2 will decline.
The three novel feature extraction methods have shown good performance but can still be improved. The first question is how to set an appropriate window size and spacing between two windows. If the window is too small, important information will be lost and a large number of groups will be generated. If the window is too large, too much redundant information will be generated. If the spacing between two windows is too large, repeat information will be lost. In addition, groups can be combined in a variety of ways, such as adjacent groups (group 4, group 5), interval groups (group 4, group 7), or more than two groups (group 4, group 5, group 7). Our future studies will focus on these questions with regard to subcellular localization.

Disclosure of potential conflicts of interest
No potential conflicts of interest were disclosed.