Predicting algorithm of attC site based on combination optimization strategy

Site-specific recombination systems are widely used as bioengineering tools. However, the traditional site-specific recombination system requires a consensus sequence for the specific site. Such sequence-level constraints limit effective recombination between sites. Therefore, in order to develop an efficient site-specific recombination system, we investigated the attC site of the bacterial integration subsystem and built a predictive model to infer the important features that contribute to recombination. Here, we design an attC site prediction algorithm based on a combination optimisation strategy. Based on the structural features of attC sites, the prediction algorithm realises the high-precision prediction of the recombination frequencies between sites and the screening of the top 20 important features that play a role in recombination, which are effective for improving the design method of attC sites. The algorithm has better portability and higher prediction accuracy compared with the existing advanced algorithms, among which the Pearson correlation coefficient is 0.87, explained variance score is 0.73, root mean square error is 0.006 and mean absolute error is 0.041. This can not only provide ideas for the research of efficient recombination systems but also provide a theoretical basis for developing genetic engineering further.


Introduction
Gene recombination is a way that organisms use recombinase to recombine different genes to produce new genotype individuals. It is widely present in prokaryotes and has important meanings such as maintaining biological genetic diversity and promoting biological evolution (Epum & Haber, 2022). Common recombination includes: homologous recombination, translocation recombination and site-specific recombination. Currently, with the development of site-specific recombination systems, site-specific recombination technology has been extensively used in various biological genetic engineering operations, especially in higher eukaryotes (Bessen et al., 2019;Häcker et al., 2017). Site-specific recombination refers to the integration, excision and transformation of DNA fragments between specific sites, which is catalysed by integron integrase. This type of recombination is associated with specific DNA sequences in bacteriophages and bacteria, and the reaction always involves two DNA-specific sites. However, these two specific sites usually have very similar or even exactly the same DNA sequences. Such sequence-level constraints restrict efficient recombination between the two sites (Tian & Zhou, 2021). Therefore, in order to solve the problem of sequence constraints, it is necessary to study the structure of specific recombination sites in the recombination system.
In this paper, we study based on the bacterial integration subsystem. The bacterial integration system is an important application of site-specific recombination, which can capture and express foreign gene cassettes and convert them into functional gene expression units by site-specific recombination (Domingues et al., 2012). Through the recombination of DNA fragments between sites, bacteria can acquire properties that are beneficial to themselves, such as increasing bacterial resistance. Studies have shown that integronpromoted horizontal gene transfer enables bacteria to acquire foreign drug resistance genes as an important reason for accelerating the spread of clinically relevant Gramnegative pathogens (Mazel et al., 2015;Stalder et al., 2012). The function of the integron depends on the activity of the integrase (Nivina et al., 2016;Weiss et al., 2019), which is a site-specific tyrosine recombinase with the special ability to excise, integrate, reverse and translocate DNA fragments in organisms. Gene cassettes are usually mobile elements that carry a single gene and an associated recombination site (attC). Investigation shows that many gene cassettes often carry antibiotic resistance genes. Gene cassettes can exist independently in the form of a free ring, or they can be recognised and catalysed by integrases, and site-specific recombination becomes a part of the integron structure (Ghaly et al., 2021;Vit et al., 2020).
In the integron system, the excision of the gene cassette occurs between two adjacent attC sites, and the insertion of the gene cassette is mainly achieved through the recombination of the attI site on the integrator platform and the attC site on the free gene cassette (Mukhortava et al., 2019). The bottom strand of DNA at the attC site is folded into a hairpin-like structure and recombined as a single strand, and the conserved structure of the folded ensures the specificity of the reaction. Previous research has shown that tyrosine recombinase has a high sequence homology requirements for recombining attI sites, but recombinase can efficiently recombine attC sites with highly variable sequence and structure, and the recombination of attC sites is highly dependent on its structural features (Olorunniji et al., 2016). Meanwhile, attC sites appear to have very few sequence-level constraints. The attC site allows insertion into the cassette sequence of the integron-related recombination site (attI), and the recombination mainly occurs between A and C in the consensus sequence 5'-RYYYAAC-3' of the R box (Larouche & Roy, 2011). Investigations revealed that integrase binding and recombination are mainly driven by the structural features of the stem loop at the attC site: outer helical bases (EHBs), unpaired central spacer (UCS) and variable end structure (VTS) (Frumerie et al., 2010;Grieb et al., 2017). The favourable properties of these attC sites make them key to developing efficient site-specific recombination systems. Therefore, studying the relationship between the function and structure of attC sites and seeking effective features that play a role in recombination can not only expand new ideas for the synthesis of efficient attC sites but also be important for understanding bacterial evolution and the promotion of antibiotic resistance. This paper proposes an attC site prediction algorithm based on the combination optimisation strategy for the attC site of the bacterial integration subsystem. By analyzing and quantifying the structural characteristics of attC site, three regression algorithms are used to establish prediction models, which can effectively improve the accuracy of prediction and avoid the shortcomings of a single model. Meanwhile, by visualising the correlation between structure and function, we can find unknown important features that contribute to reorganisation, which plays a key role in improving the design method for synthetic sites and improving the recombination rate. Compared with four traditional regression algorithms: Decision Tree Regression (Lewis, 2000), Ridge Regression (Hoerl & Kennard, 2000), Support Vector Regression (Klopfenstein & Vaiter, 2021) and Random Forest Regression (Pellegrino et al., 2021), this algorithm has obtained higher credibility. More importantly, the algorithm is highly flexible and extensible.

Related works
With the rapid development of gene sequencing, traditional biochemical experiments can not meet the needs of massive gene data. At this time, the application of computer technology in biology has shown its strong performance advantages, and machine learning technology has been applied in various fields of biology Wang et al., 2021;Wei et al., 2021). In the field of science, research on the biological significance and value behind the data is also the current research hotspot. Machine learning has the learning ability to acquire knowledge from data and experience. It can not only extract knowledge from massive biological data and realise data-based predictions, but also continuously improve its performance and realise self-learning in the process of learning (Cao et al., 2020;Libbrecht & Noble, 2015). Therefore, it is very effective to apply machine learning to the field of site prediction to improve testing efficiency. At present, the typical algorithms applied by machine learning in this field include Random forest, hidden Markov and logistic regression.
With the development of synthetic biology, the strategies and tools for rapidly constructing new biochemical pathways will become more and more valuable (Doudna, 2020). Institut et al. proposed a synthetic integron for genetic recombination in vivo, providing a way to construct and optimise metabolism using the inherent gene recombination activity of natural bacterial site-specific recombination system integrators. This method demonstrated the ability to use site-specific recombination to efficiently generate a large number of gene combinations and arrangements in vivo for the first time (Bikard et al., 2010). This indicates that studying the structure of attC sites and synthesising a recombinant and efficient attC site will effectively improve the integration subsystem.
Pereira et al. presented a new method HattCI based on Hidden Markov Model (HMM) to quickly and accurately identify attC sites in DNA sequence data. This model describes each core component of an attC site separately, that is, the attC site can be directly identified in the fragment data without any additional information about the integrated substructure. The results of cross-validation showed that HattCI achieved a high sensitivity of up to 91.9% while maintaining a satisfactory false positive rate (FPR) (Pereira et al., 2016).
Meanwhile, there are many programmes that can recognise attC sites. The XXR programme uses pattern matching technology to recognise attC sites in Vibrio vulgaris integrons (Rowe-Magnus et al., 2003), and the programmes ACID (Joss et al., 2009) and Attacca (Tsafnat et al., 2009) are designed to search for categories 1 to categories 3 mobile integrators. However, these identification programmes based on sequence conservation can only identify attC sites in a limited integration subcategory, and have greater difficulty in identifying distant relative attC sites with highly different sequences.
The identification programme IntegronFinder proposed by Jean et al. aims to accurately identify any integrase and attC sites belonging to integrons. This model uses 291 artificially managed attC sites to establish a covariance model for attC sites, and uses the covariance model to search for attC sites on 2484 complete bacterial genomes. The recognition programme IntegronFinder has achieved 96% sensitivity while ensuring maximum attC site diversity (Cury et al., 2016).
Nivina et al. designed a structure-specific recombination system based on attC sites to construct a large-scale mutation library for attC r0 sites with low recombination rates. By analyzing and quantifying the structural features of the sites in the library, a Random forest-based approach was realised and obtained a higher prediction accuracy of attC site recombination . This model is also applicable to other specific sites and other genetic elements based on sequence characteristics, which have good portability. However, the model used in this method is relatively single, so there may be data deviation.
With the further development of genome research, a growing number of biological genomes have been completely sequenced, and the prediction of DNA recombination sites has also become a significant goal of computational analysis of biological information. Among them, the attC site is an important element in the realisation of site-specific recombination systems. Studies have shown that this site is closely related to gene therapy and drug development. Therefore, accurate prediction of the attC site can provide theoretical support for the treatment of certain diseases. Traditional biochemical experiments have been unable to meet the problem of processing a large amount of biological data. On the contrary, the advantages of combining biological information with computer technology have gotten more and more obvious. The prediction model of attC sites established by biological data can not only effectively help the development of recombinant systems, but also be applied to other research fields, with high flexibility and portability. After learning from previous studies, we found that a large number of studies were used to identify attC sites in DNA sequences, ignoring the prediction of their recombination rates. The high-precision prediction of the recombination rate based on the characteristics of the attC site is of great significance for the establishment of an efficient recombination system.
Previous research has shown that the attC site has highly conserved structural features and can be used to identify highly variable sequences (Bouvier et al., 2009;Lacotte et al., 2017). Therefore, it is an effective method to establish prediction experiments based on the structural features data . However, the predictions of attC sites defined by traditional research methods are different. Most of them have problems such as single feature, unrepresentative, single model and complex extraction of biological data. At the same time, a large amount of feature data not only wastes time, but some error features also have a great impact on the experimental results. Therefore, it has great development potential to comprehensively use the relevant knowledge of machine learning and deep learning to improve the prediction performance for the above problems.

Prediction of attC site
The attC site is the main site for site specific recombination, and its bottom strand is folded into a hairpin-like structure and recombined as a single strand (Figure 1). The attC site can mediate the insertion and excision of gene cassettes under the catalysis of integrase. The excision of the gene cassette mainly occurs between two adjacent attC sites on the integrative system, and the insertion of the gene cassette involves the attC site on the gene cassette and the attI site on the integrative system (Figure 2).
At present, the research of site prediction is developing rapidly, and there are mainly two technologies at this stage. The first is to predict the site through biochemical experiments, and the second is to quickly predict the site through bioinformatics combined with computer-related techniques. However, the prediction of recombination sites based on bacterial integration subsystems is currently mostly limited to biochemical experiments and a small number of bioinformatics methods, which consume a lot of economic and time costs. Consequently, this paper aims at the attC site of bacterial integration subsystem to study the correlation between its structure and function, and realises the accurate and efficient prediction of attC site, which is a beneficial supplement to the existing experimental methods.

Data sources for attC sites
In order to determine which structural features can contribute to the high recombination rate of the attC site, we visited the attC r0 mutant library for analysis . The library included all sequences with a single mutation in the constant region of the attC r0 site and a sequence containing all possible combinations of two mutations. This paper selects the attC site sequence containing two mutations in the library as the initial data set. The data set contains 12879 attC r0 mutants and 292 structural features, including 9 global features and 283 basic features. Meanwhile, we employ five-fold cross-validation to randomly select validation set and training set from the data set at a ratio of 1:4 during each training session.

Data preprocessing and segmentation
Data preprocessing refers to some processing performed on the data before the main processing. Since the data quality of the data set will directly affect the experimental results,  and a good data set will have a positive effect on the results, it is particularly important to perform preprocessing operations on the data before training the model . First, the invalid features in the database are filtered and removed, including features with all values of 0, features with the variance of 0 and features with low variance, where low variance features refer to more than 80% of the numbers in a single feature with the same value. These features have little influence on the results and are not representative. The variance judgment formula is shown in formula (1). A total of 14 features with zero variance are removed here (Table 1).
Second, standardise and linearly normalise the remaining features, and scale the value of the feature to [0,1]. The standardised formula and the normalised formula are shown in formula (2) and formula (3) respectively: Among them, μ is the average value of a single feature, and σ is the standard deviation of a single feature. m is the number of values of a single feature. X min is the minimum value  Finally, an oversampling operation is performed to construct a balanced data set. By defining the threshold of recombination rate (0.46), positive sites and negative sites were screened out. That is to classify the sites, marking the positive sites greater than the threshold as 1 (positive sample) and the negative sites less than the threshold as 0 (negative sample). A total of 1762 positive samples and 11117 negative samples were collected at last. Then, we performed replacement sampling on the selected positive samples, resulting in a total of 11117 positive samples, which together with the selected 11117 negative samples constituted a balanced data set of 22234 samples. At this point, a list of class features is added to the data set to label the properties of each site. Here, we mark the positive sites as 1 and the negative sites as 0.
The validation set and training set are randomly selected from 22234 data based on the ratio of 1:4, so the numbers of training set and validation set in this experiment are 17787 and 4447, respectively.

Parameter optimisation of the model
Hyperparameters are parameters that can control the behaviour of machine learning model and affect the performance of model to a great extent. For the most machine learning models, the optimisation training of hyperparameters plays a decisive role in the performance of the model . Therefore, in order to get the optimal hyperparameters of each model, we use Optuna, an efficient hyperparameter optimisation framework. Optuna allows users to dynamically set the search space of hyperparameters according to their own needs, which has the advantages of high efficiency and convenience. Meanwhile, Optuna also provides a visualisation window to visualise the hyperparameter optimisation process, which is very helpful for finding the optimal hyperparameter combination.
The specific process of hyperparameter optimisation is: first, estimate the space of possible values of the hyperparameter, and search for the hyperparameter in this space. When new results appear, update the interval and continue searching, using repeated searches and evaluation updates to find better performing hyperparameters. Here, we perform 4 iterative experiments of 100 rounds for each model, and select the four groups of hyperparameter parameter combinations with the highest score by five-fold cross-validation and then select the optimal hyperparameter combination through model evaluation indicators to establish the final predictive model (Figure 3).

Algorithm description of XRLattCPred
Machine learning can integrate different algorithms for optimisation, which can bring the advantages of each model into full play and achieve higher performance. Random forest is to build a forest in a random way. The forest is composed of many decision trees (CART), and there is no correlation between each decision tree (Nagra et al., 2019;Zhang et al., 2020). Random forest utilises many decision trees to train and predict samples, which can be used to implement features such as feature selection Too et al., 2021), regression prediction (Xu et al., 2021), and classification prediction (Cao et al., 2021). The simple average method is usually used in regression, and the regression results obtained by multiple weak learners are arithmetic averaged, which is the final model output (Abdulkareem & Abdulazeez, 2021). The XGBoost (Extreme Gradient Boosting) algorithm is improved from the GBDT algorithm. It is an integrated learning algorithm on account of boosting. XGBoost integrates multiple decision trees to form a strong classifier from multiple weak classifiers. On this basis, the objective function adopts the second-order Taylor expansion and adds a regularisation term to find the optimal solution of the model (Chen & Guestrin, 2016). LightGBM algorithm uses the histogram algorithm. The basic idea is to discretize continuous floating-point feature values into k integers, and at the same time construct a histogram with a width of k. LightGBM supports parallel learning, so that it has faster training speed and higher efficiency, and has the ability to process big data (Song et al., 2021). Based on the idea of machine learning, this paper adopts the strategy of combination optimisation and parameter optimisation to construct the prediction algorithm XRLattCPred, which is to find the combination with the best performance by combining and reconstructing the above three algorithms together. The algorithm XRLattCPred achieves a significant improvement in prediction efficiency, and the algorithm has high flexibility and portability. Figure 4 shows the prediction process of the algorithm XRLattCPred.
The algorithm XRLattCPred performs site prediction by learning the relationship between the structural characteristics of the recombination site and the recombination rate. The input of the algorithm is the structural feature data set D of the attC site, the attC data set Z to be predicted, and the threshold of the recombination rate, and the output is the recombination rate of each point in the data set Z. The models used in the algorithm XRLattCPred are Random forest, XGBoost and LightGBM. Through parameter optimisation and model evaluation, the study found that the model constructed by Random forest + XGBoost has better performance, so it is used as the final prediction model. The specific description is as follows: First, perform the preprocessing operation on the initial data set D to obtain the data set D', and then obtain a balanced data set D" after oversampling; Using the algorithm Random forest, the algorithm XGBoost and the algorithm LightGBM to construct the initial prediction model, and perform operations such as parameter optimisation and model reconstruction; Input the data set D" into the above model to obtain the first 20 important features in the feature sequence output by each model, and name them as the data sets D r , D x and D l ; The above three data sets are input into Random forest algorithm, XGBoost algorithm and LightGBM algorithm as training sets respectively, and nine algorithm combination models are constructed; The Random forest + XGBoost model is constructed as the model to obtain the optimal score, which is used as the final prediction model; Finally, the data set Z is input to predict the recombination rate. The pseudo code of the algorithm XRLattCPred is as follows: Algorithm XRLattCPred.
Assign the values of reorganisation threshold t = 0.46, hyperparameter training times b = 4, hyperparameter iterations c = 100 and feature selection iterations d = 10.
for i = 1 to n do 4.
if y i > t 5.
D train and D test ← Divide the data set D" according to the ratio of training set: test set = 2:1; 10. Build the initial model by Random forest, XGBoost and LightGBM; 11.
for i = 1 to b do 12.
for j = 1 to c do 13.
Parameter values set P ← Optimise the hyperparameters of the three models with Optuna and five-fold cross-validation; 14.
end for 15. end for 16.
Reconstruct the three model separately according to P; 17.
Construct scoring mechanism indicators → Output sets M r , M x and M l ; 18.
for k = 1 to d do 19.
Input D train and D test into M r , M x and M l respectively → Output feature scoring sets S r k , S x k and S l k ; end for 21.
Filter features according S r , S x and S l → Get feature sets D r , D x and D l ; 22.
Input D r , D x and D l into Random forest, XGBoost and LightGBM respectively → Get Combination model M;

23.
Run M with prediction data set Z → Get Regression frequency value; End As shown in the algorithm XRLattCPred, nine algorithm combination models are established in the experiment, which shows that there are order restrictions between different algorithm combinations, and different arrangement orders may lead to different model results. In this paper, the combined model with the best prediction performance is named XRLattCPred as the final prediction model.

Evaluation index
The evaluation index is an intuitive performance that reflects the quality of the model. In the experiment, we used four different evaluation indicators to evaluate this model. They are Mean Absolute Error (MAE), Root Mean Square Error (RMSE), Pearson Correlation Coefficient (PCC) and Explained Variance Score (VarScore), respectively.
MAE represents the sum of the average absolute value of the predicted value and the actual value. The smaller the value, the better the fit between the two sets of data. The calculation formula of MAE is shown in formula (4): RMSE is the square root of the variance of the residual (the predicted value minus the actual value). The fitting effect of the two sets of data increases with decreasing RMSE values. The calculation formula of RMSE is shown in formula (5): PCC indicates the degree of correlation between two variables, and the value is usually between [0, 1]. The stronger the correlation between the two sets of variables, the closer the value is to 1, on the contrary, the weaker the correlation, the closer the value is to 0 (Shah & Zaveri, 2021;Wiedermann & Hagmann, 2016). Here, we expect the model to obtain a higher PCC score, which means that the model has higher prediction accuracy, better prediction performance, and higher data correlation. The calculation formula of PCC is shown in formula (6): The value of VarScore is located at [0,1]. When this value is closer to 1, the independent variable can clearly explain the variance of the dependent variable. If this value is smaller, its effect will be worse. The calculation formula of VarScore is shown in formula (7): Among them, y i and u i represent the actual recombination frequency and the predicted recombination frequency respectively, y i and u i are their mean values, n is the total number of data points, and VarScore is the variance of each distribution.

Experimental environment configuration
All the algorithms used in this paper are available in the Python scikit-learn library, and the experimental environment is python 3.6.

Parameter settings
Parameter setting is one of the necessary conditions for the model to have good performance. In this paper, we utilised parameter optimisation to find the optimal hyperparameters for each model. The specific parameters are shown in Table 2.
The parameter n_estimators represents the number of decision trees in the forest. The larger the n_estimators, the stronger the learning ability of the model and the easier the model is to overfit. The parameter max_depth represents the maximum depth of the decision tree, which can effectively limit overfitting. The parameter learning_rate is the learning rate, that is, the step size of the decision tree iteration. The larger the learning_rate, the faster the iteration speed, but it may not converge to the real optimum. The smaller the learning_rate, the more likely it is to find a more precise optimal value, but the iteration speed will be slower. Their search scopes are shown in Table 3.

Efficient prediction of XRLattCPred
In this paper, XRLattCPred is used to predict the recombination rate of attC r0 mutants, and a prediction algorithm based on structural features is established. In the algorithm XRLattCPred, the input is used to describe the structural characteristics of attC sites, and the output is the recombination rate. By learning the relationship between structural features and recombination, the algorithm XRLattCPred can infer which important structural features the recombination is based on, which is very helpful for improving the method of synthesising recombination sites. The algorithm XRLattCPred is based on the improvement of the combination optimisation strategy. It establishes a prediction model through the first 20 important features in the feature sequence, and cross-uses Random forest algorithm, XGBoost algorithm and LightGBM algorithm. A total of nine algorithm combinations are constructed and selected the best performance algorithm combination of Random forest + XGBoost (Table 4).
The experimental results show that the scores of the algorithm XRLattCPred on the four evaluation indicators are PCC = 0.87, VarScore = 0.73, MAE = 0.041 and RMSE = 0.006, respectively. Among them, a high PCC score indicates a high correlation between the real results and the predicted results, a high VarScore indicates that the algorithm can better reflect the changes of the data set, a low MAE score indicates that there is a smaller error between the predicted results and the real results, and a low RMSE score indicates that the predicted results are closer to the real results. In conclusion, the algorithm XRLattCPred achieves high precision prediction of attC sites and achieves smaller prediction error. We can conclude that the use of smaller feature sequences can still achieve effective prediction of attC sites from this experiment.

Comparison with other algorithms
Traditional regression algorithms usually have good predictive performance, so we also compare the algorithm XRLattCPred with traditional Decision tree regression, Ridge regression, Support vector regression and Random forest regression algorithms, and the experimental results show the strong performance of the algorithm XRLattCPred (the comparison result is shown in Figure 5). It can be seen from the overall comparison of the experimental results that in the four evaluation index dimensions, the algorithm XRLattCPred has achieved good scores. Among them, the Pearson correlation coefficient and the explained variance score both obtained the highest scores, PCC = 0.87 and VarScore = 0.73, respectively. This shows that the correlation between the value predicted by the algorithm XRLattCPred and the actual value is higher than other models. The scores of mean absolute error and root mean square error scores were the lowest, MAE = 0.041 and RMSE = 0.006, respectively. This shows that the algorithm XRLattCPred has a smaller error. To sum up, the algorithm XRLattCPred with higher prediction accuracy and better performance. Meanwhile, according to the results, we can find that there are features that have opposite effects on recombination in the feature sequences describing the attC site.

Important features filtered out
Regression algorithm is a supervised learning technique commonly used in machine learning. Data prediction can learn the relationship between independent and dependent variables and estimate unknown continuous functions based on limited data points. For the above experiments, it can be seen that the regression algorithm XGBattCPred has achieved a good score on the attC r0 mutant data set, which indicates that the regression algorithm has learned the relationship between the structural characteristics of attC sites and the recombination rate.
Therefore, this paper judges the importance of the feature by calculating the correlation between the relevant feature and the target value, so as to realise the scoring output of the feature sequence. This can not only screen out important features that contribute to recombination, but also be of great help in improving the synthesis method of attC sites. In this paper, accumulating and summing the 10 feature scoring sequences output by Random forest, XGBoost and LightGBM, not only effectively reduces the randomness and uncertainty caused by a single result, but also makes the output feature scoring sequence obtain higher reliability and stability. The importance score of each feature in the feature sequence is the value between the interval [0,1], and the sum of the scores of all features is equal to 1, and the important features will get higher scores. The top 20 important features obtained through XRLattCPred are the final result of the feature sequence output by the Random forest ( Figure 6). From the output feature score map, we can see that the recombination rate of attC sites is a multi-factor function, and the final result is based on a series of important features. Therefore, we propose the algorithm XRLattCPred based on important features to establish a predicting model.

Conclusions
Due to the high importance of DNA recombination sites for gene therapy and drug development, we studied the prediction problem of DNA recombination sites and proposed an attC site prediction algorithm XRLattCPred based on combination optimisation strategy. Algorithm XRLattCPred adopts combination optimisation strategy, which combines different algorithms to give full play to the advantages of each algorithm and improve the accuracy of prediction. In addition, the XRLattCPred algorithm is also suitable for other specific sites and other genetic elements based on sequence features, with good portability.
In this work, the data set we used is the structural feature data of attC site mutants, and the recombination of attC site may also be influenced by the environment, integron and other factors. Therefore, in the future, we will consider more features and establish an attC site prediction algorithm based on multiple features to further improve prediction accuracy and credibility.