Patterns of genome-wide codon usage bias in tobacco, tomato and potato

Abstract Understanding codon usage patterns and their shaping factors in members of the Solanaceae family may reflect important features of their evolutionary history. Here, we investigated codon usage bias (CUB) over the whole genome of tobacco, tomato and potato. Analyzing the effective number of codons (ENc) revealed a weak CUB over the whole genome of the three genomes. The correlation analysis between the ENc and the nucleotide content at the third codon position indicated that genes with high ENc have high A/T and low G/C contents. Analyzing nucleotide composition and the relative synonymous codon usage confirmed that the three genomes are AT-rich and favor codons ending with T and/or A. Comparing CUB in each of the three genomes revealed that approximately 50% and 45% of the genes were dominated by mutational and natural selection, respectively. The remaining 5% of the genes were affected by combined factors of mutation and translation selection. The first four axes of the correspondence analysis (CA) were the major contributors to the variation within the codon sequences (CDS) of the three genomes. Results of a multi-regression analysis between the 4 axes and the CUB indices verified the fact that natural selection and directional mutation for base composition along with other factors could describe the variations in the CUB within the CDS of the three genomes. In conclusion, the CUB of the three species showed the same weak trend and that mutation and natural selection were the main factors influencing the nuclear genes of the three genomes.


Introduction
During the translation process from mRNAs to proteins, genetic information is transmitted in the form of triple nucleotides named codons to encode amino acids. Apart from methionine (Met) and tryptophan (Trp), each amino acid can be encoded by multiple codons which are known as synonymous codons [1]. Several studies on different organisms reported that synonymous codons are not used uniformly within the same species nor between different species, while translating genes into proteins, a phenomenon called 'synonymous codon usage bias (SCUB) or codon usage bias (CUB) [2,3]. Hence, each organism has its optimal codons that are used more frequently in highly expressed genes than in lowly expressed genes [4]. Directional mutation and natural selection are the two main factors shaping CUB of an organism. If mutational bias solely shapes the CUB of an organism, the four base content should be observed uniformly within a codon family in a gene [5,6]. There are other factors that influence CUB of an organism such as nucleotide composition, gene length [7], tRNA abundance [8], codon hydropathy and DNA replication initiation site [9], and expression level [10]. Investigating the CUB of an organism facilitates a better understanding of its molecular evolution, gene expression, host-pathogen co-adaptation and predicted optimal codons. Investigating the CUB of an organism facilitates a better understanding of its molecular evolution, gene expression, host-pathogen co-adaptation and predicted optimal codons. The later facilitates designing highly expressed genes and constructing suitable cloning vectors [11]. Various studies have quantified CUB using the position and the total GC content [12,13]. Those studies illustrated that the synonymous third codon position (GC 3 ), also known to be linked with translation optimization, is a strong predictor of CUB. Different approaches have been used to substantiate the degree of CUB and to compare the relative effects of natural selection and mutation bias in shaping codon usage. The most widely used approaches are, the effective number of codons (ENc), the codon adaptation index (CAI), the relative synonymous codon usage (RSCU) and the translation selection (P2) index [4]. The ENc measures the bias of using a smaller subset of codons apart from the equal use of synonymous codons irrespective of gene length. Generally, highly expressed genes should have a high level of codon preference and hence low ENc values. Besides, ENc can be an indicator of codon usage regarding mutational bias [14]. In the case of CAI, a reference set of highly expressed genes (e.g. ribosomal genes) is used to serve as an indicator of gene expression levels and natural selection, where both help in shaping the pattern of codon usage. Also, CAI can be used as an indicator of selection for a bias toward translational efficiency and/or accuracy [15]. RSCU calculates the observed count in relation to the expected count of each analyzed codon. P2 index measures the bias of anticodon-codon interactions, from which the translation efficiency can be inferred. Neutrality plot [16] and parity rule 2 (PR2)-bias plot analysis have also been used to examine the influence of natural selection, mutation or both on CUB [17].
Solanaceae family is one of the biggest families within the angiosperms. This family consists of approximately 3,000-4,000 species and contains a number of major economically important species all over the world such as tobacco (Nicotiana tabacum), tomato (Solanum lycopersicum) and potato (Solanum tuberosum).
Not much information is available on the factors influencing CUB in tobacco, tomato and potato. For example, a previous study on the tomato cultivar Micro-Tom using the available non-redundant full-length cDNA (nrFLcDNA) and expressed sequence tags (ESTs), revealed that GC3 content and CAI were the major factors affecting the CUB [18].
To increase our understanding of the CUB and the main factors influencing it in the three genomes, we designed the present study. Our results revealed similar weak CUB over the whole genome of the three species. We also showed that mutation and natural selection on translation efficiency and/or accuracy were the main shaping factors influencing the nuclear genes of the three species.
Nucleotide content/frequencies in the third codon position (A3, T3, G3 and C3) were calculated using a self-developed python script. In addition, GC content at different codon positions, i.e. GC1, GC2 and, GC3, as well as the overall GC and GC12, were assessed. Averages were computed to examine the association between codon usage variation and base composition. Using the above equation, plots of the expected fitting curve of ENc values were drawn using ggplot package in R [20], and then ENc values versus GC3s values for each gene were plotted. If the distribution of the plotted genes is along/near the curve, then the codon usage bias is only affected by the mutation. If the distribution of the plotted genes is below the curve, then the codon usage bias is affected by the selection and other factors.

Codon adaptation index
The python Package CAI [21] was used to calculate CAI using the following equation [15]: where L is the count of codons in the gene and wc (k) is the relative adaptive value for the kth codon in the gene. CAI values range from 0 to 1 with 0 for genes with even usage of all synonymous codons, and 1 for genes with consistent usage of specific codons.

Relative synonymous codon usage
The R package seqinR [22] was used to calculate RSCU using and the following equation [15]: where O ac is the count of codon c for an amino acid and k a is the number of synonymous codons.
From the above equation, if the value of RSCU is less than one, then a codon is observed less frequently than the average codon usage. RSCU becomes equal to one when the codon is observed as the average, and more than one when the codon is observed more frequently than the average codon usage [23].

Translational selection index
Translational selection index (P2) for each CDS was calculated using python and the following formula;

WWY SSY
where W = A or U(T), S = G or C, and y = C or U(T). Then the P2-index for the whole CDS for each genome was calculated by taking the average. P2 values correlate positively with the level of gene expression, i.e. high for highly expressed genes and low for lowly expressed genes [4]. If P2 is more than 0.5, there is a bias toward translation selection in the tested CDS.

Neutrality plot
Neutrality plots were drawn using ggplot package in R [20]. The neutrality plot was used to explore the pattern of codon usage. The average GC content at the first and the second codon positions (GC12) was plotted against the GC content at the third codon position (GC3). A regression plot with a slope equal to or near 0 indicates that natural selection influences codon usage, while a slope equal to or near 1 indicates that mutation influences codon usage [16].

Parity rule 2-plot analysis
Using nucleotide content at the third codon position (A3, T3, G3, and C3) for each gene, AT-bias (A3/ (A3 + T3)) and GC-bias (G3/(G3 + C3)) were estimated and used in drawing the PR plot as the ordinate and the abscissa, respectively, with both coordinates equal to 0.5, where A = T and G = C [24]. If genes are scattered equally over the plot, then the CUB is likely to be solely caused by the mutation [17]. PR plot was drawn using ggplot package in R [20].

Determination of putative optimal codons
Five percent of the highest and lowest expressed genes were determined using the ENc values. Then, the RSCU values for each codon are compared between the highest and lowest expressed genes using the chi-square test. The codon that is used significantly more in the highly expressed genes than in the lowly expressed genes is regarded as an optimal codon [25].

Correspondence analysis
To examine the variations in the codon usage bias among all the CDS, we performed multivariate statistical analysis on the 59 codons, i.e. the 61 synonymous codons excluding Met and Trp codons. One way to do that was using correspondence analysis (CA) [17] by plotting the group of genes on the continuous axes in multidimensional space according to the trends affecting the synonymous codon usage within the gene groups. Two types of CA were employed on codons to check the factors affecting the choice of amino acids with respect to the codons being used [26]. The first CA used the relative codon usage frequencies and the second CA used the amino acid usage. The correlation between the first 4 axes of all CA types (the major contribution to the codon usage variation) and CAI, ENc and GC3 was estimated. CA analysis was implemented using CodonW software (http://codonw.sourceforge.net/).

Codon base composition
To facilitate our analysis of the three genomes, and to investigate CUB and ENc, we used 84,255, 33,697 and 56,210 coding sequences (CDS), i.e. the total count of codons in all CDS of the genome (Table 1) of tobacco, tomato and potato, respectively. The total percent of GC observed here was 42.61%, 41.28% and 42.50%, while the GC3 percent, i.e. percent of GC at third codon position, were 38.48%, 37.01% and 38.22% for tobacco, tomato and potato, respectively (Table 1). Analyzing each nucleotide at the synonymous third codon position (A3s, T3s, C3s, and G3s) in the genomes of the three species, tobacco, tomato and potato, indicates that the mean values of A3s and T3s are significantly higher (t-test, p < 0.01) than the mean values of G3s and C3s (Figure 1-i). The neutrality correlation analysis revealed significant, P-value <0.05, positive correlation between GC12 and GC3, with slope approximately equal to zero, i.e. 0.000311, 0.00105, 0.00102, for tobacco, tomato and potato, respectively.

The synonymous codon usage
To compare the genomes of the three species and to measure the strength of CUB, ENc was calculated. The three species showed similar weak CUB with ENc values of 58.31, 57.87 and 57.67 for tobacco, tomato and potato, respectively. The ENc values for each gene of the three genomes were significantly negatively correlated with their GC, GC12, i.e. the average of GC1 + GC2, and GC3 nucleotides contents ( Table 2). The highest negative correlation, i.e. r = −0.69, was observed for potato between GC3 and ENc. In the case of tobacco and tomato, GC3 and ENc were significantly negatively correlated, i.e. r = −0.43 and −0.31, respectively.
In the ENc plot (Figure 1-ii), the plotting values of ENc against GC3s content illustrated the effect of the base contents on CUB in the three CDSs of the studied genomes. Coding sequences (CDSs) were gathered around the expected ENc fitting curve, in case of high ENc values, weak bias. In contrast, CDSs with low ENc values (strong bias), were found to be below the expected curve.

The synonymous codon usage features and the determination of putative optimal codons
In general, RSCU values were greater than 1, meaning a codon is observed more than expected, for 28 codons in each of the three genomes. These high-frequency codons ended with either A or T in 24 codons, where 20 of these codons are common in the three genomes i.e. AAA, AAT, ACA, ACT, AGA, ATT, CAA, CAT, CCA, CCT, CTT, GAA, GAT, GCA, GCT, GGA, TGT, TCT, TCA, TAT. Based on the correlation between RSCU and ENc for each gene in each genome, 25 optimal codons ending with either A (7) or T (16) were determined in tobacco, with two additional codons ending with C (Table 3). In tomato, 30 optimal codons ended either with A (13), T (15) or G (2). In potato, 26 optimal codons were found to end either with A (9), T (16) or G(1) ( Table 3).

PR2-bias plot
Figure 1(iii) shows the PR2-bias plots of the three genomes, where CDS of the three genomes were distributed on the left quartiles along the ordinate, or the upper left quartile along the abscissa. The numbers of CDS were scattered on the G > C plane more than the G < C plane.

Translational selection (P2) index
Analyzing the P2-index for each gene in the three genomes revealed similar effects of natural selection on translational efficiency and/or accuracy in tobacco and tomato. For example, out of 84,255 and 33,697 total CDS in tobacco and tomato, respectively, 52% showed low selection effect with P2 < 0.5, 4% had moderate selection effect with P2 = 0.5, and 44% had high selection effect with P2 > 0.5. In the case of the potato genome, the result was slightly different, where from the total of 56,210 CDS; 49% of the CDS had low selection effect with P2 < 0.5, 4% had moderate selection effect with P2 = 0.5, and 47% had high selection effect with P2 > 0.5. For the three genomes, it was observed that both SSU and WWU were higher than SSC and WWC (Table 4), where W = A or U, S = C or G. Hence among the two pyrimidines (U and C) the preference tends to be U (T) in the third position of a codon.

Correspondence analysis using codon usage frequencies
The result for correspondence analysis using the relative synonymous codon usage frequencies (CA-RSCU) 38.5% (SD = 6.1) 37.0% (SD = 6.95) 38.2% (SD = 6.9) note: number of cDS = the total count of codon sequence of a genome. number of codons = the total count of codons in all cDS of a genome. gc% refers to the total percent of gc, gc12% = percent of gc at first and second positions, gc3% = percent of gc at third codon position. SD refers to standard deviation.
for each genome showed that CA axes could explain 32%, 44% and 46%, respectively, in tobacco, tomato and potato. The first four axes explained most of the CUB variation found in the CDS of the three genomes; therefore correlation analysis between the first four axes and the calculated CUB indices (CAI, ENc, and GC3) was performed (Table 5). To examine the variation within the amino acid, CA using the amino acid  Table 2. the correlation analysis between the effective number of codons (enc) and each of gc, gc12 and gc3 in the three genomes. usage was performed and correlation analysis between the first four axes and the calculated CUB indices was carried out (Table 5).

Discussion
One factor shaping the pattern of codon usage is the nucleotide composition at the third codon site, i.e. A3, T3, C3 or G3. Our results revealed that the three investigated genomes favor using codons ending with T or A more frequently than codons ending with C or G. This finding suggests that there are similar codon base compositions at the genome level of tobacco, tomato and potato. Our results agree with the previous studies on the genomes of dicot species, i.e. cotton [4], tomato [18], Arabidopsis thaliana [27,28], peas, tobacco [27], where they show that dicot genomes have low GC-content and tend to use A/T ending codons. Earlier studies showed significant over-representation of codons ending with A/T in dicot species, including tomato and potato, more than in monocot species [29,30]. Neutrality correlation analysis establishes a relationship between GC12 and GC3 contents to show the role of either natural selection or mutation in shaping the CUB. Our neutrality plots displayed a significant correlation between GC12 and GC3 contents in the three genomes suggesting that mutational stress plays a role in shaping the codon usage preferences. On the other hand, and with a regression neutrality plot with a slope almost equal to zero, natural selection can be suggested to play a role in shaping the codon usage preferences.
The negative correlation between the ENc, with values of 58.31, 57.87 and 57.67 for tobacco, tomato and potato, respectively, and CUB indicate low or weak CUB meaning general random codon preference. Also, weak CUB was observed for CDSs with high ENc values, i.e. around the expected ENc curve in Figure 1(ii), indicating that CUB in the three genomes is affected by the mutation. In addition, the significant negative correlations observed between the ENc and each of the GC, GC12 and GC3 indicate that the base content at the first, second and third positions of the synonymous codons directly affects the level of CUB. Moreover, plotting the ENc values for the CDS versus the GC3s of the three genomes indicates that the codon usage pattern of CDS might be formed by the combined impacts of directional mutation and natural selection.    To determine the putative optimal codons for the three genomes, RSCU analysis was performed and revealed 26 common codons ending with T or/and A. Those 26 codons were observed more frequently than expected. This finding supports a high degree of uniformity among the three genomes. A similar preference of optimal codons was recently reported among four different cotton species [4]. Taking advantage of the positive correlation between the CUB and gene expression, putative optimal codons could be predicted [10]. Therefore, we conclude that genes with weak CUB would have lower G3s, and C3s, and higher T3s and A3s values. Altogether, our results indicate that genes of the three investigated genomes prefer to use high expression codons ending with pyrimidines (T/A). A similar trend was reported for the optimal codons in chloroplast genes among 12 Solanum species, where the chloroplast genes were found to prefer A/T ending codons, and 66% of the optimal codons had A/U at the third codon position [25].
Despite the GC content of a gene, using PR2-bias plot facilitates identifying a gene with CUB formed only by nucleotide composition, i.e. having even distribution between G3 and C3, as well as A3 and T3 [31]. In the present study, the PR-2 bias plots of the three genomes showed inequality between A + T and G + C at the third codon positions indicating that mutation, selection and other factors such as gene expression, gene length, etc. were the main factors defining the codon usage patterns. Similar asymmetry between pyrimidines and purines was previously observed in chloroplast genes among 12 Solanum species [25] as well as in four cotton species [4].
Normalized values of translational selection (P2-index) range between 0 and 1, where a gene with P2-index equal to 1 has the largest translation selection. The CDS of the three genomes investigated here showed an average P2-index equal to 0.5, indicating that the CDS of the three genomes, in general, are dominated by mutation pressure rather by translation selection. Further analysis showed that 44%, 44% and 47% of the genes in the genome of tobacco, tomato and potato, respectively, have a P2-index more than 0.5, indicating that translational selection played the dominant role over mutation pressure in the codon usage. Altogether, our results illustrated that not only one factor, but both translation and mutation affect the CUB of the three genomes.
To verify the factors affecting the codon usage in the three genomes, correlation analysis was carried out between CAI, ENc and GC3 and the first 4 axes of CA. The results showed that the first 4 axes were the major contributors to the codon usage variation for the three genomes. In tobacco, a significant correlation was found between axis-1 of CA-RSCU and CAI (r = −0.57, p-value < 0.05) which may support the hypothesis that selection for translation efficiency and/or accuracy explains most of the variation of codon usage. Furthermore, CA using amino acid usage (CA-AA) showed strong correlation between axis-1 and CAI (r = −0.56, p-value < 0.05). Hence, it could be inferred that there is an effective choice of amino acids for translation efficiency in the tobacco CDS. Different results were observed in tomato, where axis-1 of CA-RSCU correlated with GC3 (r = −0.75, p-value < 0.05) as well as ENc (r = 0.37, p-value < 0.05). This result reveals that directional mutation for base composition, not selection, dominates the CUB within the tomato genome. However, we did not find any remarkable trend for CA-AA and hence amino acid usage in tomato was affected by factors other than mutation and selection.
In potato, no single force could explain most of the variation in CUB, where in CA-RSCU axis-1 the highest correlation was found with CAI (r = −0. 384, p-value < 0.05) and for axis-2 the highest correlation was found with GC3 (r = −0. 373, p-value < 0.05). The same trend in the result was found for CA-AA. Therefore, it can be concluded that CUB in potato was influenced by different factors and mainly by mutation for base composition and translation selection.

Conclusions
The present study investigated the genomes of tobacco, tomato and potato, which shared the following main criteria. They uniformly use synonymous codons ending with either T or A. They show the same weak trend of CUB with a high percentage of T3/A3 and low G3/C3 content. Their nuclear genes were influenced by mutation and natural selection. Although it was not possible to obtain the required data, we are aware that the GC content in introns or intergenic sequences could be used as the reference to measure the effect of mutation bias. Finally, studying plant CUB and the factors affecting it will increase our understanding of the evolutionary history of those plants. Also, it enables determining the preferred optimal codons that help in some biotechnology applications such as gene transfer and gene editing.

Data availability
Data used here are available online from the websites stated in materials and methods.

Disclosure statement
No potential competing interest was reported by the authors.