Bioinformatic analysis of promoter, motifs and CpG islands of genes encoding potassium transporters in crop plants

ABSTRACT Potassium transporter genes are essential for plant salt stress tolerance. Identification of gene regulatory elements is vital for the recognition of gene expression patterns. Thus, understanding gene regulatory systems of potassium transporter genes is useful to improve the salt tolerance of crop plants. The present study was aimed at in silico analysis of promoter and regulatory elements of potassium transporter genes in coffee, peanut, soybean, maize, sorghum and potato crops. A total of 19 potassium transporter genes were identified, and the transcription start site (TSS), conserved motif, CpG islands were analysed using various computational tools. The highest promoter prediction score (1) was obtained in one gene sequence (LOC113728271) for Coffea arabica transcription start site, whereas the lowest score (0.8) was recorded in one gene sequence (LOC8065633) for Sorghum bicolor transcription start site. The analysis also showed that 66% of the genes contained more than one transcription start site whereas 63.16% had only one transcription start site. Five motifs were identified. Motif 1 was found as the common promoter motif residing on 78.95% of potassium transporter promoter sequences with an E-value of 3.4e − 002.C2H2 zinc finger transcription factors are predicted to bind to these conserved motifs with high statistical probability (2.28e − 03). Very few CpG islands were observed both in the promoter and body region of the gene sequences using two algorithms. The present study could contribute for better understanding of gene expression and the improvement of crops’ tolerance to environmental stresses through molecular assisted breeding or genetic engineering techniques.


Introduction
Potassium (K + ) is one of the most significant macronutrients for plant health and it contributes up to 0.8%-8% of plant dry weight [1]. It is involved in plant growth and development and determines the yield and quality of crop production. Potassium is associated with enzyme activation, opening and closing of the stomata and exchange of water vapor, oxygen and carbon dioxide [2,3]. Therefore, it helps plants in resisting salinity, drought and diseases. Moreover, in the phloem, K + is critical for the transport of photoassimilates from source to sink [4]. Most of these functions depend on the activities and regulation of membrane-bound K + transport proteins, working over a large range of K + concentrations [5].
Potassium is absorbed by plant roots and translocated into plants through several K + channels and transporters [6,7]. K + transport systems are proteins controlled by a large family of genes that are expressed in almost all plant tissues. As plants require significant amounts of K + , there is a large number of K + transport systems to allow its uptake from the complex soil environments and its redistribution to plant tissues [3]. These transporter genes have been clustered into four main gene families in plants: the KT/HAK/KUP family, Trk/HKT family, KEA (K + efflux antiporter) family, and CHX (cation/hydrogen exchanger) family [8,9]. Transporters of the HKT family are few in plants. For instance, there is only one HKT gene in Arabidopsis thaliana [10] and probably two genes in Oryza sativa [11] and Eucalyptus spp [12]. In contrast, the HAK transporter family also referred to as the KT or KUP transporter family is the largest gene family found in plants [13]. Members of this family are expressed virtually in all plant tissues that have been investigated in a wide variety of species [14].
Several previous investigations discovered that numerous plant HAK/KUP/KT families reveal diversity in terms of expression patterns in different tissues and cell types and response to specific signals [2]. As an example, under normal K + conditions, the AtHAK5 transcripts were detected in roots, however not in shoots but under K + deficiency conditions, the AtHAK5 transcript was present in both roots and shoots, though in roots the degrees of expression were higher [15]. Similarly, the qRT-PCR (qualitative real-time polymerase chain reaction) analysis showed that the expression levels of TaHAK genes in wheat seedlings were considerably up-or down-regulated once seedlings were exposed to K + deficiency, high salinity or dehydration [16].
The study of gene expression regulation is an important step in understanding the molecular mechanisms of plant stress response and gene function. Gene expression is usually controlled by the gene promoter sequence. The promoter is a non-coding DNA segment that is composed of multiple motifs or cis-acting regulatory elements (CAREs) and CpG Island that take part in the regulation of gene expression [17,18]. A promoter is usually located upstream from the transcription start site (TSS) and is responsible for the initiation of transcription. CAREs are short conserved motifs (5 to 20 nucleotides) which are specific binding sites for proteins such as transcription factors (TFs). TFs bind to the promoter region on a particular motif and orchestrate the initiation of transcription, promote or block the recruitment of the RNA polymerase II (Pol II) complex. CpG islands (CGIs) are clusters of CpG dinucleotides in GC-rich regions, often located on the 5′ end of genes. Methylation of promoter-associated CGIs plays an important role in the regulation of tissue-specific genes [19].
Accurate prediction of promoter regions and identification of regulatory elements are useful in improving genome annotation and elucidating the regulation and expression pattern of genes and may even produce valuable information about the function and signalling of novel and uncharacterized genes [20]. Modern molecular techniques such as RNA interference, microarrays, RNAseq and others have permitted prediction and analysis of promoter regions of target genes but these techniques are costly and technically difficult. In recent times, plenty of computational tools are available to search the promoter regions for different regulatory elements responsible for gene regulation [21,22]. Despite many recent efforts, the in silico identification of promoter regions is still in its infancy compared to gene prediction. Therefore, the present study was aimed to predict the promotor region, conserved motifs and CpG islands for potassium transporter genes in some selected crop plants.

Gene search
The sequences of three maize, four potato, four sorghum, two peanut, three soybean and three coffee potassium transporter encoding genes were retrieved from the National Center for Biotechnology Information, NCBI (http://www.ncbi.nlm.nih.gov) nucleotide database. For these crops, the pseudogenes were excluded, only genes with full-length sequences (with a start codon) were collected in FASTA format. As a result, an inconsistent number of genes per crop were considered in the study. Information about the chromosomal location, DNA length (bp), exon number, open reading frame (ORF) number and source organisms was obtained for each gene from NCBI, GeneBank (http:// www.ncbi.nlm.nih.gov).

Determination of transcription start sites (TSSs) and promoter regions of the genes
Transcription start sites (TSSs) and promoter regions of the genes were determined using Neutral Network promoter prediction tool (NNPP version 2.2). The Neural Network Promoter Prediction tool (https://www. fruitfly.org/seq tools/promoter.html) set was used with the minimum promoter score (between 0 and 1) (Reese, 2001). About 1-2-kb sequences upstream of the start codon were excised from each sequence. The excised sequences were submitted to the neural Network Promoter prediction program and an appropriate adjustment was made on the type of the organism, type of DNA strand used and the minimum promoter score (cut-off ) to run the program. For this study a default cut-off value (0.8) for eukaryotes was used. Then the program scans the sequences and shows the transcription start site in large font. For those sequences containing more than one transcription start site region, the transcription start site with the highest value of prediction score (0.8) was considered to have a trustable and accurate prediction. The remaining transcription start sites were used for simple comparative analysis among the sequences. Gene promoter regions, 1 kb region upstream of each transcription start site, were determined from the sequences and these regions were taken to predict conserved motifs on different gene sequences.

Determination of conserved motifs and transcription factors that bind to them in potassium transporter gene promoter regions
Analysis of conserved motifs for potassium transporter genes was performed by MEME (Multiple Em for Motif Elicitation) software version 5.0.5 (http:// meme.sdsc.edu) using a 1 kb sequence region upstream of the transcription start site, which was determined by Neural Network Promoter Prediction [23]. Using this technique, the presence of the common motifs that serve as binding sites for the transcription factors expected to regulate the expression of potassium transporter genes were determined across the promoter regions. Before searching the common motifs for uploaded sequences, basic search parameters such as, motif discovery mode, sequence alphabet, site distribution and number of motifs were adjusted. For the present study, the classic discovery mode, DNA sequence alphabet, default site distribution (zero or one occurrence per sequence (zoop)) and five (5) numbers of motifs were used. MEME HTML output format was used in this study though six different formats were displayed. The MEME search result motif having the least E-value with a minimum and maximum width of 6 and 50 residues, respectively, was selected as the best motif to be used in TOMTOM web-based searching motif comparison programs for transcription factor prediction. The selected MEME result was forwarded to TOMTOM to predict common protein motifs that bind to DNA motifs at the promoter region by comparing one or more motifs against the database of known motifs, the JASPAR database for this study. In this way, the TOMTOM program generated a list of transcription factors (proteins) ranked by p-value and E-value. Accordingly, the motif with the lowest P-value and E-value was taken as the high binding possibility transcription factor [24].

Search for CpG islands
A 2-kb length query sequence upstream of the start codon was prepared for the 19 protein-coding sequences of potassium transporter genes. Two algorithms were used to search for CpG islands of the genes. The first algorithm was the stringent search criteria, Takai and Jones [25] algorithm: GC content ≥55%, ObsCpG/ExpCpG ≥0.65, and length ≥ 500 bp, which uses CpG island searcher program (CpGi130) available at the web link http://dbcat.cgm.ntu.edu.tw/. The second tool was the offline tool CLC Genomics Workbench ver. 3.6.5 (http://clcbio.com), which was used for searching the restriction enzyme MspI cutting sites (fragment sizes between 40 and 220 bps).

Phylogenetic analysis
Phylogenetic analysis was performed using full-length protein-coding gene sequences to understand the evolutionary relationship among and within the genes. Firstly, the gene sequences retrieved from NCBI were aligned using ClustalW of MEGA ver 6.06 [26] multiple sequence alignment (MSA) tool with the default settings. Then the unrooted phylogenetic tree was constructed using Molecular Evolution Genetic Analysis ver. 6.06 (MEGA6) (http://www.megasoftware.net) tool by the neighbor-joining method [27]. In MEGA ver 6.06, some parameters were modified to obtain a stable estimate of a reliable phylogenetic tree. Accordingly, for transition and transversion nucleotide substitution, the p-distance model was used. Bootstrap values were computed with 2000 repetitions with a uniform rate among sites and complete deletions of gap sites to analyze the sequences.

Sequence retrieval
Gene sequences of 19 potassium transporter family members of six different crop species, 3 coffee (Coffea arabica), 2 peanut (Arachis hypogaea), 3 soybean (Glycine max), 3 maize (Zea mays), 4 sorghum (Sorghum bicolor) and 4 potato (Solanum tuberosum) sequences were retrieved from the NCBI database. These crops were selected randomly from different species to see the genetic variation that may exist among potassium transporters of different crop species. The characteristics of each potassium transporter gene sequence are shown in Table 1 including locus, chromosome number, number of cis-acting elements, number of ORFs, gene length, number of exons and source organism. The retrieved potassium transporter gene sequences have variable length of nucleotides ranging from 1287 to 7085. The coding sequences of all the potassium transporter genes were disrupted by introns and have exon numbers varying from three to eleven. Four out of 19 genes contain 10 exons and another similar number of genes contain 8 exons in their sequence. The selected genes have a variable number of ORFs ranging from 3 to 30. The maximum number (30) of ORFs was recorded for the LOC8086366 gene, while the minimum number (3) was obtained from five genes (LOC112747245, LOC112736009, LOC100799414, LOC102661038 and LOC102604030). Variation in the number of ORFs was also observed within genes that were isolated from the same crop. The Exon number, number of ORFs and sequence length were different even within genes identified from the same crops ( Table 1).

Determination of transcription start sites and promoter regions for potassium transporter genes
Determining the location of the transcription start sites and promoter region in a given gene is crucial for mapping the cis-regulatory elements and hence for further studying the mechanism of gene regulation [28]. To identify their corresponding transcription start sites, 1-3 kb sequences upstream of the start codon were excised from each sequence. The Neural Network Promoter Prediction (NNPP version 2.2) tool set was employed with the minimum standard predictive score (between 0 and 1) cut-off value of 0.8 for eukaryotes [29]. Accordingly, the result of transcription start site and promoter analysis showed a significant variation in the number of transcription start sites ( Table 2). The in silico prediction revealed that seven, five, three, two and one potassium transporter genes contained one, two, three, four, five and six transcription start sites, respectively. To make the prediction more reliable for genes having more than one transcription start site, the transcription start site with the highest score was taken as the correct transcription start site. The result indicated that the highest transcription start site number (6) was recorded in the regulatory region of the gene (LOC8065633) isolated from the sorghum crop, whereas the least number (one) of transcription start site was obtained in LOC113687353 (Zea mays) and LOC102594138 (Solanum tuberosum) gene sequences. In other words, 36% of the genes contained more than one transcription start site, whereas 63.16% had only one transcription start site. Unlike other crops, all the three genes obtained from Glycine Max showed the least number (1) of transcription start sites (Table 2).
Similarly, the relative positions of all transcription start sites with respect to the start codon are given in Table 2. The nearest transcription start sites were recorded for LOC8075232 (−27) followed by LOC8077489 (−28), whereas the far-flanged transcription start site was observed for LOC8086366 (−2804) followed by another transcription start site of the same sequence (−2769) upstream of the start codons of the gene. It was also demonstrated that the locations for 80% of the transcription start sites were within −1000 bp relative to the start codon. The relatively highest frequency (20%) of occurrence above −1000bp followed by the upstream region of −1 to −200 bp and −201 to −400 bp (17.8%). The lowest occurrences were observed at −601 to −800 and −801 to −1000 bp (13.3%).

Common motifs and respective transcription factors in the promoter regions of potassium transcporters
MEME Suite web server was used to identify common elements (motifs) from a cluster of sequences. Accordingly, five motifs (motif 1, 2, 3, 4 and 5) were found in 5 out of 19 (26.32%) potassium transporter genes' promoter regions. The relative position and Solanum tuberosum spatial distribution of these motifs on the promoter regions were presented in Figure 1. Among the five motifs selected for their distribution across the promoters, motif 1 was found as the common promoter motif residing in 78.95% of potassium transporter promoter sequences containing 15 binding sites and having a width of 15 nucleotides (Table 3). The common motif was used as a binding site for different transcription factors that regulate the expression of these genes. The conserved motif, motif 1, was found on the negative strand of 52.6% of the genes, whereas the rest resided on the positive strand of the genes. The least identified common motif was motif 5, which was found only in 26.32% of the input sequences with five binding sites ( Table 3). The total number of motifs discovered in the promoter regions of genes encoding potassium transporters in all the 6 crops were about 48, out of which relatively, higher distributions of motifs were found in positive (32) than in negative (16) strands. In the potassium transporter genes of all these crops, the distribution of motifs along the promoter regions was almost the same in positive and negative strands, except for genes isolated from Sorghum bicolor, for which 10 and 5 motifs were discovered on the positive and negative strands, respectively. The location and distribution of these motifs in the potassium transporter genes ranged from −1000 to −30, whereas a higher concentration of motifs was found between −850 and −30 bp of the transcription start sites (TSSs) (Figure 1).
Further analyses of motif 1 of potassium transporter genes were carried out (Figure 2A). Motif 1 was compared to the registered motifs in publically available databases such as JASPAR to observe whether a newly discovered putative motif resembles any of the previously discovered regulatory motifs for transcription factors using the statistical measure of motif-motif similarity, TOMTOM web application [24]. Accordingly, the result revealed that motif 1 matched with 18 out of 489 known motifs found in the JASPAR plant motif database. Based on their estimated statistical significance values, the top 10 matched motifs were selected ( Table 4). The result showed that two C2H2 zinc finger, three Barley B recombinant/basic pentacysteine (BBR-BPC) and one basic helix-loop-helix (bHLH), one helix-turn-helix, and Dof type transcription factor families were motif 1 binding candidates. It was also detected that motif 1 had the highest resemblance with the motif binding to C2H2 zinc finger genes, which is also a known transcription factor family. The sequence logo of the C2H2 zinc finger transcription factor predicted by TOMTOM is presented in Figure 2(B).

Determination of CpG islands in potassium transporter gene promoter regions
In this study, CpG islands were identified in the promoter and gene body region for 19 potassium transporter genes of six crop plants using two algorithms, CLC Genomics Workbench 3.6.1 (CLC bio, 2014) and CpG island Finder (Takai and Jones, 2002) algorisms. Accordingly, the former, using restriction enzyme MspI (C/CGG sequence) cutting sites with standard fragment sizes between 40 and 220 bp revealed that in the potassium transporter genes, seven (LOC112747245, LOC100784861, LOC103647711, LOC8065633, LOC8077489, LOC102604030 and LOC107063183) and four (LOC112736009,  LOC103639705, LOC8065633, and LOC102604030) CpG islands were found in the gene body and promoter region, respectively (Tables 5 and 6). The result also revealed that the CpG islands were observed in the gene body of the genes from all crops except Coffea arabica. On the other hand, Arachis hypogaea, Zea mays and Solanum tuberosum contained one CpG island in one of their gene's promoter regions, while Sorghum bicolor had two CpG islands ( Table 6). The second algorithm, CpG island searcher program (CpGi130) [25] was used to determine the presence of CpG Islands in the promoter and gene body of the potassium transporter genes. The analysis result revealed that only one gene sequence, LOC8086366 (S. bicolor), contained one CpG Island region in its gene body having 833 bp length and 63% GC content (Table 7). There were no CpG islands in the promoter region of all the other potassium transporter genes studied here.

Phylogenetic analysis
To illustrate the phylogenetic relationships within the potassium transporter gene family, we constructed a phylogenetic tree based on the alignment of full-length DNA sequences using MEGA 6.0 software by neighbor-joining (NJ) method [26]. As indicated in Figure 3, the transporter genes were grouped into four classes (I, II, III and iV), each with 4, 4, 5 and 6 members, respectively. An unrooted tree with a bootstrap value of 14 to 100 was constructed based on the alignment of the KT/HAK/KUP DNA sequences in coffee (C. arabica), peanut (A. hypogaea), soybean (G. max), maize (Z. mays), sorghum (S. bicolor) and potato (S. tuberosum). The phylogenetic tree indicated that the gene sequences clustered together regardless of their source organism, which implies a close relationship among the genes rather than the whole species.

Discussion
In the present study transcription start sites were predicted for 19 potassium transporter genes of crop plants and the obtained result revealed that some of the genes contained multiple transcription start sites at different locations on their promotor regions, whereas others comprised a single transcription start site. Accordingly, 36% of the genes contained more than one transcription start site, whereas 63.16% had only one transcription start site. This result is similar to a previous report that 37.9% of 29 transcriptional regulator ACP92 genes in Herbaspirillum seropedicae contained more than one transcription start site, whereas 62.1% had only one site [30]. Aman Beshir and Kebede [31] also reported that the number of transcription start sites varied from 1 to 3 with the majority of sequences (71.43%) having more than one transcription start site. Triska et al. [32] also reported that in eukaryotic genes a promoter contains one or more transcription start sites, which initiate the transcription of a gene. For instance, the MAP kinase gene OsBWMK1 of rice (Oryza sativa) is differentially expressed producing two variants of OsBWMK1 transcripts with different transcriptional initiation sites [33]. The Oryza sativa Lagging Growth and Development 1 (LGD1) gene generates multiple transcripts that have different transcription start sites [34,35]. In general, multiple transcription start sites could give the alternative transcription potential for the genes to produce a variant of the transcript in response to environmental condition change. On the other hand, high throughput TSS identification is helpful for accurate prediction of promoter region [32], as the promoter region which comprises the transcription factor binding sites is mostly found within 1 kb upstream of the transcription start site. The proximal promoter usually possesses dozens of short conserved motifs that serve as transcription factor binding sites, which are responsible for specific transcription regulation [36]. In the present analysis, we have identified common motifs with different information content in the promoter region of the genes. Accordingly, five motifs that were at least shared by 26.32% of the potassium transporter promoter sequences were predicted. Among the five motifs selected for their distribution across the promoters, motif 1 was found as the common promoter motif residing in 78.95% of potassium transporter promoter sequences. A previous study reported that three Single cut (at 616) pro-loc112747245 Single cut (at 330) pro-loc112736009 Single cut (at 169) 169 pro-loc100784861 Single cut (at 559) pro-loc100799414 Single cut (at 559) pro-loc102661038 -pro-loc103639705 Single cut (at 127) 127 pro-loc103645925 Single cut (at 24) pro-loc103647711 Single cut (at 463) pro-loc8065633 Single cut (at 47 ) & single cut (at 184) 47, 184 pro-loc8075232 Single cut (at 249) pro-loc8077489 -pro-loc8086366 Single cut (at 502) pro-loc102587332 Single cut (at 419) pro-loc102594138 -pro-loc102604030 Single cut (at 109) 109 pro-loc107063183 Single cut (at 501) - putative conserved motifs, each 50 aa in length, were identified in the OsHAK family [8]. Similar to the present finding, Wang et al. [37] identified five motifs in pear KT/HAK/KUP genes based on MEME analysis. Moreover, Song et al. [38] also identified three putative conserved motifs for 41 NtHAK genes. In a similar study, three putative conserved motifs were also identified in ZmHAKs genes and all these three putative conserved motifs were present in all the ZmHAKs members [39]. In this study, C2H2 Zinc finger, BBR-BPC, bHLH, Helix-Turn-Helix and C2H2 Zinc Dof type transcription factors were found to bind to the identified motif (motif 1). These transcription factors bind to specific sequences (motif 1) in the promoters of their target genes, thereby regulating gene expression and affecting biological phenotypes [40]. In agreement with the present findings, yirgu and Kebede [30] identified five different transcription factor families binding to common motifs in the promoter regions of Herbaspirillum seropedicae ACP92s genes.
CpG islands (CGIs) are also important regulatory elements in the promoter regions of the genome [41]. In the present study, a CpG Island on the prompter region and gene body was predicted using two algorithms. It was suggested that potassium transporter genes are poor in CpG islands both in the promoter region and the gene body. A low number of CpG Islands implies less DNA methylation, meaning that there is a low probability of epigenetic effects in the regulation of these genes. Ashikawa [42] predicted CpG Islands in the genome of plants and the result showed that 80%, 76%, 69% of rice, Arabidopsis and sorghum genes, respectively, had a CpG Island within their gene region. They also elucidated that plant genomes contain CpG islands more frequently than the human genome does. In another study, yirgu and Kebede [30] reported the occurrence of a myriad of CpG islands in the promoter region and gene body of Herbaspirillum seropedicae bacterium ACP92s genes.
The phylogenetic tree demonstrates the evolutionary relationships of the members under consideration. The phylogenetic tree constructed in the present study revealed that all K + transporter genes in the crops studied were grouped into four major clusters. This implies that the gene sequences grouped in the same cluster have closer relationship than the genes falling in other clusters. Amrutha et al. [43] reported that in a phylogenetic analysis of K + transporters in rice, the genes fall into five major branches. Similar phylogenetic tree results of KT/HAK/KUP transporters have been reported in purple osier willow (Salix purpurea) [44], rice (Oryza sativa) [8], poplar (Populus nigra) [9] and pear (Pyrus communis) [37].

Conclusions
It is known that potassium transporter gene families, HAK/KUP/KT, enable various plants to withstand salt and drought stress. Our present in silico study identified the promoter regions and their regulatory elements with diverse biological functions for 19 potassium transporters including core promoter elements, transcription factor binding sites and CpG Islands of the promoter region. The identified genes from different crops have common motifs that serve as a binding site for certain proteins such as transcription factors. CpG islands were poorly observed in the genomes analyzed in this study. The regulatory motifs were distributed on the promoter region in a clustered manner, which enabled us to precisely identify DNA regulatory sequences that are responsible for spatial and/or temporal-specific gene expression. The accurate detection and demarcation of promoter regions are essential for several reasons, such as improving genome annotation and understanding the transcriptional regulation in plants. The results from in silico analysis expanded our knowledge of the promoters along with the regulatory motifs in the potassium transporter genes and could help to use this information for the improvement of the crops under study using genetic engineering or molecular marker-based techniques. The obtained results also provide valuable information for the functional characterization of the potassium transporter genes of the selected crop plants.