Single nucleotide polymorphism analysis in plastomes of eight Catharanthus roseus cultivars

Abstract Catharanthus roseus, an important medicinal plant, is known for the production of various pharmaceutical compounds including the anti-tumour drugs, vinblastine, and vincristine. Also, this ornamental plant is widely known for its flowers with different colours. Its varieties are identified on the basis of the morphological characteristics of the flowers, including petals and flower eye colour. Morphological characterization cannot be performed before the flowering stage, leading to a major obstacle for consumers, since most of the sales associated with its medicinal value occur at the vegetative stage. In the present study, we utilized high-throughput, next-generation sequencing to detect chloroplast single nucleotide polymorphism (SNP) that are unique to each variety for molecular characterization. The total genomic DNA of eight C. roseus varieties were sequenced using Illumina HiSeq 2000 platform. The alignment of resulting sequences to chloroplast reference genome showed the presence of specific SNPs for all eight varieties. Also, intravarietal SNPs were found that confirmed the applicability of heteroplasmy theory in the chloroplast genome of this species. Thus, this investigation provides valuable insights into molecular characterization of C. roseus, especially at the vegetative stage.


Introduction
The tropical plant Catharanthus roseus, also known as Vinca rosea or Madagascar periwinkle, is a flowering species that belongs to the family Apocynaceae. It originated in the Madagascar Island and has widely spread to warmer areas including Saudi Arabia [1]. This plant is extensively studied for its pharmacological content and medicinal importance. It has traditionally been used to treat diabetes, diarrhea, scurvy, hypertension, helminthes infection, ulcer, chronic wounds, and loss of memory [2]. The past decade has witnessed increased use of metaboltes obtained from this plant as anti-cancer drugs with yearly revenue of 25 to 300 million U.S. dollars [1]. C. roseus is one of the major sources of terpenoid indole alkaloids (TIAs), which are secondary metabolites and synthesized by the plant as a means of defense against insects and pests [3]. Approximately, 130 TIAs with pharmaceutical properties [4] are produced by the plant, including the powerful species-specific anti-tumour drugs, vinblastine, and vincristine [5]. These drugs are utilized as a part of chemotherapy to treat a broad range of cancers, which is attributable to their distinctive mechanisms to block mitosis [6].
Apart from its anti-cancer properties, C. roseus is considered as a widespread ornamental plant owing to its various and attractive flower colours with an excellent ability to tolerate dry and nutrition-deficient conditions [7]. The implementation of breeding programs has resulted in a broad spectrum of flower petal colours, including white, pink, scarlet, reddishorange, peach, and purple with flower eyes of various shades such as white, red, pink, deep pink, pale-yellow, and many more [4]. Many studies have concluded that the varieties with coloured flowers yield more alkaloids than those producing colourless flowers. For example, C. roseus var rosea (pink) has a higher alkaloid content than C. roseus var alba (white) [8][9][10]. The major obstacle is the inability to morphologically characterize the varieties before the flowering stage. This causes problems for consumers since most of the sales associated with its medicinal features take place at the vegetative stage. The limitation has prompted investigators to utilize molecular markers for characterization of the varieties during the past decade. Although molecular markers based on fragment data such as RAPD, ISSR, SSR, EST-SSR and AFLP have many advantages, they also suffer from some drawbacks including limited repeatability and variation in fragment size accuracy [11]. Nonetheless, these limitations have been overcome with the recent revolution in molecular makers and their detection platforms. In this regard, the emergence of next-generation sequencing has facilitated the use of single nucleotide polymorphism (SNP) as a recently developed molecular marker [12].
Single nucleotide polymorphism involves a variation in a single nucleotide and at a specific position in the genome of individuals. It has received much attention because it is the most accurate, abundantly present in the genomes, and is a potential molecular marker for high-throughput detection platforms [13]. These advantages make SNP detection a powerful approach to discover relationships between varieties and identify the genetic basis of commercially important traits [14]. The publication of C. roseus chloroplast genome [15] provided a reference genome to examine chloroplast SNPs that are unique to each variety and determine phylogenetic relationships among varieties. SNPs in chloroplast genome could be inter-or intravarietal. The detection of intravarietal SNPs is referred to as heteroplasmy phenomenon and is commonly present in the chloroplast genome [16].
The present study investigated the (1) reliability of using chloroplast SNP variants to identify C. roseus varieties and (2) applicability of the heteroplasmy theory in chloroplast genome, and (3) investigated the phylogenetic distances among varieties.

Plant material and sampling
The present study included eight varieties of C. roseus obtained from a plant nursery in the Makkah region, Jeddah, Saudi Arabia. Each variety is characterized by a distinct colour of the petal, flower eye, and center (Table 1, Figure 1). Flower colours have been described in detail previously [17,18]. Samples were kept in the greenhouse under normal conditions that included 14 h of light per day at 22 C and 80% humidity and irrigated with full-strength Hoagland solution for 2 weeks until the samples started blooming. Samples from a single plant of each eight variety in three replicates were harvested and flash-frozen in liquid nitrogen and stored at À80 C for genomic DNA extraction. Each sample was about 50 mg of fresh and young leaf tissue.

Genomic DNA extraction and purification
Total DNA was extracted from leaf tissues using DNeasy plant mini kit (no. 69104) purchased from Qiagen. To achieve maximum concentration, DNA from three samples of each variety was mixed and applied to DNA SpeedVac concentrator. The Qubit 2.0 fluorometer was used for evaluating DNA concentration. High-quality DNA with a concentration between 20 and 40 ng/mL and OD ratio (260/280) close to 1.8 was finally used for further analysis.

Genome sequencing, reference assembly and SNP analysis
Total genomic DNA samples were transported in dry ice to the Beijing Genomics Institute (BGI), Shenzhen, China for deep sequencing using Illumina HiSeq 2000 platform. The received raw data for the eight varieties consisted of about 70 million, 100 bp paired-end reads. Before analyzing the raw data, adapter sequences were trimmed and reads with low-quality bases were filtered. High quality reads (>20) from each variety were then assembled separately to C. roseus chloroplast reference genome (NCBI accession no. KC561139) using the CLC Genomics Workbench 3.6.5 software with parameters of length fraction ! 50% and identity ! 80%. The same software was used to identify SNPs for each chloroplast genome. The SNP detection parameters were average coverage = 500, central base quality ! 20, and SNP frequency ! 35%.

Multiple sequence alignment and phylogenetic analysis
Fragments from nine chloroplast genomes of C. roseus varieties (eight from this study and one as a reference from GenBank) were aligned using the CLC Genomics Workbench. The fragments were selected depending on the presence of SNPs. Maximum likelihood (ML) phylogenetic tree for chloroplast SNPs was constructed using the CLC Genomics Workbench to estimate the genetic distances among them.

Results and discussion
Mapping of reads to reference genome The total reads of all eight C. roseus varieties ranged from 73.06 to 75. 40 Figure S1). The average coverage of chloroplast reads ranged from 5,387.75 to 7,491.05.

Variety-specific SNPs and phylogenetic relationship
Although extensive research has been carried out on C. roseus, none has utilized SNPs to characterize varieties. All previous studies on C. roseus were focused on molecular markers based on fragment data, such as RAPD, ISSR, SSR, EST-SSR and AFLP [17][18][19][20][21][22][23]. The emerging technology of high-throughput, next-generation sequencing, such as Illumina HiSeq, has facilitated the discovery of SNPs in different genomes and eliminated the drawbacks of low-throughput approaches [12]. The use of organellar SNPs has seen a tremendous increase [24,25] owing to the relatively small size of the chloroplast, uniparental inheritance, and repeatability within a single cell as compared to nuclear genome [26]. The literature reports few studies that have employed chloroplast genome to detect SNP variants [16,[27][28][29][30][31] however, these studies could  detect only a low level of genetic polymorphism and were limited owing to relatively lower level of mutation rate in chloroplast genomes compared to nucleus and mitochondria. Thus, it acted as a major barrier to detection of chloroplast markers [32,33].
In this study, the number of chloroplast SNPs ranged from 24 to 84 per variety with total 429 SNPs among all eight varieties (Supplemental Table S1). Unique SNPs varied from 1 to 36 per variety with a total of 82 unique SNPs across the varieties (Figure 2, Table 3). The highest number of total (84) and unique (36) SNPs was recorded for PW variety, whereas the lowest number of total (24) and unique (1) SNPs was recorded for ED variety (Figure 2). Approximately 40% of the SNPs were heteroplasmic owing to the presence of both reference and alternate nucleotides in the reads of a given variety. The number of shared SNPs was 347 among varieties, 51 SNPs shared by two varieties, 31 by three varieties, 7 by four varieties, 13 by five varieties, 5 by six varieties, 3 by seven varieties, and 1 by all varieties. SNPs were located in 193 distinct locations within the chloroplast genome with 30 in intragenic regions, 39 in introns, 89 in intergenic spacers (IGS), 14 in rRNA-related sequences and 21 in tRNA-related sequences. The intragenic SNPs resulted in 10 synonymous (S) and 20 non-synonymous (NS) substitutions ( Figure 3).
In general, the observations of the present study are consistent with previous data in terms of small number of SNPs detected in chloroplast genomes. However, our results revealed a higher number of SNPs as compared to those detected by the most recent analogous study [16]. The latter group compared nine date palm cultivars for organellar SNPs but could reveal only 30 chloroplast SNPs including 15 unique SNPs in seven out of nine cultivars [16]. In the current study, we could successfully discover molecular signature by variety-specific chloroplast SNPs for all eight C. roseus varieties (Figure 2). A possible explanation might be that C. roseus varieties are wild germplasms with a high rate of genetic variation as compared to the cultivated crops that have been subjected to domestication and homology processes [34]. Another possible explanation is that SNPs were retrieved under stringent methodology used by [16], which led to the reduction in the number of SNP detected.

Location of organellar single nucleotide polymorphism
The number of chloroplast SNPs ranged from 24 to 84 per variety with a total of 429 SNPs among all eight varieties (Supplemental Table S1). Unique SNPs varied from 1 to 36 per variety with a total of 82 unique SNPs across the varieties (Figure 2, Table 3). The highest number of total (84) and unique (36) SNPs was recorded for PW variety, whereas the lowest number of total (24) and unique (1)   These results agree with the phenomenon suggesting that substitutions evolve rapidly in non-coding sequences due to the lack of direct functional constraints. Compared with non-coding regions, coding regions are characterized by conservation and low rate of substitutions [35]. A noteworthy observation in the present study was that non-synonymous SNPs (nsSNPs) were located in six different genes, namely ycf2, ycf15, rpl2, rps12, rpl23 and ndhB. The first two genes are hypothetical chloroplast protein-coding regions with unknown functions. The genes ycf1 and ycf15 consist of 12 and 4 nsSNPs, respectively. This is in accordance with the studies that have suggested that ycf1 and ycf15 genes strongly display high rates of mutation that can be attributed to them being targets for positive natural selection [36,37]. Only one nsSNP encountered in rpl2, rps12 or rpl23 gene encoding ribosomal proteins has been shown to be involved in the assembly of ribosomal subunits. The ndhB gene also contained one nsSNP in PW variety. The gene encodes for NADH dehydrogenase subunit 2, which has a role in photosynthesis. A previous study by [38] in tobacco has demonstrated a moderate decline in photosynthesis, leading to growth retardation in ndhBinactivated plants. It is interesting to mention that PW variety has a lower rate of photosynthesis and growth compared to other varieties [8]. It can thus be hypothesized that the reduction of PW variety in terms of photosynthesis and growth may correspond to deleterious substitution in the ndhB gene.

Single nucleotide polymorphism hotspots
Another obvious finding was the uniform distribution pattern of SNPs in the SNP hotspots of the chloroplast genome. Hotspots were found in 10 different positions within chloroplast genomes with size ranging from 40 to 520 bp. All hotspots occupied non-coding regions including introns, IGS, pseudogenes and tRNA-and rRNA-related sequences. Eight of them were shared among varieties, and only two hotspots were present in single varieties. The number of shared SNPs in  Table 1. hotspots was different from one variety to the other ( Table 4, Supplemental Figure S2.1-S2.33). Generally, even smaller clusters of four SNPs were considered not to be randomly distributed [39]. The reason behind the formation of SNP hotspots has remained elusive with many hypotheses proposed. First, natural selection can create regions with non-random increased variability by balancing selection [40]. Second, some regions of the genome feature relatively high conservation causing other regions to appear with variability above average [39].

Chloroplast INDELS
Several deletion/insertion polymorphisms (DIPs) were generated during the assembly of each of the eight C. roseus varieties. They ranged from 3 to 16 DIPs for each variety with a total of 81 DIPs across varieties. The deletions and insertions represented 63 and 37% of the total DIPs, respectively. All but Deep Pink variety had at least one unique DIP with a total of 26 DIPs (Table 5). DIPs were present in 44 different locations, mostly in non-coding regions and only nine were in coding regions that resulted in frame-shift mutations. This is attributed to the fact that the chances of substitution are higher close to Indels in eukaryotes as they enhance mutation in surrounding sequences [41,42]. However, the existence of DIPs in coding regions results in deleterious effects on the organism [42]. DIP density in the coding regions in this study was 20.4% relative to non-coding regions.

Phylogenetic analysis
Maximum likelihood (ML) phylogenetic tree was generated for the eight C. roseus varieties based on chloroplast SNPs (Figure 4). The results indicated that the Patricia White variety was less similar to other varieties. Also, no relationship was detected between petal colours and the ML tree topology. On the other hand, the moderate TIA accumulation varieties (First Kiss Peach and First Kiss Polka Dot) were grouped together in the tree. High TIA varieties in the tree were sub-grouped depending on other traits (CO and BP), (VR and ER), and (ED and REF). The most related varieties were Cooler Orchid and Blue Pearl, and the most distinct were Victory Red and Patricia White.
The phylogenetic ML tree of chloroplast SNPs was inadequate in terms of colours of a flower petal since it showed a distinct genetic relationship among varieties with similar flower colours. However, the ML tree provided some information toward partial separation of varieties with different center colours.

Intravarietal single nucleotide polymorphism or heteroplasmy
Heteroplasmy was present in the chloroplast genomes of many C. roseus varieties in the present study (Supplemental Table S1). The chloroplast heteroplasmy is a phenomenon that has been detected in many flowering plants, for example, Pelargonium [43], Gossypium [44], Oenothera [45], Medicago [46], Actinidia [47], Cynomorium [48], Passiflora [49] and Phoenix dactylifera [16]. However, a concern has been raised about the commonness of this phenomenon in chloroplast genomes. As more molecular studies on organelle genomes were conducted, more evidence was gathered to support and/or prove chloroplast heteroplasmy to be a common and regular phenomenon. Heteroplasmy occurs in two circumstances, either biparental or uniparental chloroplast inheritance. In the case of biparental inheritance, both parents transmit their chloroplasts to the zygote resulting in heteroplasmy. Heteroplasmy, in uniparental inheritance, is caused by either mutation in chloroplast genomes after zygote formation or variation in parental chloroplast owing to incomplete vegetative segregation [50]. In C. roseus, heteroplasmy in chloroplast genome is a result of uniparental inheritance [15].

Conclusions
Overall, this study detected chloroplast SNPs and found DIPs to be a reliable approach to identify C. roseus varieties. The study also confirmed the  applicability of the heteroplasmy theory in this species, thereby providing additional evidence. The study demonstrated a limited phylogenetic relationship among varieties with low support values. We believe that the study of the mitochondrial genome of C. roseus will add to our understanding of the utility of SNPs and DIPs in distinguishing closely related varieties. It will also throw light on the mechanism of heteroplasmy. However, the lack of information regarding C. roseus mitochondrial genome limits further examination of the genome for SNPs and DIPs. In addition, research could be conducted on nuclear genome owing to a higher prevalence of SNPs, which, in turn, could be suitable for constructing a high-resolution phylogenetic ML tree and therefore a better picture of relationships among C. roseus varieties.