Characterization and novel Est-SSR marker development of an important Chinese medicinal plant, Morinda officinalis How (Rubiaceae)

Abstract Morinda officinalis How (Rubiaceae) is well known as one of the four most popular medicinal plants in the southern region of China. In order to assess its genetic diversity and assist the development of its molecular assisted-selection breeding in China, we constructed the transcriptome database of M. officinalis using fresh leaf material and developed novel expressed sequence tag–simple sequence repeat (EST-SSR) markers for this important medicinal plant species. Our study identified 8064 potential EST-SSRs from 6668 unigenes. For perfect repeat motifs, the mononucleotide repeats were most abundant (4221; 52.34%), followed by the trinucleotide repeats (1588; 19.69%), and the dinucleotide ones (1040; 12.90%). The seven most commonly occurring repeat motifs were A/T (4661), AG/CT (588), AAG/CTT (408), AC/GT (353), ATC/ATG (338), AT/AT (296) and AGG/CCT (277). Of a total of 5605 SSR primer sets that were successfully designed de novo, 100 were selected for validation analysis. There, 24 primer sets were polymorphic across the 96 M. officinalis individuals collected from four different populations in Guangdong province. Based on cross-species transferability test, 19 and 22 primer sets produced successful amplification in M. umbellata subsp. obovata and M. parvifolia, respectively. These novel EST-SSR markers with high cross-species transferability between M. officinalis and its closely-related species could serve as aids to future studies in genetic diversity, conservation, assisted-selection breeding and genes associated with the medical uses of the genus.


Introduction
Morinda officinalis How (Rubiaceae) is a perennial vine plant, naturally found in the wild in the Indo-Malaysian region, i.e. in China, India, Japan, Philippines, Sri Lanka, Thailand, Vietnam and other countries [1,2]. Its long succulent roots were harvested and air-dried to be made into Chinese traditional medicine, Bajitian, currently known as one of the four major southern medicinal materials in China. Nowadays, the species is widely applied in the treatment of male impotence, female infertility, irregular menstruation, hypogastriccold pains and rheumatoid arthritis [3,4].
Today, mass cultivation of M. officinalis has been promoted in many places of the southern regions of China, yet its wild populations have been greatly threatened due to unsustainable harvesting activities [2]. A recent assessment carried out by Zhang et al. [8] reported the natural population of M. officinalis to be on the edge of extinction, especially in Guangdong Province. It was listed as a 'Category III' protected species in the Chinese National Key Protected Wild Plants [9]. Although researches on M. officinalis have mostly centered on its pharmacological properties, He et al. [10] proposed that M. officinalis growing in different regions exhibited variable phenotypic characters in its chemical properties, which was further supported by Zhu et al. [11] and Zhou et al. [12]. More and more studies have begun to pay attention to its molecular mechanism of action. For example, Ding et al. [13] used random amplified polymorphism DNA (RAPD) markers to investigate the genetic diversity of 5 populations of M .officinalis in Guangdong Province of China, and the intergenic spacer region psbA-trnH, and internal transcribed spacer (ITS) were further applied in the identification of specific cultivars of Bajitian [14,15]. Recently the complete chloroplast genome structure of this medicinal plant was characterized [16]. In order to design suitable conservation plans for the natural populations and identify superior breeding materials among these cultivars, an appropriate fingerprinting technique is required in looking into the intra-specific variation across M. officinalis natural populations and cultivars, as well as identifying gene pools and producing useful markers for future plant breeding programs.
At present, the advanced development in life science technologies has allowed researchers to obtain complete genomes of non-model plants, providing convenience to identify and work on regulating genes toward secondary metabolism biosynthetic pathways, thus obtaining information of their potential active ingredients. Although pharmacologically related studies are well-documented for M. officinalis, to date, there is no transcriptomic work reported for this valuable medicinal plant. Therefore, in this study, we aimed to establish the first transcriptome library for M. officinalis using next generation sequencing (NGS). In addition, we took the initiative to develop novel expressed sequence tag simple sequence repeat (EST-SSR) primers for M. officinalis, as well as to test on their cross-transferability between its closely-related species, M. parvifolia and M. umbellata subsp. obovata.

Plant materials
For the transcriptomic study, fresh leaf samples were collected from an ex-situ M. officinalis tree planted in the greenhouse of Sun Yat-sen University, originating from a natural population in Deqing (DQ), Guangdong province. For the EST-SSR study, fresh leaf samples were collected from 96 individuals of four M. officinalis natural populations (Conghua (CH), n ¼ 26; Dianbai (DB), n ¼ 19; DQ, n ¼ 26; Fengkai (FK), n ¼ 25), whereas fresh leaf samples from six individuals were collected for M. parvifolia and M. umbellata subsp. obovata, both originating from natural populations in FK, Guangdong province. All collected fresh leaf samples were kept in zip-lock bags filled with silica gel prior to transfer back to the laboratory. Voucher specimens were deposited at the Herbarium of Sun Yat-sen University (Supplemental Table S1).

DNA extraction
A total of 150 g fresh leaf samples were pulverized into fine powder using mortar and pestle, with the aid of liquid nitrogen. DNA extraction was carried out using the cetyltrimethylammonium bromide (CTAB) method [17]. The extracted DNA was quantified using Nano-drop TM (Thermo Fisher, USA) and stored at À20 C until further use.

RNA extraction
A total of 150 g fresh leaf samples were pulverized into fine powder using mortar and pestle, with the aid of liquid nitrogen. Total RNA was isolated based on a modified CTAB method. During the extraction process, residual DNA was treated with DNase I (Takara Biotechnology, Japan) for 45 min at 37 C, and the quality and concentration of the extracted total RNA was visualized under 2.0% agarose gel, quantified using Nano-drop TM (Thermo Fisher, USA).

De novo assembly and unigene annotation
The cDNA library was constructed using TruSeq RNA Library Preparation Kit version 2.0 (Illumina, USA) based on the manufacturer's protocol and sequenced using HiSeq 2000 sequencing system (Illumina, USA). Amplicons were sequenced from both paired ends to obtain 125-bp short reads. Raw reads were filtered by eliminating reads containing 'N' bases or more than 10% bases with a Q-value less than 20 using NGS toolkit V 2.3.3 [18]. Subsequently, all the clean reads were de novo-assembled to construct contigs using Trinity v 2.4.0 [19] based on its default settings; then the contigs were connected to form transcripts. Paired-end reads were incorporated to fill up the gap between these transcripts in order to obtain unigenes that were later deposited to the NCBI GenBank as short read archive (SRA; Accession no. SRP149778).
Unigenes were annotated by matching against the Gene Ontology (GO) database (http://www.geneontology.org), based on the condition where by the e-value cut-off was 1.0e À5 , and provided a minimum of 20 best hits, using Diamond software [20]. Web Gene Ontology (WEGO) plot annotation was conducted to perform GO functional classification and to illustrate the distribution of gene functions at a macro-level [21].

Est-SSR primers development
Short sequence repeats (SSR) with more than five repeat motifs within the assembled unigenes were identified using MIcroSAtellite identification tool (MISA, http://pgrc.ipk-gatersleben.de/misa/). A total of 5605 primer sets were designed successful by MISA (Supplemental Table S2). The primers were designed by MISA based on the following parameters: (1) primer length ranging between 18 and 23 bp, with 21 bp as optimum; (2) amplicon size ranging between 100 and 300 bp; (3) an optimum annealing temperature of 52-65 C; and (4) a percentage of GC content at 40-60%, with 50% as optimum. One hundred SSRs were randomly selected from the 5605 primers to design SSR-specific markers using Primer3 (http://pgrc. ipk-gatersleben.de/misa/primer3.html) by incorporating the online Perl scripts [22]. The selected 100 primer sets were synthesized (Sangon Biotech (Shanghai) Co., Ltd., Guangzhou, China) for polymerase chain reaction (PCR) amplification in the subsequent experiments.

Results and discussion
In China, the cultivation of M. officinalis was first reported in Deqing of Guangdong province and Nanjing of Fujian province as a renewable medicinal resource. Yet its wild populations were greatly threatened by artificial harvesting activities [26]. Previous studies of M. officinalis in part of the southern region of China based on RAPD and ISSR markers, and ITS and chloroplast sequencing showed the apparent genetic structure in M. officinalis [26,27]. Researches on chemical compounds showed that populations collected from different regions always demonstrated obvious variations in their morphological characters and chemical compositions [28]. These researches also showed that the wild resources of M. officinalis may contain ample useful chemical compounds, which calls for urgent protection of its natural populations and scientific researches on its genetic resources. Therefore, by utilizing the high-throughput sequencing technology, we sequenced the transcriptome of M. officinalis, producing tens of thousands of genetic sequences, which could be helpful for studies on gene functions [29]. These novel SSR markers could be greatly useful in further studies on the genetic diversity of M. officinalis to lay out plans for the protection of the species.

Characterization of Est-SSRs
This is the first report on the transcriptome and development of the working EST-SSR primer sets for M. officinalis using NGS. A total of 6668 selected unigenes containing SSRs were identified from 51049 examined sequences ( Table 1). The average length and N 50 value of the unigenes (801 bp and 1587 bp) were at least comparable to other species of the same family, such as Coffea canephora (662 bp and 868 bp) and C.arabica (663 bp and 832 bp) [30], thus indicating that the transcript quality was satisfactory in this study. Comparing to other species of Rubiaceae, the EST-SSRs frequency for M. officinalis was 1 per 5.084 kb (8064 SSRs in 40.9 Mb), which was higher than C. canephora (1/ 7.73 kb), and lower than Gardenia jasminoides (1/ 3.516 kb) [31,32]. A total of 8064 SSRs were identified from 6668 selected unigenes (Table 1). Among the identified SSRs, 7570 were perfect repeats (Supplemental Table  S2), and 494 were imperfect repeats ( Table 2). A total of 457 SSRs were perfect compound repeats. Among these perfect SSRs, mononucleotide repeats (52.34%) were the most common, followed by trinucleotide (19.69%), dinucleotide (12.90%), tetranucleotide (1.98%), pentanucleotide (0.79%) and hexanucleotide (0.50%) repeats. Among all SSRs, the mononucleotide repeat units also had the biggest percentage number of repeat units and the trinucleotide repeat units ranked second with a percentage of 22.81% (1839), which was 1.47 times when compared to the dinucleotide motifs (15.45%) ( Table 3). The repeat unit that came with the highest frequency was 10 (27.96%); the longest repeat motif was the mononucleotide A, with a total of 42 repeating bases (Table 3). Three trinucleotide motifs, AAG/CTT (408), ATC/ATG (338) and AGG/CCT (277), ranked three, five and seven in the greatest copy numbers of motif type. Yet it is known that shifts in trinucleotide SSRs, which are in multiple of three nucleotide bases, do not cause changes in EST reading frames as codons are represented with three nucleotide bases. Therefore, the consistent multiplication of trinucleotide SSRs without fail in M. officinalis indicated a positive selection pressure for the SSRs in genic sequences [33].

Categories and annotations of unigenes containing SSRs
In this study, 6046 unigenes of the 6668 unigenes were annotated in the GO database and divided into    three major categories according to their functions, namely biological process, cellular component, and molecular function. In the biological process category, 25 different processes were identified, whereby most unigenes were involved in the cellular processes (n ¼ 6041; 8.06%), followed by the cellular component organization or biogenesis (n ¼ 5924; 7.90%), and metabolic process (n ¼ 5887; 7.85%) (Figure 1a). For the cellular component category, 13 different cellular components were recorded and most unigenes were involved in the cell (n ¼ 5515; 16.06%), followed by the cell parts (n ¼ 5514; 16.06%) and organelles (n ¼ 5190; 15.11%) (Figure 1b). In the molecular function category, nine different functions were recorded, whereby most unigenes were involved in catalytic activities (n ¼ 5459; 33.45%), followed by binding (n ¼ 4221; 25.87%) and transporter activity (n ¼ 3237; 19.84%) (Figure 1c). When we matched the M. officinalis transcriptome against the GO database, although some returned with low match identity, most of the genes were able to be correctly annotated. This showed that M. officinalis has low probability of presence of rare or unknown genes. As bajitian, the root of M. officinalis, is a raw medicinal material. We reported the presence of gene synthases, such as b-fructofuranosidase (EC:3.2.1.26), which is involved in inulin-type oligosaccharide, galactosum and sucrose (PATH: ko00500) pathways in metabolism. However, we seemed not able to detect several pharmacologically-related genes which are responsible for the production of chemical compounds such as polysaccharide, anthraquinones and terpenoids in bajitian [6,10,34]. As bajitian is processed from M. officinalis roots, whereas our specimen for the transcriptome sequencing was from its leaves, although the chances are small, it is possible that the genes responsible for these chemical compounds may be specific to the roots [35]. Alternatively, the differences in reported gene synthases could be caused by phenotypic changes, whereby a comprehensive comparison of bioactive components and pharmacological activities between different geographical origins could perhaps be able to link components to the related genes in this species [36].

Polymorphism and cross-species transferability
In this study, a total of 5605 primers were designed from the 6668 unigene sequences that contained SSRs. However, we randomly selected 100 primer sets from the list to check on their capability of successful amplification. As a result, 59 of them were recorded with successful amplification; while 24 of them showed polymorphisms in all four populations (Table 4). Often, EST-SSR primer sets failed to amplify due to a few reasons such as existing of large introns, primer sequences designed across splice sites, poor sequence quality, or assembly errors. The primer sequences used in this study were deposited in the NCBI GenBank under the accession numbers: MF496207-MF496230 (Supplemental Table S3). Based on this study, the linkage disequilibrium (LD) test showed no significantly linked pairs of primers after a Bonferroni correction, so these primers could be used alone to test the populations genetic parameters. The number of alleles (A) ranged from three to ten in the population DB, from three to eleven in the population FK and CH, and from Four to twelve in the population DQ. The Ho and He values ranged, respectively, from 0.211 to 1 and from 0.483 to 0.817 in population DB, from 0.400 to 1 and from 0.554 to 0.873 in population FK, from 0.115 to 1 and from 0.629 to 0.857 in population CH, and from 0.385 to 1 and from 0.576 to 0.878 in population DQ. HW tests showed that three populations of FK, CH and DQ significantly deviated from the Hardy-Weinberg equilibrium in most loci, and DB had most loci according to the Hardy-Weinberg equilibrium (Table 4).
For the cross-species transferability test, 19 primer set were able to amplify in both M. parvifolia and M. umbellata subsp. obovata (Table 4), while three primer sets (MO57, MO60 and MO63) were able to amplify in M. parvifolia but not in M. umbellata subsp. obovata. From this study, we believe that M. officinalis has a closer evolutionary relationship with M. parvifolia when compared to M. umbellata subsp. obovata. Also, based on the high success of the amplification rate in the cross-species transferability test, we foresee that the newly developed M. officinalis EST-SSR primer sets could work well on other Morinda species, as well as other members of the family Rubiaceae, thus contributing as a powerful genetic tool to provide useful genetic information for future biotechnology purposes.

Conclusions
EST-SSRs are known to be valuable gene function markers, which is useful in genetic diversity assessment across germplasm resources and marker-assisted breeding programs. Therefore, the M. officinalis transcriptome and the newly developed EST-SSR primer sets will be beneficial to future research on gene association with active pharmaceutical ingredients, genetic diversity, population structure and improving the understanding of the regulatory processes of pharmacologically-related compounds in M. officinalis, specifically towards the development of a renewable medicinal source for bajitian.