EST–SSR marker development and transcriptome sequencing analysis of different tissues of Korean pine (Pinus koraiensis Sieb. et Zucc.)

ABSTRACT Korean pine is a gymnosperm, and gymnosperms have relatively large genome sequences and lack a model organism reference genome. Understanding the important gene expression in the tissue growth process of needles (T1), stems (T2), female flowers (T3) and cones (T4) of the Korean pine is necessary to develop and compound the enzyme genes related to secondary metabolite synthesis. Simple sequence repeat (SSR) markers were explored based on the expressed sequence tag (EST). High-throughput sequencing technology was used to compare and analyse transcriptomes of four different tissue parts of the Korean pine, yielding many differentially expressed unigene sequences. Fluorescently labelled SSR primers were designed to analyse the polymorphism level of 60 open-pollinated families from the Heilongjiang province of China. The research showed that (1) 21.3 GB of data was obtained from the transcriptome sequencing, and 41,476 candidate unigenes were identified based on sequence annotation using various databases. Clusters from orthologous groups and gene ontology function classification tools were used to divide the annotated transcript sequences into 56 functional categories. (2) Cones had the highest number of expressed genes during puberty, with rich expression as they were being formed. (3) By pathway enrichment analysis, 16 key enzyme genes related to fatty acid synthesis in other homologous species were obtained. (4) Ten novel polymorphic fluorescence labelling were used to identify 60 open-pollinated families with a medium polymorphism level. The research showed that high-throughput sequencing technology could analyse the transcriptome expression level between different organisms, and SSR markers were successfully developed.


Introduction
The tree species Korean pine (Pinus koraiensis), commonly called the Korean pine, is in the genus Pinus of the family Pinaceae. The Korean pine is a grade II national key preserved wild plant of China and is a native tree species in the north-east China forest region [1]. Pinus koraiensis Sieb. has gained more and more attention as the most important tree species for timber and forestation in north-east China. In addition, Korean pine nuts are famous for their medical uses and nutritional value. Korean pine seeds are the most popular among current pine-seed products [2][3][4], and they contain large amounts of crude protein, crude fat, total sugar and crude fibre, as well as vitamins, minerals and trace elements (calcium, phosphorous, manganese, cobalt, copper and zinc) [4,5].
Currently, in the process of genetic breeding of the Korean pine, edible nut grove breeding and strain improvement are given priority. However, there are few studies on the molecular biology, diversity level and genetic markers of the Korean pine. Korean pine is a gymnosperm, and gymnosperms have relatively large genome sequences and lack a model organism reference genome. New generation sequencing has enabled the sequencing of several conifer genomes despite their very large size, estimated at 20-24 Gbp [6,7]. Understanding the genetic characteristics of Korean pine growth and development, the impact of gene expression on the growth of the Korean pine, the development of enzyme genes in the synthesis process of secondary metabolites and the development of EST-SSR markers can prove fundamental for hybridization breeding, germplasm resource preservation and new germplasm excavation [8,9]. Expressed sequence tag (EST) technology is an effective method for developing transcriptome research. It has been widely applied in the analysis of gene expression and gene function prediction in animals, plants and the microbiology spectrum [10][11][12]. It can reduce the cost and time of sequencing, and abundant data can be obtained with the development of sequencing technology and the applications of highthroughput sequencing to the study of transcriptomes. This research studied transcriptome data from Pinus trees obtained by HiSeq2000 sequencing technology developed by the Illumina company. To study gene expression in the process of growth and development of the Korean pine, we performed research such as building a homogenization cDNA library of the Korean pine, sequencing and gene functional annotation and classification, etc. Then, we obtained difference unigenes, which was helpful for selective breeding and genetic engineering research on the Korean pine and its similar species. We used high-throughput approaches to obtain transcriptome data from the Korean pine. We developed a large collection of transcriptome SSR markers by using computer assistance [13][14][15][16][17]. Simultaneously, we analysed its composition, distribution and characteristics, combined with the SSR fluorescence labelling capillary electrophoresis method. Finally, we found suitable primers for SSR markers of the Korean pine and detected the polymorphism level of 60 open-pollinated families.

Sample source
Fresh samples were collected from superior clones (No. 15 and No. 20) in Linkou seed orchard of the Korean pine, located in Heilongjiang province. The seed orchard was grafted and built in 1979. The clones were propagated by grafting, and no relationships existed between their ortets. We collected four tissues, needles (T 1 ), the top of the stem (T 2 ), female flowers (T 3 ) and cones (T 4 ), in the same period in June 2013. These samples were immediately frozen in liquid nitrogen in the field and then stored at ¡80 C for RNA extraction. We extracted RNA from each of the four materials. The extracted RNA was used for transcriptome sequencing. Sixty genome samples were used for EST-SSR amplification. The samples were collected in clonal seed orchards from Weihe, Teili, Hegang and Linkou in the Heilongjiang province ( Figure 1). From each seed orchard, 15 open-pollinated families were chosen. Stock plants from four seed orchards were grafted and planted in 1979. The plant spacing was 4 m £ 6 m, with a dislocated arrangement in the seed orchard. We collected seeds in the autumn of 2013, and seed-accelerated germination was performed in the winter of 2013. Breeding was conducted in the culturing room. Each family was individually collected and then stored at ¡80 C.

RNA isolation, cDNA library creation and HiSeq2000 sequencing
The total RNA extraction method for T1-T4 tissues was performed according to the instructions of Trizol Reagent (Invitrogen, USA). Then, the total RNA was delivered to Biomarker Technology Company (Beijing, China) for cDNA library construction and sequencing. We used the NEB Next Poly (A) mRNA Magnetic Isolation Module (NEB, E7490) kit to collect mRNA after identifying the quality, integrity and purity of the total RNA. Using RNA as a template, we built a computer library using NEB Next mRNA Library (Library effective concentra-tion༞2 nmol/L). Clusters were created with a detectionqualified library. Finally, the sequencing was processed by HiSeqTM 2000.

Assembly and annotation
We converted image data into corresponding nucleotide sequence data by using the Illumina platform sequencing. Trinity software was used for the mixed assembly of sample data [18]. Contigs were obtained by assembling overlap information between sequences. According to the similarity between the paired-ends of the sequence information and contigs, we could obtain transcripts by assembling. Through the assembly of components from the selection of potential alternatively spliced transcripts, we selected the longest transcript as the unigene sequence of the sample.
Unigene sequences were compared between databases non-redundant protein, Swiss-Prot, gene ontology (GO), COG and Kyoto Encyclopedia of Genes and Genomes (KEGG) by using BLAST software [19][20][21][22][23]. Comment information on the unigenes was obtained (E-value 1e-05). Then, the sequences were divided into different categories according to different functions. Finally, gene function information notation by KEGG was commented and predicted by biological pathways.

Analysis of differential unigene expression in different tissues
A reads database and a unigene database were obtained by sequencing each sample; then, both of these databases were compared using Bowtie comparison software. Expression levels were estimated using RSEM [24]. According to comparison information, we used RPKM to reflect the expressive abundance of unigenes [25]. Differential expression analysis was selected by EBSeq [26]. We used the False Discovery Rate (FDR) method to analyse unigene expression. In the screening of differential genes, we selected an FDR <0.01 and a fold change 2 as the standard definition for differential unigenes. Significant enrichment analysis by GO function could perform the main biological function of differentially expressed genes (DEGs). Through pathway, significant enrichment could identify (DEGs) involved in the main biochemical pathways and signal transduction pathways.

Genome extraction, primer screening and sequencing validation
The cetyl-trimethyl ammonium bromide method with minor modifications was used for total genomic DNA extraction [29]. A total of 101 pairs of primer were designed for amplification and detection. The Polymerase Chain Reaction (PCR) reaction mixture was 25 mL: 2.0 mL of template DNA (25 ng/mL), 2.5 mL of 10 £ PCR buffer, 0.5 mL 1 U Taq enzyme, 0.5 mL of primer (10 mmol/L), 0.5 mL of dNTPs and 18.5 mL of ddH2O. The thermal cycle was as follows: initial denaturation for 5 min at 95 C, followed by 35 cycles of denaturation for 30 s at 95 C and annealing at the optimal temperature 30 s, and then 72 C for 10 min. To detect polymorphic loci with SSR primers, PCR amplification was performed; 6% denaturing polyacrylamide gel electrophoresis was separated and tested; amplification products were collected, followed by sequencing validation (Shanghai Sangon Biological Engineering, Shanghai, China).
Four families of DNA templates were selected for the preliminary screening of primers. Ten pairs of polymorphic loci primers, with stable amplification reactions and clear bands, were chosen for the synthesis of fluorescence labelling primers. After PCR amplification, 1 mL of PCR products was mixed with 8.5 mL of ion formamide and 0.5 mL of LROX3730-500 internal standard, and the mixture was denatured for 5 min at 95 C, stored at 4 C for 5 min, and centrifuged for 1 min at 3000 r/min. Automatic fluorescence detection of capillary electrophoresis was performed with an ABI3730DNA analyzer, and then, data were collected and analysed by GeneMapper 4.0 software [30,31].

High-throughput sequencing and functional annotation
Prior to sequencing, cDNA samples obtained from four tissues and from individuals were normalized to increase the sequencing efficiency of rare transcripts. Subsequently, a total of 21.3 GB of data were generated from a full Illumina Hiseq 2000. 4901 106 contig were obtained through sequence splicing (removing low quality and uncertain sequence from the sequencing process). Among these segments, there were 35,331 segments bigger than 300 bp and 14,661 segments bigger than 1 kbp. A total of 71,238 transcripts and 41,476 unigenes were obtained through the Trinity assembly. The N50 of transcripts and unigenes was 1855 and 1826 bp, respectively. As shown in Figure 2, after BLAST sequence alignment, 25,824 (62.26% of the total number) unigenes with annotation information were eventually obtained. In total, 41,476 unigenes of the Korean pine were obtained by assembling, and then, they were compared with the Nr database. A total of 11,837 unigenes of the Korean pine (45.84%) showed sequence homology with the spruce (Picea sitchensis). A total of 2830 unigenes (10.96%) showed sequence homology with the grape (Vitis vinifera). In total, 41,476 unigenes were distributed into one or more GO terms, with 23.75% for cellular components, 27.24% for molecular functions and 49.01% for biological processes (Figure 3). In addition, compared with the KEGG database, 5356 unigenes (19.95%) were annotated by a pathway and related to 118 KEGG pathways.

GO enrichment analysis of differentially expressed genes (DEGs)
Annotated genes from the different tissues were used to perform a differential comparative analysis. The results showed that the three tissues of needles (T 1 ), stems (T 2 ) and flowers (T 3 ) showed higher differential gene numbers than fruits (T 4 ) ( Figure 4). Counting of variable genes revealed that the number of DEGs with down-regulated expression was larger than those with up-regulated expression for the three tissues and fruits. Needles and fruits had the highest number of variable genes, which suggested that fruits had the highest number of expressed genes during the developmental period, and cone expression was rich in the formation process.
GO enrichment analysis of the DEGs showed that in the different tissues, the unigenes were divided into three groups: components, molecular function and biological process. The proportion of the three subclasses, cell, organelle and cell parts, was the highest in the cellular components group. The proportion of the two subclasses, catalytic activity and binding, was the highest in the molecular function group. Metabolic process and cellular process were the highest proportion in biological process group ( Figure 5). GO enrichment correlation analysis of unigene variation was performed between different tissues based on this grouping. (T 1 ) vs. (T 4 ) were used as examples ( Figure 6). There were significant differences between leaf tissue and cone unigenes regarding sequence-specific DNA binding transcription factor activity (122), oxidoreductase activity (63), metabolic processes (183), oxidationreduction processes (298), carbohydrate metabolic processes (83), the extracellular region (129), planttype cell wall (95), cytoplasmic membrane-bounded vesicles (190) and apoplasts (91) in the formative period (P-value 0.05).

Pathway annotation analysis of different unigene in different organizations
We used the KEGG database to do the pathway enrichment analysis (Figure 7), in order to study the metabolic pathways of different unigenes involved. For example, the research found out that needles (T 1 ) and fruit (T 4 ) had significant differences in the pathway of penylpropanoid biosynthesis (61), penylalanine metabolism (43), photosynthesis (31), crcadian rhythm (17), favonoid biosynthesis (25), dterpenoid biosynthesi (13) and protein processing in endoplasmic reticulum (47). The data of the top three different unigene pathway showed that the pathways were consistent between the T 1 and T 2 , T 1 and T 3 . The shared pathways were phenylpropanoid biosynthesis, phenylalanine metabolism and photosynthesis. In addition, the pathways were consistent between T 1 and T 4 , T 2 and T 4 , T 3 and T 4 . The shared pathways were penylpropanoid biosynthesis and protein processing in endoplasmic reticulum. In the end, the shared pathways were phenylpropanoid biosynthesis, phenylalanine metabolism and plant hormone signal transduction between T 2 and T 3 .

Unigenes of pathway in fatty acid synthesis
There were 669 DEGs annotated to metabolic pathways of secondary metabolite unigenes in the transcriptome database, and the metabolic pathways had 25 different  branches. Forty-eight unigenes could be associated with the fatty acid biosynthesis pathway by pathway enrichment analysis from the transcriptome data of the different tissues. Fatty acid synthesis is the foundation of fat synthesis. Study of the genes involved in the fatty acid synthesis pathway would help to further understand the mechanism of fat synthesis. A pairwise comparison of DEGs was performed on each tissue. Some of the DEGs in each group were classified as associated with the fatty acid biosynthetic pathway after pathway enrichment analysis. Sixteen key enzyme genes, homologous with other species, were obtained after a comparison between 48 unigene sequences in the KEGG database. The average fragment length was 1397.65 bp. Among the genes, a ketoacyl-ACP synthase (KASIII) fragment sequence was the longest, with 2542.5 bp. Palmitoyl-ACP thioesterase (FatB) and stearoyl-ACP desaturase (SAD) had the highest expression level. Moreover, these two genes were at the highest level of expression in cone tops (Figure 8).

The number and distribution of SSR loci
A total of 1757 SSRs conforming to the definitions were recovered from 1549 sequences. The distribution frequency (the number of SSRs compared with the total number of sequences) of SSR loci was 4.24%. One SSR locus was found in an average of 17.38 kbp. The SSR loci contained 1-5 repeat types. The types of repeats were mainly concentrated among mononucleotide to trinucleotide repeats, which accounted for 98.68% of the total SSRs. The average unit length for di-, tri-, tetra-and pentanucleotide repeats were 11.81, 14.34, 16.37, 21.11 and 20 bp, respectively. The most abundant repetitive sequence length was 10-14 bp, which reached a 52.82% proportion; the proportion of 15-18 bp was 37.56%; the proportion of 19-22 bp was 7.63%; the proportion of 23 bp was 1.94%.

Screening and verification of EST-SSR primers
In total, 101 pairs of EST-SSR primers were designed and synthesized, which included 2-5 nucleotide repeat units.
In the prescreening of all primer pairs with an individual pair randomly chosen from each seed orchard, we successfully amplified 21 SSRs (20.8%). Twenty-one pairs of primers were tested by detecting polymorphic loci, and then, PCR amplification was performed; 6% denaturing polyacrylamide gel electrophoresis was used to separate  and test the PCR products, and amplification products were collected. Sequencing validation was then conducted (Shanghai Sangon Biological Engineering, Shanghai, China). The results showed 16 SSRs (76.19%, containing 2 di-, 12 tri-, 2 tetra-) among the 21 pairs of primers with the expected sizes (Table 1).

Polymorphism analysis of SSR loci
Ten SSRs were selected to synthesize fluorescence labelling and detect polymorphisms in 60 open-pollinated families [32]. Four capillary electrophoresis templates were amplified by the P92 primers ( Figure 9). Data analysis by GeneMapper4.0 software showed that the fragment size of the amplification products was between 110 and 432 bp. Detection of 10 SSR loci of genetic variants in 60 samples is shown in Table 2. The polymorphism information content (PIC) detected from these 10 SSR loci was 0.1168-0.6913, with an average of 0.4500. According to the criteria of Botstein et al., half of the 10 novel polymorphic SSRs demonstrated highly informative scores (PIC > 0.50), and the other polymorphic SSRs had reasonably informative scores (0.50 > PIC > 0.25), except for the SSRs P.74 and P.70 (PIC < 0.25) [33]. The results suggested that these markers could be considered an important marker resource in Korean pine population genetic studies.

Results and discussion
A large number of ESTs for pines have been sequenced to date. Large numbers of ESTs have been sequenced  and analysed in pine to discover growth and wood quality trait-related genes [34][35][36][37][38]. But researches are few in the different organizational development system of Korean pine. Especially, Korean pine nuts are the product of an edible woody oil tree species with oil content as high as 56.76%. The nutritional components of the seeds of the P. koraiensis trees exhibited rich variation; the variation coefficient of the fatty acids was 2.24%-66.83% [39]. Our research obtained 41,476 unigenes and 25,824 annotation information unigenes by assembly. The DEGs of different Korean pine tissues were subjected to GO function classification and enrichment analysis of unigene differences. The results showed that the most unigene differences were observed between cones and needles, tender stems and flowers. The results also suggested that cones had abundant biochemical processes and a high level of gene expression during the developmental period. Li et al. [40] used high-throughput sequencing technology to perform a transcriptome analysis between different tissues and to find saponin biosynthesis genes. Some researchers have used highthroughput sequencing for transcriptome sequencing and to excavate gene resources in Norway spruce, lodgepole pine and Pinus tabulaeformis [34][35][36]. This study confirmed that high-throughput sequencing of the Korean pine is feasible for studying the transcription levels between different tissues. Genetic improvement of the Korean pine had, so far, mainly involved traditional selection and crossing based on phenotypes. Preliminary showed that phenotypic traits have abundant genetic variation [40]. The developed SSR markers could be used for fingerprinting of superior clones, paternity analysis and pedigree reconstruction for breeding populations [8,13,14,41,42]. Thus, we expect these makers to significantly enhance the accuracy and progress of the Korean pine breeding program. Furthermore, our subsequent development of large-scale EST-based SSR markers should greatly facilitate Quantitative Trait Loci (QTL) mapping, Molecular Assisted Selection (MAS) and comparative genomics studies. SSRs have emerged as an important marker system in many areas of molecular genetics because they exhibit a high degree of allelic diversity. The identification and development of SSR markers represents significant challenges for non-model organisms and organisms with extremely large genomes [8].
Although traditional sequencing yields longer EST sequences, it has little advantage compared to new assembly technology based on the large-scale ESTs. Additionally, most previous studies used a cDNA library of one tissue, whereas we used a normalized cDNA library comprising multiple tissues and individuals. These large-scale ESTs will provide more comprehensive pine transcriptome information and facilitate the assembly of Pinus koraiensis ESTs in the future. The Korean pine is a gymnosperm with relatively large genome sequences. It is a model organism without a suitable reference genome that must be assembled de novo from transcriptome data. We examined the characteristics of 2-5 bp unit-type microsatellites in the Korean pine based on next-generation genome sequencing technology. The distribution frequency of SSRs was 4.24%, which is less than the American black poplar (Populus deltoides) (14.83%), and Oil-tea (Camellia oleifera) (6.7%) but greater than the Loblolly pine (Pinus taeda) (4.32%) and the Masson pine (Pinus massoniana) (1.4%) [15,[43][44][45].
In the study of different tree species, the proportion and type of SSR repetitive sequences in ESTs vary. This study could amplify polymorphic loci with primers, and priority was given to dinucleotide and trinucleotide repeat units. Codons are a functional unit with three nucleotides; therefore, a three-nucleotide displacement would not have a large effect on the reading frame of expressed genes. Therefore, the trinucleotide repeat SSR types are the most frequent in EST sequences, while dinucleotide and trinucleotide types are the common type of transcriptome SSR repeat unit in most species [46]. Our study is consistent with the above research. Trinucleotide repeats predominated, accounting for 34.66% of the total number of SSRs; dinucleotide repeats accounted for 17.12% of the total number of SSRs. At the same time, the results verified that the polymorphism of low-level SSR units was higher than that of senior units, so designed SSR primers should focus on containing the sequence of the lower unit [47,48].
EST-SSRs have a great advantage in characterizing and protecting the genetic diversity of germplasm resources, but they also have some shortcomings. Because EST-SSRs come from functional gene sequences of relative conservation; therefore, species-specific genomic SSR markers generally have higher polymorphism levels than EST-based ones, which is useful for fingerprinting and the pedigree reconstruction of closely related individuals [49][50][51]. SSR polymorphism molecular markers are based on the different branches of fragments amplified in micro-satellite repeat number, but they cannot effectively distinguish between SSRs of the same length and different repeat sequence and SSRs with a similar length. With increasing EST sequencing and the development of new EST-SSR markers in public databases, ESTs are obtained based on more effective high-throughput sequencing technologies, helping to increase the number of polymorphic SSR primers and to make up for the low polymorphism of EST-SSR markers. At the same time, the reliability and accuracy of the test can be improved by combining with other comparative studies on molecular markers, such as markers, Single Nucleotide Polymorphisms (NP), etc. In this research we adopted the ABI3730DNA analyzer instrument, using capillary electrophoresis, PCR product segments amplified by 10 pairs of fluorescence labelling SSR primers; this approach overcomes the artificial estimation error caused by fragment size inherent to polyacrylamide gel electrophoresis methods.

Conclusions
In this research, high-throughput sequencing technology was used to analyse the transcriptome expression level of different organizations. The DEGs of different Korean pine tissues were subjected to GO function classification and enrichment analysis of unigene differences, and SSR markers were successfully developed. EST was obtained based on effective highthroughput sequencing technologies. High-throughput sequencing technologies were contributed to increase the numbers of polymorphic SSR primers, and made up polymorphism of EST-SSR marker. EST-SSR markers are useful for the studies of population genetics in Korean pine.

Disclosure statement
No potential conflict of interest was reported by the authors.