The first transcriptome sequencing and analysis of the endangered plant species Picea neoveitchii Mast. and potential EST-SSR markers development

Abstract Picea neoveitchii Mast. is a rare and endangered evergreen coniferous tree species in China. P. neoveitchii grows within a specific ecological and geographical range. Here, we performed transcriptome sequencing of P. neoveitchii using the Illumina high-throughput deep sequencing technology with the following aims: (i) to perform de novo transcriptome assembly, (ii) to perform transcriptome assembly with functional annotation, (iii) to study the candidate genes involved in the biosynthesis of cellulose and lignin, (iv) to identify and analyse the transcription factors, and (v) to develop potential expressed sequence tag–simple sequence repeat (EST-SSR) markers for future population genomic and genome-wide association studies in P. neoveitchii. This is, to the best of our knowledge, the first study to provide a comprehensive transcriptome of P. neoveitchii. The transcriptome data contained 12.97 gigabase pairs (Gbp) for a total of 120,951 different transcripts belonging to 66,022 unigenes. In addition, we classified enzymes involved in the biosynthesis of cellulose and lignin and identified and analysed the transcription factors involved in stress responses, growth and development. Based on the larger than 1 kb unigenes, 1,814 EST-SSR markers were shown. Our results demonstrated that the transcriptome analysis is a valuable technology for non-model species with large genomes and this dataset will be an essential resource for future genetic and genomic research in P. neoveitchii.


Introduction
Picea neoveitchii Mast. is an evergreen coniferous tree that belongs to the genus Picea in the Pinaceae family. This species is both rare and an endangered species that is native to China. It grows within an ecological and geographical range that is severely threatened by climate change, forest fires and landuse changes [1,2]. This species is important to various disciplines, including botany, geography and climatology. When compared with other conifers, this conifer has a straight and vertical trunk, and its fibre and wood are of a very high quality. Furthermore, whereas its cones contain compounds that can be used for the treatment of cough and phlegm, its leaves contain compounds that can inhibit fungal damage to plants [3,4]. So far, research into the protection and utilisation of Picea neoveitchii has focused on its wood and leaf extracts, and its genomics and transcriptomics have not been explored.
Recent studies have shown that RNA-Seq is a very effective and powerful technique for gathering extensive transcriptome information in many plants [5][6][7][8].
This study focused particularly on the classification of the assembled unigenes involved in the biosynthesis of cellulose and lignin. Cellulose is an important part of the plant cell wall and is considered to be the most abundant polysaccharide material in nature, with an average annual production and decomposition of about 180 billion tons [22,23]. Lignin is also an important part of the plant cell wall; it is a highly branched polymer composed of phenylpropanoids and represents about 30% of the organic carbon in the biosphere [24,25]. Using this new comprehensive transcriptome dataset, we can also further study some important functional genes and determine some molecular markers for transcriptional genomics, such as ones for stress responses, growth and development. These data will be an essential resource for future genetic and genomic research.
In this study, we obtained the P. neoveitchii comprehensive transcriptome sequence by Illumina RNA-Seq technology. The transcriptome is publically available in the NCBI (National Center for Biotechnology Information) database, and we believe that it will help future researchers to perform breeding, gene expression and population genomic analyses of this coniferous tree species. At the same time, the study will contribute to the study of the genetic evolution of the entire gymnosperms.

Plant materials and RNA extraction
Root, leaf and stem samples from a single individual P. neoveitchii tree were collected. The samples were snap frozen in liquid nitrogen and stored for subsequent experiments. We used the Plant RNA Kit (Aidlab-Biotech, China) to extract total mRNA. A spectrophotometer and an Agilent 2100 Bioanalyzer were used for measuring the quality and quantity of purified mRNA. Subsequently, we combined the same amount of high quality mRNA from each tissue to produce a pooled mRNA sample for the mRNA-seq library.
Library construction for illumina sequencing, data analysis and assembly The Illumina sequencing library was constructed by the Illumina's TruSeq RNA Sample Preparation Kit. The quality of the library was assessed by the Agilent Bioanalyzer 2100 system. The fragments (300 bp ± 25 bp) were purified and amplified as previously reported in detail [5], and then the library was sequenced using the Illumina HiSeq platform.
In order to obtain high-quality clean data, we removed all adapter sequences and low-quality sequences from the raw data. High quality data reads were de novo assembled using the Trinity program (http://trinityrnaseq.sourceforge.net/) [26]. Our optimised k-mer length was 25. The longest assembled transcripts were defined as unigenes. The coding area was predicted by the EMBOSS Getorf Software (http://emboss.bioinformatics. nl/cgi-bin/emboss/getorf) [27]. The transcriptome is publically available in the NCBI (National Center for Biotechnology Information) database.

Functional annotation
Annotating the unigenes was done based on the following six databases: NR, NT, SWISS-PROT, GO, COG and KEGG. We selected the best alignment from the matches with an E-value 10 À5 . Based on the highest score BLAST comparison, we gave a gene name for each assembled unigene sequence. We assigned GO to the assembled unigene sequence using the "Blast2GO" [28]. The Kyoto Encyclopedia of Genes and Genomes pathways were assigned to the assembled sequences using the online KAAS (http://www.genome.jp/kegg/kaas/).

EST-SSR detection
We used the MISA software to perform EST-SSR detection in the assembled 12,845 unigenes (!1kb) (http:// pgrc.ipk-gatersleben.de/misa/) by adjusting the parameters to identify perfect Di-, Tri-, Tetra-, Penta-and Hexa-nucleotide motifs with a minimum of six, five, four, four and four repeats respectively as previously described [5].

RNA-Seq and transcriptome assembly
After removing all adapter sequences and low-quality sequences from the raw data, about 64,221,704 reads (12.97 Gbp) with 91.44% Q20 bases were selected (see Supplemental Table S1). High quality Illumina sequencing reads were submitted to the NCBI Short Read Archive (accession number: SRR2132464).
We used the Trinity software program to de novo assemble the high-quality sequencing data [19], which produced 120,951 transcripts with an N50 length of 1,467 bp and a mean length of 871 bp. The length distribution of the transcripts is shown in Supplemental Figure S1A and Table S2. Then the assembled transcripts were further analysed for clustering and assembly. This produced 66,022 unigenes with an N50 length of 1,194 bp and a mean length of 668 bp (Table 1 and Supplemental Figure S1B). These data will be used for subsequent annotation analysis.

Functional annotation and classification
The unigenes functional annotation is shown in Table 2. We used BLASTX for similarity analysis and compared against the NR database. Of these unigenes, 41,011 (62.12%) unigenes had a significant match, whereas approximately 40% of the unigenes did not show a significant match. Based on the previous studies, it depicts that the determined sequences of unigenes did not significantly match about 35%-50% of the sequences. Among the many plant protein sequences, the assembled unigenes of P. neoveitchii had the highest number of hits against Picea sitchensis at 12,205 hits (29.76%), followed by Vitis vinifera at 2,896 hits (7.06%), Glycine max at 1,784 hits (4.35%), Theobroma cacao at 1,295 hits (3.16%), Ricinus communis at 1,227 hits (2.99%), and Populus trichocarpa at 1,086 hits (2.65%) (Figure 1). The high similarity of Picea neoveitchii unigenes and Picea sitchensis genes indicates that we can use the Picea sitchensis transcriptome and genome as a reference for further study.
The SWISS-PROT database annotation results contain experimental results, computational features and scientific conclusions. Of the 66,022 assembled unigenes, 30,924 (46.84%) were similar to accessions in SWISS-PROT. Using the GO database for enrichment analysis, the unigenes could be classified into three separate groups. The most GO terms were assigned to biological process 67,447 (48.40%); molecular function had 38, 975 (27.97%) terms assigned, and cellular component had 32,927 (23.63%) terms assigned ( Figure 2).

Candidate enzymes involved in the biosynthesis of cellulose and lignin
Cellulose and lignin, which account for most of the dry weight of wood, are two important biopolymers. We focused the next step of our analysis on the genes involved in the biosynthesis of cellulose and lignin.
At present, cellulose synthesis is one of the main fields in plant molecular biology research. Based on previous studies [29,30], we identified 111 unigenes associated with 6 enzymes of the cellulose biosynthesis pathway in the annotated P. neoveitchii transcriptome ( Figure 4A). The cellulose synthase complex (CSC) is composed of cellulose synthase catalytic subunits (CesA). In this study, 59 unigene sequences were annotated as encoding CesA subunits.   Although lignin research has been carried out for many years, some aspects of the lignin biosynthesis remain a debated topic. In this study, we identified 105 assembled unigenes which were associated with 10 enzymes in the normal phenylpropane pathway ( Figure 4B).  However, since cellulose and lignin biosynthesis are complex processes involving multiple parameters, we assume that our transcriptome sequencing analysis will throw light on only a part of the whole process. To reach more specific conclusions, we need additional accurate molecular, genomics and proteomics analysis programmes to validate and further build on our predictions. Based on the sequence and annotation information from NR, NT, SWISS-PROT, GO and the KEGG database, 514 proteins were identified as putative TFs that belong to different TF families ( Figure 5). In the model plant species Arabidopsis thaliana, the MYB family is one of the largest TF families [31]. In our dataset, the most abundant family was also the MYB family. Being the largest TF groups in plants, the MYB family has demonstrated its essential role for the responses to abiotic stresses [32][33][34][35]. In addition, the WRKY, bHLH and NAC TFs were also identified in the dataset, and efforts have been made to unveil the molecular mechanism of the responses to abiotic stresses in Arabidopsis, Oryza sativa, Populus and other plant species [36][37][38][39][40][41]. Increasing evidence has been provided for the crucial functions of TFs, as they relate to mediating the growth and development of plants. This evidence increases our knowledge of the functional importance of the MADS-box genes involved in flower, fruit, and seed development [42][43][44][45][46][47][48]. Additionally, there have been strides in conifer tree research over the past decade [49][50][51].

Identification and analysis of transcription factors in P. neoveitchii
Among the other classes of TFs in Arabidopsis, the transcription factor MYBL2 is a transcriptional repressor that regulates anthocyanin biosynthesis by modulating the MYB/bHLH/WD40 complex [52]. Similarly, AtMYB15 combines with the bHLH protein AtICE1, and they both bind to the AtCBF3 promoter to regulate the cold stress response [53]. Overall, through in-depth analysis of transcription factors in our transcriptome dataset, we can identify candidate genes involved in the growth and development, plant signalling pathways and regulatory networks for this conifer tree.

EST-SSR markers discovery
The SSR markers are the most widely used molecular marker systems. To explore the EST-SSR markers, we extracted 1,560 sequences containing 1,814 EST-SSRs from 12,845 unigenes. The most highly represented repeat units of potential SSRs were five and six, which accounted for 46.67% (414) and 29.43% (261), respectively ( Table 3). The most common repeat sequences were AT/AT (156), AGC/GCT (132), AAG/CTT (119) and AGG/CCT (112) (Supplemental Table S3). We believe that this large set of EST-SSR markers in this study

Conclusions
To the best of our knowledge, this study provides the first comprehensive transcriptome of P. neoveitchii. The coverage of the transcriptome includes 12.97 Gbp, with a total of 66,022 different unigenes. We used the assembled transcriptome to identify enzymes involved in the biosynthesis of cellulose and lignin and transcription factors involved in stress responses, growth and development. We also proposed a large number of EST-SSR markers, which will be helpful for the subsequent population genetic studies and molecular breeding.

Disclosure statement
No potential conflict of interest was reported by the authors.