De novo assembly and characterization of the transcriptome of a wild edible mushroom Leucocalocybe mongolica and identification of SSR markers

ABSTRACT Leucocalocybe mongolica is a prestigious rare wild edible mushroom in Northeast Asia. It is the unique species of the genus and the studying of its phynotype and genome is crucial to genus and species classification. Beyond that, it has high application and ecological value in the industry of food and atmosphere investigation. On account of the fact that transcriptomic and genomic data of L. mongolica lacked in the biological information database, that is a limitation to further study. The transcriptome data were obtained by virtue of Illumina paired-end sequencing technology: 42,622,958 clean reads were achieved and 37,302 contigs were generated. These contigs were subsequently assembled into 13,821 unigenes. These unigenes were annotated within 7 public databases. The 3914 unigenes were associated with a COG classification. Throughout all of the unigenes, 6642 were classified as three functional groups; 3110 unigenes were selected from KEGG pathways, and taken to further clustering analysis for 5 main categories; 57 genes, potentially involved in terpenoid, steroid, and unsaturated fatty acids biosynthesis were identified and selected for further research. The total number of carbohydrate-active enzymes of L. mongolica is 446 and the number of carbohydrate binding module (CBM) is relatively low. CE11, GT19, GT51, GT56, GH131, GH133, GH135 constitute the characteristic carbohydrate-active enzymes subfamily compared to other edible mushrooms. The characteristic carbohydrate enzymes relative to other mushrooms could play a vital role on the metabolism of nutrients. In these generated sequences, 1860 SSRs were identified and characterized as molecular candidate markers existing L. mongolica.


Introduction
Leucocalocybe mongolica (S. Imai) X.D. Yu & Y.J. Yao is a prestigious rare wild edible mushroom in Northeast Asia. Its former Latin name was Tricholoma mongolicum S. Imai [1] and its Chinese name is 'Bai-mo' or 'Kou-mo'. T. mongolicum was proposed to constitute a new genus, Leucocalocybe, based on morphological and molecular evidence [2], which was later verified by other experimenters [3]. L. mongolica is the unique member of a single-species genus. The species has an isolated taxonomical status, which is of great importance in scientific research from the perspective of biodiversity [4].
In a global context, it is an endemic species in some specific areas of temperate steppe in Northeast Asia. It is distributed in the grassland of Inner Mongolia of China, Mongolia, and the Far East of Russia. Due to its highly desirable flavour and short fruiting season, L. mongolica is collected by indigenes and mycophiles. Since ancient times, L. mongolica has been traded as a natural delicacy. Besides that, it has a potential for application in the treatment of a number of diseases, such as stomach disorders. Lectin in L. mongolicais is known to inhibit cancer and lower both blood pressure and cholesterol [5].
The mushroom is a native folk medicine for local minorities. Mongolians believe that it can be applied in the treatment of botulism and wounds, and for detoxification. Evenks boiled it with mutton to fortify their health and to relieve blood-heat syndrome [6]. The bioactivities of L. mongolica have attracted intense interest from scientists over recent years, as the basidiomata has tumour growth inhibitory activity in H 22 -bearing mice. Besides, the ergosterol and ergosterol peroxide from the basidiomata could induce apoptosis of HepG 2 cells and could inhibit their division [7]. Fungal polysaccharide has attracted widespread interest for searching active substances [8][9][10][11][12]. Some new biological activities have also been detected, such as anti-proliferative effect on human cervical carcinoma cells and hepatoma cells [12]. L. mongolica has a high antioxidant activity and can decrease the risk of atherosclerosis [13]. L. mongolica contains carbohydrate-binding proteinslectins [14,15] which possess a number of physiological activities in tests on mice, such as immunomodulatory, antitumour [16][17][18][19], antiproliferative [20] and vasorelaxant activity [21]. The main types of chemical constituents found in the medicinal fungi include terpenoids, steroids, aikaloids, phenolic constituents, sphingolipids, pigments and polysaccharides [22]. However, for more comprehensive activity screening, the research has been restricted by the rareness of L. mongolica, so the potential applications of the biologically active components of L. mongolica remain to be exploited.
Nowadays, there is still a lack of a positive procedure for artificial cultivation of L. mongolica. L. mongolica is frequently collected, wild resources are in the state of predatory development and their habitats are vulnerable to destruction like grassland desertification, overgrazing and trampling. The mushroom is highly endangered. Therefore, more attention should be drawn to this unique member of the single-species genus Leucocalocybe.
With the advancements of technology, the nextgeneration sequencing technologies have opened up exciting possibilities for whole genome sequencing of macro-fungi. Macro-fungi belonging to 64 families (seven species are incertae sedis) have been reported with their genome sequences in the biology informatics databases. The advances in Macro-fungal Genomes Sequencing stimulate the development of basic research, conservation and utilization of macro-fungi. On account of the fact that there are no transcriptomic data or genomic data of L. mongolica in the biological information databases, it is necessary to obtain transcriptomic and genomic data. The further understanding of genomic information is indispensable for more far-reaching protection of the germplasm diversity. We gained a deeper understanding of the complex transcriptome involved in the biosynthesis of active substances and conducted a new plan for further research. We can identify candidate genes involved in the biosynthesis of key constituents.
Microsatellites (SSRs) as essential molecular markers have been used in forensics, population and conservation genetics phylogeography, genomic mapping and determining parentage [23][24][25] due to their high polymorphism, rich informativeness, simplicity and spread through the genome. The key research topic concerning the conservation of L. mongolica populations is the genetic diversity. In this study, SSR markers were sought on the basis of the transcriptome sequences to explore the population genetic diversity of L. mongolica.

Strain isolation and identification
The strain was originally isolated from basidiomata which was collected in Old Barag Banner, Inner Mongolia Autonomous Region, China (GPS coordinates: N49.34762449; E119.2343394) at an altitude of 625 m in 2013 (Online supplementary Figure 1S). The strain was maintained in Strains Laboratory of Fungi at the Engineering Research Center of the Chinese Ministry of Education for Edible and Medicinal Fungi in Jilin Agricultural University. The voucher specimen was preserved in Herbarium Mycology of Jilin Agriculture University (HMJAU) under number MCCJLAU2015YKS. It was preserved on potato dextrose agar (PDA) medium at 4 C. The sequence (ITS1-5.8S-ITS2 of nuclear ribosomal DNA) was amplified from the mycelium (annealing Tm. is 50 C) and 639 bp of the DNA fragment were sequenced (Sangon Co., Ltd., Shanghai, China). The sequence of the strain (GenBank accession number: KY070316) is identical to that of the voucher specimen.

Fungal culture conditions
The strain was transferred into Petri dishes by the rolltube technique. The media components included soluble starch 2%, beef extract 0.2%, CaCO 3 1%, agar 2% and distilled water. The medium was autoclaved at 121 C for 30 min. When mycelia had grown all over the Petri dishes, 1 cm (diameter) homogeneous pieces were punched around concentric circles using a puncher. Five homogeneous pieces were transferred into 250 mL flasks with fermentation medium. They were propagated on a rotary shaker (150 rpm) at 25 C. The formula of the fermentation medium contained 20% corn extract. Mycelium pellets were collected by the filtering method. After that, mycelium pellets were stored at ¡80 C.

RNA extraction, cDNA library construction and sequencing
Total RNA was extracted from frozen mycelium pellets by using the Transzol plant (TransGen Biotech, Inc.) following the manufacturer's instructions. The mycelium pellets samples from three fermentation periods (8 days, 14 days and 20 days) were mixed together to prepare total RNA. The RNA samples that met the requirements were used to construct transcriptome sequence libraries. The RNA samples were tested for concentration, purity and RNA integrity with NanoDrop 2000 microvolume spectrophotometer (Thermo Scientific, Waltham, MA, USA) and Agilent 2100 Bioanalyzer (Agilent Technologies, Santa Clara, CA, USA). The total RNA amount of12.2 mg was equally pooled for cDNA library preparation. cDNA library construction and Illumina sequencing of the sample was conducted at Annoroad Gene Technology Co Ltd., Beijing, China. The specific method of cDNA library construction was done as described in [26]. IlluminaHiseq 2000 system with sequencing PE150 strategy was implemented with regard to RNA-sequencing.
Transcriptome data processing and de novo assembly The image data originally generated by the sequenator were converted to raw data which were processed with Perl scripts to ensure the quality of data. The raw data were filtered by the following criteria: (1) removing the adaptor-polluted reads, (2) removing the low-quality reads which account for more than 15% of total bases, (3) removing reads with a number of N bases accounting for more than 5%. As for paired-end sequencing data, both reads would be filtered out if any read of the paired-end reads was adaptor-polluted. The obtained Clean Data after filtering were analysed statistically based on quantity and quality, including Q30 (those with a base quality value less than 30), data quantity and base content statistics, etc.
The reference genome of L. mongolica has not been reported in biology informatics databases. The software Trinity was used for de novo assembly [27]. Trinity represents a novel method for efficient and robust de novo reconstruction of transcriptomes from RNA-sequencing data.

Gene annotation and analysis
Trinotate version 20140717 (http://trinotate.github.io/) was used for performing the functional annotation of unigenes [28]. Trinotate has an integrated annotation procedure, especially for de novo assembled transcriptomes. Trinotate was run for functional annotation by virtue of a series of different well referenced programs. Homology search in bioinformatics databases which consist of the National Center for Biotechnology Information (NCBI) non-redundant (NR) protein database, NCBI nucleotide sequence database (NT), Swissprot, Pfam (http://pfam.xfam.org/) and EggNOG (http://egg nog.embl.de/version ç 4.0.beta/) was implemented. BlastX (https://blast.ncbi.nlm.nih.gov/Blast.cgi) analysis was further carried out with an E-value cut-off of 1.0e-5 based on the results of the protein database annotation. On account of an unknown functional gene, the cluster of orthologous group (COG) method was utilized [29].
The Gene Ontology (GO) (http://geneontology.org/) classification was applied to classify the annotated sequences. The vital biological biosynthesis pathways of L. mongolica were identified by Kyoto Encyclopedia of Genes and Genomes (KEGG) (http://www.kegg.jp/).

RNA sequencing and de novo assembly
A total of 46,624,842 raw reads were generated for L. mongolica with a paired-end sequencing method by Illuminasequenator. Based on the filter of the low-quality reads, adapters, poly-A tails and reads containing more than 5% unknown nucleotides, the raw data were processed in order to satisfy the quality control. Approximately 42,622,958 clean reads were obtained. A general summary of the statistical data on sequencing and assembly is demonstrated in Note: Q30 percentage is the proportion of nucleotides with quality value larger than 30 in reads; Ns reads-rate is the proportion of unknown nucleotides in clean reads; GC percentage is the proportion of guanidine and cytosine nucleotides among the total nucleotides; N50 length is defined as the shortest sequence length at 50% of the genome.
respectively. The genome data of L. mongolica has not been reported; therefore, de novo assembly was chosen ( Table 1). As a consequence, 37,302 contigs were generated, which were subsequently assembled into 13,821 unigenes. In addition, the GC percentage of these unigenes is 49.62%. The maximum length of unigenes is 43,076 and the minimum length is 201. In brief, all relevant data revealed a high quality and sequencing depth generated by Illuminasequenator.

Functional annotation by searching against public databases
The reference genome of L. mongolica was not present in biology informatics databases. The de novo transcriptome assembly was annotated by Trinity: 138,201 non-redundant unigenes were subjected to similarity analysis in seven public databases ( A total of 9278 unigenes were annotated in the NCBI Nr protein database. Of all resoundingly annotated unigenes, 72.65% had powerful homology with the aligned proteins (E-value <1E ¡45 ) (Figure 1(A)). The similarity distribution showed that 54.39% of the unigenes had a similarity higher than 60% (Figure 1(B)). Distribution analysis by Blast based on the Nr protein database showed that 12.68% of the annotated unigenes have homologs in Laccaria amethystina (Figure 1(C)), indicating that the annotated L. mongolica unigenes were not highly homologous with species in this database.
The Venn diagram ( Figure 2) illustrates the interrelation of Blast hits for unigenes against three databases. BlastX represents an optimal gene alignment in the Uniprot database. BlastP represents an optimal open-reading frame (ORF) alignment in the Uniprot database. The results revealed 1373 unigenes were considered as orthologous genes across all these three databases, as shown in the Venn diagram.

Functional classification by COG, GO and KEGG
To further obtain better understanding of the functions of the unigenes, the BlastX method was implemented by comparing to the COG database ( Figure 3). A total of 3,914 unigenes were associated with a COG categorization. With regard to the 25 COG categorizations, the R category (general function prediction only) represented the maximal group (439; 11.22%), followed by the E category (amino acid transport and metabolism), the C category (energy production and conversion), the G category (carbohydrate transport and metabolism) and the J category (translation, ribosomal structure and biogenesis).
GOntology classification was applied to classify the predicted functions of L. mongolica genes. Within the molecular function category, 6642 unigenes were distributed among 67 functional groups in Gene Ontology ( Figure 4). Three dominant groups, and namely biological processes, cellular components and molecular functions, were shown to include 52,744,137 and 3097 unigenes, respectively. Under the category of biological processes, cellular processes (1236, 18.61%), metabolic processes (1209, 18.20%) and single-organism processes (1005, 15.13%) were conspicuous and considered to be involved in vital cell processes and metabolic activities of L. mongolica. Regarding the cellular components, there were two prominent categories: cell (1277 unigenes/19.23%) and organelle (689 unigenes/10.37%). In the categorization of molecular function, the first and second largest groups were catalytic activity (1283 unigenes / 19.32%) and binding (1174 unigenes/17.68%), respectively. The rest of the categories, such as transporter, electron carrier, molecular function regulator, nucleic acid binding, enzyme regulator, structural molecule, molecular transducer, antioxidant, protein-binding transcription factor, guanyl-nucleotide exchange factor and nutrient reservoir accounted for 638 unigenes, representing only 9.61%.
The annotation of the KEGG pathway database can promote the further understanding on the biological functions and interactions of genes. Out of the 4019 unigenes, 3110 had prominent matches in the KEGG database and were classified into five main groups, including 35 KEGG pathways ( Figure 5). Relative to the five main categories, metabolism was the largest (1389, 44.66%), followed by genetic information (620, 19.94%), organismal systems (452, 14.53%), cellular processes (360, 11.58%) and environmental information processing (289, 9.29%). The results indicate that stirring metabolic processes were currently taking place.
From Figure 5, it can be seen that KEGG metabolism contained 13 categories, for instance, amino acid metabolism, biosynthesis of other secondary metabolites, carbohydrate metabolism and energy metabolism. The KEGG database annotation was also conducive to revealing the essential biological pathways operating in L. mongolica.

Candidate genes involved in bioactivation in the three categories of significant biosynthesis
Mushrooms are usually applied as adaptogens and immunostimulants to balance and promote a healthy body. Most mushroom products possess beneficial health effects owing to the synergistic action of the present bioactive molecules and can be used on a regular basis without harm [31]. So far, terpenoids and steroids have been extracted from various mushroom species [32]. Terpenoids and steroids are important pharmacologically active ingredients. Like the Ganodermatriterpenoids, peroxyergosterol is recognized worldwide. In addition, unsaturated fatty acids play a crucial role in healthy products that are becoming a hot spot of the wellness industry. We investigated the genes that control the terpenoid, steroid and unsaturated fatty acids biosynthesis in L. mongolica. As shown in Table 3, 27 genes related to terpenoid biosynthesis (15 involved in the terpenoid backbone, 12 involved in ubiquinone and others involved in terpenoid-quinone synthesis), 20 genes related to the steroid biosynthesis (9 involved in steroid hormone synthesis and 11 involved in steroid synthesis) and 10 genes related to unsaturated fatty acids biosynthesis in L. mongolica were identified. L. mongolica is also a folk medicine for local native minorities. It is originally attractive for its high nutritional and healthy function. Secondary metabolites have aroused the interest of more and more researchers.

Carbohydrate-active enzyme (CAZy) database annotation
The transcriptome of L. mongolica was annotated with the carbohydrate-active enzyme (CAZy) database that contributes to promote the deeper understanding of the correlation between nutritional modes and ecological environment. Carbohydrate-active enzymes play an important role in substrate degradation processes [33]. Mushrooms are important decomposers in the ecosystem returning nutrients to the soil by putrefaction and decay. We can obtain information on the distribution of the major classes of carbohydrate-active enzymes in L. mongolica from Figure 6. GH constitute a reasonable  Nine other mushrooms were selected as references. Some model mushroom species possess representative ecological characteristics. For example, grass-rotting, wood-rotting and ECM fungi form mutualistic symbioses with many tree species. The three ecotypes are rather typical. Among the nine mushrooms, Agaricus bisporus, Volvariella volvacea and Coprinus cinereus are grass-rotting fungi, Pleurotusostreatus, Lentinuse dodes, Trametes versicolor, Schizophyllum commune and Serpula lacrymans are wood-rotting fungi and Laccaria bicolor is an ectomycorrhizal fungus. The above species have reference genomes. As shown in Figure 7, the total number of CAZymes in L. mongolica is 446, more than those of A. bisporus and L. bicolor. The number of carbohydratebinding modules was the lowest relative to other mushrooms. The distribution of 446 CAZyme genes in L. mongolica is presented in the Online supplementary  Table 1S.
Out of the 13,821 unigenes that were examined, an aggregate number of 1860 SSRs were identified in L. mongolica. Statistical analysis of the identified SSRs is shown in Table 4. The quantities of sequences containing SSRs which is equal and greater than 1 were 1407 and 286, respectively. The numbers of mono-, di-, tri-, tetra-, penta-and hexarepeats were 1252, 200, 383, 14, 7 and 4, respectively. These SSRsare very informative in forensics, population and conservation genetics phylogeography, genomic mapping and determinning parentage inL. mongolica. From Table 5 it can be seen that the number of potential SSRs with 10 repeats was 478, followed by 11 (274), 5 (237) and 6 (207). There were 147 potential SSRs with more than 15 repeats, and all these motifs were mono-or dinucleotide repeats.
Relative to whole-genome sequencing of Eukaryota, transcriptome sequencing is a faster and more low-cost approach to develop molecular SSR markers for genetic diversity, cultivar identification and breeding programmes. SSR markers have been used to trace pedigree relationships of cultivars [45]. Researchers applied SSR  markers to identify wild populations which were rich in diversity and private alleles and good candidates for conservation [46].

Conclusions
In this study, we comprehensively characterized the L. mongolica transcriptome by next-generation Illumina sequencing and de novo analysis without a reference genome. The results showed that Illumina paired-end sequencing by the strategy PE150 is feasible. Candidate genes potentially related to terpenoid, steroid and unsaturated fatty acids biosynthesis were identified. Seven characteristic carbohydrate-active enzymes subfamily compared to other edible mushroom were discovered. Furthermore, in all of the sequences, a total number of 1860 SSRs were identified as molecular markers in L. mongolica. The annotation information associated with the genomic sequences will expedite the further study of the physiology and ecology of this specie.

Acknowledgement
We would like to express our gratitude to professor Mei Ya and doctor Liang Guo in the Xilin Gol institute of bioengineering who helped us during the writing of the thesis.