Integration of transcriptomics and metabolomics to identify key coumarin biosynthetic genes in Bupleurum chinense

Abstract Bupleurum chinense DC. is a perennial herb plant belonging to the Umbelliferae family, genus Bupleurum, that is used as a traditional Chinese medicine. Coumarins are characteristic secondary metabolites of the Apiaceous family, which possess anti-inflammatory, anti-cancer and antiviral activities, among others. Few studies on the coumarin biosynthesis pathways in B. Chinese have been reported. In this study, we identified 89 differentially expressed genes belonging to 7 gene families involved in the synthesis of coumarins in B. chinense, using transcriptome and metabolomic data obtained by high-throughput RNA-sequencing data analysis, and liquid chromatography tandem mass spectrometry. Five coumarin compounds: esculetin (2), fraxetin (4), psoralen (5), parmesan (6) and angelicin (7) were identified through metabolomic analysis. A total of 39 unigenes belonging to five gene families: shikimate O-hydroxycinnamoyltransferase (HCT), caffeic acid ο-methyltransferase (COMT), 4-coumarate: coenzyme A ligase (4CL), feruloyl-CoA 6’-hydroxylase (F6’H) and psoralen synthase (PS), were identified by correlating coumarin content with transcriptional changes. According to correlation network analysis, PS genes (Bc09210, Bc12711), COMT genes (Bc14830, Bc17539, Bc28792, Bc32856) and HCT genes (Bc32814, Bc10245) may be important factors affecting coumarin biosynthesis.

Currently, high-throughput integrated analysis technology is an essential means of obtaining valuable biological information. Transcriptomics, metabolomics and their combination are important aspects of systems biology [26], which have been applied to studies of Bupleurum. Illumina HiSeqTM 2500 high-throughput sequencing technology was used for Bupleurum transcriptome sequencing, obtaining 6.52 gB of clean data, and 59,288 unigenes [27]. Chloroplast genome sequences from B. chinense DC. and B. boissieuanum H. Wolff were sequenced on an Illumina HiSeq X Ten platform, and comparative genomics and phylogenetic analysis were carried out based on sequencing data [28]. The coumarin biosynthetic pathways of other plants have also been studied. One hundred and twenty-nine up-regulated differentially expressed genes (Degs) (encoding BgA (99 Degs), 4CL (2 Degs), HCT (13 Degs), COMT (7 Degs) and F60H1 (8 Degs)) involved in coumarin biosynthesis and accumulation were identified by analyzing the transcriptome data of flower, leaf, root and stem tissues of Cnidium monnieri L. Cuss [24]. A large amount of transcriptomic data for Peucedanum praeruptorum Dunn was assembled to identify putative CYP450 and multidrug resistance (MDR) genes involved in coumarin compound biosynthesis and transport [21]. To date, coumarins have been under-studied in B. chinense. Thus the biosynthetic pathway remains to be studied, and key regulatory genes remain unclear.
To identify the functional genes regulating coumarin biosynthesis in B. chinense, this study performed metabolome analysis on B. chinense root, stem, leaf and flower tissues at flowering stage. Transcriptome analyses were also performed. We screened to identify key genes involved in regulatory networks impacting coumarin biosynthesis by correlating transcriptome and metabolome analysis results.

Plant materials
The commercial B. chinense variety Chuanbeichai No. After cleaning with ultrapure water, three biological replicates of each sample type were immediately frozen in liquid nitrogen and stored in an ultra-low freezer at −80 °C.

Metabolite extraction and HPLC-MS/MS analysis
One hundred milligrams each of 12 tissue samples (four organs × three biological replicates) were placed into eP tubes, snap frozen in liquid nitrogen, ground, then suspended in 500 μL 80% methanol (Thermo, United States) aqueous solution containing 0.1% formic acid (Thermo, United States), and vortexed. All samples were placed in an ice bath for 5 min, then centrifuged at 15000 rpm (Thermo ST16R centrifuge, United States), 4 °C for 10 min. A portion of each supernatant was diluted to a final concentration containing 53% methanol, using 1/2 volume LC-MS grade water. Samples were centrifuged at 15000 rpm, 4 °C for 20 min in a fresh eppendorf tube, and supernatants were collected for analyses by high-per formance liquid chromatography-tandem mass spectrometry (HPLC-MS/ MS). equal volumes were taken from each sample and blended for a quality control (qC) reference sample. An exionLC™ AD system (SCIeX, United States) coupled with a qTRAP® 6500+ mass spectrometer (SCIeX) [29,30] were used for HPLC-MS/MS analysis. A BeH C8 Column (100 mm ×2.1 mm, 1.9 μm, Waters, United States) was used with a 30-min linear gradient at a flow rate of 0.35 mL/min in positive polarity mode (eSI+). An aHSS T3 Column (100 mm ×2.1 mm) was used with a 25-min linear gradient at a flow rate of 0.35 mL/min in negative polarity mode (eSI-). In both modes, the mobile phase consists of solvent A (0.1% formic acid-water) and solvent B (0.1% formic acidacetonitrile). The solvent gradient was as follows: 5% B (2% in eSI-), 1 min; 5%-100% B (2%-100% in eSI-), 24.0 min (18 min in eSI-); 100% B, 28.0 min (22 min in eSI-); 100%-5% B, 28.1 min (22.1 min in eSI-); 5% B, 30 min (25 min in eSI-). A qTRAP® 6500+ mass spectrometer was operated in eSI + and eSI-modes with Curtain gas at 35 psi, Collision gas at Medium, an IonSpray Voltage of 5500 V (-4500 V), Temperature of 500 °C, Ion Source gas of 1:55. Data files were integrated and peaks corrected with SCIeX OS Version 1.4 software.

Iso-seq and transcriptome analyses
RNA was extracted from four different organs, and transcriptome sequencing was carried out according to the Isoform Sequencing protocol (Iso-Seq) using a Clontech SMARTer PCR cDNA Synthesis Kit (Clontech, United States) and the BluePippin Size Selection System protocol described by Pacific Biosciences (Pacific Biosciences, California, United States). Sequencing data was processed using SMRTlink 5.1 software. Additional nucleotide errors were corrected using Illumina RNAseq data with LoRDeC software [39], and CD-HIT [40] was used to remove redundancies in corrected consensus reads. We used high confidence protein sequences derived from this species or closely related species for ANgeL training, then ran ANgeL to predict and analyze coding sequences (CDS) [41]. CNCI [42], CPC [43], Pfam [44] and PLeK [45] were used to predict the coding potential of transcripts. Plant transcription factors (TFs) were predicted using iTAK software [46]. Iso-seq and transcriptome analyses were both completed by the Novogene Bioinformatics Institute (Novogene, Beijing, China).

Gene functional annotation and analyses
After redundancies were removed with CD-HIT software, gene function was annotated using the following databases: NCBI non-redundant protein sequences (NR) [47], NCBI non-redundant nucleotide sequences (NT), gene Ontology (gO) [48], Cluster of Orthologous groups of proteins (COg) [49], Kyoto encyclopedia of genes and genomes (Kegg) [50], Protein family (Pfam), followed by the manually annotated and reviewed protein sequence database (Swiss-Prot) [51]. BLAST was used for alignments, with an e-value of '1e −10′ in NT database analysis, and Diamond BLASTX was used with an e-value of '1e −10′ in NR, KOg, Swiss-Prot and Kegg database analyses. Hmmscan was used for Pfam database analysis. The expected number of Fragments Per Kilobase of transcript sequence per Millions base pairs sequenced (FPKM) transformation was determined using the readcount data of each sample compared to each gene, as counted by bowtie2 of RSeM [52] software in end-toend, sensitive mode. Other default parameters were used for comparison results, after correction of consensus sequence redundancy. The average Fragments Per Kilobase Million (FPKM) values of genes were converted to log 2 values, and the Heatmap package in R (3.6.2) was used for visualization. Differential expression analysis was performed using the DeSeq2 R package (3.6.2) [53]. We used a model based on the negative binomial distribution by DeSeq2 to determine differential expression from digital gene expression data. The Benjamin and Hochberg approaches were used to control the false discovery rate and adjust the obtained P values. genes with adjusted P-values < 0.05 found by DeSeq2 were designated as differentially expressed. gO enrichment analysis of differentially expressed genes was carried out with the gOseq R package (3.6.2), and KOBAS software was used to test the statistical enrichment of differentially expressed genes (Degs) in Kegg pathways.

Integrative metabolome and transcriptome analyses
A regulatory factor network was constructed using the Pearson correlation algorithm on TF related genes and metabolites. In four different organs, the average content of metabolomic data and the average value of differentially expressed genes in transcriptomic data were used for correlation analyses, and the fold changes in metabolomic and transcriptomic data in each organ were compared. The log 2 (fold change) of each metabolite, and log 2 (fold change) of each transcript were calculated using eXCeL, and a correlation coefficient of R 2 > 0.9 (P-value < 0.05) was selected. Cytoscape (version 3.6.1) [54] software was used to visualize metabolites and transcription factors.

Coumarins compounds of metabolites
In this study, 5 metabolites involved in coumarin biosynthesis pathways were found. These were fraxetin, angelicin, psoralen, esculetin and marmesin (Table 1). Significant differences in fraxetin and angelicin were observed between root and stem, and between root and leaf, and differences in psoralen were observed between root and flower (VIP >1.0, P-value <0.05). The fraxetin and angelicin contents in roots were higher than those in leaf, stem and flower samples. The esculetin and marmesin contents were higher in leaves than in other tissues. The psoralen content was highest in flowers.

Unigenes involved in coumarins biosynthesis
Coumarin biosynthesis mainly occurs through the phenylalanine pathway. FPKM values in various organs were used to indicate levels of differential expression. eighty-nine differentially expressed genes related to enzymes in coumarin biosynthesis pathways were identified, including members of seven gene families: PAL  Table S1). We produced a heatmap to show the relative expression of different genes in root, stem, leaf and flower tissues. A portion of these genes are obviously highly expressed in roots, especially the C4H and F6'H families. Some other gene families are highly expressed in flowers, including two HCT genes, sixteen COMT genes, one 4CL gene and three PS genes. The remaining genes were highly expressed in stems or leaves.

Integrated analysis of gene expression and metabolome data
Pearson correlation analysis between 5 coumarins and 89 unigenes was carried out, yielding strong correlation coefficient values (P-value <0.05) between 39 unigenes and 3 metabolites. An interaction network between metabolites and unigenes associated with coumarin biosynthesis pathways was established based on the results in root, stem, leaf and flower tissues in B. chinense, as described above (Figure 4). Differences in unigene and metabolite distribution in different organs, and their relationships to each other can be found in the network. Among these, the contents of fraxetin and angelicin were obviously highest in roots, while the psoralen content was highest in flowers. Bc12711 (encoding PS), Bc14830 (encoding COMT), Bc32814 (encoding HCT) displayed their strongest correlations with fraxetin, Bc10245 (encoding HCT) displayed its strongest correlation with angelicin. The strongest correlations with psoralen in flowers were seen with Bc09210 (encoding PS), Bc17539 (encoding COMT), Bc28792 (encoding COMT) and Bc32856 (encoding COMT), as these are most abundantly expressed in flowers.

Discussion
As a traditional Chinese herbal medicine, Bupleurum possesses important medicinal properties. Most studies of Bupleurum to date have focussed on its active components, including saikosaponin and volatile oils. There is currently a lack of research into coumarin biosynthesis. In the present study, 12 specimens were comprehensively analyzed for metabolome and transcriptome data for root, stem, leaf and flower tissues. Correlation analyses between 89 genes and 5 coumarins identified 39 genes that were significantly correlated with 3 coumarins (fraxetin, angelicin and psoralen). We used HPLC and high-resolution MS, solid-phase extraction (SPe) with LC-MS, ultraviolet spectrophotometry, and other biophysical techniques for quantitative coumarin studies [55][56][57]. Metabolites, especially the coumarins fraxetin, psoralen, angelicin, esculetin and marmesin in root, stem, leaf and flower of Bupleurum were detected with HPLC-MS/MS (Table 1). In CBC1, each of these metabolites produced a different performance result. Fraxetin is most abundant in roots, as it is in Arabidopsis. Other studies have shown that fraxetin chelates and mobilizes Fe most efficiently [58]. Our results are thus consistent with the general finding of large amounts of iron in plant roots [59]. The relative content of angelicin, which is formed by angelicin synthase catalyzed conversion of (+) -Columbianetin [60,61] in different organs of CBC1 displayed the same pattern as that of fraxetin (root > flower > stem > leaf ). The psoralen contents in root, stem, leaf and seed were comparable to results from a study of Psoralea corylifolia, which found that the psoralen content was highest in fruits, and lowest in roots [62]. In this study, the psoralen content was assayed in root, stem, leaf and flower tissues, and was found to be highest in flowers and lowest in roots. esculetin and marmesin were also detected in our samples, displaying their highest content in leaves, but the esculetin content was lowest in roots, and the marmesin content was lowest in stems.
The five metabolites in the present study can all be synthesized through the phenylalanine metabolic pathway (Figure 3), and can be regulated by PAL, C4H, HCT, COMT, 4CL, F6'H, PS gene family enzymes. One gene (Bc16375) in the PAL gene family displayed significantly upregulated expression in stems alone. The Bc36725 gene displayed significantly downregulated expression in roots. In addition, one C4H gene, eight HCT genes, nineteen COMT genes, one 4CL gene, eight PS genes, and fourteen F6'H genes displayed significantly upregulated expression in roots. Four HCT genes, seventeen COMT genes, one 4CL gene, and three PS genes displayed elevated expression in flowers.
According to transcriptome and correlation analyses based on FPKM values, 39 unigenes related to coumarin biosynthesis from 5 gene families (HCT, COMT, 4CL, F6'H and PS) were identified (Figure 4). These genes have different correlations with coumarin metabolites in the four examined tissues. Bc10245 and Bc02075, from the HCT family, correlated significantly (P-value <0.05) with angelicin. Thirteen unigenes (Bc79689 from the HCT family, Bc09210 and Bc20312 from the PS family, and Bc17539, Bc32856, c28792, Bc17945, Bc11015, Bc27729, Bc35437, Bc17695, Bc34074 and Bc30832 from the COMT family) correlated significantly with psoralen. The remaining 24 unigenes correlated significantly with fraxetin. In previous studies, PAL, 4CL, HTC and F6′H were identified as important enzymes for scopoletin biosynthesis [63]. Scopoletin is synthesized through the phenylpropanoid pathway, which is the largest biosynthetic pathway among all pathways involved in secondary metabolite biosynthesis. Most of these transcripts play an important role in coumarin synthesis via this pathway [64]. In the phenylalanine metabolic pathway, C4H and 4CL can convert cinnamate into p-coumaroyl-CoA via hydroxylation, methylation and dehydration [65], and their activity levels can change in coordinated fashion in different development stages [66]. The results of this study show that both Bc51139 (encoding C4H) and Bc28789 (encoding 4CL) were highly expressed in the root, and Bc28789 correlated significantly with fraxetin (p-value =0.01). Most research into phenylpropanoid metabolism has focussed on COMT, which is . putative biosynthetic pathways of coumarins in Bupleurum. pal, phenylalanine ammonia lyase; c4h, cinnamate 4-hydroxylase; c3h, ρ-coumarate 3-hydroxylase; hct, shikimate o-hydroxycinnamoyltransferase; comt, caffeic acid ο-methyltransferase; 4cl, 4-coumarate: coenzyme a ligase; F6'h, feruloyl-coa 6'-hydroxylase; pS,psoralen synthase.; c2'h, ρ-coumaroyl coa 2'-hydroxylase; u-8-p, umbelliferone 8-prenyltransferase; cS, columbianetin synthase; aS, angelicin synthase; ca2h, caffeic acid 2-hydroxylase; omt, ο-methyl-transferase; S8h, scopoletin 8-hydroxylase; *, no corresponding gene was found. the first plant small molecule OMT to be described [67], or have focussed on an important protein that participates in lignin synthesis [68], but is also important for coumarin biosynthesis. In total, 13 and 10 unigenes are significantly associated with fraxetin and psoralen, respectively. Among these, expression levels of Bc14830, which correlates with fraxetin, are highest in the roots. The expression levels of Bc17539, Bc28792 and Bc32856 are higher in the flowers, and are all significantly correlated with psoralen. This suggests that COMT may regulate psoralen biosynthesis. PS (a cytochrome P450) can directly convert marmesin into psoralen [69], consistent with the results of the present study, which indicate that genes PS are highly expressed in flowers and significantly correlate with psoralen. The coumarin content generally correlates positively with expression levels of related enzymes, and various enzymes may interact with each other and ultimately affect coumarin synthesis. The F6′H gene could be over-expressed to increase the cellular scopoletin content [64], consistent with our results, that characterized most of the F6 'H genes as having their highest expression in roots, corresponding with highest fraxetin content occurring in roots.
In brief, transcriptomic and metabonomic analyses were performed on root, stem, leaf and flower tissues of adult CBC1 plants in this study. enzymes involved in coumarin biosynthesis pathways were analyzed based on transcriptomic and metabonomic analysis results. Comparing the metabolite levels with related unigene expression levels indicated 8 unigenes that may play important roles in regulating coumarin biosynthesis. These results can be used to inform cultivation and breeding of new B. chinense cultivars in the future and can also be helpful for other research and related studies of other plant species.

Conclusions
In conclusion, we explored the regulatory network associated with coumarin biosynthesis using integrated analysis involving the metabolome and transcriptome of four tissues from CBC1. We also found that genes in COMT, HCT, PS, F6'H and 4CL families played an important role in regulating coumarin biosynthesis in B. chinense. In addition, Bc09210 (encoding PS), BC12711 (encoding PS), BC14830 (encoding COMT), BC32814 (encoding HCT), BC17539 (encoding COMT), BC28792 (encoding COMT), BC32856 (encoding COMT) and BC10245 (encoding HCT) may be key regulatory factors, making them of interest for future studies of coumarins in B. chinense.

Disclosure statement
The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Data availability statement
All data generated or analyzed during this study are included in this published article and its supplementary information files, further inquiries can be directed to the corresponding author.