Analysis of the complete chloroplast genomes of Scutellaria tsinyunensis and Scutellaria tuberifera (Lamiaceae)

Abstract Scutellaria Linn. is a perennial herb with about 300 species. This genus has high medicinal value and many are used in Traditional Chinese Medicine (TCM). In this study, we sequenced and assembled the complete chloroplast genomes of Scutellaria tsinyunensis and S. tuberifera. Subsequently, we conducted a comprehensive comparative genomics analysis with 12 other published Scutellaria species. These genomes all had a conserved quartile structure, and the gene contents, gene sequences and GC contents are highly similar. The study on the genetic characteristics and nucleotide substitution rate of different genes found that the protein-coding genes of chloroplasts have differed greatly. Most genes are under purifying selection, but the rps12 gene may have undergone positive selection. Besides, we identified three hypervariable regions as potential markers for Scutellaria taxa, which could play an important role in species identification of Scutellaria. Phylogenetic analysis showed that the 14 Scutellaria taxa were divided into two major clades. Moreover, the variation of IR regions is closely related to the evolutionary history as was reconstructed based on SNPs. In conclusion, we provided two high-quality chloroplast reference genomes of Scutellaria, this reliable information and genomic resources are valuable for developing of efficient DNA barcodes as reconstruction of chloroplast evolutionary history of the genus.


Introduction
Scutellaria Linn. is a perennial herb of about 300 species, which belongs to the family Lamiaceae. Scutellaria plants are widely distributed throughout the world except for tropical Africa. Several species from Scutellaria are used in Traditional Chinese Medicine (TCM) with the functions of clearing away heat and dampness, purging internal heat, and detoxification (Zhao T et al. 2019). For instance, the dried roots of S. baicalensis, also known as 'Huang Qin', are used for liver and lung complaints and even used for complementary cancer treatments (EghbaliFeriz et al. 2018;Wang CZ et al. 2020). Phytochemical studies have shown that the main active compounds of Scutellaria species are a series of flavonoids, include wogonin, wogonoside, baicalin, and baicalein (Wang ZL et al. 2018;. By now, the research on Scutellaria taxa is mainly focused on chemical composition, medicine activity and biological technology (Wang ZL et al. 2018;). In particular for S. baicalensis, which is favored for excellent effect in disease treatment. However, the resource identification based on molecular phylogenetic studies is relatively scarce.
Chloroplast genome (referred to as cp genome in the following text) plays an important role in plant photosynthesis (Szab o and Spetea 2017) and are widely used in phylogenetic studies and species identification (Santos and Pereira 2018;). Due to its conservative genome structure and contents, the cp genome has become an ideal model for evolutionary and comparative genomic studies (Shin et al. 2016). Although the cp genome is relative conserved compared to the nuclear genomes, it also contains highly variable regions that were widely used as molecular markers (Liu ML et al. 2018;Liu X et al. 2018;Pang et al. 2019;Thakur et al. 2019). For instance, matK, rbcL, and trnH-psbA were used as the universal DNA barcodes for distinguishing species (de Vere et al. 2015;Guo et al. 2011;Yu et al. 2021). In a recent study, Zhao et al. (2020) reported 8 cp genomes of Scutellaria plants, which have greatly enriched the cp genome resources. However, cp genome sequencing is still inadequate in such a moderately large genus, and the comparative genomic analysis of cp genomes is incomplete.
In our study, we have sequenced two cp genomes of Scutellaria species, they are S. tsinyunensis C.Y. Wu & S. Chow and S. tuberifera C. Y. Wu et C. Chen. Among them, S. tsinyunensis is an endangered perennial herb endemic to Mt. Jinyun, Chongqing, China (Li and Hedge 1994). Subsequently, we conducted a comprehensive comparative genomics analysis with 12 other published Scutellaria taxa. In particular, we focused on the molecular evolution of chloroplast genomes, such as the expansion/contraction of IR regions, the evolution of protein-coding genes, and the identification of hypervariable regions. The entire cp genome sequences were used as a super-barcode to determine the phylogenetic position of Scutellaria plants.

Sampling, DNA extraction and sequencing
The fresh leaves of two Scutellaria species, S. tsinyunensis and S. tuberifera were collected from Mt. Jinyun, Chongqing (Geospatial coordinates: N29.842889, E106.394527) and Greenhouse 9, Southwest University, Chongqing (Geospatial coordinates: N29.817767, E106.421054), respectively. The samples have been deposited in the herbarium of Southwest University, Chongqing, China with the accession number: 20200320CQ-1 and 20200320CQ-2, respectively. The total genomic DNA was extracted by using CTAB method (Arseneau et al. 2017). The DNA library with an insert size of 350 bp was constructed using the NEBNext V R library building kit (Emerman et al. 2017) and sequenced by using the Hiseq Xten PE150 sequencing platform. Sequencing produced a total of 4.19 G and 5.23 G raw data. A total of 19,816,746 and 22,736,325 raw reads (2 Â 150 bp) were obtained. Clean data were obtained by removing low-quality sequences: sequences with a quality value of Q < 19 accounted for more than 50% of the total base, and sequences with more than 5% bases being 'N'.

Genome assembly and annotation
Genome assembly from the clean data was accomplished utilizing NOVOPlasty version 2.7.2 (Dierckxsens et al. 2017), with a k-mer length of 39 bp and a sequence fragment of the rbcL gene from maize as the seed sequence. The average basecoverage was 499.3 (S. tsinyunensis) and 506.6 (S. tuberifera). Then, we use Geneious version 8.1 (Auckland, New Zealand) (Kearse et al. 2012) to map all clean reads to the assembled genome sequence to verify whether the spliced contigs were correct. The cp genome was annotated initially by using CPGAVAS2 (Shi et al. 2019) using the reference dataset of 2544-plastomes. Geseq was then used to confirm the annotation results (Tillich et al. 2017). Furthermore, the annotations with problems were manually edited by using Apollo (Misra and Harris 2005).

Sequence analysis and genome comparison
The GC content was conducted by using the cusp program provided by EMBOSS version 6.3.1 (Rice et al. 2000). IRscope (https://irscope.shinyapps.io/irapp/) was used for visualizing the IR boundaries in these cp genomes (Amiryousefi et al. 2018). A total of 78 orthologous genes and 89 intergenic spacer regions (IGSs) among 14 Scutellaria species were identified and extracted by using Phylosuite version 1.2.1 . The corresponding nucleotide sequences were aligned by using MAFFT version 7.450 (https://mafft. cbrc.jp/alignment/server/) (Rozewicki et al. 2019) implemented in Phylosuite. We used MEGA version 6.0 (Tamura et al. 2013) to calculate the percentage of variable sites (PV) in protein-coding genes and the pairwise K2-P distance in IGSs. Then, we used DnaSP version 6.0 (Rozas et al. 2017) to calculate the nucleotide diversity (Pi) among the proteincoding sequence.

Nucleotide substitution rate analysis
The protein-coding sequences in the previous step were processed in parallel. We used the CODEML module in PAML version 4.9 (Yang 2007) to estimate rates of nucleotide substitution, including dN (nonsynonymous), dS (synonymous), and the ratio of nonsynonymous to synonymous rates (dN/ dS). The detailed parameters were: CodonFreq ¼ 2 (F3 Â 4 model); model ¼ 0 (allowing a single dN/dS value to vary among branches); cleandata ¼ 1 (remove sites with ambiguity data); other parameters in the CODEML control file were left at default settings. The phylogeny tree structure of each gene was generated by using the maximum-likelihood (ML) method implemented in RaxML version 8.2.4 (Stamatakis 2014).

Phylogenetic analysis
The cp genome sequences of 14 species belonging to Lamiaceae were downloaded from GenBank (Table S1). Two species (Lamium album and Stachys byzantina) were used as outgroups. A total of 16 complete cp genome sequences were aligned by using MAFFT version 7.450 online version with default setting (Rozewicki et al. 2019). These aligned sequences were used to construct the phylogenetic trees by using the ML method implemented in RaxML version 8.2.4 (Stamatakis 2014). The parameters were 'raxmlHPC-PTHREADS-SSE3 -f a -N 1000 -m GTRGAMMA -x 551314260 -p 551314260'. The bootstrap analysis was performed with 1000 replicates.

General features of cp genomes
The cp genomes of Scutellaria species are characterized by a typical circular DNA molecule with the length of 151,675-152,417 bp. It has a conservative quartile structure which is composed of a LSC region (83,891-84,608 bp), an SSC region (17,305-17,570 bp), and a pair of IR regions (25,208-25,255 bp) ( Table 1). The GC content analysis showed that the overall GC contents ranged from 38.3% to 38.4% in the 14 cp genomes.
The cp genomes encode a large number of genes. Take S. tsinyunensis for example, the cp genomes comprise 134 genes. Among which, 114 are unique genes, including 80 protein-coding genes, four rRNAs, and 30 tRNAs (Table 2). Figure 1 shows the schematic diagram of the cp genomes of S. tsinyunensis. This result is similar to that of other species in this genus (Jiang et al. 2017;Lee and Kim 2019). In one particular case, two protein-coding genes of S. kingiana, ndhD, and ndhF, are encounter termination codons in advance within the coding frame. As two pseudogenes, they cannot translate the normal protein products. The two genes were not included in subsequent analysis.

Contraction and expansion analysis of IR regions
We observed four genes are span the boundary regions in all 14 species, they are trnH, rps19, ndhF, and ycf1 ( Figure 2). Extensively comparative analysis observed the location of these four genes of Scutellaria species is slightly different. Based on these differences, we divide them into two types (three subtypes). For gene rps19, it overlaps with the IRb regions by 41 bp in type I. However, in type II, the overlap is 46 bp (type IIa) or more than 50 bp (type IIb). For gene ndhF, most sequences are located in SSC regions, it also overlaps with the IRb regions by 32 bp in type I except for S. quadrilobulata (25 bp). In type II, the overlap is 45 bp (type IIa) or 25 bp (type IIb). The variation of ycf1 genes is quite different, and it did not show an obvious classified pattern. It may be related to the high mutation rates of ycf1. It is worth noting that ndhF gene is a pseudogene in S. kinglana.
Interestingly, the ndhF genes cross the border of IRb/SSC, and we observed overlaps of ndhF and the first copy of ycf1. The length of the overlapping regions ranged from 25 to 35 bp in type I and type IIb, but over 120 bp in two species from type IIa, indicating that type I is close to type IIb, and they are quite different from type IIa.

Genetic characteristics of protein-coding genes
In our study, the Pi value and PV value were highly similar in all 78 genes (Figure 3(A)). The Pi value (0.0190) and PV value (5.7550) of ycf1 were all the highest. Other genes with high nucleotide polymorphism were rpl32 (0.0176, 5.7471), rps16 (0.0168, 4.9242), and rpl22 (0.0138, 4.1850). The Pi value and PV value were given in parentheses one by one, respectively Conserves open reading frames ycf1, ycf15 (x2), ycf3, ycf2 (x2), ycf4 Gene Fragments (pseudogene) ycf1, rps19, ndhD Ã , ndhF Ã Note. The '(x2)' indicates that the gene located in the IRs and thus had two complete copies. The ' Ã ' indicates that it was a pseudogene only in S. kingiana.  (Table S2). Five genes (ycf15, petN, psbE, psbN, and rpl23) did not have any variable sites and they are highly conserved. The rates of synonymy (dS) and non-synonymous (dN) substitution rates and their ratios (dN/dS) of 78 orthologous genes were estimated to detect the heterogeneity of substitution rates. Among the 78 genes, rps12, ycf1, rpl22, and psbK had higher dN values, which were 0.0652, 0.0604, 0.0591, and 0.0432, respectively. The dS value of rpl32 was the highest at 0.2896 (Figure 3(B), Table S3). The dN/dS value of most genes was less than 0.6, indicating that they have been under purifying selection during evolution. It is worth noting that the dN/dS value of rps12 gene reaches 1.7814, which is likely to undergo positive selection. Other genes with higher dN/dS values are cemA (0.8926), ycf3 (0.7090), ycf1 (0.6734), ccsA (0.6698), and matK (0.6406), which are all active genes in the process of evolution.

Identification of hypervariable regions
Considering that the protein-coding genes are extremely conserved, we are more focused on the IGSs. As shown in Figure 4, Figure 1. Graphic representation of features identified in the cp genomes of Scutellaria plants by using CPGAVAS2. Taking S. tsinyunensis as an example, the map contains four rings. From the center going outward, the first circle shows the forward and reverse repeats connected with red and green arcs, respectively. The next circle shows the tandem repeats marked with short bars. The third circle shows the microsatellite sequences identified using MISA. The fourth circle is drawn using drawgenemap and shows the gene structure on the cp genomes. The genes were colored based on their functional categories, which are shown in the left corner. Label intron-containing genes with Ã . the K2P distances of the 89 IGSs were quite different. The maximum, minimum and mean K2-P distance showed significant differences in three IGS, which are ndhF-rpl32 (5.8178), trnL-UAG-ccsA (6.1056), and rpl32-trnL-UAG (10.5154). The mean was given in the parentheses, and the details are shown in Table 3. The above three IGSs could be used as potential DNA barcodes. Other IGSs with larger differences were trnH-GUG-psbA, rpl16-rps3, trnC-GCA-petN, and psaC-ndhE, which could be used as candidate hypervariable regions.

Phylogenetic analysis
In this study, we selected two outgroups and analyzed the phylogenetic relationships of 14 Scutellaria species. The 16 complete cp genome sequences were used for constructing a ML tree. The phylogenetic trees have high bootstrap support values (100) on most nodes except for three nodes, showing the reliability of the phylogeny recovered ( Figure 5). Our phylogenetic trees displayed two clades clearly, and then further diversified into different subclades. Among the two clades, five species (S. baicalensis, S. Amoena, S. Kingiana, S. Altaica, and S. Przewalskii) were clustered, and the other nine Scutellaria taxa clustered together. In the two species that we sequenced, S. tsinyunensis had the closest relationship with S. quadrilobulata, and S. tuberifera had the closest relationship with S. lateriflora. These results exhibited that the whole cp genome sequences can be used as a super-barcode for species identification with extremely high resolution at the species level.

Discussion
Here, we sequenced and assembled the complete chloroplast genomes of S. tsinyunensis and S. tuberifera, which were highly similar to previously published one (Jiang et al. 2017;Lee and Kim 2019). This result suggested the cp genomes were highly conserved in Scutellaria.
The contraction and expansion of IR regions are considered to be an important reason for the length diversity in cp genomes (Goulding et al. 1996). In the comparative analysis, although no significant differences were observed in Scutellaria, we were able to divide the cp genomes into several basic types based on the subtle differences of IR regions. It is worth noting that the dynamic changes of genes near the IR boundary are consistent with the topology of the phylogenetic tree, indicating that, structure variation, e.g. the shift of IR boundaries, reflected similar evolutionary history as was reconstructed based on SNPs.
In previous studies, the evolution of plastid protein-coding genes in Scutellaria was rarely involved. The 14 samples allowed us to conduct a wide range of plastid gene studies in Scutellaria, and we calculated the nucleotide substitution rates to understand the evolution rates of plastid genes in Scutellaria species. We found a wide range of heterogeneous evolutionary rates of the plastid genes. Some genes have higher mutation rates, such as ycf1, rps16, and rpl32. While some are highly conserved and have no mutation sites. This difference in the rates of evolution is important for studying the molecular evolution of chloroplasts, because it is usually related to the purifying or positive selection of genes during evolution. In particular, we observed that the dN/dS value of rps12 gene was greater than 1, suggesting that the gene might have undergone positive selection during evolution processes. As part of the 30S ribosomal subunit, rps12 had the function of rRNA binding, and it is also the only transsplicing gene in the chloroplast. By contrast, the evolution rate of non-coding regions of the cp genome is generally higher than that of protein-coding regions. The study of noncoding regions or IGSs is helpful for us to find appropriate species-specific DNA barcodes. Based on the results of K2-P distance, we recommended three hypervariable regions, including ndhF-rpl32, trnL-UAG-ccsA, and rpl32-trnL-UAG. These markers could be used to distinguish different species of the genus or even different individuals of the same species.
In summary, our results enrich the data on the cp genomes of Scutellaria and provide the basis for the phylogenetic reconstruction of Scutellaria. We have carried out in-depth studies on plastid genes and deepened our understanding of plastid genes of Scutellaria taxa.   The 'Â in each boxplot represents the average K2-P distance. Three IGS had mean K2-P distance over 5, they are ndhF-rpl32 (5.8178), trnL-UAG-ccsA (6.1056) and rpl32-trnL-UAG (10.5154).