Structure and features of the complete chloroplast genome of Salix triandroides (Salicaceae)

Abstract Salix triandroides W. P. Fang, an indigenous small willow with high potential for heavy-metal soil remediation and levee protection, has been used for wetland ecological restoration for decades. However, studies on the evolution and genetic breeding of S. triandoides have been hindered by the lack of its genetic information. Here, we present the complete chloroplast genome of S. triandroides, which was assembled from Illumina sequencing data. The chloroplast genome of this species is circular, 155,683 bp in length and includes two inverted repeat regions (IRs) (27,490 bp) separated by a large single-copy (LSC) region (84,463 bp) and a small single-copy (SSC) region (16,240 bp). A total of 126 unique genes were identified, including 81 protein-coding genes, 37 tRNA genes and eight rRNA genes. Comparative analysis identified some hypervariable regions, with potential to be used as specific DNA barcodes or candidate markers for phylogenetic studies. Based on the sequences of the protein-coding genes, the phylogenetic analysis assigned 32 Salix species into two major clades and revealed that S. triandoides was a sister taxon to S. triandra. Our results provide a foundation for further molecular breeding of S. triandroides and insights into the evolution of Salix. Supplemental data for this article is available online at https://doi.org/10.1080/13102818.2021.2023326 .


Introduction
Salix L. (often know as willows) is the largest genus of the family Salicaceae and has 450-520 species [1][2][3]. Salix species are distributed mainly in the Northern Hemisphere [2] and over half of the willow species (275) could be found in China [4,5]. Salix species have been used for ecological remediation and restoration in heavy metal-contaminated soil because these trees are highly productive and thrive on many nutrient-poor soils [6][7][8]. Salix triandroides W. P. Fang, an indigenous small willow, has been used to remediate Cd-contaminated soils for decades [9,10]. Being highly adaptable in wet conditions, S. triandroides are often planted for wave reduction, levee protection and biodiversity conservation in wetlands [11]. Previous studies on this species are few and mainly focus on effects of various stress factors (e.g. flooding and heavy-metal toxicity). So far few genomic studies have been documented, except for phylogenetic analysis with a few short gene sequences [3].
The plant chloroplast genome contains a typical quadripartite construction, which contains two inverted repeat regions (IRa and IRb), each insulated by a large single-copy (LSC) at one end and by a small single-copy (SSC) at the other end. The chloroplast genome is a haploid and maternally inherited in angiosperms. It is a good complement to the nuclear genetic markers and has been widely employed for phylogenetic analysis and domestication studies of higher plants [12][13][14]. The complete chloroplast genome sequences also show potential in understanding structural and functional evolutionary pathways [12,15]. In addition, chloroplast genetic engineering has been used for improving crops productivity and stress-resistance [16]. In genus Salix, nearly 30 chloroplast genomes have been published, but the chloroplast genome of S. triandroides has not yet been reported.
Willows; wetland protection plants; chloroplast genome; mutational hotspots; phylogenetic relationship In this study, we de novo sequenced and assembled a complete chloroplast genome of S. triandroides. In parallel, we assembled two chloroplast genomes of two other closely related species, S. brachista and S. dunnii, from published Illumina sequencing data [17,18]. Then we characterized their genomic structures and features including the gene content, repeated sequences and IR expansions/contractions, etc. We detected sequence divergence and phylogenetic relationship in Salix by comparative chloroplast genome analyses. We also predicted mutation hotspot regions in Salix, which would be helpful for developing potential molecular markers. This study will also provide a powerful tool to accelerate breeding and biotechnological studies for S. triandroides.

Sample preparation, de novo sequencing and assembly
For chloroplast genome sequencing, we chose a mature and healthy S. triandroides tree from yueyang Forestry and ecology garden (yueyang City, Hunan Province, China). Total genomic DNA was extracted from fresh leaves following the CTAB-chloroform protocol [19]. We constructed short-insert libraries with sizes of 500 bp for sequencing on the Illumina HiSeq X Ten platform under the Pe 150 bp protocol. After quality assessment, reads were assembled into contigs using the program NOVOPlasty [20]. A ribulose-1, 5-bisphosphate carboxylase/oxygenase (rbcL) gene sequence from S. triandra (genBank accession no. mK722343) was used as the seed sequence, and complete chloroplast genome sequences of S. triandra and S. wilsonii (NC053549) were used as references to resolve the inverted repeat in the chloroplast genome of S. triandroides.
For the lack of available annotations of the chloroplast genome, we assembled the chloroplast genomes of another two Salix species (S. brachista and S. dunnii) which had been whole-genome sequenced [17,18]. These two chloroplast genomes were assembled with Illumina reads which were downloaded from the NCBI (National Center for Biotechnology Information) database (SRX9358501 and SRX4215130).

Genome annotation and sequence characterization
The assembled chloroplast genomes were annotated by a combination of geSeq [21] and CPgAVAS2 techniques [22]. The different annotations of protein-coding sequences were determined using BLASTx. The tRNAs were scanned with tRNAscan-Se v2.0.3 [23]. Final chloroplast genome maps were drawn using OgDRAW program [21]. The annotated chloroplast genomic sequence has been deposited in genBank with an accession number: mW929213-mW929215.
We acquired chloroplast genome sequences of Salix species and two outgroup species in genus Populus (Salicaceae) from genBank (Supplemental Table S1) to compare features among them and the chloroplast genome assembled in this study. We calculated the gC content for chloroplast genomes using an in-house perl script. The codon usage frequencies and the relative synonymous codon usage (RSCu) were analyzed for all protein-coding sequences using megA-X [24]. In addition, we predicted RNA editing sites in protein-coding genes using the PReP suite with a cut-off value of 0.8 [25]. To identify simple sequence repeats (SSRs), identified mono-, di-, tri-, tetra-, pentaand hexanucleotides with the minimum repeat number were set at 10, 5, 4, 3, 3 and 3, respectively, through the mIcroSAtellite (mISA) program. Redundant SSRs were manually checked and discarded. In addition, complex repeat sequences (i.e. forward, reverse, complement and palindromic) were identified using RePuter online software setting hamming distance ≤ 3 bp and the minimum repeat size ≥ 30 bp [26].

Comparison of the chloroplast genomes and divergent hotspot identification
To estimate the conservation of chloroplast genes, we compared six Salix species with the annotation of S. triandroides as the reference. We used mVISTA online software to visualize rearrangements and inversions in sequences through the Shuffle-LAgAN program [27]. Junction regions of IR-LSC and IR-SSC in the chloroplast genomes of these six Salix plants and Populus trichocarpa were compared by using IRscope [28].
For the nucleotide diversity analysis, we downloaded the chloroplast genomes of 29 Salix species from NCBI (Supplemental Table S1). The nucleotide diversity (Pi value) was calculated based on the sliding window analysis using the DnaSP v5.10 [29]. The window size was set to 600 bp with a step size of 100 bp. After identifying hypervariable regions by sliding widows, we recalculated Pi values for the complete genic or intergenic sequences of these potential hotspot regions.

Phylogenetic analysis
For the phylogenetic analysis, 32 Salix species and two outgroup Populus species were recruited to construct the phylogenetic tree. The whole chloroplast genome sequences of the analyzed species were aligned using mAFFT v7.475 with the default settings [30]. The best-fitting nucleotide substitution model was determined using the Akaike Information Criterion in the model-finder IQ-TRee [31]. mL analysis was performed using IQ-TRee with the best model of gTR + F + R2 and 1000 bootstrap replicates.

Organization of chloroplast genomes of Salix species
The chloroplast genomes of three Salix species (S. triandroides, S. brachista and S. dunnii) contain the typical quadripartite structures ( Figure 1; Table 1), which include a large single copy (LSC) region, two inverted repeats (IR) and a small single copy (SSC) region. These three Salix chloroplast genomes ranged from 155,600 bp to 155,683 bp in length (Table 1), with IRs ranging from 27,429 to 27,490 bp, LSCs ranging from 84,422 to 84,497 bp and SSCs ranging from 16,240 to 16,292 bp. each chloroplast genome encoded 37 transfer RNAs (tRNAs) and 8 ribosomal RNAs (rRNAs). The chloroplast genome of Salix dunnii included 82 protein-coding genes, while ycf1 around JSA (joint of SSC and IRa regions) was lacking in the other two species. In total, each chloroplast genome included 126 (ycf1 loss) or 127 (ycf1 present) genes. In addition, 17 intron-containing genes were detected in all genomes, and three of the intron-containing genes (rps12, pafI and clpP1) contained three exons ( Table 1). The total gC contents of these three assembled chloroplast genomes were 36.7% on average (Table 1).
To detect length variations, we compared the chloroplast genomes of nine Salix plants and nine Populus plants. There were higher genetic variabilities in the single copy regions (both LSC and SSC) than in the IR regions in Salix species. However, the IR regions diverged more than the SSC region in Populus species (Supplemental Table S2). In addition, we compared the total gC content of nine Salix plants and nine Populus plants. The Salix species have similar total gC contents to Populus species (average in 36.7%) (Supplemental Table S2).

Codon usage analysis and RNA editing in protein coding genes
In this study, we found that there were 23,879, 23,991 and 22,601 codons (including stop codons) in the chloroplast genomes of S. triandroides, S. brachista and S. dunnii, respectively. Twenty amino acids were translated from each genome. Leucine was the most abundant translated amino acid and corresponded to 2423 to 2545 codons, followed secondly by isoleucine corresponding to 1938 to 2076 codons.
Commonly, a RSCu (relative synonymous codon usage) larger than 1 means that the codon is a frequently used one, while a value less than 1 suggests a relatively less used codon [32]. A RSCu at 1 indicates there is no preference for codon use [32]. In this study, we found that there were 64 types of codons in the Salix chloroplast genome. RSCu showed high level of similarities in the three Salix chloroplast genomes ( Table 2). Leucine and isoleucine were the most abundant amino acids and had a frequency of 10.6% to 10.7% and 8.6% to 8.7% in these genomes. In each genome, there were 30 codons with RSCu values larger than 1, and 32 codons with RSCu values less than 1. In addition, the RSCu values of Aug and ugg were equal to 1, which suggests that these two types of codons have no bias in expression. The codon usage bias in Salix was similar to those in Garcinia species [33] and Theobroma species [14], but different from Lilium species [34].
There were 23 protein coding genes in the chloroplast genome of S. triandroides with a total of 58 RNA editing sites (Supplemental Table S3). Both S. brachista and S. dunnii contained 60 RNA editing sites, and they were distributed on 23 and 25 coding genes (Supplemental Table S4 and S5). All sites served the function of changing the nucleobases from Cytosine to uracil. Both ndhB and ndhD genes contained 11 RNA editing sites in these three chloroplast genomes.

Repeat sequences and SSR analysis
SSR molecular markers have been well-developed and frequently used in plant genetic analysis. In this study, mISA was used to analyze the SSRs in six Salix chloroplast genomes (Figure 2). S. dunnii had the largest number of SSRs at 275, whereas S. triandroides had the least number at 261 (Figure 2). The majority of these SSRs were mononucleotides (poly-A or poly-T mainly), with 194-202 members in these chloroplast genomes ( Figure 2). In each species, there were more than 50 dinucleotide repeats, about ten tetranucleotide repeats and at least one trinucleotide repeat. Pentanucleotide and hexanucleotide repeats were rarely found in some species. For most SSRs (more than 65%), the A/T content is much higher than that of g/C. It was consistent with the fact that the complete chloroplast genomes of Salix species are rich in A/T (> 63.2%).
We also detected long complex repetitive sequences including forward repeats, reverse repeats, palindromic repeats and complement repeats in these six species. most of the repeats were of 30-60 bp in length. The minimum number of repeats was found in S. brachista (36) whereas the maximum was found in S. dunnii (49) (Figure 3A). In addition, we studied the correlation between the repeat number and size of the genome for the nine species, and found that they were not related ( Figure 3B). These observations  strengthen the independence of repeats number from genome size.

Inverted repeats contraction and expansion
The inverted repeats contraction and expansion revealed variations at the junctions of any two of the regions (LSC/IRs/SSC). The comparative analysis indicated that the junction of LSC and IRb regions (JLB) and the junction of LSC and IRa regions (JLA) were conserved across species. In all species, rps19 is present completely in the LSC region and rpl22 starts in LSC regions and extends into the SSC region. A functional copy of ycf1 gene was observed at JSA in four species, whereas the truncated copy was found at the junction of IRb/SSC (JSB) in all species. At the  junction of LSC and IRa (JLA), trnH exists in the LSC and rps19 exists in the IRa (Figure 4).

Sequence divergence and hotspots
The mVISTA plots display sequence similarities, collinearity patterns and genetic variations among the chloroplast genomes from six Salix species ( Figure  5). Our results indicated that the gene number, order and orientation were conserved in Salix. The sequences in the IR regions were more conserved than those in the single copy (LSC and SSC) regions. This may be due to the conservative characteristics of rRNA genes, which were all distributed in IR regions. The sequences in the non-coding regions were more divergent than those in the coding regions. The most divergent non-coding regions among these chloroplast genomes were trnK-psbK, ndhC-trnM and psbE-petL. Additionally, we downloaded another 29 Salix chloroplast sequences from NCBI (Supplemental Table S1) and measured the nucleotide diversity by DNAsp to identify the mutation hotspot regions in the whole Salix chloroplast genomes ( Figure 6). Nucleotide sequences less than 600 bp long had diversity values varying from 0 to 0.01559. Highly diverged regions were shown in Figure 6. We extracted the whole sequences of highly diverged regions and calculated their Pi values ( Table 3). The region ndhC + trnV has the highest Pi values (Pi = 0.01295) followed by the other four spacer regions (Pi > 0.01) including trnS-lhbA-trnG-trnfM-rps14, trnF-ndhJ, rpoB-trnC-petN and atpH-atpI (Table 3). All these features were in the LSC region.

Phylogenetic analysis
To study the phylogenetic position of S. triandroides, we constructed a maximum likelihood (mL) phylogenetic tree by using the amino acid sequences of protein-coding genes of 32 Salix species (29 downloaded from NCBI database and three assembled in this study) with two Populus species as outgroups (Figure 7). Salix plants were clearly subdivided into two well-supported clades (boot-strap values = 100%), consistent with previous studies [35,36]. The phylogenetic analysis strongly supported that S. triandroides was closely related to S. triandra, and they were assigned to Clade I. This is congruent with previous research [3]. This analysis substantially increases our understanding of the evolutionary relationship among Salix species.

Discussion
In terms of species number, Salix is the largest genus in Salicaceae [37]. Due to its wide geographic distribution, strong adaptability and fast growth rates, most Salix species are economically valuable. Some of the Salix species have been cultivated as raw materials for a wide range of wood products, as well as for paper and pulp, rope making and others [38]. Due to the great demand of wood, genetic engineering within the chloroplast genome has been used as an effective way to improve plant productivity [16]. Chloroplast genome resources can facilitate genetic engineering in Salix species. Here, we obtained the complete chloroplast genome of S. triandroides using Illumina sequencing data.
The structure of the chloroplast genome of S. triandroides was conservative and displayed a classical quadripartite structure. The gene content, gene order and gene orientation in S. triandroides were conserved. Some length differences in the chloroplast genomes were observed among strains of both Salix and Populus species (Supplemental Table S2). In Salix, the length of LSC regions accounted for the size difference in the chloroplast genome, consistent with similar studies for Gossypium genus [39] and Chrysanthemum boreale strains [40]. In contrast, in Populus, the IR regions rather than the SSC regions contributed more to the variations in lengths of chloroplast genomes, which agree with a study on duckweed species (Lemnoideae) [41]. This indicated that variations in chloroplast size may be lineage-specific in Salicaceae. even though universal DNA barcodes have been used for phylogenetic analysis at various taxonomic levels, they often discriminate the closely related species poorly for the lack of variations and lower divergence [42]. In Salix, traditional chloroplast markers (Table 3) lack variability, the employment of which may lead to unsuccessful species identification [36,43,44]. Chloroplast genome sequences provide an efficient tool for taxonomic studies and provide valuable genetic information for subsequent analysis. The mutation events are unevenly distributed within the chloroplast genome and are concentrated in certain regions, called 'hotspot' regions [42]. In this study, we identified some hotspot regions ( Figure 6; Table 3) for 32 Salix species. These hypervariable regions could be developed to be specific DNA barcodes for further studies.
In Salix, the occurrence of repeats (SSR and LDR) was more prevalent in the non-coding regions (intergenic regions and introns) than in the coding regions, which is similar to the reports in Chrysanthemum boreale [40], Artemisia selengensis [45] and Aster tataricus [46]. Weng et al. inferred that the distribution patterns of repeat sequences in plastid genomes are associated with genome rearrangements and nucleotide substitution rates [47]. Therefore, these repeats could be used to develop genetic markers not only for phylogenetic studies, but also for detecting the diversity and differentiation of Salix and other Salicaceae species.  Phylogenetic analysis of willows has been reported for more than two decades. Azuma et al. (2000) firstly assigned 23 Salix species, based on rbcL genes, into two major clades [35], which was supported by many subsequent studies [36,43,44]. However, all these phylogenetic analyses were conducted based on less than three markers and relationships within species were not well resolved. A similar result has been found in subgenus Salix s.l. [3]. Based on the whole genome assembly, Zhang et al. [48] and He et al. [18] constructed phylogenetic trees with high support values by using thousands of single-copy orthologous genes. However, these studies analyzed only three and five Salix species. In the present study, 32 Salix species were also subdivided into two major clades and the relationships among species were well resolved ( Figure  7). To the best of our knowledge, this is the first phylogenetic tree generated for so many Salix species based on the whole chloroplast genome sequences. The low support values of some nodes may be due to the high frequency of interspecific hybridization within this genus [49]. Nevertheless, two major clades were supported with high bootstrap values.
To summarize, we have presented a de novo assembled chloroplast genome sequence of S. triandroides. We detected some hypervariable regions of Salix and Figure 7. phylogenetic tree constructed using the maximum likelihood (ml) method based on the whole chloroplast genomes from 32 Salix species and two Populus species. the numbers above the branches represent the ml bootstrap values.

Conclusions
Salix triandroides is an important small willow for levee protection and ecological restoration of wetland. In this study, we reported a complete chloroplast genome of S. triandroides which is 155,683 bp in length, including a pair of inverted repeat regions (27,490 bp), a large single-copy region (84,463 bp) and a small single-copy region (16,240 bp). We annotated a total of 126 unique genes in this circular chloroplast genome. We further predicted some hypervariable regions in this chloroplast genome for candidate marker development. Phylogenetic analysis assigned Salix species into two major clades and revealed that S. triandoides was a sister to S. triandra. Our research will provide a foundation for molecular breeding of S. triandroides and represent new insights into the evolution of Salix.