A novel SARS-CoV-2 related coronavirus with complex recombination isolated from bats in Yunnan province, China

ABSTRACT At the end of 2019, A new type of beta-CoV, SARS-CoV-2 emerged and triggered the COVID-19 pandemic, which spread overwhelmingly around the world in less than a year. However, the origin and direct ancestral viruses of SARS-CoV-2 remain unknown. RaTG13, a novel coronavirus found in bats in China’s Yunnan Province, is the closest relative virus of the SARS-CoV-2 identified so far. In this study, a new SARS-CoV-2 related virus, provisionally named PrC31, was discovered in Yunnan province by retrospectively analyse bat next generation sequencing (NGS) data of intestinal samples collected in 2018. PrC31 shared 90.7% and 92.0% nucleotide identities to the genomes of SARS-CoV-2 and the bat SARSr-CoV ZC45, respectively. Sequence alignment of PrC31 showed that several genomic regions, especially orf1a and orf8 had the highest homology with those corresponding genomic regions of SARS-CoV-2 than any other related viruses. Phylogenetic analysis indicated that PrC31 shared a common ancestor with SARS-CoV-2 in evolutionary history. The differences between the PrC31 and SARS-CoV-2 genomes were mainly manifested in the spike genes. The amino acid homology between the receptor binding domains of PrC31 and SARS-CoV-2 was only 64.2%. Importantly, recombination analysis revealed that PrC31 underwent multiple complex recombination events (including three recombination breakpoints) involving the SARS-CoV and SARS-CoV-2 sub-lineages, indicating that PrC31 evolved from yet-to-be-identified intermediate recombination strains. Combined with previous studies, it is revealed that the beta-CoVs may possess a more complex recombination mechanism than we thought.


Introduction
Coronaviruses (CoVs) are a group of viruses that can infect humans and various mammalian and bird species [1,2]. So far, seven CoV species have been identified in humans. Of these, severe acute respiratory syndrome coronavirus (SARS-CoV) emerged in 2003 and caused multiple epidemics worldwide, and had a fatality rate of ∼9.5% [3]. Approximately ten years later, another highly pathogenic human CoV, Middle East respiratory syndrome coronavirus (MERS-CoV) emerged and caused numerous outbreaks in the Middle East and South Korea in 2015 [4][5][6]. In December 2019, a novel beta-CoV, now termed severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2), was first identified. SARS-CoV-2 caused a pneumonia outbreak in Wuhan, China, and eventually caused a pandemic, with >116,521,000 reported cases and >2,589,000 deaths worldwide as of March 9, 2021 [7-10].

Methods
We retrospectively analyzed bat next generation sequencing (NGS) data that we performed in 2019, and found SARS-CoV-2-related reads present in one pool of intestinal tissues. The details of sampling and high-throughput sequencing are given below.

Sample collection and pretreatment
In 2018, 36 bats were captured in Yunnan province, China. The bats were dissected to collect liver, lung, spleen and intestinal tissue specimens after anesthetization and transported to the Chinese Center for Disease Control and Prevention. The samples were stored at −80°C until further analysis. The bat species were identified by PCR to amplify the cytochrome B gene (cytb), as previously described [24]. The tissue samples collected from 36 bats were homogenized in minimum essential medium and the suspensions were centrifuged at 8,000 rpm. The intestinal tissue supernatants were merged into two pools according to bat species, and then digested using DNase I. All procedures were performed in a biosafety cabinet in a biosafety level 2 facilities. This study was approved by the ethics committee of the CCDC (No.20160715023), and was performed according to Chinese laws and regulations.

RNA extraction and next-generation sequencing (NGS)
Nucleic acids were extracted using a QIAamp MinElute Virus Spin Kit (QIAGEN). The library preparation and sequencing steps were performed by Novogene Bioinformatics Technology (Beijing, China). In brief, the ribosomal RNA was removed using the Ribo-Zero-Gold (Human-Mouse-Rat) Kit (Illumina, USA) and the Ribo-Zero-Gold (Epidemiology) Kit (Illumina). The libraries were constructed using a Nextera XT kit (Illumina), and sequencing was performed on the Illumina NovaSeq 6000 platform according to the procedure for transcriptome sequencing.

Bioinformatic analyses
Bioinformatics analysis of the sequencing data was conducted using an in-lab bioinformatics analysis platform. Prinseq-lite software (version 0.20.4) was used to remove lower quality reads [25], and Bowtie2 was used to align and map the filtered reads to the host reference genome [26]. Mira (version 4.0.2) was used for de novo assembly of the clean reads [27]. Both BLASTn and BLASTx of the BLAST+ package (version 2.2.30) were used to search against local viral nucleotide and protein databases. The E-value cut-off was set to 1 × 10 −5 to maintain high sensitivity and a low false-positive rate when performing BLAST searches.

Sequencing of full-length genomes and quantitative real-time PCR (qRT-PCR)
We obtained reads that showed 96 − 98% nucleotide identity to SARS-CoV-2 from NGS data. To confirm the sequences and fill the gaps, we designed 32 primer pairs according to the consensus sequences from the NGS and the conserved regions of SARS-CoV-2, RaTG13 and RmYN02, to amplify the whole genome with at least 100 bp overlap between adjacent PCR fragments for Sanger sequencing (Table S1). The sequences were assembled using Geneious Prime. Positive samples were quantified using TaqManbased qPCR kit targeting the ORF1ab and N genes (BioGerm, China).

Phylogenetic and recombination analyses
The complete genome sequences of reference viruses were downloaded from GenBank (https://www.ncbi. nlm.nih.gov/) and GISAID (https://www.gisaid.org/). The complete genome of PrC31 was aligned with representative sequences using MAFFT (v7.475). Phylogenetic analyses were performed using RaxML software (v8.2.11) with the general time reversible nucleotide substitution model, GAMMA distribution of rates among sites, and 1000 bootstrap replicates. The phylogenetic trees were visualized using online website: iTOL [28]. Potential recombination events were screened using RDP4 software and further analyzed by similarity plot using Simplot (v3.5.1) with potential major and minor parents.

Structural modelling
The three-dimensional structures of PrC31, ZC45 and SARS-CoV-2 RBDs were modelled with the Swiss-Model programme using the SARS-CoV-2 RBD structure (PDB: 7a91.1) as the template.

Identification of a novel SARS-CoV-2-related coronavirus
Based on the molecular identification results, all collected bats belonged to five different species: Rhinolophus affinis, Miniopterus schreibersii, Rhinolophus blythi, Rhinolophus pusillus, and Hipposideros armiger (Table S2). By retrospectively analyzing our NGS data, we found a new bat beta-CoV related to SARS-CoV-2 in intestinal tissues of Rhinolophus blythi collected from southern region of Yunnan province, China, in 2018. The corresponding pools of liver, lung and spleen did not present reads against to beta-CoV. All tissue samples were used to screen SARS-CoV-2 by qRT-PCR, results revealed that three samples tested positive for SARS-CoV-2 with Ct values of 32.4 (sample C25, intestinal), 35.1(sample F25, lung corresponding to C25) and 35.6 (sample C31, intestinal). Both bats were identified as Rhinolophus blythi.
A near complete genome of this virus comprising 29,749 bp was obtained from sample C31 and tentatively named PrC31. The virus genome isolated from the second positive sample had the same sequence as PrC31.

Genetic characteristics and comparison with SARS-CoV-2 and other related viruses
Analysis of the complete PrC31 genome revealed that it shared 90.7% and 92.0% nucleotide identity to SARS-CoV-2 and bat SARSr-CoV ZC45, respectively (Table 1). Although the whole genome of PrC31 was more closely related to ZC45 and ZXC21 compared to the other viruses examined, several genes of PrC31 showed highly similar nucleotide identities (>96%) with SARS-CoV-2, including E, ORF7a, ORF7b, ORF8, N and ORF10 (Table 1). Notably, ORF8 and ORF1a (the region spanning nucleotides 1-12,719) of PrC31 were genetically closer to SARS-CoV-2 than any other virus identified to date, exhibiting 98.1% and 96.6% nucleotide identities, respectively. However, in other regions, PrC31 was more similar to SARS-CoV or SARSr-CoV ZC45. Based on the most conserved gene RdRp, the PrC31 share a common ancestor with SARS-CoV.
Unexpectedly, the S gene of PrC31 showed 74.9% nucleotide and 80.1% animo acid identities with SARS-CoV-2, but it was more closer to ZC45 with 94.8% nucleotide and 99.1% animo acid identities. Especially, RBD region was evolutionarily distant from SARS-CoV-2, sharing only 64.2% amino acid identity, whereas it was almost identical to that of ZC45, with only one amino acid difference. Similar to most bat SARSr-CoVs, two deletions (14 and 5 aa) were present in PrC31, while were absent from SARS-CoV, SARS-CoV-2, pangonlin-CoVs and RaTG13. We predicted the three-dimensional structure of the RBD of PrC31, ZC45 and SARS-CoV-2 using homology modelling. Similar to RmYN02, the two loops close to the receptor binding site of the PrC31 RBD were shorter than those of SARS-CoV-2, due to two deletions; this region may influence the binding capacity of the PrC31 RBD with the angiotensin converting enzyme 2 (ACE2) receptor (Figure 1  (A-D)). Moreover, of the six amino acid residues that are essential for the binding of the SARS-CoV-2 spike protein to ACE2 (L455, F486, Q493, S494, N501, and Y505), PrC31 shared only one with SARS-CoV-2 (Y505) (Figure 1(E)).

Phylogenetic analysis of PrC31 and representative sarbecoviruses
Phylogenetic analysis of the complete PrC31 genome revealed that it belonged to a separate clade to SARS-CoV-2, while most other SARS-CoV-2-related viruses were grouped together ( Figure 2). However, the PrC31 RNA-dependent RNA polymerase was phylogenetically grouped within the SARS-CoV lineage and clustered with bat SARS-rCoV. The spike protein of PrC31 fell within the SARS-CoV-2 sub-lineage and clustered with ZC45 and CXZ21, while being distant from SARS-CoV-2. The topological differences between various regions of PrC31 strongly suggest the occurrence of recombination events throughout its evolution.

Multiple and complex recombination events in the evolution of PrC31
The full-length genome sequences of PrC31 and closely related beta-CoVs were aligned to search for possible recombination events. Strikingly, both the similarity and bootstrap plots revealed multiple and complex long-segment recombination events in PrC31, which likely arose from multiple beta-CoVs from within the SARS-CoV and SARS-CoV-2 sublineage. As shown in Figure 3, three recombination breakpoints were detected. For the region spanning nucleotides 1-12,719 and 27,143 to the 3 ′ terminus of the genome, PrC31 was most closely related to SARS-CoV-2 and RmYN02. In these regions, PrC31 was phylogenetically grouped with RmYN02 and in a sister clade to SARS-CoV-2 (Figure 4(a,d)). For the 12,720-20,244 nucleotide region, which included ORF1ab, PrC31 was grouped with SARS-CoV and bat SARSr-CoVs (Figure 4(b)). Moreover, PrC31 presented the highest similarity to ZC45 in the 20,245-27,142 genomic fragment, which included part of ORF1ab, S, ORF3, E, and part of the M gene, and fell within the SARS-CoV-2 sub-lineage (Figure 4(c)).

Discussion
The recently-emerged SARS-CoV-2 virus triggered the ongoing COVID-19 pandemic, which has high morbidity and fatality rates, and poses a great threat to global public health. Identifying the origin and host range of SARS-CoV-2 will aid in its prevention and control, and will facilitate preparation for future CoV pandemics. Although several SARS-CoV-2related viruses were detected in bats and pangolins, none of them appear to be the immediate ancestor of SARS-CoV-2; the exact origin of SARS-CoV-2 is still unclear [12,29]. In this study, we discovered PrC31, a sarbecovirus discovered from bat intestinal tissues collected in 2018. PrC31 phylogenetically falls  into the SARS-CoV-2 clade and has undergone multiple and complex recombination events.
Animals that continuously harbour viruses closely related to SARS-CoV-2 for extended time periods can become natural SARS-CoV-2 hosts [2]. To date, several bat viruses have been identified that have strong sequence similarities to SARS-CoV-2, sharing more than 90% sequence identity. Especially, RaTG13 possesses 96.2% identity with SARS-CoV-2 [7,17,18,20,22]. The PrC31 virus identified in this study showed 90.7% genome identity with SARS-CoV-2; notably, the E, ORF7, ORF8, N and ORF10 genes shared more than 96% identity with SARS-CoV-2. Both the genetic similarity and diversity of SARS-CoV-2-related viruses support the claim that bats were the natural hosts of SARS-CoV-2 [10,18].
Recombination events between various SARSr-CoVs have occurred frequently in bats [5,16]. SARS-CoV-2 may also be a recombinant virus, potentially with the backbone of RaTG13 and a RBD region acquired from pangolin-like SARSr-CoVs [12,20]. In this study, we found that PrC31 phylogenetic clustered with SARS-CoV-2 and its related viruses. The results from our phylogenetic analyses suggested that recombination had occurred in PrC31. The similarity plot indicated that the PrC31 was subjected to multiple and complex recombination events involving more than two sarbecoviruses in the SARS-CoV and SARS-CoV-2 sub-lineages. The three recombination breakpoints of PrC31 separate the genome into four regions. Region 1 (within ORF1a) and region 4 of PrC31 were closely related to SARS-CoV-2, RaTG13 and RmYN02. Region 2 of PrC31 was more similar to members of the SARS-CoV sub-lineage, including SARS-CoV and SARSr-CoV Rs4237 strain; region 3 was more closely related to ZC45 within SARS-CoV-2 sub-lineage. The multiple recombination events of PrC31 hint toward the existence of intermediate recombination strains within the SARS-CoV and SARS-CoV-2 sub-lineages that are yet to be identified. Our work suggests that the backbone of PrC31 may have evolved from a recent common ancestor of RaTG13, RmYN02 and SARS-CoV-2, and that it acquired regions 2 and 3 from precursor viruses of SARS-CoV and SARSr-CoV ZC45, respectively.   At present, the precise patterns and mechanisms driving recombination in sarbecoviruses are largely unknown. A recent report identified 16 recombination breakpoints in 69 sarbecoviruses [30], although in the majority of strains, the recombination sites were located within the S gene and upstream of ORF8 [5,9,16]. The three recombination breakpoints of PrC31 were located in ORF1a, ORF1b and M genes with long fragment recombination, suggestive of a complicated recombination pattern in sarbecoviruses. Similar to PrC31, SARS-CoV-2 may have evolved via complex recombination between various related coronaviruses or their progenitors [10]. In fact, the direct progenitor of SARS-CoV may have evolved by recombination with progenitors of SARSr-CoV (Hu et al. 2017). Together, these findings suggest that recombination and its role in the evolution history of sarbecoviruses may be more complicated and significant than initially expected.
Pangolins may also harbour ancestral beta-CoVs related to SARS-CoV-2 [2,19,20]; the receptor-binding motif of pangolin beta-CoVs share an almost identical amino acid sequence with SARS-CoV-2 [19,20], suggesting that SARS-CoV-2 may have acquired its RBD region from a pangolin CoV via recombination [31]. However, unlike bats, pangolins infected with beta-CoVs present overt symptoms and eventually die, rendering them unlikely to be natural hosts. Intermediate hosts generally serve as zoonotic sources for human infection, acting as vectors for viral replication and transmission to humans [2,32]. Current evidence suggests that pangolins were not the direct intermediate hosts of SARS-CoV-2. However, pangolins certainly played an important role in the evolutionary history of SARS-CoV-2 related viruses, eventually leading to the transmission of SARS-CoV-2 to humans. It cannot be excluded that a novel recombination event involving SARS-CoV-2 related viruses and or SARSr-CoV will produce a novel virus presumed as "SARS-CoV-3", which may be transmitted to human populations in the future.

Disclosure statement
No potential conflict of interest was reported by the author(s).

Data availability statement
The sequence of PrC31 generated in this study was deposited in the GISAID and GenBank databases with the accession numbers EPI_ISL_1098866 and MW703458, respectively.