Naturally occurring SARS-CoV-2 gene deletions close to the spike S1/S2 cleavage site in the viral quasispecies of COVID19 patients

ABSTRACT The SARS-CoV-2 spike (S) protein, the viral mediator for binding and entry into the host cell, has sparked great interest as a target for vaccine development and treatments with neutralizing antibodies. Initial data suggest that the virus has low mutation rates, but its large genome could facilitate recombination, insertions, and deletions, as has been described in other coronaviruses. Here, we deep-sequenced the complete SARS-CoV-2 S gene from 18 patients (10 with mild and 8 with severe COVID-19), and found that the virus accumulates deletions upstream and very close to the S1/S2 cleavage site (PRRAR/S), generating a frameshift with appearance of a stop codon. These deletions were found in a small percentage of the viral quasispecies (2.2%) in samples from all the mild and only half the severe COVID-19 patients. Our results suggest that the virus may generate free S1 protein released to the circulation. We suggest that natural selection has favoured a “Don’t burn down the house” strategy, in which free S1 protein may compete with viral particles for the ACE2 receptor, thus reducing the severity of the infection and tissue damage without losing transmission capability.


Introduction
RNA viruses replicate using their own RNA-dependent RNA polymerase (RdRp), which lacks proofreading mechanisms and is prone to mutate at high rates (10 −3 -10 −5 substitutions/nucleotide/replication cycle), lending the virus a quasispecies structure [1,2]. Previous studies with severe acute respiratory syndrome coronavirus (SARS-CoV) and mouse hepatitis virus have reported moderate mutation rates of 9.06 × 10 −7 and 2.5 × 10 −6 subs/site/cycle respectively, below the expected range for RNA viruses [3]. This is consistent with a role for non-structural protein (nsp) 14 in RNA proofreading or repair functions because of its 3'-5' exonuclease (ExoN) activity. Nonetheless, the large size of the CoV RNA genome increases the probability that deletions will be generated and recombination events will take place, which could facilitate adaptation to new host environments, as occurs with jumping between species [1,2]. One naturally occurring deletion on 29 nucleotides in the open reading frame (ORF) 8 of SARS-CoV after human-to-human transmission was found to be associated with attenuation of replication [4].
The low mutation rate, high human-to-human transmissibility (R 0 = 2.2) [5], and absence of human pre-existing immunity against SARS-CoV-2 could explain its rapid spread through the human population, with very high sequence identity (99.9%) between isolates recovered all over the world (sequence published in the repository sequence data banks, GISAID and GenBank). The high pathogenicity of the virus, the severity of coronavirus disease-19  and the lack of an effective antiviral treatment or vaccine has pushed the scientific community worldwide to develop, in record time, a solution for this pandemic [6].
Among the SARS-CoV-2 structural proteins, including spike (S), envelope (E), and membrane (M) constituting the viral coat, and the nucleocapsid (N) protein that packages the viral genome, the S glycoprotein is the most promising as a therapeutic and vaccine target. The S protein is encoded by the S gene, and following trimerization, it composes the spikes of the characteristic viral particle crown (corona). The S protein is essential for SARS-CoV-2 to infect a host cell [7] by recognizing and binding to the human cell receptor, angiotensin-converting enzyme 2 (ACE2) [8], and possibly (with lower affinity) to other receptors, such as CD209L (L-SIGN), also used by SARS-CoV [9] and dipeptidyl peptidase 4 (DPP4), used by MERS [10].
The S gene has 3822 nucleotides with 1273 amino acids (GenBank reference sequence MN908947.3). It has five essential domains: the receptor-binding domain (RBD), O-linked glycan residues flanking a polybasic S1/S2 cleavage site, fusion peptide (FP), heptad repeats HR1 and HR2, and a transmembrane domain (TM). The S1 RBD includes 6 amino acid positions that show high affinity for the human ACE2 receptor, which is widely distributed, but mainly present in alveolar type 2 (AT2) cells of the lungs [11]. Once the virus is attached to the host cell receptor, cleavage occurs between subunits S1 and S2, and subunit S2 drives the viral and cellular membranes to fuse [12]. Thus, S1 recognizes and binds to the human cell receptor ACE2, whereas S2 directly facilitates entry into the host cell. Both functions are crucial for infection, and therein lies the interest of S as a target for the development of vaccines and antiviral agents.
Because of the importance of the S protein, we carried out a deep-sequencing study of the S gene in upper respiratory tract samples from 18 patients with mild or severe SARS-CoV-2 disease. Of particular note, hotspot deletion sites were found in minority mutants located upstream and very close to the S1/S2 (PRRAR/S) and S2' (KPSKR/SFI) cleavage sites, suggesting that these genomes code for a truncated S protein. The variants were significantly more prevalent in patients with mild than those with severe disease. Thus, their effect on the protein could constitute a favourable regulatory mechanism emerging in the viral quasispecies to modulate the pathological effect of the infection. Discussion is provided on the implications this observation may have in the biology of SARS-CoV-2.

Patients
Upper respiratory tract specimens (naso/oropharyngeal swabs or nasopharyngeal aspirates) from individuals consulting in the emergency room were collected for SARS-CoV-2 testing in the Department of Microbiology at Hospital Universitari Vall d'Hebron (HUVH), Barcelona (Spain). Samples from 18 patients with no previous comorbidities other than COVID-19 were included in the study. As defined by CDC criteria (https://www. cdc.gov/coronavirus/2019-ncov/hcp/clinical-guidancemanagement-patients.html), 10 patients had a mild clinical presentation of COVID-19 (absence of viral pneumonia and hypoxia, no hospitalization requirement, able to manage their illness at home), whereas 8 patients had severe disease (intensive care unit (ICU) admission for supportive management of complications of severe COVID-19 such as pneumonia, pneumonia, hypoxemic respiratory failure, sepsis, cardiomyopathy and arrhythmia, acute kidney injury, and other complications). All patients were attended by March 2020, and those with both mild and severe disease had a favourable outcome with resolution of the infection.

Detection of SARS-CoV-2
The diagnosis of COVID-19 was performed by two tests, an in-house RT-PCR assay using the primer/ probe set from the CDC 2019-nCoV Real-Time RT-PCR Diagnostic Panel (Qiagen, Hilden, Germany) and a commercial real-time RT-PCR assay (Allplex 2019-nCoV Assay, Seegene, South Korea).

SARS-CoV-2 sequencing
The 18 respiratory specimens were inactivated by mixing 140 µL of the sample with 560 µL of AVL buffer (Qiagen, Hilden, Germany). Extraction of nucleic acids was then performed using the QIAmp Viral RNA Mini Kit (Qiagen, Hilden, Germany) following the manufacturers' instructions but without the RNA carrier, obtaining a final elution of 30 µL.
The complete S gene was amplified using a double PCR. The first RT-PCR step consisted in amplifying 2 large fragments, 3314 base pairs (bp) and 3591 bp in length, respectively. The 5' end of primer 1 and 3' end of primer 2 were designed to be outside the S region to ensure that we were amplifying SARS-CoV-2 genomic RNA, and not subgenomic RNA (Table  S11).
The SuperScript III One-Step RT-PCR System with Platinum Taq HiFi DNA Polymerase (Invitrogen; Carlsbad, CA, USA) was used for the RT-PCR. Reverse transcription was done at 50°C for 30 min, followed by a retrotranscriptase inactivation step at 94°C for 2 min. Next, 30 cycles of PCR amplification were performed as follows: denaturation at 94°C for 15 sec, annealing at 54°C for 30 sec, and elongation at 68°C for 5 min. After the last cycle, amplification ended with a final elongation step at 68°C for 5 min.
The second round of amplification (nested) was done using overlapping internal primer pairs to amplify fragments 470 bp to 313 bp in length. The Fas-tStart High-Fidelity PCR System dNTPack (Sigma, St. Louis, MO, CA) was used for this purpose, as follows: activation at 94°C for 4 min, followed by 30 cycles with denaturation at 94°C for 30 sec, annealing at 55°C for 30 sec, and elongation at 72°C for 40 sec, ending with a single elongation step at 72°C for 7 min.
PCR products were purified using the QIAquick Gel Extraction Kit (Qiagen, Hilden, Germany) with QG buffer, following the manufacturers' instructions, and eluted DNA was quantified by fluorometry using the QUBIT dsDNA BR Assay Kit (ThermoFisher, MA, USA). For each patient, PCR products were normalized to 1.5 ng/µL, pooled in a single tube, and purified using KAPA Pure Beads (KapaBiosystems, Roche, Pleasanton, CA, USA) to ensure that no short DNA fragments were present in the library. Library preparation was done using the KAPA Hyper Prep Kit (Roche Applied Science, Pleasanton, CA, USA) and each pool was individually indexed using the SeqCap Adapter Kit A/B (Nimblegen, Roche, Pleasanton, CA, USA). After library enrichment and a second clean-up with KAPA Pure Beads, the pools were quantified again using the QUBIT dsDNA BR Assay Kit and qualitytested using the 4150 TapeStation System (Agilent, Santa Clara, CA, USA). All pools underwent a final normalization to 4 nM, and 10 µL of each pool was added to the final library tube. The final library was qPCR-quantified using the KAPA Library Quantification Kit (KapaBiosystems, Roche, Pleasanton, CA USA) in a LightCycler 480 system (Roche) to obtain the precise concentration of indexed DNA. PhiX V3 internal DNA control (Illumina, San Diego, CA, USA) was added to the final dilution. The library was loaded in a MiSeq Reagent Kit 600V3 cartridge (Illumina, San Diego, CA) and sequenced using the MiSeq platform (Illumina, San Diego, CA).

Bioinformatics analysis. InDel study
The sequence analysis aimed to obtain high-quality haplotypes fully covering the amplicons. The pipeline comprises the following steps: (1) Amplicons were reconstructed from the corresponding R1 and R2 paired ends using FLASH [13] and setting a minimum of 20 overlapping bases and a maximum of 10% mismatches. Low-quality reads that did not meet the requirements were discarded. (2) Next, all reads with more than 5% of bases below a Phred score of Q30 were filtered out. (3) The reads were demultiplexed by matching primers, allowing a maximum of three mismatches, and the primers were trimmed at both read ends. Identical reads were collapsed to haplotypes with the corresponding frequencies as read counts. A fasta file was generated with each pool/primer/ strand combination. The reverse haplotypes were reverse complemented. (4) Raw forward and reverse haplotypes were multiple aligned with MUltiple Sequence Comparison by Log-Expectation (MUSCLE) [14], then separated into strands, and haplotypes common to both strands at abundances ≥0.1% were identified. Low-abundance haplotypes (<0.1%) and those unique to one strand were discarded. The haplotypes common to both strands, with frequencies not below 0.1% were called consensus haplotypes, and were the basis of subsequent computations.
The amino acid alignments were computed as follows: (1) Gaps were removed and haplotypes translated to amino acids. (2) The translated stops generated were identified, and haplotypes were trimmed after the stop. (3) Resulting amino acid haplotypes were realigned with MUSCLE (EMBL-EBI https://www.ebi.ac. uk/Tools/msa/muscle/).
All computations were made in the R language and platform [15], developing in-house scripts using Biostrings [16] and Ape [17] packages.

Results
Eighteen COVID-19 patients (10 mild and 8 severe) were included in the study. In total, 48,746,647 reads, ranging from 81,202 to 597,558 reads per amplicon (median 171,478), were analysed from upper respiratory tract samples (Table 1), using 13 overlapping amplicons covering the complete S protein. Thus, we studied 3,749,742 complete S genes, with a mean of 208,319 per patient. Sequences have been uploaded to the GenBank Sequence Read Archive (SRA) database with BioProject accession number PRJNA630679. Results related to amplicon positions, coverage, percentage of the master sequence, gap incidence per patient and per amplicon, and premature stop codons are available as Supplementary Tables S1-S10.
Deletions were not randomly accumulated along the S gene, but instead, were found at specific regions ( Figure 1, Figures S1-S27). Deletions coded as delta (Δ1-Δ18) ranged from 1 to 42 nucleotides lost ( Table 2). In some cases, the sequence recovered the correct reading frame, in others, the frameshift caused the appearance of a premature stop codon very close to the deletion site, whereas in still others, a new amino acid segment appeared.
Among the total of 43 deletions detected in amplicon N07 (Table 3), a premature stop codon appeared immediately after the deletion site in 5 cases (11.6%) and the reading frame recovered after losing 4, 5, or 7 amino acids in 6 cases. However, a frameshift that changed the reading frame and caused the appearance of a premature stop codon several amino acids later was generated in most of the deletions 32/43 (74.4%), and in consequence the S1/S2 cleavage site and the polybasic domain (PRRAR/S) disappeared. In 39 of the 43 (90.7%) N07 deletions, a TATA box-like motif (nt 2,007-2,010) was lost. In this particular region, the deletion was characterized by a similar 3' cutting edge (Table S11). An interesting result at the amino acid level was that regardless of the starting point of the deletion (nt 654, 663, 664, 665, 666, 667 or 671), in 9 of the mild patients (all except P14) and in 2 severe ones (P09 and P10), the frameshift caused by the deletion generated a new peptide motif, IRLR-LILLGGHVV*, with a stop codon (*) at the end (Table 4).

Discussion
Here, we describe the naturally occurring deletions in the SARS-CoV-2 S gene in a set of patients with mild or severe COVID 19. The deletions mainly clustered in two hot-spot regions, one (Δ15, affecting aa660-aa679) located upstream but very close to the S1/S2 cleavage site (aa 685/686) and the second (Δ16 affecting aa817-823) situated just upstream to the secondary cleavage site S2' (aa 815/816). These two deletions were found in most of the patient samples studied, and notably, the Δ15 deletion was present in 100% of patients with mild infection and in half of those with severe disease, three quarters of the patients studied (Table 2). This finding suggests that the deletions are not sporadic events even though they were seen in a relatively small percentage of the viral quasispecies (2.2% for Δ15; and 0.54% for Δ16). The mutants could be interpreted as a strategy that natural selection has favoured during the SARS-CoV-2 infectious life cycle to facilitate extensive spread of the infection, as is discussed below. This study involved deep-sequencing of the complete SARS-CoV-2 spike gene using 13 overlapping amplicons in laboratory-confirmed samples for SARS-CoV-2 in 18 patients. In studies with other SARS-CoV viruses, several subgenomic RNAs were reported to be generated during the cell cycle [6,18]. To exclusively study the genomic viral RNA of SARS-CoV-2, RT-PCR was performed using two large PCR products in which the 5' end of primer pair 1 and the 3' end of primer pair 2 were designed to be outside the spike region (5' end in ORF1ab and 3' end in S1 to S14 for nucleotides and S15 to S27 for amino acids). Table 3. List of deletions found in amplicon N07 at the nucleotide level aligned under the reference sequence Wuhan Hu-1 (MN908947.3). Alignment between nucleotides 1974 and 2070 is shown. Nucleotides represented in bold red in the reference sequence indicate the S1/S2 cleavage site R (CGT) / S (AGT). ORF3a) (Table S11, Figure S14). Taking into consideration that CoV have 3'-5' ExoN activity (nsp14 protein), consistent with a proofreading mechanism to correct mutations during replication, the deepsequencing analysis accepted mutants present at a low frequency of ≥0.1%. Because of the possibility of  EHVNNSYECDIPIGAGICASY--SPRRAR/SVASQSIIAYTMSLGAENSVAYS rRF PCR artefacts, deep-sequencing point mutations, or deletion of single nucleotides generated mainly at homopolymeric sites, we did not include single deletions unless they were found in different patients and in overlapping amplicons at higher frequencies (>1%). No insertions were found. Entry of the viral genome into the cell depends on recognition and binding of the surface subunit S1 to the ACE2 human receptor [11], whereas the S2 subunit is responsible for fixing the S protein to the viral membrane surface. After binding to the ACE2 cell receptor, the S protein is primed by the serin-protease TMPRSS2, which leads to S protein cleavage at S1/S2 and S2' [8]. After cleavage, S1 remains attached to ACE2, while subunit S2 anchors the viral and cellular membranes, inducing fusion and viral entry. The Δ15 deletion ( Table 2) mainly causes a frameshift that generates an in-frame stop codon. The presence of this new stop codon would result in translation of a truncated S, which would consist of an almost complete S1 subunit, and total absence of the S2 subunit responsible for anchoring S to the lipid membrane of the viral particle. The absence of the S2 anchor peptide suggests that S1 could be produced as a "free" protein (free S1). As S1 is located on the exposed outside of SARS-CoV-2 in the crown structures, it could have hydrophilic domains and be a soluble peptide with potential for release outside the infected cell, in the lower respiratory tract and even to plasma (Figure 3). These free soluble proteins, which are not a part of the viral cycle or components of the viral particles have also been observed in other viral infections. For example, a huge amount of "empty" subviral genomic particles, consisting of viral envelope proteins (HBsAg), are often found in plasma of patients with hepatitis B virus (HBV) infection. These empty particles are produced and secreted during HBV infection, and have an immunomodulatory role [19]. In addition, soluble HBV e antigen (HBeAg), which is not a component of the viral particles and shares immunoactive epitopes with the HBV core antigen (HBcAg viral capsid component), is detected during HBV infection and has an immunomodulatory role [20].
Human respiratory syncytial virus (HRSV) is another respiratory virus with the ability to produce pre-anchored proteins. The attachment protein (G) of HRSV is an anchored protein whose main function is viral attachment to the host's cell membrane through a still unknown receptor [21]. As in many other viruses, this protein has several functions, and in this case, because of the existence of a second start codon, a soluble form of G protein lacking the anchor is produced, and this is shed to the extracellular medium [22] in abundant quantities by infected cells. The function of soluble, free G is to inhibit toll-like receptors, thereby modulating the host's immune response. Free G also binds to the host's neutralizing antibodies, which are mainly directed to this protein. In this way, neutralization of circulating virions is reduced, favouring viral infection [23].
The free S1 binding subunit of SARS-CoV-2 without its membrane anchor S2 could have similar functions (Figure 3). One putative action of secreted free S1 protein might be to attach to the human ACE2 cell receptor, thereby competing with complete viral particles to re-infect or newly infect respiratory tract cells, resulting in less severe disease. This could be interpreted as an effect of natural selection to attenuate the infection and facilitate its persistence with minimal damage, increasing the human-to-human transmission into the community. This strategy, which we have dubbed "Don't burn down the house" is supported by the finding that the minor variants carrying these deletions were statistically more frequent in patients with mild than severe COVID-19. This self-modulating viral strategy has also been seen in hepatitis delta virus (HDV) infection, where one viral antigen (short HDV antigen, SHDAg) enhances HDV replication, while a second antigen (large HDV antigen, LHDAg), produced after a stop codon edition (TAG to TGG) by cellular adenosine deaminase, acts as a negative regulator of replication [24].
The fact that the truncated S protein was present in only a low percentage of the entire viral quasispecies suggests that natural selection may have designed a favourable equilibrium in which a limited number of deleted virions are generated to balance virus production with infection of new cells during disease progression. A likely reason for maintaining a minority population of genomes with deletions able to produce free S1 protein would be to infect a host while causing minimal damage, which would greatly facilitate transmission of the virus within the population. However, the mutants were also found in half the patients with severe disease; hence, additional study is needed to determine whether they also relate to disease severity. In clinical practice, it has been seen that progression to severe disease can occur within hours, which suggests that any variant associated with virulence should be detected at the time of the diagnosis. The samples studied here were obtained on the day patients were admitted to the emergency room, and very close to the onset of infection. Does the percentage of these viral mutants change during disease progression? To elucidate this issue, it would be of interest to investigate changes in the frequency of deleted genomes in a large number of patients and in sequential samples from the same patient, together with virus culture experiments to determine whether the presence of deleted sequences increases or not during the passages. As a consequence of the frameshift, a new peptide motif, IRLR-LILLGGHVV*, appeared in several sequences with a deletion that started in different nucleotide points. Additional work is also needed to determine whether acquisition of this peptide motif has biological consequences.
Two other putative consequences of the S mutants might be that free S1 protein could bind with S-specific antibodies, acting as a decoy and weakening the immune response, or to circulating ACE2, released from the cell membrane to plasma [25,26], with cardiovascular effects. However, as the deletions were mainly found in patients with mild disease and considering the zoonotic origin of the virus (animal immune and cardiac systems differ from human ones) and the short time that the virus has been evolving in the human population, we believe that the most likely reason for maintaining a minor population of mutant genomes able to produce free S1 protein would be to cause an infection with limited damage in the host, thus facilitating transmission and persistence of the virus in the population. The observation of mutation hot spots in the S gene opens the door to further work on a number of potentially related aspects.
Recent studies have reported the presence of deleted variants in the S1/S2 junction in virus isolated by cell culture of clinical specimens [27]. Deletions of 10-15 nucleotides at the S1/S2 junction were identified by plaque purification of Vero-E6 cultured SARS-CoV-2 genomes obtained from nasopharyngeal aspirate of a COVID-19 patient. Infection of hamsters with virus containing these variants led to attenuated viral disease [27,28]. Digital PCR-based assays demonstrated that such mutants carrying deletions at low intra-host frequency can also be transmitted from human to human, which suggests that they may have significant implications in the zoonotic origin and natural evolution of SARS-CoV-2 [28]. These findings support Figure 3. Based on the life cycle of SARS-CoV, this diagram represents the hypothesis derived from our results. Entry of the virus in the host cell is shown at the top right of the diagram. At the transcription step, two scenarios are depicted: to the left, the viral particle resulting from normal S protein, and to the right the viral particle resulting from truncated S protein. In normal conditions, once the nucleoprotein is freed into the cytoplasm ss + RNA is translated into the non-structural proteins required for transcription. ss + RNA is transcribed into ss-RNA and later into genomic ss + RNA which is encapsidated (left side of the figure). Once the complete viral particle has been formed, it is secreted from the cell by exocytosis. The right side of the figure depicts the situation when a deletion occurs in the S gene during transcription of the complete genome and before subgenomic mRNAs are generated to produce the structural proteins. Translation of a deleted subgenomic spike mRNA would lead to a truncated S protein composed of the S1 domain without S2, which could be shed outside the cell as free S1. The box depicts possible destinations of free S1, which could bind to (1) the ACE2 cell receptor, (2) S1-specific neutralizing antibodies, or (3) free ACE2 receptor. ***The red triangle indicates the deletion in genomic RNA. ***Abbreviations: ACE2, angiotensin converting enzyme 2; mRNA, messenger RNA; NAb; neutralizing antibodies; pp1a, polyprotein 1a; RdRp, RNA-dependent RNA polymerase; S, spike; S1, subunit S1 at the N-terminal domain of the S protein, which includes receptor binding domain (RBD); S2, subunit S2 located at the C-terminal domain of S protein, which includes fusion peptide (FP), heptad repeat (HR) domain 1 and 2, and the transmembrane domain (TM); ss, single stranded; ss + RNA, single-stranded positive sense RNA; TMPRS22, human serine protease TMPRSS2.
our hypothesis that deletions close to the S1/S2 cleavage site are likely a natural phenomenon. Here, we suggest that this phenomenon may have been favoured by natural selection to enhance the spread of SARS-CoV-2.
To conclude, in-depth sequencing of the SARS-CoV-2 S gene in 18 patients with COVID-19 enabled identification of a naturally occurring deletion very close to the S1/S2 cleavage site. Our results indicate that the mutant S would have a large impact on the S protein, and suggest that the virus could produce free S1, which may have implications regarding the candidacy of S protein as a target for vaccination and antiviral treatment strategies. The deletions were significantly more prevalent in patients with mild than in those with severe disease, supporting the notion that they could be a strategy of natural selection to decrease the injury caused after onset of the infection. In this "Don't burn down the house" strategy, the ability of the virus to bind with ACE2 receptor and spread to others would be unchanged; thus its propensity for transmission would be enhanced by a mildly affected host. To prove this hypothesis, it is essential to investigate whether the truncated S protein (free S1) is present in respiratory tract specimens and in plasma. To detect free S1 at low concentration by western blot analysis, entire and truncated recombinant spike proteins should be used as controls, together with highly specific antibodies to S protein. These studies are currently ongoing, and in parallel, we are investigating whether the new peptide motif IRLRLILLGGHVV* will have sufficient antigenicity to be used as a probe to detect truncated free S1 protein.