“Outlaw” mutations in quasispecies of SARS-CoV-2 inhibit replication

ABSTRACT The evolution of SARS-CoV-2, the agent of COVID-19, has been remarkable for its high mutation potential, leading to the appearance of variants. Some mutations have never appeared in the published genomes, which represent consensus, or bona fide genomes. Here we tested the hypothesis that mutations that did not appear in consensus genomes were, in fact, as frequent as the mutations that appeared during the various epidemic episodes, but were not expressed because lethal. To identify these mutations, we analysed the genomes of 90 nasopharyngeal samples and the quasispecies determined by next-generation sequencing. Mutations observed in the quasispecies and not in the consensus genomes were considered to be lethal, what we called “outlaw” mutations. Among these mutations, we analysed the 21 most frequent. Eight of these “outlaws” were in the RNA polymerase and we were able to use a structural biology model and molecular dynamics simulations to demonstrate the functional incapacity of these mutated RNA polymerases. Three other mutations affected the spike, a major protein involved in the pathogenesis of COVID-19. Overall, by analysing the SARS-CoV-2 quasispecies obtained during sequencing, this method made it possible to identify “outlaws,” showing areas that could potentially become the target of treatments.


Introduction
We have recently been able to identify "hyperfertile" mutations associated with the evolution of SARS-CoV-2 [1].We have defined these mutations in this way because they are at the origin of phylogenetic nodes that have each generated more than 835 descendants.Other authors have described these mutations as belonging to a "whitelist" of mutations to indicate the benefit they confer to the virus [2].All these studies were carried out on viral genome sequences, which consist of consensus genomes being selected and reproducing.These consensus, or bona fide genomes, which we call "democratic" genomes, are virtual molecules that are built with the most common nucleotides at each genome position, as determined during the assembly of sequence reads generated by next-generation sequencing (NGS).However, as particularly highlighted by the massive sequencing and analyses of HIV genes for genotypic antiviral drug resistance testing [3] and most recently by a global and unprecedent effort of genomic surveillance applied to SARS-CoV-2 [4], it is currently clear that viral quasispecies exist that are not necessarily encompassed in the "democratic" genomes.
For SARS-CoV-2, "hyperfertile" mutations appeared in viruses that circulated in Europe and are found at several notable positions [1].The first amino acid substitution (P323L) is located in the NSP12 gene that encodes the RNA-dependent RNA polymerase (RdRp) and occurred in viruses that caused the pandemic onset after the Wuhan virus came into Europe [5].It was reported to lead to a significant increase in the mutation rate and hence to an evolutivity that was probably critical in the development of a pandemic.In contrast, epidemics of SARS-CoV-1 and MERS-CoV did not exhibit such outcome with "hyperfertile" mutations, persistence of high levels of incidence for many months and over several years, and viral spread associated with new variants [6].Regarding SARS-CoV-2, other key amino acid changes (firstly D614G) were located in the spike protein, which is explained by the predominant role of this protein in viral entry into host cells and hence in pathogenicity [7,8].Paradoxically, several other favourable mutations were identified in so-called accessory genes, including ORF8, where they generated stop codons [9,10].Thus, we described that these mutations in ORF8 obeyed the "Mistigri" rule as the loss of this gene led to a major rebound of virus incidence and spread, which in several cases was at the origin of a "hyperfertile" phylogenetic node [1].In our present work, we wanted to identify the missing mutations that could be at the origin of abortive forms of the virus, which we called "outlaw" mutations.These mutations are referred to as belonging to a "blacklist" by other authors [2].To achieve this, we used our database of SARS-CoV-2 genomes obtained by nextgeneration sequencing (NGS) and analysed the viral quasispecies in order to identify and locate the mutations that were the most common yet never expressed in the "democratic" genomes.Finally, we focused on frequent lethal mutations (not expressed in the "democratic" genomes) located in the RdRp in order to understand why these mutations could not be expressed.

SARS-CoV-2 quasispecies analysis
The analysis of SARS-CoV-2 quasispecies was based on raw NGS data analysed in a previous study [13].In short, this consisted of sequencing reads generated from 90 nasopharyngeal samples collected from SARS-CoV-2-positive patients between March and September 2020 through direct sequencing, in absence of prior PCR amplification that can affect sequencing accuracy, by the Illumina technology with the Nextera XT paired-end strategy on a MiSeq instrument (Illumina Inc., San Diego, CA, USA), as previously reported [14].These reads had been mapped on the genome of the SARS-CoV-2 Wuhan-Hu-1 isolate (GenBank Accession no.NC_045512.2) using the CLC genomics workbench software v7 (https://digitalinsights.qiagen.com/) with thresholds of 0.8 for sequence coverage and 0.9 for nucleotide similarity.Only genomes with a mean depth of sequencing reads per position ≥50 and a coverage of the genome NC_045512.2≥ 90% were considered.Also, an intra-sample nucleotide diversity of 4% at a given nucleotide position was considered as the significant diversity threshold (as this corresponded to ≥2 reads per nucleotide position for a mean number of reads per position ≥50 at the genome scale).The 90 SARS-CoV-2 consensus genomes had been classified in lineages that circulated before or during the Alpha variant epidemic, based on the Pangolin (https://cov-lineages.org/resources/pangolin. html; [16]) and the Nextclade web applications (https://clades.nextstrain.org/;[17]).In 23, 9 and 5 cases, SARS-CoV-2 genomes were classified as belonging to 20A, 20B and 20C lineages, respectively, that derived from the Wuhan-Hu-1 isolate and circulated in our geographical area until May 2020.We sought for mutations in the SARS-CoV-2 quasispecies with the pileup function of the Pysam python module (https://www.python.org/)when analysing NGS reads, with parameters that include a quality score >13 as threshold.Complete genome mapping data generated by the CLC software and exported from the mapping output file as tab separated value files were analysed through an in-house Python (https://www.python.org/)script.

"Outlaw" mutations
We classified the mutations according to their frequencies among quasispecies as well as among consensus, or "democratic," genomes.We gave the name "outlaw" to the mutations that were the most frequent among quasispecies, as defined by their presence with a prevalence of ≥4% in ≥20% of the genomes analysed [13], by their absence among "democratic" genomes from the 90 nasopharyngeal samples [1], and by their complete or almost complete absence among "democratic" genomes from our centre as defined by a presence in ≤50 (0.1%) of 61,397 genomes [1], and among "democratic" genomes from the GISAID database as assessed through the Cov-Spectrum web application (https://covspectrum.org/) [15] and defined by a presence in ≤3000 (0.02%) of 15,396,904 genomes (Figure 1).
Gene positions were taken from the UCSC genome browser web application (https://genome.ucsc.edu/cgi-bin/hgGateway) [18].The four SARS-CoV-2 gene categories considered here had been defined in a previous study [1] and consisted of genes encoding (i) structural proteins; (ii) informational proteins (proteins involved in information storage and processing); (iii) other non-structural proteins; or (iv) accessory proteins.

Protein structural analyses
The structural model of the SARS-CoV-2 RdRp (NSP12 gene product) was retrieved from pdb file 7bv2 [19].An energy minimized model was generated with the Polak-Ribiere algorithm with the Bio-CHARMM force field in Hyperchem using a maximum of 3 × 10 5 steps and a root mean square (RMS) gradient of 0.01 kcal/mol.Å as the convergence condition [20].Mutations were introduced in RdRp with Deep View/Swiss-Pdb viewer, followed by several rounds of energy minimization as described previously [10].Molecular dynamics simulations (MDS) were performed on a Dell workstation with the Hyperchem program (http://www.hypercubeusa.com) as previously described [21].The systems were equilibrated at constant temperature (310 K) and constant pressure (1 atm) [20].The energy of interaction of each RdRp-template complex was calculated with the ligand energy inspector function of Molegro Molecular Viewer (http://molexus.io/molegro-molecularviewer),as described previously [22].

Presence, prevalence and location of the "outlaw" mutations in the SARS-CoV-2 quasispecies
Twenty-one positions in SARS-CoV-2 genomes analysed here were identified as harbouring "outlaw" mutations (Table 1; Figure 2) as they exhibited significant intra-sample nucleotide diversity but were exceptionally mutated in "democratic" genomes obtained from our centre and GISAID [15,16] relatively to the Wuhan-Hu-1 isolate genome.At these 21 positions, mean intra-sample nucleotide diversity ranged between 2.3% and 6.7%.In the 90 "democratic" genomes from quasispecies whose intra-sample nucleotide diversity was analysed, none of these positions were mutated.In the whole set of 61,397 "democratic" genomes obtained in our institute [1], the mean proportion of genomes harbouring these mutations was 0.003 ± 0.005% (0.000-0.020%).Finally, in a set of   15,396,904 "democratic" genomes analysed through the Cov-Spectrum online tool [16], this mean proportion was 0.007 ± 0.005% (0.001-0.017%).The 21 nucleotide positions were located in NSP12 (the RdRp encoding gene) in 8 cases; in the Nucleocapsid encoding gene in 8 cases; in the Spike encoding gene in 3 cases; in NSP2, a non-structural gene that encodes a protein that may interact with other viral proteins, in one case; and in NSP3 that encodes a predicted phosphoesterase in one case [23] (Tables 1 and 2; Figure 2).None of these positions was located in accessory genes.We only focused hereafter on mutations located in the RdRp.

Structural analysis of RdRp in complex with a template-primer RNA: effect of mutations generating an amino acid substitution
The structure of an energy-minimized model of RdRp in complex with a template-primer RNA is shown in Figure 3(a).Some of the "outlaw" mutations identified here are expected to affect the enzymatic activity of RdRp.Amino acid S578 does not directly interact with the template (Figure 3(b)).However, it is a critical residue that forms a stabilizing hydrogen bond with the peptidic NH group of D484 (Figure 3(c)).This ensures that the alpha-helix in position 563-581 comes in close contact with the long disordered loop in position 479-505.Mutation S578T is predicted to abolish the hydrogen bond with D484, leaving the 479-505 loop free to adopt alternative conformations that do not respect the functional 3D structure of the enzyme, potentially leading to a function loss.MDS of the S578T mutant confirmed the absence of hydrogen bond between S578T and D484 and the subsequent reorientation of the 479-505 loop.Indeed, the energy of interaction of D684 with residue 578 dropped from −8.3 kJ.mol −1 (hydrogen bond D484-S578, Figure 3 (c)) to 0 kJ.mol −1 (no hydrogen bond between D484 and the S578T mutant, Figure 3(d)).This affected Notes: Quasispecies diversity is the proportion of samples who display a significant intra-sample nucleotide diversity.Nucleotide diversity is the proportion of sequencing reads that do not harbour the majority, consensus nucleotide.4(a)).This could, at best, allow the formation of a low energy hydrogen bond between the template and the enzyme (this possibility is not observed in the structure shown in Figure 4(a)).Mutation Q573K is expected to attract the template more strongly on the enzyme by replacing a loose hydrogen bond with strong electrostatic interactions.Such changes in the mode of enzyme-substrate interaction could functionally affect RdRp.Congruently, MDS suggested that Q573K considerably increased the energy of interaction between RdRp and the template.Starting from initial conditions (Figure 4(a)) with no interaction between the Q573 amide group and the template (a suspected weak hydrogen bond was not observed), Q573K resulted in a strong electrostatic interaction, with a ΔG of −26.5 kJ.mol −1 (Figure 4(b)).There is no other amino acid at such level of energy of interaction between wild-type enzyme and its substrates.Altogether these in silico data suggested that the complex between the Q573K mutant and the template might be too sticky to be fully functional.

Discussion
In this work, we were able to implement a way of detecting unexpressed, also called "blacklist" mutations [2] or "outlaw" mutations, which are observed as minority mutations in the quasispecies and not in the consensus, "democratic" genomes.Interestingly, these "outlaw" mutations are found in the same genes as "hyperfertile" mutations [1], but at different nucleotide sites.In the case of the RdRp, where mutations are understandably not necessarily welcome, non-synonymous and synonymous mutations have been identified as "outlaw" mutations.In some cases that involved non-synonymous mutations, structural changes easily explain the loss of RdRp activity, which is not compatible with virus multiplication.Regarding synonymous mutations, structural explanation must be cautious.Nonetheless, these mutations may be associated with codons changes that modulate translation kinetics, which may alter protein folding in case it requires transitory stops of the ribosome induced by codons in the messenger RNA (mRNA) that match tRNA isoacceptors of low-abundance.For instance, it was thus reported in Escherichia coli that "silent" mutations inducing change to codons with highly abundant corresponding tRNA may affect the translational pausing, hence altering the co-translational folding of a bacterial protein (SufI) [24].Another study conducted on Escherichia coli genes over 40,000 generations in a long-term evolution experiment indicated that purifying selection had a tendency to get rid of synonymous mutations that alter the secondary structures of mRNA, which implies these mutations might decrease bacterial fitness [25].It was also reported that a synonymous mutation in the Multidrug Resistance 1 gene that encodes the P-glycoprotein corrupted the conformation of this protein in HeLa cells, and suggested that this mutation might alter the timing of co-translational folding and insertion of P-gp into the membrane, and consequently the protein function [26].
Besides, in hepatitis A virus, changes in optimization of codon usage in the capsid encoding region were found to tune translation kinetics for a proper folding of this protein [27], and in influenza A virus it was reported that synonymous mutations altering the structure of the nucleoprotein-encoding RNA could impact genome packaging and cause viral attenuation [28].Altogether, these data indicate that synonymous mutations in genomes of RNA viruses may not be strictly neutral.
Overall, these observations are critical for the understanding of SARS-CoV-2 evolution.Some mutations have enabled spectacular and unprecedented development, while others were abortive.The frequency of "outlaw" mutations in quasispecies is surprising, suggesting a form of adaptive immunity with particular targeting of abortifacient viral genomic areas.We speculate that these areas could be the subject of hybridization with microRNA or other coding or non-coding elements of the human genome in response to viral aggression, possibly with the memory of past infection with viruses presenting sequences with a relative similarity, allowing recombination, which is only identifiable through the presence of such "lethal" mutations.This could correspond to an as yet undescribed mechanism of adaptive immunity.In fact, 128 human microRNAs were identified in the human lung epithelium as having the capability to target the SARS-CoV-2 genome [29].
Interestingly, a "outlaw" mutation may be the result of the action of cellular apolipoprotein B mRNA-editing enzyme catalytic polypeptide-like (APOBEC) enzymes, a described antiviral mechanism [30].This APOBEC enzyme activity was also identified in the generation of "hyperfertile" mutations [1].It is clear that this activity is random and not specific.Lastly, it is interesting to observe that ORF7 and ORF8 genes, which could harbour stop codons conferring no disadvantage to or being advantageous for the virus, and which were the subject of the greatest proportion of "hyperfertile," "fertile," and neutral or weakly deleterious mutations [1], were not at all the subject of "outlaw" mutations.This suggests that these genes probably only present disadvantages in the development of SARS-CoV-2 in humans.Thus, this seems in line with the general evolutionary strategy of coronaviruses, which are probably of animal origin, and which lose useless genes while adapting to humans, including ORF8 [10] that may have already disappeared from other human endemic coronaviruses [31].In contrast, SARS-CoV and MERS coronavirus did not spread as prolifically as SARS-CoV-2, possibly because they lacked "hyperfertile" mutations in the RdRp and spike proteins and they kept the ORF8 "Mistigri" gene, although SARS-CoV experienced major deletions in that gene [32].
The present study has several potential limitations.First, we cannot exclude that NGS errors could contribute to the prevalence of some "outlaw" mutations, but error rate was previously estimated to be approximately 0.5% or lower with the Illumina technology on a MiSeq instrument [33][34][35].Second, some "outlaw" mutations might be missed by NGS but with this Illumina technology, the estimated lowest detection threshold of minority variants was reported to be 0.5-1% and estimated sensitivity while detecting minor viral variants in a mixture of standards was reported to be 97.5% for a minor variant with a prevalence of 1% [36].Beyond, other mutations than "outlaw" mutations can exist that are unseen through NGS in viral consensus, "democratic" genomes as well as in viral quasispecies, as they are too rare and/or too drastically deleterious for the virus.Third, structural explanation must be cautious regarding synonymous mutations as MDS is not valuable to be performed on the whole enzyme in such cases.Fourth, another limitation in the present study is the lack of further validation of the findings through in vitro experiments, which could be worthy to be done in future, dedicated works.Finally, sequential samples from patients chronically infected with SARS-CoV-2 were not investigated here, but this would be worthy to be performed in future studies to try observing the disappearance of "outlaw" mutations.
In conclusion, we believe that the deposition of all the quasispecies detected by SARS-CoV-2 genome NGS during the epidemic phases is essential to understand the evolution of the virus and to identify, among these quasispecies, those that will never be expressed in the bona fide, "democratic" genomes, making it possible to identify the most fragile genome areas and possibly, in the future, to prepare therapeutic tools based on the identification of these regions.Conserved viral genomic sites were previously deemed or demonstrated to be critical for virus replication and expansion [2,37].Here, by searching deep in the viral quasispecies, we could identify the nucleotide and amino acid changes in viral gene sequences that lead to a fatal virus phenotype, and described that they can consist of both non-synonymous and synonymous mutations.Such "outlaw" mutations can be searched in already available NGS data or searched, and surveyed in future ones.Their identification may be contributive in the field of replication-defective live virus vaccines by conferring a fatal phenotype [38,39].
Beyond, the identification and characterization of "outlaw" mutations are worthy to help gaining a better insight into the structure-activity relationships in viral proteins and how synonymous mutations can have an impact on the generation and functionality of proteins.This might improve the targeting of particular sites in the viral proteins by monoclonal antibodies, or in the viral genes by small interfering RNA [40,41].Anyway, here again, phenotypic studies should be performed at preliminary steps to confirm the detrimental effect of the mutations on the SARS-CoV-2 replication and propagation.

Figure 1 .
Figure 1.Schematic diagram to illustrate how SARS-CoV-2 genomes obtained by NGS were analysed to identify and locate the "outlaw" mutations, which are present in the virus quasipecies yet never expressed in the consensus, "democratic" genomes.Note: IHU MI, University Hospital Institute (IHU) Méditerranée Infection; NGS, next-generation sequencing.

Figure 2 .
Figure 2. Quasispecies diversity along the viral genome in comparison with nucleotide diversity in consensus genomes for the same set of clinical samples.

Figure 3 .
Figure 3. Localization of "outlaw" mutations in the structure of the SARS-CoV-2 RNA-dependent RNA polymerase (NSP12 gene product) in complex with a Template-Primer RNA.(a) Global view of RNA-dependent RNA polymerase (grey ribbons) with RNA template (green) and RNA primer (yellow).The positions of amino acid residues T85 (cyan), L576 (orange), K577 (blue) and S578 (red) are indicated.(b-c) Focus on the 576-578 triad near the template and D484.The arrows in panel c indicate a favourable interaction between the cationic group of K577 and the template.This interaction is optimized by a hydrogen bond between D484 and S578.(d) The S578T mutation breaks the hydrogen bond between amino acids 578 and 484, due to the methyl group of S578T.The energy of interaction of K577 with the template has been calculated for the wild-type (c) and the S578T mutant (d) after 100 ns of molecular dynamics simulation.

Figure 4 .
Figure 4. Effect of mutation Q573K in RNA-dependent RNA polymerase on the interaction with a Template-Primer RNA.(a) In the wild-type RNA-dependent RNA polymerase, Q573 is too far from the template to interact with it (ΔG = 0 kJ.mol −1 ).(b) In the case of Q573K, the cationic group of lysine interacts strongly with the template (ΔG = -26.5 kJ.mol −1 ), which may result in an inactive enzyme-substrate complex.

Table 1 .
Presence, location and prevalence of the "outlaw" mutations in the SARS-CoV-2 quasispecies.

Table 2 .
Correspondence between mutated nucleotide positions in the NSP12 gene that encodes the RNAdependent RNA polymerase gene and amino acid changes.