Genetic cluster analysis of SARS-CoV-2 and the identification of those responsible for the major outbreaks in various countries

ABSTRACT A newly emerged coronavirus, SARS-CoV-2, caused severe pneumonia outbreaks in China in December 2019 and has since spread to various countries around the world. To trace the evolution route and probe the transmission dynamics of this virus, we performed phylodynamic analysis of 247 high quality genomic sequences available in the GISAID platform as of 5 March 2020. Among them, four genetic clusters, defined as super-spreaders (SSs), could be identified and were found to be responsible for the major outbreaks that subsequently occurred in various countries. SS1 was widely disseminated in Asia and the US, and mainly responsible for outbreaks in the states of Washington and California as well as South Korea, whereas SS4 contributed to the pandemic in Europe. Using the signature mutations of each SS as markers, we further analysed 1539 genome sequences reported after 29 February 2020 and found that 90% of these genomes belonged to SSs, with SS4 being the most dominant. The relative degree of contribution of each SS to the pandemic in different continents was also depicted. Identification of these super-spreaders greatly facilitates development of new strategies to control the transmission of SARS-CoV-2.


Introduction
A number of newly emerged coronaviruses such as the highly pathogenic severe acute respiratory syndrome coronavirus (SARS-CoV) and Middle East respiratory syndrome coronavirus (MERS-CoV) have caused serious respiratory and intestinal infections in human within the past two decades [1]. In December 2019, another new coronavirus, SARS-CoV-2, has emerged and caused outbreaks of lower respiratory tract infections, often with poor clinical outcome, in Wuhan, China. The virus, which has since spread to other cities in China and various countries worldwide [2], exhibited a high potential to undergo human-to-human transmission [3]. As of 7 May 2020, 3.7 million infections were recorded worldwide, among which a total of 3 million cases occurred in Northern America and Europe, and 83,000 cases were documented in China (https://www.gisaid.org/epiflu-applications/global-cases -betacov/). The WHO declared the risk of SARS-CoV-2 infection as "Very High" in late February (https://www. who.int/docs/default-source/coronaviruse/situation-rep orts/20200228-sitrep-39-covid-19.pdf).
The genomic characteristics of SARS-CoV-2 have been elucidated by means of phylogenetic, structural and mutational analyses by scientists around the world [4]. High-throughput sequencing showed that SARS-CoV-2 was a novel betacoronavirus which resembled SARS-CoV at around 79.5% sequence identity [5,6]. A recent study indicated that SARS-CoV-2 was 96% identical to a bat coronavirus RaTG13 (accession: MN996532) at the genomic level, suggesting that bat might be a natural host of SARS-CoV-2 [7]. GISAID is a platform for sharing genetic data of influenza. Currently, a rapidly increasing number of SARS-CoV-2 genomic sequences are being deposited into this database from laboratories around the world [8]. On the other hand, some recent studies also suspected that Malayan pangolins (Manis javanica) could be the intermediate host of this new coronavirus, since the amino acid sequence of the S protein of coronaviruses derived from Malayan pangolins illegally imported to Guangdong Province of China, as well as coronaviruses harboured by pangolins in Guangxi province of China, exhibited very high homology with the S protein sequence of SARS-CoV-2, even though the overall homology between SARS-CoV-2 and RaTG13 is still the highest [9]. However, due to the inability to detect or isolate SARS-CoV-2 from pangolins in the Wuhan Huanan seafood wholesale market, the site in which the first batch of infected patients had commonly visited, the theory of pangolins being the culprit of the Wuhan pneumonia outbreak is not substantiated. The intermediate host for SARS-CoV-2 therefore remains a mystery. In fact, it remains unclear if the Huanan Seafood Wholesale Market is the origin of this outbreak as some of the earliest cases were confirmed to have no linkage with this market. It is necessary to identify the source(s) of virus(es) that caused this outbreak to design more effective control measures to stop the continuous worldwide transmission of these highly contagious viruses. With more sequences being released, we can obtain a more comprehensive view on the genomic features of this virus through in-depth sequence analysis. One recent study has analysed over 100 available genome sequences and showed that sequences belonging to different genetic clusters have evolved [10]. In this work, we retrieved and analysed the publicly shared genome sequences of SARS-CoV-2 available as of 26 March 2020 to investigate the genetic diversity and phylodynamics of these viruses. We identified four distinct viral clusters which apparently exhibit high mutation rate and have become the most dominant viruses that prevailed in the global pandemic that started in January, 2020. Results in this study should provide valuable insight into key factors that scientists and clinicians need to consider in the control of the SARS-CoV-2 pandemic, in particular the transmission fitness and differential virulence levels of different viral strains.

Sequence analysis, alignment and mutation identification
A total of 343 full-length SARS-CoV-2 genomes available in the GISAID platform (https://platform.gisaid. org/) as of 10 March 2020 with 5 March 2020 as cutoff date were downloaded [8]. A total of 247 sequences with high sequence quality as noted in the GISAID database were included for further analysis after removing sequences containing little temporal signal and thus are not unsuitable for inference using phylogenetic molecular clock models. Information regarding the date and country of isolation were also retrieved from the GISAID platform. The annotated reference genome sequence of the SARS-CoV-2 isolate Wuhan-Hu-1 (accession: NC_045512.2) was downloaded from the NCBI GenBank database. All genomes were annotated by GATU Genome Annotator [11] using the SARS-CoV-2 isolate Wuhan-Hu-1 (NC_045512.2) as reference [12]. Nucleotide and amino acid mutations of all genome and separate proteins were analysed by blast (https://blast.ncbi.nlm.nih. gov/) using the sequence of strain Wuhan-Hu-1 as reference.

Phylodynamic analysis
Global genomic surveillance of SARS-CoV-2 was implemented by means of the automated phylogenetic analysis pipeline Nextstrain, which generates an interactive visualization integrating a phylogeny with sample metadata such as geographic location or isolation date [13]. The pipeline involved the sequence alignment module with MAFFT [14], phylogenetic analysis with IQ-TREE [15], maximum-likelihood phylodynamic analysis with Treetime [16], identification of nucleotide and amino acid mutations with Augur, and result visualization with Auspice [13]. The outputs were edited by Inkscape 0.91 [17].
Quick identification of the types of SARS-CoV-2 genomes in the database All complete genomes available as of 28 March 2020 with 26 March 2020 as cut-off date in the GISAID database were downloaded. Single Nucleotide Polymorphisms (SNPs) calling were performed by Snippy (https://github.com/tseemann/snippy) using Wuhan-Hu-1 as reference. Super-spreader clusters were classified according relative variants. A total of 1539 qualified genomes submitted after 29 February 2020 were included.

Phylodynamics analysis of genome sequences of SARS-CoV-2 strains collected worldwide
To trace the evolution process and identify the common ancestor of 247 strains of SARS-CoV-2 collected worldwide, root-to-tip regression scatter plots was conducted among all SARS-CoV-2 genomes, with R 2 being found to be 0.23, suggesting that these 247 viral sequences shared a common recent ancestor ( Figure  S1a). The date of the most recent common ancestor (tMRCA) of all reported SARS-CoV-2 viruses was 12 November 2019, suggesting that this virus emerged recently ( Figure S1a). A total of 379 nucleotide mutations were identified among these 247 sequences based on sequence alignment, among which G 11083 T (n = 5), T 3 G (n = 3), G 29864 A (n = 3), C 29870 A (n = 3), A 1 T (n = 2), A 4 T (n = 2), T 4402 C (n = 2), G 5062 T (n = 2), T 18603 C (n = 2) and G 22661 T (n = 2) were the most homoplasic mutations (Figure 1, Table S1). A total of 147 strains were found to contain single amino acid change, with the majority of such changes being located within ORF1ab (n = 104). The L 3606 F change was detected in two viral sequences, while other mutations occurred only once. Mutations that result in amino acid changes include single substitution in the S protein (n = 19, D 614 G, L 752 F, F 32 I, H 655 Y, V 483 A, F 157 L, V 615 L, K 202 N, S 939 F, F 797 C, A 930 V, R 408 I, V 367 F, Q 409 E, S 254 F, A 435 S, D 1146 E, S 247 R and P 1143 L), ORF3a (n = 8, E 191 G, G 76 S, K 61 N, V 259 L, T 176 I, L 140 V, T 269 M and V 88 L), N protein (n = 6, K 247 I, S 194 L, P 46 S, S 327 L, E 378 Q and D 343 V), ORF8 (n = 4, T 11 I, L 84 S, S 97 N and S 67 F), ORF7a (n = 3, P 34 S, Q62* and H 73 Q), ORF10 (n = 2, P 10 S and I 13 M) and E protein (n = 1, S 6 L) ( Figure S1b). Identification of single amino acid substitutions in SARS-CoV-2 isolates consistently showed that these isolates shared a recent common ancestor but entered diverse evolution paths. The estimated substitution rate of SARS-CoV-2 was 8.90e-04 subs/site/year, which was similar to that of other RNA viruses including SARS-CoV, Ebola virus, Zika virus and others, which was found to be at ∼ 1e-3 subs/site/year (http://virological.org/t/ phylodynamic-analysis-93-genomes-15-feb-2020/356). Based on this mutation rate, a genome of 29 kb (kilobase) of SARS-CoV-2 will end up with ∼26 mutations per genome per year, suggesting that within the four months' study period, the number of mutations in each genome should not exceed ten if all test isolates emerged as a result of natural evolution of a single SARS-CoV-2 strain.

Multiple origins of SARS-CoV-2 in Wuhan, China
To shed light on the evolution trend of SARS-CoV-2, we analysed the time-dependent changes in mutation profiles of the test strains in detail. A total of 16 viral genomes collected before 1 January 2020, were included ( Table S2). All of these 16 genomes were obtained from Wuhan, with half being from Huanan Seafood Wholesale Market (HNSM). Six genomes contained identical sequences, four of which belonged to isolates obtained from HNSM. Compared to these six viral genomes, others displayed various mutation profiles which comprised 1 to 6 mutations in the genome. We therefore set these six genomes as reference genome for subsequent analyses. Two earliest viral genomes reported on 24 and 26 December 2019 were found to harbour two and three mutations when compared to the reference viral genome, respectively. Four viral genomes from HNSM also contained two mutations with different profiles, suggesting that the original SARS-CoV-2 strain might have been circulating in HNSM for a certain period of time and underwent mutational changes in different intermediate hosts before infecting human (Table S1). These observations suggested that HNSM was not the only origin of the COVID-19 outbreak, instead the market might only serve as a medium in which transmission of this virus to human first occurred. The original virus was transmitted to various provinces in China subsequently, including Guangdong, Zhejiang, Anhui, Jiangsu and Chongqing, and then to other countries such as Japan, Taiwan, Thailand and USA in the following month (January 2020). The viral genomes reported initially from USA were those of viruses recovered from patients in the Princess Diamond Cruise, confirming that the original virus was the one that caused the outbreak in this cruise; such view is consistent with the finding that identical genomes were reported in Japan, where the cruise ship was docked. A total of 26 out of the 247 genome sequences tested contain one mutation. Unless isolated from the same location, most of these genomes exhibit unique mutational profile. Five sequences from the Princess Diamond cruise ship were found to exhibit unique mutation profiles, thus further suggesting that the virus could undergo adaptive evolution during the transmission process, generating a number of genetic variants. It should be noted that these genome sequences were also reported in Wuhan, other parts of China and various other countries, confirming that the transmission of the original virus to different parts of the world was accompanied by active but random mutational changes during the process (Table S1).

Phylogenetic analysis of genome sequences of SARS-CoV-2
Phylogenetic analysis of the 247 SARS-CoV-2 genomes was also performed, with results showing that such viral genomes exhibited highly diverse genetic profiles and that random mutations occurred during the evolution process within the first two months. Interestingly, four distinct clusters of genome sequences could be identified among the 247 genomes, with the rest exhibiting more diverse profiles. These results were consistent with the data of maximum-likelihood phylodynamic analysis shown in Figure 1. Comparison of the mutation profile of each cluster enabled us to discover that all viral genomes in the same cluster were derived from one parental viral strain which bears a signature mutation profile, as such profile could be identified in all offsprings of that parental strain ( Figure 1). The first cluster contained two mutations, C 8782 T and T 28144 C; the second cluster contained the mutation G 26144 T; the third cluster contained the mutation G 11083 T; the fourth cluster contained three mutations, C 241 T, C 3037 T and A 23403 G. Tracing the changes in mutation profiles of these viral genomes over time allowed us to visualize the transmission and evolution dynamics of SARS-CoV-2. Since viruses of all of these four clusters exhibited very high potential to undergo global transmission, we define viruses in these four clusters as super-spreader cluster 1 (SS1), 2 (SS2), 3(SS3) and 4(SS4) respectively.
Evolution and transmission of super-spreader cluster 1 (SS1) SS1 carried the signature mutation profile of C 8782 T and T 28144 C. The C 8782 T change is a silent mutation, whereas T 28144 C is associated with the amino acid substitution L 84 S in the Orf8 protein. The SS1 viruses were presumably transmitted very efficiently, as a total of 85 out of the 247 (34%) genome sequences tested were found to belong to this cluster as of 3 March 2020. The earliest sequences in this cluster was reported in Wuhan, China on 5 January 2020; another seven were subsequently reported in January and February in different parts of China and Australia, suggesting that widespread transmission of this cluster of viruses occurred ( Table 1). The viruses in SS1 were mainly transmitted among Asian countries such as China, Vietnam, Japan, South Korea, Taiwan and Singapore, but were also detectable in North America, in particular the states of California and Washington in USA ( Table 1). The viruses in SS1 were also found to rapidly mutate along the transmission paths. Three genome sequences that were reported from Australia, Vietnam and USA on 24, 28 February and 3 March 2020 respectively, were found to harbour a total of 11 mutations. An additional nine mutations were acquired by the parental virus within 50 days (from 5 January to 24 February 2020), with a mutation rate of 2.3e-3 subs/site/year (29 kb genome size), which was much higher than the predicted mutation rate of SARS-CoV-2 (4.057 e-4 subs/site/year) and other coronaviruses such as SARS-CoV and MERS virus. Among viral genomes in this cluster, 43 of the 85 genomes exhibited five or more mutations (Table 1).
Detailed analysis of mutation profiles of the genome sequences in SS1 enables us to trace the evolution routes of these viruses in specific region. In Washington State, USA, a genome sequence with three mutations, C 18060 T, C 8782 T and T 28144 C, was reported on 25 January 2020. A virus carrying these three   (Table 1). This virus was then further transmitted in the Washington area and continued to acquire mutations. Twelve genome sequences reported between 1 and 5 March 2020 in the states of Washington and California contained 6-11 mutations. These data represent direct evidence of active evolution that results in a large number of mutational changes during the process of transmission of a single virus within a short period of time. In addition, a genome sequence which has two additional mutations when compared to the original virus in this cluster, but were different from those in genome sequences in Washington, was reported in Sichuan, China, suggesting that the same parental virus was also transmitted across China during this period (Table 1).
Evolution and transmission of super-spreader 2 (SS2) SS2 carried the signature mutation G 26144 T, which resulted in the G 251 V amino acid substitution in Orf 3 protein of SARS-CoV-2. The first viral genome in this cluster was reported in Australia on 25 January 2020. Among the 247 sequences tested, a total of 28 (11.2%) were found to belong to SS2. The parental virus had acquired different mutations and had been disseminated to various Asian countries, North America (USA), Europe, South America (Brazil) and Australia. Viruses in this cluster seemed to be extensively transmitted by the end of January and lasted till early February. By the end of February, however, transmission efficiency of such viruses seemed to have dropped, as only 4 of 28 sequences reported during the period 26 February to 3 March 2020 belong to this cluster. Viruses in this cluster were also found to have significantly mutated with a total of 12 mutations observed in one strain isolated in Washington on 27 February 2020 (Table 2). Our data showed that as many as 11 additional mutations were acquired by the parental virus within a 30 days period (from 25 January to 27 February 2020), representing a mutation rate of 4.6e-3 subs/site/year (29 kb genome size), which was much higher than the predicted mutation rate of SARS-CoV-2 (8.0e-4 subs/site/year).

Evolution and transmission of super-spreader 3 (SS3)
SS3 carried the signature mutation G 11083 T, which caused the L 3606 F amino acid substitution in the Orf 1 protein of SARS-CoV-2. The first viral genome in this cluster was reported on 18 January 2020 in Chongqing, China. A total of 22 such genome sequences were reported so far, accounting for 9% of the 247 SARS-CoV-2 sequences documented in the GISAID database as of 5 March 2020. It has since been transmitted to several Asian countries including Singapore and Japan, as well as Europe, USA and Australia (Table 3). Like SS1 and SS2, viruses in this cluster were also found to mutate efficiently, with one genome reported on Febuary 27, 2020 from Washington, USA, carrying 12 mutations. Our data showed that a total of eleven mutations were acquired by the parental virus within a 40 days' period (from 18 January to 27 February 2020), with a mutation rate of 2.8e-3 subs/site/year (29 kb genome size), which was again much higher than the predicted mutation rate of SARS-CoV-2 (8.0e-4 subs/site/year). Curiously, there is no virus of this cluster being reported in Iran, a country with one of the highest incidence of SARS-CoV-2 infections. However, two genome sequences from Australia, which belong to viruses recovered from patients with travel history to Iran, were reported, suggesting that this cluster of virus might also contribute to the outbreaks in Iran. In addition, the first genome sequence from Brazil, which might have originated from Italy, also belonged to this cluster (Table 3).

Evolution and transmission of super-spreader 4 (SS4)
SS4 carried a signature mutation profile that consists of three mutations: C 241 T, C 3037 T and A 23403 G. The C 241 T and C 3037 T changes are silent mutations, whereas A 23403 G results in the D 614 G substitution in the spike (S) protein of SARS-CoV-2. SS4 viruses were found to be transmitted only in Europe, with the exception of one genome reported from Mexico, in which the patient had travel history from Italy. Viruses in SS4 were responsible for the explosive increase in incidence of COVID-19 in Europe in March (Table 4). Compared to SS1, 2 and 3, viruses in SS4 were reported more recently, mostly from the end of February to early March. A total of 21 of the 247 genomes examined (8.4%) were found to belong to SS4. Among genome sequences of the four clusters of super-spreaders, none was found to contain only one of the three SS4 mutations or a combination of two of the three mutations, suggesting that the parental viral genome of SS4 could not be identified. The first virus of this cluster, reported on 28 January 2020 in Germany, has acquired another silent mutation, C 14408 T, and was further disseminated to    Temporal and spatial distribution of superspreaders of SARS-CoV-2 To better understand the temporal and spatial distribution of these super-transmitters, we plot variation in the types of genome sequences recovered from different continents against time. The original viruses were found to be spreading in the week before the emergence of these super-spreaders. SS1 was the first batch of viruses that emerged and dissemination continued throughout the study period. Other SSs emerged at different time points and transmission also peaked at different dates. Transmission of SS2 and SS3 mainly occurred between mid January to mid February. Transmission of SS4 viruses mainly began at the end of February. Viruses of the four clusters exhibited much higher mutation rate than those which exhibited diverse genetic profiles and could not be allocated into specific genetic cluster when compared to the original genome (Figure 2(a)). SS1 viruses were those which were disseminated extensively in China, in particular in the later stage of the outbreak (Figure 2(b)). SS1, SS2 and SS3 were prevalent in other Asian countries (Figure 2(c)). All the four clusters were involved in the outbreaks in Europe at the early stage, but SS4 was the cluster that eventually transformed the outbreaks in Europe into the pandemic level (Figure 2(d)). In Oceania, SS1 was involved mainly in the early stage of the outbreak, yet SS2 became dominant at a later stage (Figure 2(e)). SS1 and other types of viruses were the major transmitters in the US. SS1 was shown to be transmitted mainly in the states of Washington and California, whereas the other types were mainly transmitted in other states ( Figure 2(f), Table S1).

Distribution of different super-spreader types of most recent SARS-CoV-2 in different parts of the world
Upon completion of analysis of SARS-CoV-2 sequences available in the GISAID database as of 5 March 2020 and identification of the four "superspreader" type strains, we investigated if viral strains of the four super-spreaders were responsible for the vast majority of subsequent infections. A total of 1539 genome sequences reported after 29 February 2020 were included for a quick analysis to identify the type of these most recent genomes. As shown in  (Table 5).

Discussion
We conducted detailed and comprehensive analyses of 247 high quality SARS-CoV-2 sequences deposited in the GISAID database during the period December 2019 to 5 March 2020 to provide insight into the evolution and transmission of this novel virus ( Figure 3). Our data indicated that the ancestor strain of SARS-CoV-2 could have emerged at a date as early as November, 2019. According to the time line of outbreaks, the original virus from Wuhan city and HNSM was responsible for the initial transmission of SARS-CoV-2 in various countries in January. The origin of the outbreak was not limited to HNSM, instead, those which occurred in multiple sites in Wuhan city might have contributed more significantly to the early transmission events and subsequent dissemination to different parts of China and various countries around the world. These data implied that wild animals sold in HNSM may not be the intermediate host of SARS-CoV-2 as sources other than HNSM are also considered the origin of this virus. Given the fact that multiple patients in Wuhan were simultaneously infected by viruses of different genetic composition in the initial outbreak, we hypothesize that a common wild animal would be the most likely intermediate host. Alternatively, a common environmental factor, such as a faulty sewage system, may be involved. It is necessary to investigate the possible role of a common animal vector or dissemination route in eliciting the initial outbreak that involved multiple SARS-CoV-2 strains. Interestingly, as the original virus continued to transmit in China and all over the world, it has evolved into four major genetic clusters, namely super-spreader clusters, along with other non-cluster variants derived from the original virus. Each SS cluster carried one or more unique signature mutation(s) which enable us to trace the origin and transmission paths of most subsequently recovered SARS-CoV-2 strains. In the early transmission stage (December 2019 and early January 2020), variants from the original virus were dominant, yet by the end of February and early March, members of four super-spreaders became dominant, with different SSs being prevalent in different regions of the world. SS1 was prevalent in China and other parts of Asia and became the major virus that caused severe outbreaks in Washington and California states in the US and South Korea; SS2 and SS3 were extensively transmitted in other parts of Asia and Europe during the end of January and early February but their prevalence dropped at the end of February and early March, and was replaced by SS4 which also contributed to the pandemic in Europe. Interestingly, SS4 was not reported in China or other parts of the world. The first genome of this cluster was reported in Germany and contributed to the rapid dissemination of SARS-CoV-2 in Europe. Mutation profile with SS4 is unique, with three mutations being observed in the first viral genome. Importantly, the mutation A 23403 G, which results in the D 614 G substitution in the S protein of SARS-CoV-2, was also identified as characteristic genetic change in a highly transmissible strain by Korber et al [20]. Genomes with only one or two of these three mutations were not reported elsewhere. However, these data do not simply imply that SS4 originated from Europe. One limitation of the study is that we can only make assessment using currently available genome sequences. The lack of genome sequence of SS4 in other continents does not necessarily mean that SS4 viruses are not present in other continents. In fact, viral strains carrying the D 614 G substitution have already been recovered in Canada and the USA in March 2020 in our second phase analysis to verify the transmission potential of strains of the four super-spreaders [20]. A second limitation of this study is the lack of data to explain the mechanisms underlying the evolution of various genetic clusters into super-spreaders. Since every super-spreader cluster carries at least one amino acid substitution, whether such amino acid changes enabled SARS-CoV-2 to exhibit superior transmission potential needs to be investigated in future research studies.
Our data also unveiled the genetic features and transmission paths of major viral strains responsible for the current global pandemic of SARS-CoV-2 in detail. For example, in Italy, SS2, SS3 were prevalent in the end of January but gave way to SS4 in February and early March. Similar trends were observable in other countries, with exception of a consistently high proportion of the original viral genomes in Netherland throughout the course of the pandemic. In the US, the original viruses were reported in various states, whereas SS1 was dominant in Washington and California States. Other SS genomes were also sporadically reported in the US. Although data from Iran is not available, two genomes reported from Australia with travel history from Iran were shown to belong to SS3,  . SS1 strains were transmitted mainly in Asia and the US but were less prevalent in other parts of the world. SS2 and SS3 strains were transmitted mainly in Asian countries other than China, as well as Europe from mid January to mid February. SS4 strains were transmitted mainly in Europe at the beginning of the pandemic and were then transmitted to all over the world.
suggesting that this cluster was responsible for the pandemic in Iran. In Australia, all genomes were reported except SS4. Our data are consistent with those obtained in a recent study. Analysis of 160 complete genomes of SARS-CoV-2 by Forster et al. identified three central variants, namely Type A, B and C [21]. Type A was the ancestral virus. Type B strains carried the C 8782 T and T 28144 G mutations, which was equivalent to our SS1. Type C, which carried the G 26144 T change, was equivalent to SS2. Forster et al found that Type A and C were mainly transmitted in Europe and America, whereas Type B was the most common type in East Asia. Our results are therefore highly consistent with theirs but our works could elucidate the transmission patterns in details. For example, we found that viruses in SS1, or type B according to the work of Forster et al, were mainly transmitted among Asian countries such as China, Japan, South Korea, Taiwan and Singapore, but were also common in North America, in particular the states of California and Washington in USA. Furthermore, we also identified SS3 and SS4, as more viral sequences were included in our analysis. Accuracy of this kind of phylodynamic analysis depends on comprehensiveness of viral genome sequences procured at different stages of infection. Data will inevitably be biased in if genomic sequences of viruses that caused infections in specific countries are under-represented. Nevertheless, we were able to validate the accuracy of our phylodynamic analysis by using the signature mutations as markers for different SSs to determine the relative prevalence of each SS type in 1539 genomes reported in March. The data further confirmed that four SSs continued to be dominant, with around 90% of the genomes belonging to these four SSs, among which SS4 remained the major cluster being disseminated in Europe. This second dataset showed that viruses of SS4 have since been transmitted to other parts of world including Africa, Asia, North America and Oceania. SS1 continued to be the major type in the US but has been transmitted to South America, in particular Brazil. These data appear to suggest that SS1 and SS4 have out-competed SS2 and SS3 and became superspreader strains responsible for future transmission of SARS-CoV-2.
In conclusion, this study show that four major genetic clusters of viruses have evolved from the original SARS-CoV-2 and have transmitted extensively, each becoming dominant in different parts of the world, and that viruses without any signature mutation of the four super-spreaders appear to be transmitted much less efficiently. These super-spreaders exhibit not only high transmission efficiency, but also high mutation rate without compromising infectivity, posing enormous challenge to the control of future transmission of SARS-CoV-2.