Comparative genomic analysis of two emergent human adenovirus type 14 respiratory pathogen isolates in China reveals similar yet divergent genomes

Human adenovirus type 14 (HAdV-B14p) was originally identified as an acute respiratory disease (ARD) pathogen in The Netherlands in 1955. For approximately fifty years, few sporadic infections were observed. In 2005, HAdV-B14p1, a genomic variant, re-emerged and was associated with several large ARD outbreaks across the U.S. and, subsequently, in Canada, the U.K., Ireland, and China. This strain was associated with an unusually higher fatality rate than previously reported for both this prototype and other HAdV types in general. In China, HAdV-B14 was first observed in 2010, when two unrelated HAdV-B14-associated ARD cases were reported in Southern China (GZ01) and Northern China (BJ430), followed by three subsequent outbreaks. While comparative genomic analysis, including indel analysis, shows that the three China isolates, with whole genome data available, are similar to the de Wit prototype, all are divergent from the U.S. strain (303600; 2007). Although the genomes of strains GZ01 and BJ430 are nearly identical, as per their genome type characterization and percent identities, they are subtly divergent in their genome mutation patterns. These genomes indicate possibly two lineages of HAdV-B14 and independent introductions into China from abroad, or subsequent divergence from one; CHN2012 likely represents a separate sub-lineage. Observations of these simultaneously reported emergent strains in China add to the understanding of the circulation, epidemiology, and evolution of these HAdV pathogens, as well as provide a foundation for developing effective vaccines and public health strategies, including nationwide surveillance in anticipation of larger outbreaks with potentially higher fatality rates associated with HAdV-B14p1.

Beginning in 2005, HAdV-B14 re-emerged in the U.S., with multiple outbreaks in both civilian and military settings, as a highly contagious respiratory pathogen that was associated with a 76% hospitalization rate and an 18% fatality rate. [47][48][49] This fatality rate was alarming and unusual for a HAdV epidemic, particularly in presumably otherwise healthy adults. The same genome type, HAdV-B14p1, appeared in Europe with two ARD outbreaks in Ireland (2009) and the U.K. (2011). 5,11,50 These resulted in the highest fatality rates reported for HAdV-B14 outbreaks, at 33% and 23%, respectively. 5,11,50 In 2011, this pathogen re-emerged in Canada and caused one death in the three patients hospitalized with ARD. 49 In China, there were no HAdV-B14 cases reported until 2010, when HAdV-B14 emerged nearly simultaneously in two geographically distinct locations (Guangzhou and Beijing). 12,32 Following these two reports, at least three additional HAdV-B14-related ARD outbreaks in China were recorded: One occurred in the Gansu Province (2011) in which 43 students in an elementary school setting presented with febrile respiratory illnesses, 51 one occurred in Beijing (2012) in which 30 adults presented with severe symptoms that required hospitalization, 33 and the third occurred in Liaoning Province (2012) in which 24 students in a middle school presented with febrile respiratory illnesses. 52 Unlike the U.S. 303600 strain, these five China strains did not have fatalities associated with them.
Given the severity of the symptoms and the numbers of afflicted individuals in China, the earlier outbreaks, with associated higher morbidity and fatality rates in the U.S., Canada, and Europe, and the putative transmission of HAdV-B14 from U.S. to Asia (South Korea) by military trainees (2007), 53 it is critical to determine the relationships between these China isolates and their global counterparts in order using high resolution genome analysis to understand the epidemiology of this re-emergent pathogen. These data will provide a basis for public health preventive measures, including diagnosis, surveillance, and appropriate protocols for limiting outbreaks and prevention. Specific information of the mutation rates, mutation hotspots, and new variations, particularly at their serologic recognition sites, will play a key role in the development of vaccines against these pathogens. This report presents the bioinformatics analysis of the HAdV-B14p1 Guangzhou strain (GZ01; October 2010), which was the first HAdV-B14 strain isolated in China, 12 and compares it with the contemporaneous Beijing strain (BJ430; December 2010) as well as a subsequent strain isolated in 2012 (CHN2012; February 2012). Provided is a high-resolution view of highly similar yet intriguingly divergent genomes, linking one of these strains with the globally re-emergent strain isolated in the U.S. in 2005, and suggesting at least two co-circulating lineages of type 14.

Cells, virus stock, and DNA preparation
HAdV-B14p1 GZ01was isolated from a throat swab of a 17-monthold child hospitalized with acute suppurative tonsillitis in Guangzhou, China (October 2010). 12 Viral DNA was extracted using a Viral DNA Extraction Kit (Omega Bio-Tek Inc Corp; Norcross, GA). The protocol for PCR amplification is described earlier. 54 All the experimental protocols in this study were approved by the institutional ethics committee of Southern Medical University and were carried out in accordance with the approved guidelines. The informed consent for participation in this study was obtained from the guardian of the under-aged participant. Data records of the sample and sample collection are de-identified and completely anonymous.

Genome sequencing and annotation
The genome sequence of HAdV-B14p1 strain GZ01 was obtained using the Sanger sequencing method following PCR amplification of targeted overlapping 1-2 kb regions that covered the entire genome, as described earlier. 12 The sequence data, collected with an ABI 3730 Genetic Analyzer, provided an average genome coverage of 3 to 5 folds, with both strands represented. Gaps and ambiguous sequences were PCR-amplified using different primers and resequenced. DNA sequence fragments were assembled using the SEQMAN software from the Lasergene package (DNAStar; Madison, WI) into a single contig. The genome was annotated with an annotation protocol used for the HAdV-C1 genome analysis, 35 by first dividing the sequence into contiguous 1 kb non-overlapping segments.

Sequence analysis
Comparison of sequence differences between and spanning the genomes was performed with Genome Mutation Mapper (GMM; unpublished) software developed by the authors. GMM compares two or more genomes for nucleotide differences, noting SNPs and indels of query genomes relative to a reference; the data were confirmed with DNA Sequencher v5.1 (GeneCodes, Inc.; Ann Arbor, MI). Additionally, pairs of genome sequences were aligned using EMBOSS (http://www.ebi.ac. uk/Tools/emboss/) to show sequence identity, which provided a visualization of the alignment. Nucleic acid and amino acid sequence percent identities were calculated using software which was part of the EMBOSS package. Pair-wise comparisons of genomes were performed with the LAGAN (Limited Area Global Alignment of Nucleotides) program of mVISTA (http://genome.lbl.gov/vista/lagan/submit.shtml). 55 Phylogenetic analysis of select genes and whole genomes was performed using MEGA 5.1.0 (http://www.megasoftware.net/megamac.php), which accepts FASTA files for sequence alignment and uses a Maximum Composite Likelihood method that generated neighborjoining and bootstrapped phylogenetic trees with 1000 bootstrap replications; for phylogenetic analysis; all other parameters were set by default.

GenBank accession numbers
The genome sequences used for phylogenic analyses are summarized in Table 1, with additional genome typing details given only for the type 14 strains ('p' denotes prototype; if no further designation is noted, the strain is the prototype). Sequences used for these studies included the E1A and hexon genes, either deposited as single gene entries in GenBank or which were extracted from the genome files.

Nucleotide sequence analysis of HAdV-B14p1 strain GZ01
The genome of strain GZ01 was sequenced, assembled, annotated, and analyzed using computational methods. Figure 1 presents the genomic Genome type of strains GZ01, BJ430, and CHN2012 as HAdV-B14p1 Genome type, based on restriction enzyme analysis (REA), was useful in the past for characterizing strains in the absence of full genome sequence data. 59,61 It has limited use in the era of whole genome data, but may be useful in comparisons with strains reported in the literature with REA maps but are no longer available for futher analysis. The genome types of strains GZ01, BJ430, and CHN2012 were determined by in silico REA using the Vector NTI 10.3.0 software (Invitrogen Corp.; San Diego, CA, USA) and in comparison with the other HAdV-B14 strains 4,34 ( Figure 2). The REA profiles of GZ01, BJ430, and CHN2012 were consistent with and indistinguishable from that of HAdV-14p1 strain 303600, all which were confirmed to be genome type 14p1 according to the nomenclature published previously. 4,34 It is clear, however, that genome type analysis by REA patterns may be misleading or incomplete as all the three China isolates are highly divergent from HAdV-14p1, as demonstrated by the indel mutations analysis. Furthermore, to support the observation that genome type identification may be misleading, it should be noted that while the initial wet-bench REA data suggested 303600 was a novel HAdV-B14a strain, it was disproven by the whole genome analysis. 4,34 Inverted terminal repeat (ITR) ITRs contain sequences encoding critical genome replication functions. The HAdV-B14p1 GZ01 genome has a 137 bp ITR that is identical to the ITRs of the other HAdV-B14p1 strains (BJ430, CHN2012, and 303600), except for a C to G mutation at nt 134 of strain CHN2012 (Figure 3). It is nearly identical to the ITRs reported for the other subspecies B2 members. The first 64 bases are completely identical amongst the prototype subspecies B2 ITRs, which is different for divergent subspecies B1. The binding sites for the host transcription factors NF I and NF III were identified in the ITR. 62,63 These sequences have roles in enhancing virus replication and are necessary for efficient virus growth. 62,63 Transcriptionally-related Sp1 and ATF binding sites were also conserved in subspecies B2 genomes. All four HAdV-B14p1 genomes differ from HAdV-B14p at position 68, with a C present, a transition mutation from the original T; this is also found at the complementary 3'-end, a validation that it is not a sequencing error. This particular T is conserved across all of the subspecies B2 prototypes, including HAdV-B55, but not amongst all the prototypes of the other subspecies, that is, all subspecies B1 members contain a C at this position; this is a marker to distinguish B1 from B2, and also HAdV-  Phylogenetic analysis of select genes and the whole genomes Of the five recent China strains, only three were reported with whole genome data (GZ01, BJ430, and CHN2012). Phylogenetic analysis of the HAdV-B14 E1A, fiber, and hexon genes shows that the recent HAdV-B14p1 strains (strains GZ01, BJ430, CHN2012, 303600, 2971, and Dublin2009) are closely related to each other as well to the prototype and earlier reported HAdV-B14p genomes (1955 and 1974), as shown in Figure 4. The phylogenetic analysis of the whole genome sequences also confirms the sequence similarity with the prototype genome after approximately 50 years and across large geographical distances ( Figure 4).

Distribution of indel mutations across the genomes
A global visualization of the alignment of genome sequences of GZ01, BJ430, 303600, CHN2012, and de Wit prototype is presented in Figure 5. This was generated using a 'beta test version' software, Genome Mutation Mapper (GMM; unpublished) that notes SNPs and indels of query genomes relative to a reference, and confirmed using DNA Sequencher (GeneCodes, Inc.; Ann Arbor, MI). The sequence similarities amongst them indicates that although these HAdV-B14p1 viruses have a common ancestor (type 14 serotype), their insertion/ deletion (indel) patterns suggest two lineages. Indels and their subsequently inherited conserved patterns across genomes are key markers in determining lineages and for following molecular evolution, as reversion is highly unlikely. 64 Both strains GZ01 and BJ430 are highly divergent from the U.S. strain 303600, with respect to the indel    Table 2). Base substitutions are indicative of recent and/or essential mutations as reversion may occur, unlike indel mutations. 64 The whole genome sequence of strain GZ01 is very similar to that of strain BJ430 with respect to mutations. Six mutation differences are noted between these genomes. These SNPs include two non-synonymous substitutions in the DNA polymerase (K to N) and protein VI genes (R to K). Four synonymous substitutions are noted for the E1A 29.1 K and 25.7 K genes, the L3 VI gene, and the E2A DNA binding protein gene. One mutation is in a non-coding region (nt 225) and two deletions, three T (nt 10 658) and one A (nt 13 266), in poly (T) and poly (A) regions, are also noted ( Table 2). Among these mutations, a deletion of A at nt 29 481 in the BJ430 genome distinguishes it from GZ01 ( Figure 5). When compared with strain 303600, 16 base substitutions were found in 11 coding regions of GZ01, which resulted in three nonsynonymous substitutions in the E1A 6.5 K, E1B 54.9 K, and Protein VI coding regions, respectively. All of the indels occurred in noncoding regions, which led to length changes of the corresponding poly (A) or poly(T) sequences. Compared with the prototype de Wit strain, there were more mutations in strain GZ01, as would be expected given the time differences between their isolations: 93 base substitutions in the 31 genes, resulting in 35 non-synonymous substitutions. Although there were nine indels involving 25 nucleotides, only two of these resulted in non-synonymous substitutions: One in E1 29.1 K and 25.7 K coding regions (SV to I), with the other resulting the insertion of two amino acids KE in fiber knob domain, which is the most notable sequence difference between the HAdV-14p de Wit and all the other HAdV-14p1 strains ( Table 2). This was believed to have the potential of altering cell receptor binding and tissue tropisms, and hence pathogenicity, as the fiber knob recognizes the host cell receptor. 41 When compared with strain 303600, 11 base substitutions were found in strain CHN2012, which resulted in two non-synonymous substitutions in the E1B 54.9 K, E3 20.8 K coding regions, respectively. There are four synonymous substitutions in the E2B/L1 43 K, hexon, L4 100 K coding regions, respectively. The other mutations located at NCR, including one C to G mutation in ITR.

DISCUSSION
China's large, dense, and generally inaccessible population represents a unique environment for studying viral pathogens once they enter the population. The emergence, re-emergence, and transmission of a particular pathogen may be followed using high-resolution genome sequencing and, through computational methods, the molecular evolution and epidemiology of the pathogen may be revealed in great detail. According to the genome analysis of the three HAdV-B14 strains circulating in China, all three strains appear to be of separate, but related, lineages and may likely have been transmitted from abroad. Unlike the past, Beijing and Guangzhou are now 'open' to unrestrictive global travel and this may situate them to be the foci for the introduction of infectious disease agents to and from both overseas and other China provinces. Uniquely, Guangzhou has four additional direct long-standing connections to the international community. First, since the mid-1800s, many China emigrants to overseas destinations have originated from this region; therefore, there has 'always' been direct physical contact through homecoming visits. Second, Guangzhou has hosted several international events recently, bringing in an influx of visitors. For example, the Asian Games (Nov. 2010) was hosted in Guangzhou, bringing in more than 14 000 athletes along with a large number of foreign tourists and regional transient workers. Third, residents of other China provinces migrate to this prosperous region. Finally, Guangzhou is only 119 km from Hong Kong with the populations commingling; both populations are also in close contact with the global community, including Americans from both hemispheres, neighboring Asians, and Europeans, all of which have experienced respiratory diseases linked to the subspecies B2 HAdVs as well as other respiratory pathogens. For instance, from 2002 -   Two emergent human adenovirus type 14 isolates Q Zhang et al to 2003, the severe acute respiratory syndrome coronavirus (SARS-CoV) spread quickly from Guangzhou to Hong Kong and Beijing, and then across the country and globally, with high morbidity and mortality rates. Guangzhou, therefore, serves as an important focal point for infectious disease pathogen introduction, and surveillance is important in order to limit the public health impact on its population and elsewhere in China.
HAdV are important pathogens that appear to re-emerge after lengthy absences, either due to non-reporting because of diminished pathogenicity and infectivity or to perhaps latency or other mechanisms of cryptic infection. HAdV-B14, for example, re-emerged after approximately fifty years and HAdV-B7d re-emerged after twenty-one years. 56 The 2005 re-emergent HAdV-B14p1 strain was associated with several highly contagious and geographically wide-spread outbreaks of ARD, and included unexpected higher rates of fatalities. 11,47,65 These outbreaks occurred in both civilian (24 communities) and military (nine communities) populations in the United States (2005-2009) and Europe (2009-2011). 4,5,11,47,50,65 Parenthetically and retrospectively, the first case identified was in a child in California (2003) 4 and was presumably the originating point for the subsequent larger military base outbreaks. 65 In a limited retrospective survey of respiratory infectious disease agents from patients at a hospital in Guangzhou (2010 and 2011), HAdVs were found and characterized with respect to types. 54 One of these isolates, typed, deposited, and noted in GenBank as an emergent and previously unreported HAdV type 14 in China ('human adenovirus 14 isolate HAdV-B/CHN/GZ01/2010/14[P14H14F14'), was isolated from a throat swab of a 17-month-old child with acute suppurative tonsillitis (October 2010). Another simultaneously emergent HAdV-B14 strain BJ430 was isolated from a six-month-old infant diagnosed with bronchial pneumonia and hospitalized at the Beijing Children's Hospital (December 2010). This is unusual because this condition had not been previously described for infections attributed to either HAdV-B14 or HAdV-B55, with its similar genome and disease symptoms. 11,14,48,66 As HAdV-B14 has re-emerged globally recently, and since it is known as a highly contagious pathogen that has been associated with high hospitalization rates and fatality rates, 11,47,65 the emergence of this virus should set off an alarm that the proper surveillance of this virus is critical for large dense populations with naive immunity to HAdV-B14.
Comparative genomic analysis is leading to discovery of large numbers of novel molecular markers that are proving very helpful in understanding many important aspects of microbial phylogeny; 67,68 of these molecular markers, the conserved indel mutations provide particularly useful means for identifying different groups of microbes in clear molecular terms and for understanding when they have branched off from a common ancestor. 67,[69][70][71] As reported here, indel mutations analysis suggests two co-circulating lineages from at least two co-circulating 'prototype' strains in 1950s, one giving rise to the 303600 strain and the other giving rise to the China cluster of HAdV-B14. Before the genomics era, many important adenovirus strains were characterized by REA. REA is still useful for characterizing current isolates by providing a bridge between their accessible genome data and the REA data that are only available in the literature, as important references and as historical isolates are no longer available for genomic analysis. 72 This report emphasizes that REA data should be carefully considered, as it may be misleading and incorrect. Although the REA analysis confirmed strains GZ01, BJ430, and CHN2012 as belonging to the same genome type 14p1 as strain 303600, high resolution analysis of the genome sequences, including indels, indicate different lineages and introductions into China, along with the absence of fatalities associated with 303600. Again, based on the genome mutation patterns, especially the indels, the de Wit prototype appears to be very different from the 303600 and is similar to the BJ430, GZ01, and CHN2012 strains. These may represent several lineages co-circulating 'prototype 14' strains, with one reported in the original study and the rest accumulating characteristic indel patterns that have been reported in this study. Most of the indels occur in the poly(A) and poly(T) regions, which have been proposed as high-resolution molecular strain markers for characterizing HAdV-14p1. 71 The recent report that HAdV-B14 dispersed from U.S. to Asia in 2007 via military trainees 53 provides another clue that the emergent HAdV-B14p1 strains in China may have been transmitted from abroad. However, as shown in this report, the whole genome data of that strain is critical for the correct interpretation of that possible transmission pattern.
The whole-genome sequences of strains GZ01 and BJ430 are nearly identical with each other, with respect to point mutations. One indel separates them. Strain GZ01 caused acute suppurative tonsillitis, an upper respiratory disease; however, strain BJ430 was associated with bronchial pneumonia, a lower respiratory tract disease. The reason for this difference in pathogenicity may be due to the genomic differences. A major difference between the two genomes is the two nonsynonymous substitutions in the DNA polymerase (K to N) and protein VI genes (R to K), respectively ( Table 2). Of these two proteins, the 22-kDa cement protein VI is located beneath the peripentonal hexons in the viral capsid. It is identified as an endosomal membrane lytic factor, which is important for adenoviruses to overcome the barrier of the host cell membrane. 73 The nonsynonymous substitution of R to K in the protein VI may lead to the function of protein VI being enhanced or weakened, which can presumably further change the HAdV tissue tropism. This is awaiting wet-bench studies.
Given the current ease of travel and global interactions, these ARD outbreaks associated with HAdV-B14 provide insights into the distribution, lineage, and molecular evolution of this pathogen. There had been no reports of HAdV-B14 in China prior to and during the period noted for HAdV-B14 re-emergence and circulation (2005)(2006)(2007)(2008)(2009)). This report presents the bioinformatics analysis of both HAdV-B14p1 strains in high resolution and detail, and provides putative lineages for these two surprisingly and intriguingly divergent, but, highly similar genomes linking Guangzhou and Beijing (2294 kilometers apart). A comparison with the re-emergent strain isolated in the U.S. (2007) as well as with the less similar but still remarkably conserved genome of the prototype virus from The Netherlands (1955) is also presented. The presentation of high resolution nucleotide sequence data and a map of mutations, particularly of the indels, provide detailed insight into two contemporaneously circulating human pathogens with divergent and parallel lineages. Both are noted as genome type HAdV-B14p1 via their low resolution but conveniently available REA patterns. The computational data presented in this report again demonstrates highly conserved but divergent HAdV-B14 genomes, which supports further studies of the epidemiology and molecular evolution of these related pathogens, and which, in turn, will provide a foundation for developing effective vaccines and public health strategies, including nationwide surveillance. of this manuscript were completed at the Department of Ophthalmology, Howe Laboratory, Massachusetts Eye and Ear Infirmary, Harvard Medical School (Boston, Massachusetts, USA) as Q.Z. was funded by the China Scholarship Council (CSC No. 201508440056) as a Visiting Scholar (2015-2016); he thanks James Chodosh and Jaya Rajaiya for providing a stimulating intellectual environment. This project was additionally supported by a summer research grant (2016) to D.S. from the Office of the Vice President for Research at George Mason University. The Funders had no role in the study design, data collection and analysis, decision to publish, or preparation of the manuscript.