Vector-borne viruses and their detection by viral metagenomics

ABSTRACT Arthropods, such as mosquitoes and ticks, are important vectors for different viruses (so called vector-borne viruses), some of which cause a significant number of human and animal deaths every year as well as affect public health worldwide. Dengue virus, yellow fever virus, chikungunya virus, Japanese encephalitis virus, tick-borne encephalitis virus and Zika virus are just a few examples of important vector-borne viruses. The majority of all vector-borne viruses have an RNA genome, which routinely undergo genetic modifications. The changes in the genome, apart from the environmental issues, can also influence the spread of viruses to new habitats and hosts and lead to the emergence of novel viruses, which may become a threat to public health. Therefore, it is important to investigate the viruses circulating in arthropod vectors to understand their diversity, host range and evolutionary history as well as to predict new emerging pathogens. The choice of detection method is important, as most of the methods can only detect viruses that have been previously well described. Viral metagenomics is a useful tool to simultaneously identify all the viruses present in a sample, including novel viruses. This review describes vector-borne viruses, their maintenance and emergence in nature, and detection using viral metagenomics.


Introduction
Vector-borne diseases are an increasing problem worldwide both for the human and animal population. In fact, the World Health Organisation (WHO), has estimated that vector-borne diseases constitute over 17% of all human infectious diseases and cause more than 1 million deaths yearly [1]. These diseases are caused by different parasites, bacteria and viruses spread between hosts (humans and/or animals) by so called vectors. The vectors are often blood sucking insects that transmit infectious agents by taking blood meals. Vector-borne viruses transmitted by vectors are often referred to as arthropod-borne viruses (arboviruses) and are then defined as viruses that are maintained in nature through biological transmission between susceptible vertebrate hosts by haematophagous arthropods.
Mosquitoes are considered one of the primary vectors for infectious agents, with others including ticks, biting midges, sand-flies and flies. In humans, several mosquito-borne epidemics have been reported across the globe, including the emergence of dengue fever by Dengue virus (DENV) serotypes 1-4 transmitted by Aedes aegypti, which is also an important vector for other disease-causing viruses including yellow fever virus (YFV), chikungunya virus (CHIKV), and Zika virus (ZIKV) [2]. YFV was controlled by mosquito abatement techniques in the Americas, but it remains a constant threat for re-emergence in new areas by Aedes mosquitoes. ZIKV transmission occurs primarily through bites of infected Aedes mosquitoes and has promoted major disease outbreaks in humans [3][4][5]. West Nile virus (WNV) was introduced into the Americas in 1999 and was probably a derivative of an Israeli WNV strain. After the introduction, WNV rapidly spread across the United States [6]. Similarly, CHIKV was introduced into Asia from Africa in the mid-2000s before subsequently spreading to the Caribbean region in 2013 [7]. Rift Valley fever virus (RVFV), another medically important virus in livestock, has been identified in different mosquito species (Aedes, Culex, Anopheles, etc.) [8,9] and can be transmitted to humans from infected animals. Other than mosquitoes, several arthropods have been identified as vectors for different pathogenic viruses. For example, adult midges transmit BTV [10], sand-flies are able to transmit Toscana virus, and ticks can carry and transmit tick-borne encephalitis virus (TBEV) and Crimean-Congo haemorrhagic fever virus (CCHFV) [11,12], with many more vector-borne viruses expected to be identified that affect public health. In addition to human and animal diseases caused by vectorborne viruses, these viruses can also affect invertebrate health, including that of honey bees, causing serious damage to food crops that results in huge economic losses for the agricultural industry [13]. Dicistroviruses (Acute bee paralysis virus, Israeli acute bee paralysis virus, Kashmir bee paralysis virus, etc.), iflaviruses (Deformed wing virus (DWV), Kakugo virus and Varroa destructor virus-1 (VDV-1), and Sacbrood virus (SBV), etc.), and other groups of viruses have been reported as pathogens that infect honey bee populations in different geographical locations [14].
As shown by the examples above, vector-borne viruses belong to a wide variety of viral families including Flaviviridae, Phenuiviridae, Reoviridae, Togaviridae, Rhabdoviridae and Orthomyxoviridae, Asfaviridae and Poxviridae (Figure 1). The majority of zoonotic arboviruses belong to the families Flaviviridae and Togaviridae [15,16], and other important arboviruses belong to the family Phenuiviridae, e.g. CCHFV and RVFV [17,18]. Colorado tick fever virus is another important arbovirus that belongs to the family Reoviridae and infects humans [19]. In this review, we will go through the general characteristics of vectorborne viruses as well as how we can use highthroughput sequencing to not only characterize the virome of different vectors but also discover novel viruses.

Life cycle and emergence of vector-borne viruses
Vector-borne viruses are maintained in the environment by a complex life cycle that includes a primary invertebrate host as well as a vertebrate host. Transmission of viruses may be influenced by several factors, such as the host susceptibility for the virus, the preference of the vectors for the host and the vector competence for a particular virus [20,21].
The majority of arboviruses are maintained through an enzootic cycle (sylvatic cycle), where birds, rodents or non-human primates serve as reservoir hosts ( Figure 2) and the virus transmission occurs by primary insect vectors. At the onset of viral infection, the virus replicates in the vertebrate host to higher titres and induces viremia. Upon feeding on this host, an uninfected vector will become infected and after an extrinsic incubation period, during which the virus moves to the salivary glands and replicate to high levels, the mosquito will be able to transmit the virus to the next host through the saliva as it takes a new blood meal. Viruses may also be transmitted between vectors and domestic animals, such as pigs and equines (epizootic/rural cycle) as well as to humans (epidemic/urban cycle) [22]. Spill over events of the sylvatic cycle through for example the movement of humans into sylvatic habitats can trigger the emergence of disease outbreaks in humans and domestic animals. If the human/animal do not develop viremia they are considered as dead-end hosts (e.g. horses and humans in the case of WNV), as the amplification of the virus is insufficient to allow for arthropod vectors to become infected and be able to transmit the virus further [23]. Some arboviruses, such as DENV, CHIKV and YFV alter their host range from non-human primates to humans, where it amplifies and becomes able to be transmitted to the next person by mosquitoes, leading to outbreaks without the use of an animal reservoir [24].
During the past two decades, the incidences of vector-borne viruses have been expanding geographically. It has been estimated that approximately 50% of the world's population is currently affected by at least one type of vector-borne pathogen. The diseases caused by these pathogens constitute 30% of all emerging infectious diseases (EIDs) [25]. A combination of socio-economical, environmental and ecological factors has contributed to the emergence of novel viruses, including expanding human population densities, deforestation, climate change, scattering of livestock, livestock-wildlife contacts and viral adaptation to new hosts species [26,27]. Finally, globalisation together with the complex web of factors mentioned above, facilitates the spread of viruses to new geographical locations, contributing to the emergence or re-emergence of vector-borne viruses [28].

Genetic diversity of vector-borne viruses
Vector-borne viruses comprise a genetically diverse group of viruses that differ in the structure, composition and organisation of their genomes. This diversity is generally not only evident between viral families but also between individual viral species, which can have distinctive molecular mechanisms for replication, transmission, pathogenesis and evolution [29]. The majority of vector-borne viruses contain RNA as their genetic material ( Figure 1).
Apart from ecological factors, certain genetic factors influence the diversity and emergence of vector-borne viruses, including: (i) the lack of proofreading activity and repair mechanisms of the RNA-dependent RNA polymerase (RdRP), resulting in the generation of random insertions, deletions and substitutions (point mutations) and new viral variants [30]; (ii) the exchange of long stretches of genomic sequences between closely related viruses (genetic recombination), e.g. the Western equine encephalitis virus is a product of recombination between the Eastern equine encephalitis virus and a Sindbis-like virus [31,32], and in vitro studies have also shown the potential recombination within chikungunya virus species [33]; and (iii) the exchange of genome segments between segmented viruses during co-infections (genetic reassortment) that generates new genetic combinations, e.g. Thogoto virus [34], Bluetongue virus (BTV) and Schmallenberg virus, the latter of which may be the result of a reassortment between Sathuperi and Shamonda viruses [35]. Co-circulation or simultaneous infections of different BTV serotypes can potentially generate novel reassortant viruses [36,37].
As mentioned previously, arboviruses must be able to infect both invertebrate and vertebrate hosts to replicate and maintain their life cycle in nature as such these viruses often diversify and evolve. These variants may have the ability to alter the viral infection rate. For example, a single mutation in the envelop glycoprotein E1 enhanced CHIKV transmission by Aedes albopictus mosquitoes, i.e. it increased the competence of Ae. Albopictus [38] and additional sequential mutations in CHIKV E2 increased the infection of Ae. albopictus [39]. Viral emergence can also be significantly influenced by viral intra-host evolution. For example, viral sequences containing mutations may not be identified by the RNA interference defence system (RNAi, the primary antiviral defence mechanism in mosquitoes) [40,41]. Because of the high genetic diversity of arboviruses, the application of improved molecular methods may be required to detect novel viruses as well as to characterise the viral populations, viral variants or quasispecies in different arthropod vectors.

Detection of vector-borne viruses by traditional approaches
The vast diversity of vector-borne viruses present in nature makes their discovery and classification challenging and may require a combination of methods. In general, the primary focus in most studies has been the detection of pathogenic viruses that are medically important, such as DENV, WNV, TBEV, and CHIKV and not on other insect-borne viruses, as many of them are asymptomatic in the vertebrate host. The choice of detection method is based on the known characteristics that are specific to each virus, such as incubation period, viremia pattern and antibody response. The identification of infection by antibody-based serological methods is typically used at the onset of illness or weeks after the development of symptoms [42,43]. The classical methods of serology include haemagglutinin inhibition and complement fixation, and most frequently involves the use of enzyme-linked immunosorbent assays (ELISA) and immunofluorescence assays. Direct detection methods that are currently available include virus isolation, electron microscopy, molecular methods and viral antigen detection methods. Virus isolation has for long been a gold standard method [44]. However, virus isolation and electron microscopy are laborious processes, requiring a long time for viral cultivation, which is sometimes not possible, and requires special laboratory facilities.
Molecular detection primarily includes nucleic acidbased amplification methods, including polymerase chain reaction (PCR)-based methods, specifically, reverse transcriptase (RT)-PCR-based assays, as most vector-borne viruses are RNA viruses [45]. These methods offer a mean of rapid viral detection during the viremic phase and are highly sensitive [46,47]. However, some viruses produce low and short-lived viremias, making it difficult to detect viruses such as WNV [48]. In addition to PCR, standard molecular methods, such as nucleic acid hybridisation methods and microarrays, have also been used as detection assays. All these methods are based on prior information of viral sequences present in the sample and are commonly species specific. Thus, the detection of a virus is sometimes not possible if it is not known which virus/es reside within a sample.

Viral metagenomics
Viral metagenomics is the study of the collective viral genomes from primary samples, e.g. environmental samples, clinical material from humans, animals and insect tissue. This newly developed, culture-and sequence-independent method has been able to detect viruses behind diseases of unknown aetiology as well as allowing the characterisation of the complete viral populations in a given sample [49]. The workflow of viral metagenomics often includes the following steps: sample preparation, sequenceindependent amplification, high-throughput sequencing, bioinformatics and follow-up studies, if necessary ( Figure 3) [50].

Sample preparation and amplification
Sample preparation can include a combination of different methods that are used to enrich the virome in the sample, including filtration, ultracentrifugation, nuclease treatment and the removal of ribosomal RNA. This is an important step, as the ratio of viral nucleic acids will be much lower as compared to the host genome [51]. Amplification of nucleic acids can be performed by different methods, including sequence-independent, single-primer amplification (SISPA), which is based on the ligation of adapters to nucleic acids [52]. SISPA has been combined with random PCR and nuclease treatment steps [53][54][55][56] to amplify divergent viral sequences present in the sample. Random PCR (rPCR) [57], linker-amplified shotgun library (LASL) [58], single-primer isothermal amplification (SPIA) [59] and multiple displacement amplification (MDA), the latter of which uses the displacement DNA polymerase, e.g. the phi29 DNA polymerase [60], are other amplification methods that have been used. Although these methods have been successfully used to amplify the metagenomes, they have some limitations, such as an incomplete retrieval of viral genomes, an amplification bias towards the 3ʹ end of the genomes and a biased distribution of sequencing depth [59,61,62].

High-throughput sequencing
A combination of Sanger sequencing and advanced fluorescent detection methods led to the development of next generation sequencing (NGS), often referred to as second-generation sequencing. The first highthroughput sequencing platform (HTS) was introduced in 2005, which was 454 pyrosequencing by 454 Life Sciences (acquired by Roche in 2007 and later shut down in 2013). Several HTS platforms have been developed over the years that feature variable read lengths, type of sequencing, run times and throughput capacity [63]. The cost of sequencing for each reaction has been significantly reduced in recent years, and sequencing machines are able to generate massive sequence outputs, up to 1500 Gb. The Illumina method is based on a paired-end read chemistry and has numerous platforms (HiSeq, MiSeq, and NextSeq), each with different read lengths and run times while producing high-throughput data. Iontorrent (from Life technologies) runs as a single-read platform and was the first semiconductor-based platform that could generate up to 1 Gb of data, with a longer read lengths of up to 400 bases. The newer versions, Ion proton and Ion S5, can generate up to 15 Gb of data with varying read lengths. The latest HTS platforms from Pacific Bio and Oxford Nanopore have been developed to generate longer sequences of up to 200 Kb. The choice of sequencing platform depends on the application, and each platform has it strengths and weaknesses. The benchtop instruments developed by Illumina and Ion Torrent have been largely used in various insect virome sequencing projects [64][65][66][67][68]. Table 1 summarises the HTS platforms available and their sequencing features.

Bioinformatics
Bioinformatics is the application of tools and computational analyses to understand and interpret biological data. Bioinformatics is an interdisciplinary field that has been widely applied in modern biology and  medicine for data management [69]. The analysis of massive sequencing data generated from HTS platforms typically includes quality checking, assembly and taxonomic classification of reads and/or contigs produced by the assembly (Figure 4). Quality checking involves trimming of sequences according to Phred quality scores, which are related to base calling error probabilities [70]. It also includes, identifying and removing sequence duplicates that are produced by the HTS platform as a result of PCR amplification, PCR errors or sequencing errors and is necessary to reduce the computational time, to accurately calculate an estimated species abundance and to improve the assembly. All these quality filtering conditions can be specified depending on the downstream analyses required [71,72]. Moreover, the sequences that are not a target of the study can be filtered out to eliminate misassemblies and to speed up the analysis. For example, the host sequences can be removed from a sample if the target sequences are viral-related reads [73]. Possible contaminating sequences or sequences that are not relevant can also be removed by aligning against reference sequences, which can be performed using several short reads alignment tools such as BWA, SOAP2 and Bowtie2 [74][75][76].
The assembly of shorter sequences that have matching overlaps generates longer sequences called contiguous sequences (contigs), a method referred to as de novo assembly. These contigs can be further extended by merging shorter contigs. There are two primary types of de novo assembly programmes, Overlap/ Layout/Consensus assemblers (e.g. MIRA, Celera, and VICUNA), which are widely used for longer reads [77] and de Bruijn graph assemblers (e.g. Velvet, SOAP de novo, SPAdes) [78][79][80]. However, the assembly process might generate 'chimeric' sequences involving the assembly of sequences from different organisms or species, which may be a problem in viral metagenomic studies as the biological sample may contain closely related viral sequences [81].
Taxonomic classification is the final step in the metagenomic analysis, where each sequence is assigned to a taxonomic group. The most commonly used similarity-based classification is Basic Local Alignment Search Tool (BLAST) [82], where the sequences are compared to known genomes. Different versions of BLAST can be used, such as BLASTx and tBLASTx [83,84]. Considering the time span for sequence classification, several different tools have been developed that can reduce the time required from weeks to days, e.g. RAPsearch2, Diamond, Kaiju and Kraken [85][86][87][88].
Viral metagenomics provide basic information on which viruses are present in a sample. More extensive analysis or follow-up studies are necessary to understand the roles of the identified virus/viruses. These analyses may include obtaining full-length viral genomes by the primer walking approach, RACE analysis, virus isolation, viral characterisation and developing diagnostic assays, which all depends on the objective of the study [50,89].

Bioinformatics challenges in analysing viral metagenomes
Despite the development of advanced computational tools to analyse viral sequences from mixed samples, several bottlenecks are restricting effective data analysis. For example, the tools may require expertise and computational resources for users to be able to access them. Building longer contigs by assembling shorter reads can sometimes be problematic because of the high viral diversity in the sample. Closely related viral genomes can be mapped to supplied reference genome through reference-based assembly, which is computationally efficient, however, divergent viral genomes cannot be aligned by this approach. Another method, de novo assembly, used for reconstruction of full-length genomes may generate ambiguous or chimeric sequences due to mutations and recombination of closely related viruses [90]. The de novo assembly process generates complex assembly graphs and fragmented assemblies, which are computationally demanding. One of the most important issues in metagenomic studies is sequence classification, which mainly depends on the similarity between the query sequence and annotated genomes in the database. Classification programs based on nucleotide alignments are sometimes not sensitive enough in detecting divergent viral sequences while protein searches may be slower and require powerful computers or highly optimized tools (e.g. Diamond and Kaiju). Customized databases, such as those only with viral genomes, can be used for similarity searches, however, it may result in misclassification of sequences. Limited representation of viral sequences in the curated sequence databases is another challenge in classifying novel viruses as reads that originate from viruses may be unclassified due to that relative viral sequences are not present in the database [91].

Implications of invertebrate viromes in human and animal public health
With the use of metagenomics and transcriptomics, a broad range of unknown and highly divergent RNA viruses have been discovered from different invertebrate species. For example, a metatranscriptomics analysis of 220 invertebrate species resulted in the discovery of 1445 RNA viruses, including probable new viral families [92]. In another study, 112 novel RNA viruses were reported from 70 arthropod species [93]. These studies show that invertebrates harbour RNA viruses with greater genetic diversity than previously expected, and that some of the identified viruses are likely to be ancestors of major viral groups, including those that infect vertebrates. Thus, analysing the biodiversity of invertebrate viromes may have important implications for our understanding of virus evolution, ecology and emergence. An advantage using this approach is also that we detect not only arboviruses but also viruses that are restricted to insect hosts (insectspecific viruses, ISVs). These types of viruses are interesting as evolutionary relationships show that these viruses are related to arboviruses, and that some of the ISVs may been ancestral to pathogenic arboviruses [94,95]. Also, different studies have reported that some of these ISVs are able to reduce the replication of certain arboviruses following preinfection or coinfection [96][97][98].
Thus. the complex pattern of multiple viruses and the large number of completely new ones in the same sample poses a considerable challenge for the scientific community to bring this in order so we can understand their role in disease and health. Such as, what is their function in the host and do they cooperate in some way? A practical issue to solve before we can understand this is to individually isolate them as virus particles. We need to cultivate them in order to study their respectively biological properties. From previous experience, we know that many of the viruses are not possible, or very difficult, to cultivate in cell cultures. Also, how can we ensure that they are pure from other viruses once we cultured them? To address this difficult task, we may initially use a classical virus cultivation strategy using various cells and plaque purify them. It is however likely that many will not be able to be cultivated and purified this way. An alternative approach could be to construct full-length infectious clones of the viruses we discover simply by synthetizing their genomes and insert them into suitable vectors, transfect these and then recover the viruses. This approach has been used successfully with a variety of different virus families [99,100]. The very large DNA viruses may however need another approach by amplification of the genome and subcloning parts and them assemble them as full-length genomes into the vector. By this strategy it would be possible to recover and isolate individual viruses and then study their properties. Such as what receptors do they use, their tropism, replication strategies, how do they invade the innate immunity and what type of CPE (if any) are typical for them and so on. This knowledge is important in order to go on and study what type of animals they can infect and their role in disease of animals and humans, including their mode of action.