Literature DB >> 34069049

An Unsupervised Algorithm for Host Identification in Flaviviruses.

Phuoc Truong Nguyen1,2, Santiago Garcia-Vallvé3, Pere Puigbò1,4,5.   

Abstract

Early characterization of emerging viruses is essential to control their spread, such as the Zika Virus outbreak in 2014. Among other non-viral factors, host information is essential for the surveillance and control of virus spread. Flaviviruses (genus Flavivirus), akin to other viruses, are modulated by high mutation rates and selective forces to adapt their codon usage to that of their hosts. However, a major challenge is the identification of potential hosts for novel viruses. Usually, potential hosts of emerging zoonotic viruses are identified after several confirmed cases. This is inefficient for deterring future outbreaks. In this paper, we introduce an algorithm to identify the host range of a virus from its raw genome sequences. The proposed strategy relies on comparing codon usage frequencies across viruses and hosts, by means of a normalized Codon Adaptation Index (CAI). We have tested our algorithm on 94 flaviviruses and 16 potential hosts. This novel method is able to distinguish between arthropod and vertebrate hosts for several flaviviruses with high values of accuracy (virus group 91.9% and host type 86.1%) and specificity (virus group 94.9% and host type 79.6%), in comparison to empirical observations. Overall, this algorithm may be useful as a complementary tool to current phylogenetic methods in monitoring current and future viral outbreaks by understanding host-virus relationships.

Entities:  

Keywords:  algorithm; codon adaptation index; flavivirus; host identification

Year:  2021        PMID: 34069049      PMCID: PMC8157105          DOI: 10.3390/life11050442

Source DB:  PubMed          Journal:  Life (Basel)        ISSN: 2075-1729


1. Introduction

Recent viral pandemics have shown that rapid characterization of the virus is essential during the development of an outbreak [1,2,3,4]. Among other factors, host information is essential for surveillance and control of virus spread. However, emerging viruses are fully characterized only after several confirmed cases occur; this is an inefficient method of deterring current and future outbreaks [5]. Fast and reliable computational biology methods are needed to develop antiviral treatments, to improve medical diagnoses and to efficiently contain viral outbreaks [6]. Viral genomes are modulated by high mutation rates [7] and by selective forces to adapt their codon usage to that of their hosts, especially when the viruses can infect a wide host range, as is the case for flaviviruses [8]. Previous methods identify flavivirus host range based on an analysis of dinucleotides [9,10] based on the idea that a virus that infects multiple hosts has a weaker dinucleotide bias [11]. In this article, we introduce an unsupervised algorithm to identify putative virus host ranges based on only genome sequence information. The proposed methodology has been tested in 94 viruses of genus Flavivirus and 16 potential hosts. Several flaviviruses are major human pathogens, with potential host ranges from vertebrates to arthropods [12]. Flaviviruses are classified by vector type into mosquito-borne (MBFV), tick-borne (TBFV), insect-only (IOFV) and unknown vector (UVFV) [13] flaviviruses. In MBFVs, there exists a paraphyletic subgroup of mosquito-specific viruses [14], also known as dual-host insect-only flaviviruses (dhIOFVs). However, certain annotation ambiguities exist; e.g., the Ecuador Paraíso Escondido Virus (EPEV) is defined as MBFV based on phylogeny, but may also be classified as dhIOFV [15]. Flaviviruses with the same host type tend to be monophyletic and are subject to the same selective pressures as the host; this situation is reflected in their codon usage and dinucleotide composition [16]. The most widespread and prevalent flaviviruses include Dengue virus (DENV), West Nile virus (WNV), Japanese encephalitis virus (JEV), and Zika virus (ZKV) [17]. Several articles suggest that highly similar codon usage frequencies between viruses and hosts are indicative of a high virus–host adaptation level [18]. Thus, the codon adaptation index (CAI) [19] may be a robust indicator for determining putative hosts. Here, we use a normalized CAI (nCAI) and a correspondence analysis (CA) to compare codon usage frequencies across virus and host sequences (see Materials and Methods section). Therefore, the nCAI-CA algorithm provides a fast and reliable method of identifying the putative host range of a virus. This method requires only coding sequences (CDSs) without prior knowledge, and can be implemented with minimal computational equipment. In addition, we have developed an easy-to-use web server, available at http://ppuigbo.me/programs/CAIcal/nCAI (accessed on 13 May 2021), to calculate nCAI values.

2. Materials and Methods

The optimal host identification algorithm (Figure 1) consists of two phases. In the first phase, the algorithm computes the required codon usage tables through two subroutines: one for the host and the other for the virus. These tables, along with complete genomic CDSs, are then used as the input data for CAIcal [20]. This produces CAI data between the virus and host (CAIh) using virus CDSs, and host codon usage tables and CAI data for the virus itself (CAIs) using virus CDSs and virus codon usage tables. In the second phase, the CAIh values are normalized by dividing each by its respective CAIs as in Equation (1):
Figure 1

Scheme of the algorithm used to calculate the normalized codon adaptation index (nCAI). (a) Pipeline to identify putative hosts based on nCAI values. The complete coding sequences of hosts and viruses are used to compute nCAI values, which are put into a table. These values are then subjected to correspondence analysis to identify optimal hosts and, thus, the likelihood of a virus infecting an organism. (b) Algorithm to calculate nCAI. The CAI values for possible hosts and viruses of interest are computed from the complete coding sequences (CDSs) and the codon usage tables, which are calculated from the same sequences. The CAI values of the host (CAIh) are calculated from virus CDSs and host codon usages, and the CAI values of the viruses (CAIs) are computed using virus CDSs and the codon usage values of the viruses themselves. The resulting CAI values are then normalized by dividing each CAIh by its respective CAIs.

This yields the normalized CAI (nCAI) value, from which the optimal and likely hosts can be inferred depending on how similar the codon usage of a virus is to the codon usage of its host organisms. The nCAI values range between −∞ and +∞, and the optimal value is 1.0, indicating identical codon usage between the virus and host and therefore perfect adaptation to the host. Values above and below 1.0 would indicate over- and underoptimization, respectively, and thus suboptimal adaptation to a host. The nCAI calculations can be performed with the CAIcal tool in a dedicated web server, written in PHP, that works on any web browser (http://ppuigbo.me/programs/CAIcal/nCAI, accessed on 13 May 2021). The server requires two sets of inputs: complete DNA or RNA CDSs of the viruses of interest in FASTA format and the codon usage tables of the potential host animals in the format used by the Codon Usage Database [21]. CAIcal will then output the results in a tab-delimited table with the following values: name of the query sequence (Name); CAI of the virus to a host (CAIh); CAI of the virus to itself (CAIs); normalized CAI, calculated by dividing CAIh by CAIs (nCAI); length of the query sequence (Length); overall %GC; and GC content at the first, second or third nucleotide of each codon (%GC1–3). Available flavivirus CDSs and their respective protein sequences were obtained from the RefSeq [22] and GenBank [23] databases. The viruses (n = 94) were chosen according to phylogenetic studies [24,25,26] and current ICTV classifications [13] (Supplementary Table S1). In this study, a vector is defined as an organism capable of transmitting a virus to another type of organism. This definition does not take into account whether the virus is virulent within a vector, i.e., there is no differentiation between a vector and a vector-host. A host, on the other hand, is an organism in which the virus primarily replicates, and it does not directly transmit the virus to another organism of the same type. The host organisms (N = 16) for this study were chosen based on information primarily provided by the Virus–Host Database [27], which includes representative arthropod (mosquitoes and tick) and vertebrate (mammals, birds, reptiles and amphibians) host species. Additionally, a more comprehensive list of hosts and vectors for each flavivirus is included in Supplementary Table S1. This table includes only confirmed cases of viruses sequenced from an organism, or cases in which viruses have successfully infected the cells of a host in a laboratory experiment. It is important to note that not all host animals listed in the database are primary hosts, as they might have acquired the viruses through happenstance. We computed a codon usage reference table for 16 putative hosts representing all possible flavivirus host types among vertebrates (mammals, birds, reptiles, and amphibians) and arthropods (mosquitoes and ticks). We analyzed genomes that contained over 10,000 CDSs to reflect actual codon usage frequencies, as well as those of Gallus gallus (6017) and Sus scrofa (2953). The %GC and relative synonymous codon usage (RSCU) values were calculated from the CDSs of the flaviviruses. The RSCU describes the preference bias for a codon to be used to encode an amino acid [28]. This can be calculated by dividing the observed number of a codon by the expected frequency of the same codon, assuming that individual codons for amino acids were used at equal frequency [29]. CA was performed for two different types of nCAI datasets. The first analysis included all known flaviviruses, and the second included separate datasets, containing only the values for DENV, JEV, WNV and ZKV. The correspondence analyses were performed with the “ca” package (version 0.70) and then plotted with the “ggplot2” package (version 2.2.1) in R (version 3.4.4). The nCAI values of all MBFVs were plotted in a heat map with the “pheatmap” package (version 1.0.10) in R. The virus phylogenetic tree was computed with the following steps: first, the amino acid sequences of each genome were aligned using MUSCLE [30]. Tree construction was performed with FastTree [31]. Host trees were built based on NCBI taxonomy [32]. Each of the viruses and host organisms were sorted to match their respective phylogenies. For the DENV, JEV, WNV and ZKV genomes, their results were clustered based on k-means (5) in the heat map (Supplementary Figure S9). The clustering of each subgroup was performed and visualized by computing centroids based on the multivariate normal distribution of each subgroup with a confidence level of 0.95. This was achieved with the “ggplot2” package (version 2.2.1) in R (version 3.4.4). The virus subgroups included MBFV, TBFV, IOFV, UVFV and dhIOFV, and the host type subgroups were vertebrates, mosquitoes, and ticks.

3. Results

First, to assess whether a codon usage methodology could distinguish subgroups within a viral species, we performed nCAI-CA analysis for all the available CDSs of DENV, WNV, JEV, and ZKV, which numbered 4865, 297, 1619 and 494, respectively. Each viral subgroup formed a distinct cluster based on relative synonymous codon usage (Supplementary Figures S3 and S4) and GC content (%GC) (Supplementary Figures S5–S7). In addition, we determined the interspecies and intraspecies variability of the RSCU and %GC in the DENV, JEV, WNV and ZKV genomes. The results show that the RSCU values could differentiate viral subgroups within species and that their distances mostly reflected the evolutionary histories of the viruses (Supplementary Figure S5). The %GC was not a discriminating factor at the intraspecific level (Supplementary Figure S6). At the interspecies level, the clustering patterns based on the RSCU were only slightly more similar to the evolutionary histories of the viruses than the %GC (Supplementary Figure S7). Next, we used the nCAI-CA algorithm (Figure 1) to identify the optimal hosts of 94 flaviviruses, based on only complete CDSs and codon usage tables from 16 potential hosts (vertebrates: mammals, birds, reptiles and amphibians; arthropods: mosquitoes and ticks) (Supplementary Tables S1–S3). The nCAI-CA algorithm was able to accurately determine host types for MBFVs and UVFVs (vertebrates), IOFVs (Aedes mosquitoes), and TBFVs (Ixodes scapularis) (Figure 2). The paraphyletic group of dhIOFVs clustered between Aedes mosquitoes and vertebrates (Supplementary Figure S1). The CA plot shows a partial overlap between the MBFV and TBFV groups (Supplementary Figure S1); however, on average, TBFVs had higher nCAI values (0.813) than MBFVs (0.765) for I. scapularis, suggesting a higher degree of optimization for tick hosts (Table 1 and Supplementary Table S2). The nCAI-CA analysis also revealed unexpected findings for individual viruses, e.g., WNVs clustered within MBFVs but near TBFVs, which aligns with the results of previous infectivity tests [33] and some observational studies (Supplementary Table S1). All the viruses could be classified into two general host groups: vertebrates (MBFVs, TBFVs, UVFVs and dhIOFVs) and mosquitoes (IOFVs) (Supplementary Figure S2). Our analysis shows that no group clusters near Culex quinquefasciatus, suggesting that this is not an optimal host for most flaviviruses. However, Culex mosquitoes are relatively good vector-hosts for certain flaviviruses (e.g., JEV and WNV) and in many cases the preferred mosquito vector-host is debatable [34,35,36,37,38]. Our results indicate a higher adaptation of MBFV towards Aedes; however, a higher genomic adaptation does not imply that Aedes is currently the most common host-vector for all MBFV, as additional factors should be considered. Moreover, our algorithm mostly rules out Anopheles gambiae as the main host-vector, in agreement with the literature [39], although there are few notable exceptions [40,41]. Overall, the nCAI-CA algorithm is able to predict virus groups and host types with high values of accuracy and specificity in comparison to empirical observations (Table 2 and Supplementary Table S4).
Figure 2

Correspondence analysis of the normalized codon adaptation index (nCAI) values of flaviviruses (genus Flavivirus; n = 94). The plot shows that nCAI can differentiate multiple subgroups of flaviviruses based on their degree of codon usage optimization relative to their host organisms. Mosquito-borne flaviviruses are generally optimized for vertebrate hosts, while tick-borne flaviviruses are optimized for ticks, and insect-only flaviviruses are optimized for mosquitoes. Dual-host insect-only flaviviruses show optimization for both mosquitoes and vertebrates, and unknown vector flaviviruses are also optimized for vertebrates. Dimension 1 explains 89.4% of the variation, and Dimension 2 explains 8.5% of the variation. MBFV: mosquito-borne flaviviruses, TBFV: tick-borne flaviviruses, IOFV: insect-only flaviviruses and UVFV: unknown vector flaviviruses.

Table 1

Values of %GC3 and nCAI (Mean ± Standard deviation) by flavivirus and host groups.

TickAedesAnophelesCulexMammalsOtherVertebrates
FlavivirusGroups %GC3 72.0% 58.1% 69.6% 69.3% 60.9% 52.9%
-±1.8%--±2.5%±4.5%
dhIOFV(n = 5) 50.0% nCAI 0.738 0.949 0.729 0.764 0.889 1.027
±2.3%±0.048±0.053±0.051±0.049±0.062±0.049
IOFV(n = 14) 54.3% nCAI 0.748 0.938 0.734 0.780 0.866 0.982
±3.8%±0.046±0.044±0.042±0.0470.051±0.037
MBFV(n = 49) 52.1% nCAI 0.765 0.967 0.752 0.792 0.925 1.053
±3.7%±0.033±0.033±0.029±0.032±0.051±0.035
TBFV(n = 20) 58.9% nCAI 0.812 0.979 0.795 0.842 0.938 1.045
±1.8%±0.021±0.024±0.020±0.020±0.043±0.028
UVFV(n = 6) 44.4% nCAI 0.705 0.948 0.706 0.746 0.907 1.057
±3.3%±0.012±0.028±0.010±0.010±0.053±0.038
Mosquito(n = 14) 54.3% nCAI 0.748 0.938 0.734 0.780 0.866 0.982
±3.8%±0.046±0.044±0.042±0.047±0.051±0.037
Tick(n = 20) 58.9% nCAI 0.812 0.979 0.795 0.842 0.938 1.045
±1.8%±0.021±0.024±0.020±0.020±0.043±0.028
Vertebrate(n = 60) 51.2% nCAI 0.757 0.963 0.746 0.785 0.920 1.052
±4.3%±0.038±0.035±0.033±0.035±0.053±0.037

Host types: Tick (Ixodes scapularis); Aedes (Aedes albopictus and Aedes aegypti); Anopheles (Anopheles gambiae); Culex (Culex quinquefasciatus); Mammals (Homo sapiens, Bos taurus, Sus scrofa, Mus musculus, Myotis davidii and Myotis brandtii); and Other Vertebrates (Alligator mississippiensis, Xenopus laevis, Anas platyrhynchos, Gallus gallus and Columba livia). Complete list of flaviviruses is available in Supplementary Table S1. nCAI = CAIh/CAIs; nCAI: normalized Codon Adaptation Index; CAIh: Codon Adaptation Index calculated using host codon usage as a reference; CAIs: Codon Adaptation Index calculated with virus codon usage as a reference. %GC3: Percentage of guanine and cytosine at the third codon position. MBFV: mosquito-borne flaviviruses, TBFV: tick-borne flaviviruses, IOFV: insect-only flaviviruses and UVFV: unknown vector flaviviruses.

Table 2

Values of specificity and accuracy between nCAI predictions and empirical observations from Supplementary Table S1.

VirusGroupdhIOFVIOFVMBFVTBFVUVFVHostTypeMosquitoTickVertebrate
Accuracy 191.9%94.7%96.8%81.9%95.7%90.4%86.1%78.8%86.7%92.9%
Specificity 294.9%94.4%100.0%95.6%94.6%90.9%79.6%75.3%84.0%79.5%

1 (TP + TN)/(TP + FN + FP + TN); 2 TN/(TN + FP).

The heat map of nCAI values for all flaviviruses shows common adaptation patterns (Supplementary Figure S8). Likely optimal hosts (within an nCAI range of 0.9–1.1) include mammals (Myotis brandtii, M. davidii, Mus musculus, Bos taurus, and Homo sapiens) and Aedes mosquitoes (Aedes aegypti and A. albopictus). Unlikely hosts due to low adaptation (nCAI < 0.9) include C. quinquefasciatus, A. gambiae, I. scapularis and S. scrofa. These results are in accordance with previous studies and observations, e.g., most MBFVs have a reproductive cycle that includes Aedes (host-vector) or Culex (vector) mosquitoes and a primary mammalian host (Supplementary Table S1). Based on these analyses, flaviviruses are potentially less adapted to reproduction in Culex mosquitoes due to the differences in %GC between Culex and Aedes mosquitoes. Moreover, our analysis suggests that TBFV is a group of flaviviruses optimized to reproduce in vertebrates and use ticks as vectors (and we speculate that they may occasionally reproduce in ticks). In general, flavivirus codon usage is overoptimized (nCAI > 1.1) for birds (Columba livia, G. gallus, Anas platyrhynchos), amphibians (Xenopus laevis) and reptiles (Alligator mississippiensis).

4. Discussion

Despite the high values of specificity and accuracy produced by the nCAI-CA (Table 2 and Supplementary Table S4), there are certain limitations in its application. The algorithm is based on the assumption that there is a selection pressure to optimize the relative use of synonymous codons in the virus. However, it is well known that some viruses use the opposite strategy, and some viruses deoptimize codon usage to hide from host defense mechanisms [42]. The highest level of optimization is at nCAI = 1.0, when the relative use of synonymous codons in the virus and host is identical. Virus–host adaptations were also evaluated with a correspondence analysis (CA) plot (Figure 2). Flaviviruses able to infect a wide range of hosts (generalists) tend to be in the center of the plot, whereas host-specific flaviviruses move away from the center, towards their optimal hosts (Supplementary Figures S1 and S2). Viruses with overoptimized codon usage (nCAI >> 1) might be explained by multiple factors, e.g., adaptation to multiple hosts, effects of extreme %GC bias or adaptation to highly expressed genes [43]. Moreover, some gene-specific codon usage biases may better explain adaptations in certain viruses. Nevertheless, further empirical investigations are necessary to determine reliable confidence intervals for nCAI. Host determination may be uncertain if viruses display approximately equal optimizations for different host types; for example, although dhIOFV codon usage is optimized for both vertebrate and mosquito hosts, they are insect-specific [14]. Although common host preference patterns are observed, the optimal hosts vary depending on the virus or subgroup and may not reflect documented cases (Supplementary Table S1). The observed host ranges also do not distinguish between vectors and hosts, and classical phylogenomic methods cannot determine potential hosts without confirmed cases. Our nCAI-based method overcomes this limitation by directly measuring the adaptation of viruses to the translational machinery of their hosts.

5. Conclusions

In conclusion, this novel algorithm provides a fast and proactive method to assess the potential host ranges and the risk of zoonotic host shift for new and emerging viruses. In flaviviruses, this method distinguishes between arthropod and vertebrate hosts with high accuracy. However, it might produce ambivalent results for viruses undergoing host shifts. Overall, this nCAI-based algorithm may be used as a complement to current phylogenetic methods to monitor current and future outbreaks.
  43 in total

1.  Host range specificity of flaviviruses: correlation with in vitro replication.

Authors:  G Kuno
Journal:  J Med Entomol       Date:  2007-01       Impact factor: 2.278

2.  Vector competence of field populations of the mosquito species Aedes japonicus japonicus and Culex pipiens from Switzerland for two West Nile virus strains.

Authors:  S Wagner; A Mathis; A C Schönenberger; S Becker; J Schmidt-Chanasit; C Silaghi; E Veronesi
Journal:  Med Vet Entomol       Date:  2017-10-30       Impact factor: 2.739

Review 3.  Flavivirus genome organization, expression, and replication.

Authors:  T J Chambers; C S Hahn; R Galler; C M Rice
Journal:  Annu Rev Microbiol       Date:  1990       Impact factor: 15.500

Review 4.  The 2009 A (H1N1) influenza virus pandemic: A review.

Authors:  Marc P Girard; John S Tam; Olga M Assossou; Marie Paule Kieny
Journal:  Vaccine       Date:  2010-05-27       Impact factor: 3.641

5.  Discovery of Novel Crustacean and Cephalopod Flaviviruses: Insights into the Evolution and Circulation of Flaviviruses between Marine Invertebrate and Vertebrate Hosts.

Authors:  Rhys Parry; Sassan Asgari
Journal:  J Virol       Date:  2019-06-28       Impact factor: 5.103

Review 6.  The West African ebola virus disease epidemic 2014-2015: A commissioned review.

Authors:  S A Omilabu; O B Salu; B O Oke; A B James
Journal:  Niger Postgrad Med J       Date:  2016 Apr-Jun

7.  Discovery of flavivirus-derived endogenous viral elements in Anopheles mosquito genomes supports the existence of Anopheles-associated insect-specific flaviviruses.

Authors:  Sebastian Lequime; Louis Lambrechts
Journal:  Virus Evol       Date:  2017-01-05

8.  A novel coronavirus outbreak of global health concern.

Authors:  Chen Wang; Peter W Horby; Frederick G Hayden; George F Gao
Journal:  Lancet       Date:  2020-01-24       Impact factor: 79.321

9.  Linking Virus Genomes with Host Taxonomy.

Authors:  Tomoko Mihara; Yosuke Nishimura; Yugo Shimizu; Hiroki Nishiyama; Genki Yoshikawa; Hideya Uehara; Pascal Hingamp; Susumu Goto; Hiroyuki Ogata
Journal:  Viruses       Date:  2016-03-01       Impact factor: 5.048

Review 10.  A Systematic Review of the Natural Virome of Anopheles Mosquitoes.

Authors:  Ferdinand Nanfack Minkeu; Kenneth D Vernick
Journal:  Viruses       Date:  2018-04-25       Impact factor: 5.048

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.