| Literature DB >> 27486193 |
Jaime Iranzo1, Mart Krupovic2, Eugene V Koonin3.
Abstract
UNLABELLED: Virus genomes are prone to extensive gene loss, gain, and exchange and share no universal genes. Therefore, in a broad-scale study of virus evolution, gene and genome network analyses can complement traditional phylogenetics. We performed an exhaustive comparative analysis of the genomes of double-stranded DNA (dsDNA) viruses by using the bipartite network approach and found a robust hierarchical modularity in the dsDNA virosphere. Bipartite networks consist of two classes of nodes, with nodes in one class, in this case genomes, being connected via nodes of the second class, in this case genes. Such a network can be partitioned into modules that combine nodes from both classes. The bipartite network of dsDNA viruses includes 19 modules that form 5 major and 3 minor supermodules. Of these modules, 11 include tailed bacteriophages, reflecting the diversity of this largest group of viruses. The module analysis quantitatively validates and refines previously proposed nontrivial evolutionary relationships. An expansive supermodule combines the large and giant viruses of the putative order "Megavirales" with diverse moderate-sized viruses and related mobile elements. All viruses in this supermodule share a distinct morphogenetic tool kit with a double jelly roll major capsid protein. Herpesviruses and tailed bacteriophages comprise another supermodule, held together by a distinct set of morphogenetic proteins centered on the HK97-like major capsid protein. Together, these two supermodules cover the great majority of currently known dsDNA viruses. We formally identify a set of 14 viral hallmark genes that comprise the hubs of the network and account for most of the intermodule connections. IMPORTANCE: Viruses and related mobile genetic elements are the dominant biological entities on earth, but their evolution is not sufficiently understood and their classification is not adequately developed. The key reason is the characteristic high rate of virus evolution that involves not only sequence change but also extensive gene loss, gain, and exchange. Therefore, in the study of virus evolution on a large scale, traditional phylogenetic approaches have limited applicability and have to be complemented by gene and genome network analyses. We applied state-of-the art methods of such analysis to reveal robust hierarchical modularity in the genomes of double-stranded DNA viruses. Some of the identified modules combine highly diverse viruses infecting bacteria, archaea, and eukaryotes, in support of previous hypotheses on direct evolutionary relationships between viruses from the three domains of cellular life. We formally identify a set of 14 viral hallmark genes that hold together the genomic network.Entities:
Mesh:
Substances:
Year: 2016 PMID: 27486193 PMCID: PMC4981718 DOI: 10.1128/mBio.00978-16
Source DB: PubMed Journal: MBio Impact factor: 7.867
Basic properties of the bipartite dsDNA viral network
| Element | Value for full network | Value for core genes |
|---|---|---|
| Genomes | 1,073 | 1,071 |
| Gene families | 33,793 | 1,576 |
| Edges | 98,343 | 30,661 |
| Edge density | 0.003 | 0.018 |
| Modules | NA | 19 |
| Mean no. of genes per genome | 92.1 | 28.8 |
| Mean gene abundance (mean with ORFan excluded) | 2.9 (6.7) | 19.3 |
The core gene version of the full bipartite network is a bipartite subnetwork that includes core gene families and genomes with at least one core gene.
NA, not applicable.
FIG 1 The dsDNA virus world as a bipartite network. Nodes corresponding to genomes are depicted as larger circles, and nodes corresponding to core gene families are depicted as dots. An edge is drawn whenever a genome harbors a representative of a core gene family. (A) The modular structure of the network is highlighted by coloring genome nodes according to the module to which they belong (color coding is as described for Fig. 4 to 6). The location of some major viral groups is indicated for illustrative purposes. (B) The degree distributions of genes (left) and genomes (right). In the case of genes, the best fit to a power law distribution is also shown. (C) The scaling of the clustering coefficient, C(k), with respect to the degree k (genes and genomes) suggests a hierarchical modular structure organized around high-level hallmark genes [large k and small C(k)] and low-level signature genes [small k and large C(k)].
FIG 4 Higher-order structure of the virus network. (A) Bipartite network defined by modules (numbered as for Table 4) and connector genes. A module is linked to a connector gene if the prevalence (relative abundance) of the gene in that module is greater than exp(−1). Modules 1 (crenarcheal viruses) and 2 (polyomaviruses and papillomaviruses) that are only weakly connected to other modules are not represented. Modules are represented as colored circles, with the node size proportional to the number of genomes in the module. Connector genes are represented as dots. The position of some hallmark genes discussed in the text is shown. (B) Tree representation of the hierarchical supermodule structure of the network. At each iteration, two (super)modules were merged if their members clustered together in at least 50 of 100 replicates of the module detection algorithm. Branch lengths are proportional to the number of iterations required for two modules to merge. The number associated to each branch indicates the robustness of the respective supermodule. (C and D) Heat map representations of the supermodule robustness matrices for genomes (C) and gene families (D) after the last iteration of the higher-order supermodule search. To generate these matrices, nodes of one class (genomes or gene families) were sorted according to the supermodule they belong to in the optimal partition of the network. For each pair of nodes, the matrix contains the fraction of 100 replicates in which both nodes were placed in the same supermodule. Robust supermodules appear as blocks in the module robustness matrix.
FIG 6 Internal structure of the Caudovirales supermodule. A module is linked to a connector gene if the prevalence of the gene in that module is greater than exp(−1). Modules are represented as larger circles, with sizes proportional to the number of genomes in the module; color coding is as shown in Fig. 4. Module 15 contains Siphoviridae from the Lactococcus phage 936 sensu lato and c2-like groups. Module 16 conatins Clostridium phage phiCP26F and related strains. Connector genes are represented as smaller gray nodes. Hallmark genes are labeled.
FIG 2 Core-shell-cloud structure of viral gene families. For each bin, the bar indicates the number of gene families with a retention probability in the range defined by the x axis. The blue dots indicate the median abundance of such families in the whole set of genomes (error bars correspond to the 25th and 75th percentiles). Family abundances were normalized so that an abundance equal to 1 means that the given family is present in each genome (the contributions of highly similar genomes were downweighted to compensate for sampling bias [see Materials and Methods]). The gene families with the highest retention probability (right-most bin) are typically restricted to a small number of genomes (median abundance, approximately 0.06). In contrast, many of the “core” genes according to the intuitive definition (i.e., those present in a large number of genomes) belong to the bin with a retention probability in the range of 0.7 to 0.8. For the purpose of this work, gene families to the right of the dashed, vertical line (i.e., those with a retention probability greater than 1/e) were considered core genes.
Top 25 core genes sorted by normalized abundance
| Family no. | Annotation | Retention | Abundance | Taxon(s) with presence |
|---|---|---|---|---|
| 5 | 0.969 | 0.661 | ||
| 13 | 0.960 | 0.532 | ||
| 10 | 0.718 | 0.327 | ||
| 16 | 0.740 | 0.314 | ||
| 8 | DEAD-like helicase | 0.522 | 0.298 | Megavirales, |
| 6 | 0.683 | 0.290 | ||
| 11 | 0.886 | 0.255 | Adenoviridae, “Megavirales,” Polintons, | |
| 24 | 0.794 | 0.254 | ||
| 111 | 0.379 | 0.254 | ||
| 22 | 0.904 | 0.240 | ||
| 4 | 0.594 | 0.240 | “Megavirales,” | |
| 27 | HNHc endonuclease | 0.600 | 0.222 | |
| 20 | DNA polymerase A | 0.763 | 0.205 | |
| 19 | Ribonucleotide reductase large subunit | 0.571 | 0.204 | “Megavirales,” |
| 18 | Thioredoxin | 0.371 | 0.174 | “Megavirales,” some |
| 2 | Ribonucleotide reductase small subunit | 0.440 | 0.169 | “Megavirales,” |
| 23 | Phage tail tape measure protein | 0.690 | 0.164 | |
| 30 | UvrD-like helicase | 0.886 | 0.162 | “Megavirales,” |
| 35 | Portal protein | 0.794 | 0.158 | |
| 12 | 0.868 | 0.156 | “Megavirales,” Polintons, | |
| 26 | Phage mu protein F, putative minor head protein | 0.612 | 0.143 | |
| 68 | 0.786 | 0.136 | ||
| 44 | Baseplate J family protein | 0.895 | 0.133 | |
| 36 | AAA family ATPase | 0.843 | 0.120 | |
| 47 | RuvC Holliday junction resolvase; poxvirus A22 family | 0.600 | 0.117 | “Megavirales,” some |
Bold text is used to denote hallmark genes.
The retention probability of a gene family is equal to exp(−r), where r is the estimated loss rate (see Materials and Methods).
The abundances were normalized to the total number of genomes, such that a family present in every genome would have an abundance equal to 1.
Representative genes from the viral mobilome (low retention propensity, high abundance)
| Family no. | Annotation | Retention | Abundance | Taxon(s) with presence |
|---|---|---|---|---|
| 31 | HNH endonuclease | 0.002 | 0.134 | |
| 32 | dUTPase | 0.092 | 0.158 | |
| 43 | HNH endonuclease | 0.012 | 0.103 | |
| 48 | BRO protein, phage antirepressor | 0.118 | 0.121 | |
| 56 | DUF3310 | 0.110 | 0.124 | |
| 79 | DNA methylase N-4/N-6 | 0.005 | 0.109 | |
| 80 | Peptidoglycan recognition protein | 0.049 | 0.105 | |
| 91 | ssDNA-binding protein, SSB_OBF domain, | 0.111 | 0.122 | |
| 30573 | Thymidine kinase | 0.129 | 0.178 | “Megavirales,” |
The retention probability of a gene family is equal to exp(−r), where r is the estimated loss rate (see Materials and Methods).
The abundances were normalized to the total number of genomes, such that a family present in every genome would have abundance equal to 1.
Modules of the dsDNA virus network
| Module | Composition (genomes) | Representative gene product(s) | No. of genomes | No. of genes | Robustness | |
|---|---|---|---|---|---|---|
| Genomes | Genes | |||||
| 1 | Crenarcheal viruses except | RHH domain-containing proteins | 27 | 59 | 0.98 | 0.98 |
| 2 | Papillomavirus L2 protein | 70 | 7 | 0.93 | 0.89 | |
| 3 | “Megavirales” (except | D5-like primase-helicase | 46 | 304 | 0.81 | 0.76 |
| 4 | Virion core protein P4a, IMV membrane protein, metalloproteinase, poly(A) polymerase | 26 | 107 | 0.99 | 0.99 | |
| 5 | pDNAP, packaging ATPase (FtsK superfamily), double and single jelly roll capsid proteins, Ulp1-like cysteine protease | 183 | 29 | 0.88 | 0.77 | |
| 6 | 70 | 61 | 1.00 | 0.98 | ||
| 7 | Envelope glycoproteins H, M, B, UL73; tegument proteins UL7, UL16 | 41 | 24 | 1.00 | 1.00 | |
| 8 | Capsid triplex protein | 7 | 34 | 1.00 | 1.00 | |
| 9 | Multiple | HK97-like capsid protein, large terminase subunit, protease, tyrosine integrase | 282 | 165 | 0.80 | 0.76 |
| 10 | Minor tail protein | 107 | 211 | 0.88 | 0.88 | |
| 11 | Replicative helicase/primase- polymerase | 12 | 23 | 1.00 | 1.00 | |
| 12 | Mostly | Tail completion and sheath stabilizer protein, baseplate J protein | 95 | 196 | 0.79 | 0.81 |
| 13 | Zn-ribbon-containing structural protein, tail tube subunit, tail assembly chaperone | 13 | 40 | 0.98 | 1.00 | |
| 14 | Head-to-tail connecting protein | 63 | 63 | 0.98 | 0.96 | |
| 15 | Single-stranded DNA-binding protein | 11 | 43 | 0.94 | 0.94 | |
| 16 | Cytolysin, ferritin-like superfamily protein | 4 | 32 | 1.00 | 0.98 | |
| 17 | Structural proteins, lysin A | 4 | 62 | 1.00 | 1.00 | |
| 18 | Tail proteins (Pb3, Pb4, etc.), NAD- dependent DNA ligase, nicking endonuclease | 3 | 74 | 1.00 | 1.00 | |
| 19 | RNA polymerase beta subunit, RNase H | 5 | 43 | 0.97 | 1.00 | |
Representative genes are presented based on their classification (in parentheses and boldface) as signature (S), hallmark (H), or connector (C) genes.
Mitochondrial plasmids from Babjeviella inositovora and Debaryomyces hansenii were assigned to module 5, and those from Kluyveromyces lactis, Lachancea kluyveri, and Millerozyma acaciae belong to module 6, although these assignations were based on a small number of widespread core genes.
The robustness is equal to the fraction of replicas in which pairs of members of a module were assigned to the same module, averaged over all possible pairs. Two measures of robustness apply to each module, depending on whether pairs of genes or pairs of genomes were considered.
Malacoherpesviridae lack the tail components listed as representative genes for module 12 (see text for further details).
The assignment of the only genome from family Plasmaviridae to module 9 is based solely on a shared integrase.
FIG 3 Robustness and cross-similarity of modules in the virus bipartite network. (A and B) Heat map representations of the module robustness matrices for genomes (A) and gene families (B). To generate these matrices, nodes of one class (genomes or gene families) were sorted according to the module they belong to in the optimal partition of the network. For each pair of nodes, the matrix contains the fraction of 100 replicates in which both nodes were placed in the same module. Robust modules appear as blocks in the module robustness matrix; deviations from the block structure correspond to modules that are sometimes merged or nodes without a clear module assignation. The asterisk shows the case of mitochondrial plasmids which belong to module 5 in the best partition but are often assigned to module 14. (C) Quantitative summary of the average robustness of modules at the genome and gene level (elements on the diagonal) and the cross-similarity between pairs of modules (fraction of replicates in which nodes of both modules appear together; off-diagonal elements). See Table 4 for the list of the taxa assigned to each module.
FIG 5 The internal structure of the PL-“Megavirales” supermodule. A module is linked to a connector gene if the prevalence of the gene in that module is greater than exp(−1). Modules are represented as larger circles, with sizes proportional to the number of genomes in the module; colors coding is the same as in Fig. 4. Connector genes are represented as smaller gray nodes. The PL elements, which originally formed a single module (shaded oval), were further dissected to produce the submodule structure shown. The hallmark genes are labeled.
FIG 7 Characterization of viral hallmark genes and module-specific signature genes. (A) All core gene families sorted by their relative prevalence in the major supermodules are shown in gray. Hallmark genes are those that, besides belonging to the set of connector genes, have a relative prevalence greater than 0.35 in at least one of the two major supermodules. (B) Signature genes are those genes with mutual information greater than 0.6 to their best-matching module (x axis) and less than 0.02 to their second match (y axis). The rest of the gene families are represented in gray for comparison. (C) Betweenness-rank distribution for genes in the bipartite network. The nodes with the highest betweenness correspond to hallmark and other connector genes. Signature genes are represented in red. (D) Three-dimensional representation of core genes based on mutual information, relative prevalence, and exclusivity with respect to their assigned module (same color coding as in panel C). (E) A histogram with the number of signature, hallmark, connector (nonhallmark), and other (gray) genes per module. Reanalysis of the Caudovirales subnetwork detected 13 signature genes for module 12, which are not shown in the figure. In panels B and D, a large red point indicates the existence of 205 signature genes whose presence-absence patterns perfectly match their assigned modules.