| Literature DB >> 26601271 |
Arshan Nasir1, Gustavo Caetano-Anollés1.
Abstract
The origin of viruses remains mysterious because of their diverse and patchy molecular and functional makeup. Although numerous hypotheses have attempted to explain viral origins, none is backed by substantive data. We take full advantage of the wealth of available protein structural and functional data to explore the evolution of the proteomic makeup of thousands of cells and viruses. Despite the extremely reduced nature of viral proteomes, we established an ancient origin of the "viral supergroup" and the existence of widespread episodes of horizontal transfer of genetic information. Viruses harboring different replicon types and infecting distantly related hosts shared many metabolic and informational protein structural domains of ancient origin that were also widespread in cellular proteomes. Phylogenomic analysis uncovered a universal tree of life and revealed that modern viruses reduced from multiple ancient cells that harbored segmented RNA genomes and coexisted with the ancestors of modern cells. The model for the origin and evolution of viruses and cells is backed by strong genomic and structural evidence and can be reconciled with existing models of viral evolution if one considers viruses to have originated from ancient cells and not from modern counterparts.Entities:
Keywords: fold; horizontal gene transfer; origin of life; phylogenetic analysis; protein domain; structure; taxonomy; tree of life; virus
Year: 2015 PMID: 26601271 PMCID: PMC4643759 DOI: 10.1126/sciadv.1500527
Source DB: PubMed Journal: Sci Adv ISSN: 2375-2548 Impact factor: 14.136
Fig. 1FSF sharing patterns and makeup of cellular and viral proteomes.
(A) Numbers in parentheses indicate the total number of proteomes that were sampled from Archaea, Bacteria, Eukarya, and viruses. (B) Barplots comparing the proteomic composition of viruses infecting the three superkingdoms. Numbers in parentheses indicate the total number of viral proteomes in each group. Numbers above bars indicate the total number of proteins in each of the three classes of proteins. VSFs are listed in Table 1. (C and D) FSF use and reuse for proteomes in each viral subgroup and in the three superkingdoms. Values given in logarithmic scale. Important outliers are labeled. Shaded regions highlight the overlap between parasitic cells and giant viruses.
VSFs and their distribution in the viral supergroup.
FSFs in boldface could be potential VSFs based on the criterion described in the text. FSFs were referenced by either SCOP ID or css. For example, the P-loop containing NTP hydrolase FSF is c.37.1, where “c” is the α/β class of secondary structure present in the protein domain, “37” is the fold, and “1” is the FSF.
| 69070 | a.150.1 | V | Anti-sigma factor AsiA | dsDNA |
| 55064 | d.58.27 | V | Translational regulator protein regA | dsDNA |
| 48493 | a.120.1 | V | Gene 59 helicase assembly protein | dsDNA |
| 89433 | b.127.1 | V | Baseplate structural protein gp8 | dsDNA |
| 69652 | d.199.1 | V | DNA binding C-terminal domain of the transcription factor MotA | dsDNA |
| 56558 | d.182.1 | V | Baseplate structural protein gp11 | dsDNA |
| 49894 | b.28.1 | V | Baculovirus p35 protein | dsDNA |
| 160957 | e.69.1 | V | Poly(A) polymerase catalytic subunit–like | dsDNA |
| 51289 | b.85.5 | V | Tlp20, baculovirus telokin-like protein | dsDNA |
| 88648 | b.121.6 | V | Group I dsDNA viruses | dsDNA |
| 161240 | g.92.1 | V | T-antigen–specific domain–like | dsDNA |
| 118208 | e.58.1 | V | Viral ssDNA binding protein | dsDNA |
| 54957 | d.58.8 | V | Viral DNA binding domain | dsDNA |
| 51332 | b.91.1 | V | E2 regulatory, transactivation domain | dsDNA |
| 56548 | d.180.1 | V | Conserved core of transcriptional regulatory protein vp16 | dsDNA |
| 90246 | h.1.24 | V | Head morphogenesis protein gp7 | dsDNA |
| 47724 | a.54.1 | V | Domain of early E2A DNA binding protein, ADDBP | dsDNA |
| 57917 | g.51.1 | V | Zn binding domains of ADDBP | dsDNA |
| 49889 | b.27.1 | V | Soluble secreted chemokine inhibitor, VCCI | dsDNA |
| 89428 | b.126.1 | V | Adsorption protein p2 | dsDNA |
| 82046 | b.116.1 | V | Viral chemokine binding protein m3 | dsDNA |
| 158974 | b.170.1 | V | WSSV envelope protein-like | dsDNA |
| 47852 | a.62.1 | V | Hepatitis B viral capsid (hbcag) | dsDNA-RT |
| 111379 | f.47.1 | V | VP4 membrane interaction domain | dsRNA |
| 48345 | a.115.1 | V | A virus capsid protein alpha-helical domain | dsRNA |
| 69908 | e.35.1 | V | Membrane penetration protein mu1 | dsRNA |
| 75347 | d.13.2 | V | Rotavirus NSP2 fragment, C-terminal domain | dsRNA |
| 69903 | e.34.1 | V | NSP3 homodimer | dsRNA |
| 75574 | d.216.1 | V | Rotavirus NSP2 fragment, N-terminal domain | dsRNA |
| 58030 | h.1.13 | V | Rotavirus nonstructural proteins | dsRNA |
| 49818 | b.19.1 | V | Viral protein domain | dsRNA, minus-ssRNA, plus-ssRNA |
| 88650 | b.121.7 | V | Satellite viruses | ssDNA |
| 48045 | a.84.1 | V | Scaffolding protein gpD of bacteriophage procapsid | ssDNA |
| 50176 | b.37.1 | V | N-terminal domains of the minor coat protein g3p | ssDNA |
| 75404 | d.213.1 | V | VSV matrix protein | Minus-ssRNA |
| 118173 | d.293.1 | V | Phosphoprotein M1, C-terminal domain | Minus-ssRNA |
| 69922 | f.12.1 | V | Head and neck region of the ectodomain of NDV fusion glycoprotein | Minus-ssRNA |
| 101089 | a.8.5 | V | Phosphoprotein XD domain | Minus-ssRNA |
| 58034 | h.1.14 | V | Multimerization domain of the phosphoprotein from Sendai virus | Minus-ssRNA |
| 50012 | b.31.1 | V | EV matrix protein | Minus-ssRNA |
| 48145 | a.95.1 | V | Influenza virus matrix protein M1 | Minus-ssRNA |
| 143021 | d.299.1 | V | Ns1 effector domain–like | Minus-ssRNA |
| 161003 | e.75.1 | V | Flu NP-like | Minus-ssRNA |
| 160453 | d.361.1 | V | PB2 C-terminal domain–like | Minus-ssRNA |
| 101156 | a.30.3 | V | Nonstructural protein ns2, Nep, M1 binding domain | Minus-ssRNA |
| 160892 | d.378.1 | V | Phosphoprotein oligomerization domain–like | Minus-ssRNA |
| 56983 | f.10.1 | V | Viral glycoprotein, central and dimerization domains | Plus-ssRNA |
| 101257 | a.190.1 | V | Plus-ssRNA | |
| 103145 | d.255.1 | V | Tombusvirus P19 core protein, VP19 | Plus-ssRNA |
| 89043 | a.178.1 | V | Soluble domain of poliovirus core protein 3a | Plus-ssRNA |
| 110304 | b.148.1 | V | Coronavirus RNA binding domain | Plus-ssRNA |
| 101816 | b.140.1 | V | Replicase NSP9 | Plus-ssRNA |
| 140367 | a.8.9 | V | Coronavirus NSP7–like | Plus-ssRNA |
| 143076 | d.302.1 | V | Coronavirus NSP8–like | Plus-ssRNA |
| 144246 | g.86.1 | V | Coronavirus NSP10–like | Plus-ssRNA |
| 103068 | d.254.1 | V | Nucleocapsid protein dimerization domain | Plus-ssRNA |
| 117066 | b.1.24 | V | Accessory protein X4 (ORF8, ORF7a) | Plus-ssRNA |
| 143587 | d.318.1 | V | SARS receptor binding domain–like | Plus-ssRNA |
| 159936 | d.15.14 | V | NSP3A-like | Plus-ssRNA |
| 160099 | d.346.1 | V | SARS Nsp1–like | Plus-ssRNA |
| 140506 | a.30.8 | V | FHV B2 protein–like | Plus-ssRNA |
| 144251 | g.87.1 | V | Viral leader polypeptide zinc finger | Plus-ssRNA |
| 141666 | b.164.1 | V | SARS ORF9b–like | Plus-ssRNA |
| 55671 | d.102.1 | V | Regulatory factor Nef | ssRNA-RT |
| 56502 | d.172.1 | V | gp120 core | ssRNA-RT |
| 57647 | g.34.1 | V | HIV-1 VPU cytoplasmic domain | ssRNA-RT |
Significantly enriched “biological process” GO terms in (66 +43) VSFs (FDR < 0.01).
| GO:0044415 | Evasion or tolerance of host defenses | 14.56 | 4.01 × 106 | 3.00 × 105 |
| GO:0050690 | Regulation of defense response to virus by virus | 14.56 | 4.01 × 106 | 3.00 × 105 |
| GO:0044068 | Modulation by symbiont of host cellular process | 13.8 | 5.72 × 106 | 3.00 × 105 |
| GO:0052572 | Response to host immune response | 13.14 | 7.86 × 106 | 3.02 × 105 |
| GO:0002832 | Negative regulation of response to biotic stimulus | 12.57 | 1.05 × 105 | 3.02 × 105 |
| GO:0052255 | Modulation by organism of defense response of other organism involved in symbiotic interaction | 12.57 | 1.05 × 105 | 3.02 × 105 |
| GO:0051805 | Evasion or tolerance of immune response of other organism involved in symbiotic interaction | 12.57 | 1.05 × 105 | 3.02 × 105 |
| GO:0019048 | Modulation by virus of host morphology or physiology | 12.06 | 1.36 × 105 | 3.53 × 105 |
Fig. 2Spread of viral FSFs in cellular proteomes.
(A) Violin plots comparing the spread (f value) of FSFs shared and not shared with viruses in archaeal, bacterial, and eukaryal proteomes. (B) Violin plots comparing the spread (f value) of FSFs shared with each viral subgroup in archaeal, bacterial, and eukaryal proteomes. Numbers on top indicate the total number of FSFs involved in each comparison. White circles in each boxplot represent group medians. Density trace is plotted symmetrically around the boxplots.
Fig. 3Virus-host preferences and FSF distribution in viruses infecting different hosts.
(A) The abundance of each viral replicon type that is capable of infecting Archaea, Bacteria, and Eukarya and major divisions in Eukarya. Virus-host information was retrieved from the National Center for Biotechnology Information Viral Genomes Project (). Hosts were classified into Archaea, Bacteria, Protista (animal-like protists), Fungi, Plants (all plants, blue-green algae, and diatoms), Invertebrates and Plants (IP), and Metazoa (vertebrates, invertebrates, and humans). Host information was available for 3440 of the 3660 viruses that were sampled in this study. Two additional ssDNA archaeoviruses were added from the literature (, ). Numbers on bars indicate the total virus count in each host group. (B) Venn diagram shows the distribution of 715 (of 716) FSFs that were detected in archaeoviruses, bacterioviruses, and eukaryoviruses. Host information on the Circovirus-like genome RW_B virus encoding the “Satellite viruses” FSF (b.121.7) was not available. (C) Mean f values for FSFs corresponding to each of the seven Venn groups defined in (B) in archaeal, bacterial, and eukaryal proteomes. Values were averaged for all FSFs in each of the seven Venn groups. Text above bars indicates how many different viral subgroups encoded those FSFs.
FSFs involved in capsid/coat assembly processes in viruses.
FSFs that are completely absent in cellular proteomes are presented in boldface. Several other FSFs also have negligible f values in cells.
| 82856 | e.42.1 | L-A virus major coat protein | BTV-like | 0.00025 |
| 56831 | e.28.1 | Reovirus inner layer core protein p3 | BTV-like | 0.00019 |
| 56563 | d.183.1 | Major capsid protein gp5 | HK97-like | 0.2352 |
| 103417 | e.48.1 | Major capsid protein VP5 | HK97-like | 0.00006 |
| 88633 | b.121.4 | Positive stranded ssRNA viruses | Picornavirus-like | 0.00364 |
| 88645 | b.121.5 | ssDNA viruses | Picornavirus-like | 0.00099 |
| 49749 | b.121.2 | Group II dsDNA viruses VP | PRD1/adenovirus-like | 0.00031 |
| 47353 | a.28.3 | Retrovirus capsid dimerization domain–like | Other/unclassified | 0.00407 |
| 47943 | a.73.1 | Retrovirus capsid protein, N-terminal core domain | Other/unclassified | 0.00123 |
| 47195 | a.24.5 | TMV-like viral coat proteins | Other/unclassified | 0.00099 |
| 57987 | h.1.4 | Inovirus (filamentous phage) major coat protein | Other/unclassified | 0.00068 |
| 51274 | b.85.2 | Head decoration protein D (gpD, major capsid protein D) | Other/unclassified | 0.00049 |
| 64465 | d.196.1 | Outer capsid protein sigma 3 | Other/unclassified | 0.00006 |
| 55405 | d.85.1 | RNA bacteriophage capsid protein | Other/unclassified | 0.00006 |
Fig. 4FSF distribution in the viral supergroup.
(A) Total number of FSFs that were either shared or uniquely present in each viral subgroup. A seven-set Venn diagram makes explicit the 127 (27 – 1) combinations that are possible with seven groups. (B) Ariadne’s threads give the most parsimonious solution to encase all highly shared FSFs between different viral subgroups. Threads were inferred directly from the seven-set Venn diagram. FSFs identified by SCOP css. (C) Number of FSFs shared in each viral subgroup with every other subgroup. Pie charts are proportional to the size of the FSF repertoire in each viral subgroup.
FSFs shared by different viral subgroups.
| 56672 | e.8.1 | DNA/RNA polymerases | dsDNA, dsRNA, dsDNA-RT, ssRNA-RT, minus-ssRNA, plus-ssRNA |
| 52540 | c.37.1 | P-loop containing nucleoside triphosphate hydrolases | dsDNA, dsRNA, ssDNA, plus-ssRNA |
| 53335 | c.66.1 | dsDNA, dsRNA, ssDNA, minus-ssRNA, plus-ssRNA | |
| 53098 | c.55.3 | Ribonuclease H–like | dsDNA, ssRNA-RT, ssDNA, minus-ssRNA |
| 88633 | b.121.4 | Positive stranded ssRNA viruses | dsDNA, dsRNA, minus-ssRNA, plus-ssRNA |
| 57850 | g.44.1 | RING/U-box | dsDNA, minus-ssRNA, plus-ssRNA |
| 51283 | b.85.4 | dUTPase-like | dsDNA, dsDNA-RT, ssRNA-RT |
| 56112 | d.144.1 | Protein kinase–like (PK-like) | dsDNA, dsRNA, ssRNA-RT |
| 54768 | d.50.1 | dsRNA binding domain–like | dsDNA, dsRNA, plus-ssRNA |
| 54001 | d.3.1 | Cysteine proteinases | dsDNA, minus-ssRNA, plus-ssRNA |
| 52266 | c.23.10 | SGNH hydrolase | dsDNA, minus-ssRNA, plus-ssRNA |
| 58100 | h.4.4 | Bacterial hemolysins | dsDNA, dsRNA, ssDNA |
| 49818 | b.19.1 | Viral protein domain | dsRNA, minus-ssRNA, plus-ssRNA |
| 57756 | g.40.1 | Retrovirus zinc finger–like domains | dsDNA, dsDNA-RT, ssRNA-RT |
| 50044 | b.34.2 | SH3 domain | dsDNA, dsRNA, ssRNA-RT |
| 57924 | g.52.1 | Inhibitor of apoptosis (IAP) repeat | dsDNA, plus-ssRNA |
| 50249 | b.40.4 | Nucleic acid binding proteins | dsDNA, ssDNA |
| 53041 | c.53.1 | Resolvase-like | dsDNA, ssDNA |
| 55550 | d.93.1 | SH2 domain | dsDNA, ssRNA-RT |
| 55464 | d.89.1 | Origin of replication binding domain, RBD-like | dsDNA, ssDNA |
| 56399 | d.166.1 | ADP ribosylation | dsDNA, ssDNA |
| 100920 | b.130.1 | Heat shock protein 70 kD (HSP70), peptide binding domain | dsDNA, plus-ssRNA |
| 47413 | a.35.1 | Lambda repressor–like DNA binding domains | dsDNA, ssDNA |
| 69065 | a.149.1 | RNase III domain–like | dsDNA, plus-ssRNA |
| 46785 | a.4.5 | Winged helix DNA binding domain | dsDNA, ssDNA |
| 53448 | c.68.1 | Nucleotide-diphospho-sugar transferases | dsDNA, dsRNA |
| 57997 | h.1.5 | Tropomyosin | dsDNA, dsRNA |
| 54236 | d.15.1 | Ubiquitin-like | dsDNA, ssRNA-RT |
| 47954 | a.74.1 | Cyclin-like | dsDNA, ssRNA-RT |
| 90229 | g.66.1 | CCCH zinc finger | dsDNA, minus-ssRNA |
| 103657 | a.238.1 | BAR/IMD domain–like | dsDNA, ssRNA-RT |
| 53067 | c.55.1 | Actin-like ATPase domain | dsDNA, plus-ssRNA |
| 47794 | a.60.4 | Rad51 N-terminal domain–like | dsDNA, ssDNA |
| 143990 | d.336.1 | YbiA-like | dsDNA, plus-ssRNA |
| 55811 | d.113.1 | Nudix | dsDNA, dsRNA |
| 51197 | b.82.2 | Clavaminate synthase–like | dsDNA, plus-ssRNA |
| 53756 | c.87.1 | UDP-glycosyltransferase/glycogen phosphorylase | dsDNA, dsRNA |
| 81665 | f.33.1 | Calcium ATPase, transmembrane domain M | dsDNA, plus-ssRNA |
| 52949 | c.50.1 | Macro domain–like | dsDNA, plus-ssRNA |
| 53955 | d.2.1 | Lysozyme-like | dsDNA, dsRNA |
| 49899 | b.29.1 | Concanavalin A–like lectins/glucanases | dsDNA, dsRNA |
| 48371 | a.118.1 | ARM repeat | dsDNA, plus-ssRNA |
| 51126 | b.80.1 | Pectin lyase–like | dsDNA, plus-ssRNA |
| 47598 | a.43.1 | Ribbon-helix-helix | dsDNA, ssDNA |
| 50494 | b.47.1 | Trypsin-like serine proteases | dsDNA, plus-ssRNA |
| 55144 | d.61.1 | LigT-like | dsDNA, plus-ssRNA |
| 81296 | b.1.18 | E set domains | dsDNA, plus-ssRNA |
| 161008 | e.76.1 | Viral glycoprotein ectodomain–like | dsDNA, minus-ssRNA |
| 90257 | h.1.26 | Myosin rod fragments | dsDNA, dsRNA |
| 57501 | g.17.1 | Cystine-knot cytokines | dsDNA, ssRNA-RT |
| 54117 | d.9.1 | Interleukin 8–like chemokines | dsDNA, dsRNA |
| 58069 | h.3.2 | Virus ectodomain | ssRNA-RT, minus-ssRNA |
| 50630 | b.50.1 | Acid proteases | dsDNA-RT, ssRNA-RT |
| 47459 | a.38.1 | HLH, helix-loop-helix DNA binding domain | dsDNA, ssRNA-RT |
| 50939 | b.68.1 | Sialidases | dsDNA, minus-ssRNA |
| 55166 | d.65.1 | Hedgehog/DD peptidase | dsDNA, ssDNA |
| 51225 | b.83.1 | Fiber shaft of virus attachment proteins | dsDNA, dsRNA |
| 49835 | b.21.1 | Virus attachment protein globular domain | dsDNA, dsRNA |
| 111474 | h.3.3 | Coronavirus S2 glycoprotein | dsDNA, plus-ssRNA |
| 55658 | d.100.1 | L9 N-domain–like | dsDNA, dsDNA-RT |
| 55895 | d.124.1 | Ribonuclease Rh–like | dsDNA, plus-ssRNA |
| 52972 | c.51.4 | ITPase-like | dsDNA, plus-ssRNA |
| 57959 | h.1.3 | Leucine zipper domain | dsDNA, ssRNA-RT |
| 50203 | b.40.2 | Bacterial enterotoxins | dsDNA, ssDNA |
| 48208 | a.102.1 | Six-hairpin glycosidases | dsDNA, ssDNA |
| 50022 | b.33.1 | ISP domain | dsDNA, ssRNA-RT |
| 58064 | h.3.1 | Influenza hemagglutinin (stalk) | dsDNA, minus-ssRNA |
Fig. 5Phylogenomic analysis of FSF domains.
(A) ToD describe the evolution of 1995 FSF domains (taxa) in 5080 proteomes (characters) (tree length = 1,882,554; retention index = 0.74; g1 = −0.18). The bar on top of ToD is a simple representation of how FSFs appeared in its branches, which correlates with their age (nd). FSFs were labeled blue for cell-only and red for those either shared with or unique to viruses. The boxplots identify the most ancient and derived Venn groups. Two major phases in the evolution of viruses are indicated in different background colors. Patterned area highlights the appearances of AV, BV, and EV soon after A, B, and E, respectively. FSFs are identified by SCOP css. (B) Viral FSFs plotted against their spread in viral proteomes (f value) and evolutionary time (nd). FSFs identified by SCOP css. (C) Distribution of ABEV FSFs in each viral subgroup along evolutionary time (nd). Numbers in parentheses indicate the total number of ABEV FSFs in each viral subgroup. White circles indicate group medians. Density trace is plotted symmetrically around the boxplots.
Fig. 6Ancient history of RNA viral proteomes.
(A) The length of Ariadne’s threads (colored lines) identifies FSFs that were shared by more than three viral subgroups. Filled circles indicate FSFs shared between two or three viral subgroups. Numbers next to each circle give the mean nd of FSFs shared by each combination. Numbers in parentheses give the range between the most ancient and the most recent FSFs that were shared by each combination. (B) Distribution of the most ancient (nd < 0.3) ABEV FSFs in evolutionary timeline (nd) for each viral subgroup. Numbers in parentheses indicate the total FSFs in each viral subgroup. White circles indicate group medians. A density trace is plotted symmetrically around the boxplots.
Fig. 7Evolutionary relationships between cells and viruses.
(A) ToP describing the evolution of 368 proteomes (taxa) that were randomly sampled from cells and viruses and were distinguished by the abundance of 442 ABEV FSFs (characters) (tree length = 45,935; retention index = 0.83; g1 = −0.31). All characters were parsimony informative. Differently colored branches represent BS support values. Major groups are identified. Viral genera names are given inside parentheses. The viral order “Megavirales” is awaiting approval by the ICTV and hence written inside quotes. Viral families that form largely unified or monophyletic groups are labeled with an asterisk. Virion morphotypes were mapped to ToP and illustrated with images from the ViralZone Web resource (). No picture was available for Turriviridae. Actinobacteria, Bacteroidetes/Chlorobi, Chloroflexi, Cyanobacteria, Fibrobacter, Firmicutes, Planctomycetes, and Thermotogae. (B) A distance-based phylogenomic network reconstructed from the occurrence of 442 ABEV FSFs in randomly sampled 368 proteomes (uncorrected P distance; equal angle; least-squares fit = 99.46). Numbers on branches indicate BS support values. Taxa were colored for easy visualization. Important groups are labeled. Actinobacteria, Bacteroidetes/Chlorobi, Chloroflexi, Cyanobacteria, Deinococcus-Thermus, Fibrobacter, Firmicutes, and Planctomycetes. Amoebozoa and Chromalveolata.
Fig. 8Evolutionary history of proteomes inferred from numerical analysis.
(A) Plot of the first three axes of evoPCO portrays evolutionary distances between cellular and viral proteomes. The percentage of variability explained by each coordinate is given in parentheses on each axis. The proteome of the last common ancestor of modern cells () was added as an additional sample to infer the direction of evolutionary splits. Ignicoccus hospitalis, Lactobacillus delbrueckii, Caenorhabditis elegans. (B) A distance-based NJ tree reconstructed from the occurrence of 442 ABEV FSFs in randomly sampled 368 proteomes. Each taxon was given a unique tree ID (tables S1 and S2). Taxa were colored for quick visualization.