| Literature DB >> 28690608 |
Arshan Nasir1,2, Kyung Mo Kim3, Gustavo Caetano-Anollés2.
Abstract
Untangling the origin and evolution of viruses remains a challenging proposition. We recently studied the global distribution of protein domain structures in thousands of completely sequenced viral and cellular proteomes with comparative genomics, phylogenomics, and multidimensional scaling methods. A tree of life describing the evolution of proteomes revealed viruses emerging from the base of the tree as a fourth supergroup of life. A tree of domains indicated an early origin of modern viral lineages from ancient cells that co-existed with the cellular ancestors. However, it was recently argued that the rooting of our trees and the basal placement of viruses was artifactually induced by small genome (proteome) size. Here we show that these claims arise from misunderstanding and misinterpretations of cladistic methodology. Trees are reconstructed unrooted, and thus, their topologies cannot be distorted a posteriori by the rooting methodology. Tracing proteome size in trees and multidimensional views of evolutionary relationships as well as tests of leaf stability and exclusion/inclusion of taxa demonstrated that the smallest proteomes were neither attracted toward the root nor caused any topological distortions of the trees. Simulations confirmed that taxa clustering patterns were independent of proteome size and were determined by the presence of known evolutionary relatives in data matrices, highlighting the need for broader taxon sampling in phylogeny reconstruction. Instead, phylogenetic tracings of proteome size revealed a slowdown in innovation of the structural domain vocabulary and four regimes of allometric scaling that reflected a Heaps law. These regimes explained increasing economies of scale in the evolutionary growth and accretion of kernel proteome repertoires of viruses and cellular organisms that resemble growth of human languages with limited vocabulary sizes. Results reconcile dynamic and static views of domain frequency distributions that are consistent with the axiom of spatiotemporal continuity that is tenet of evolutionary thinking.Entities:
Keywords: Heaps law; origin of viruses; phylogenomics; protein structure; proteome growth; tree of life
Year: 2017 PMID: 28690608 PMCID: PMC5481351 DOI: 10.3389/fmicb.2017.01178
Source DB: PubMed Journal: Front Microbiol ISSN: 1664-302X Impact factor: 5.640
Fact-checking the narrative of Harish et al. (2016).
| “ | In their re-examination, Harish et al. ( |
| We “ | Outgroups indicate sister taxa external to the ingroup (the taxon set being studied), which are defined |
| “ | We polarize character transformations |
| “ | The search of tree space using maximum parsimony as an optimality criterion is defined by homology relationships manifesting in tree branches not graded compositional similarities. |
| “ | During phylogenetic searches, we first optimize character change in unrooted trees using the Wagner algorithm (Farris, |
| “ | A simple node distance ( |
| “ | They included only 16 eukaryal (not 17 as they claim), 17 archaeal, 17 bacterial, and 5-9 viral proteomes, which only represent ~16% of our taxa and likely missed representation of key phyla/groups in their trees (Nasir and Caetano-Anollés, |
| “ | Our exclusion and inclusion of taxa followed clear rationale. Exclusion of organisms engaged in obligate cellular endosymbiosis ensured integrity of definition of taxa. Inclusion of representatives of all viral groups portrayed the entire viral supergroup, which is unified by its parasitic lifestyle. |
| “ | The article referred by the authors (Harish et al., |
| “ | All 49 core-FSFs (i.e., V |
| “ | To compare, our genomic dataset included 5,080 proteomes of 3,460 viruses and 1,620 cells in comparison to their inclusion of only 9 viruses and 51 cells (their Figures 1, 2). Clearly, Harish et al. ( |
| “ | Harish et al. ( |
A somehow similar table can be found in a eLetter exchange (Nasir and Caetano-Anollés, .
Figure 1Comparing the indirect outgroup comparison method of rooting trees and the direct generality criterion. Rooting involves orienting an unrooted tree and pulling down a branch that will hold the ancestor of all taxa examined. In outgroup comparison, sister (outgroup) taxa external to the study group (ingroup taxa) are identified a priori of being of ancestral origin and the branch that is closest to the ingroup pulled down. This creates a new outgroup node for rooting the phylogeny. The outgroup node adds a character state vector that includes character state o, which is diagnostic of the outgroup and is assumed to be ancestral and absent in the ingroup. Once the outgroup is made ancestral, the tree is rooted and character state i is shared and derived, making it a synapomorphy. In Weston's generality criterion (Weston, 1988, 1994), the character state distributions in the phylogeny are used to polarize character transformations. Character state z is less distributed than y within the ingroup (it is present only in a minority subset of taxa) and is considered shared and derived. The figure was modified from Bryant (2001).
Figure 2FSF use (occurrence) and reuse (abundance) are strongly correlated. Scatter log-log plots reveal a strong correlation between FSF use and FSF reuse for total (A) and universal ABEV FSF (B) sets for 368-taxon trees (Nasir and Caetano-Anollés, 2015). Viruses (266), Archaea (34), Bacteria (34), and Eukarya (34) are colored red, black, blue, and green, respectively. Each of these supergroups has its own power law regime that complies with a four-regime Heaps law of vocabulary growth. Individual regimes are indicated with numbers and follow V ~ Nβ relationships, with V representing FSF vocabulary size (use) and N representing FSF database size (reuse) in proteomes. Their fits to linear regression models using ordinary least squares and the estimation of the Heaps exponent β are described in Figure S2.
Figure 3Trees of proteomes are robust and insensitive to the effects of genome size but sensitive to holobiont relationships defining taxa. (A) The single most parsimonious tree (taxa = 368; characters = 442; length = 45,935, retention index = 0.83, g1 = −0.31) describing the evolution of 102 cellular organisms (34 each from Archaea, Bacteria, and Eukarya) and 266 viruses (sampled at least 5 viruses from each family/order) (Nasir and Caetano-Anollés, 2015). The smallest proteomes for cells (I. hospitalis and A. gossypii; black and green asterisks) and viruses (bat cycloviruses; red asterisk) are indicated. The names of taxa are not shown because they would not be visible. Instead, the positions of terminals were colored according to supergroup, green (Eukarya), blue (Bacteria), black (Archaea) and red (viruses). (B) A strict consensus of two most parsimonious trees (length = 46,781, retention index = 0.83, g1 = −19.81 and −19.82) built using phylogenomic data from the 368 proteomes of panel (A) plus the proteomes from the two extremely reduced R. prowazekii and N. equitans (gray circles and asterisks). While no major topological distortions are observed, the consensus tree losses resolution at its base.
Figure 4Scatter plots describe the relationship between ABEV FSF use (A) and reuse (B) and node distance (nd) for the 368-taxon ToL (Nasir and Caetano-Anollés, 2015). Data points for different supergroups are colored green (Eukarya), blue (Bacteria), black (Archaea) and red (viruses). The black line describes the nature of the relationship, as determined by the Locally Weighted Regression Scatter Plot Smoothing (LOWESS) method, which obtains a smoothed curve by fitting successive regression functions (q = 0.1, i = 100). The plot reveals high scatter, especially toward smaller nd values and clustering of bacterial and eukaryal taxa in the same nd range despite harboring big differences in FSF use and reuse.
Figure 5Testing the SGA artifact with the Siddal and Whiting (1999) approach. A single most parsimonious phylogenomic tree (a) describes the evolutionary relationships between four proteomes sampled each from viruses, Archaea, Bacteria, and Eukarya. Taxa are colored as previously described. Numbers on branches indicated BS support values (%). Single most parsimonious trees b through e were recovered after successive elimination of the smallest viral proteomes. TL, tree length; RI, retention index.
Figure 6Cellular endosymbionts differ from free-living organisms and viruses in their FSF composition profiles. Annotation of FSF domains into one of the seven major functional categories (Metabolism, Information, Intracellular Processes, Extracellular Processes, Regulation, General, and Other) for archaeal, bacterial, eukaryal, and viral proteomes sampled in our study (Nasir and Caetano-Anollés, 2015) and for nine viral and three extremely reduced cellular proteomes included by Harish et al. (2016) in their reconstructions Cand. Nausia deltocephalinicola was not part of our reconstructions (encodes only 55 universal FSFs). Obligate endosymbionts or parasites often increase the repertoire of informational FSF domains, as showcased by Cand. Tremblaya included by Harish et al. (2016), and for 311 other known obligate and facultative parasitic organisms in (Figure 3 in Nasir et al., 2011). Functional scheme as defined by Christine Vogel in SUPERFAMILY database (http://supfam.org/SUPERFAMILY/function.html). Category Other includes proteins with either unknown or viral functions. General includes proteins involved in binding to small molecules, ligands, and lipids, and structural proteins. Numbers in parenthesis indicate total number of proteomes included in the FSF profile representation.
Figure 7Obligate parasitic taxa destabilize leaves of trees. (A) Leaf stabilities (LS maximum) were calculated with RadCon (Thorley and Page, 2000) from 2,000 unrooted BS trees. LS values are ordered in the table (A) according to the most informative strict reduced consensus (SRC) tree (33.54 bits) out of a set of 5 SRC trees, which matches the strict component consensus (consensus efficiency = 0.555) derived from the unrooted trees. (B) LS values are visualized as violin plots. Violin plot is a combination of the box plot (the black rectangle with white circle representing group median) and density plot on each side (yellow) reflecting data distribution. The spread of LS values was calculated for the control set (C) and all possible permutations of free-living Acidobacterium capsulatum (A1–A5) and the obligate endoparasite R. prowazekii (R1–R5) with individual taxa of the corresponding bacterial superkingdoms (identified with numbers following taxon labels). The density trace is plotted symmetrically around the boxplots. White circles are group medians. Asterisks are distributions significantly different from control C (Wilcoxon rank sum test, two-tailed, P < 0.01).
Figure 8Viruses stabilize leaves of trees. (A) A single most parsimonious phylogenomic tree (length = 13,004, retention index = 0.61) reconstructed from the genomic abundance census of 442 universal FSFs (432 parsimony informative characters) in 24 proteomes selected equally from Archaea (black), Bacteria (blue), and Eukarya (green) (the 8880 dataset). The most stable taxa in each superkingdoms, as indicated by TII values (Table 2), are labeled with an asterisk. (B) A single most parsimonious phylogenomic tree (length = 12,033, retention index = 0.70) reconstructed from the genomic abundance census of 442 universal FSFs (428 parsimony informative characters) in 24 proteomes selected equally from viruses (red), Archaea (black), Bacteria (blue), and Eukarya (green) after replacing the most stable cellular taxa in (A) with viruses (the 6666 dataset). (C) A comparison of various LS statistics between the 8880 and 6666 BS tree datasets, as displayed by violin plots. None of the comparisons were statistically significant (Wilcoxon rank sum test, two-tailed). (D) Comparison of TII distribution for the 8880 dataset against the 6666 dataset, as displayed by violin plots. Inclusion of viral taxa significantly reduces overall tree instability. Asterisk indicates significant mean difference (Wilcoxon rank sum test, two-tailed, P < 0.01).
Inclusion of viral taxa decreases tree instability.
| 339984.56 | 176059.76 | – | ||
| 275171.68 | 276206.40 | −0.004 | ||
| 389177.39 | 260465.25 | 33.07 | ||
| 384956.57 | 202131.54 | 47.49 | ||
| 348612.99 | 139006.79 | – | ||
| 296748.86 | 51657.62 | 82.59 | ||
| 252024.48 | 86941.27 | 65.50 | ||
| 351756.39 | 208054.12 | 40.85 | ||
| 367648.99 | 172381.65 | 53.11 | ||
| 245672.98 | 227995.40 | 7.20 | ||
| 244554.08 | 227292.68 | 7.06 | ||
| 244638.31 | 51716.23 | 78.86 | ||
| 216223.26 | 193300.26 | 10.60 | ||
| 218019.48 | 194245.12 | 10.90 | ||
| 221980.99 | 176092.12 | – | ||
| 278079.67 | 115236.83 | 58.56 | ||
| 239186.33 | 145974.04 | – | ||
| 151131.73 | 143402.03 | – | ||
| 454093.65 | 216056.64 | 52.42 | ||
| 271355.69 | 102020.45 | 62.40 | ||
| 208389.88 | 365974.42 | −75.62 | ||
| 151131.73 | 139006.79 | – | ||
| 350181.51 | 308662.81 | 11.86 | ||
| 254966.77 | 63294.76 | 75.18 | ||
Comparison of TII values for the “8880” BS tree dataset with taxa comprising 8 proteomes each from Archaea, Bacteria, and Eukarya against the “6666” that includes 6 proteomes each from Archaea, Bacteria, Eukarya, and viruses. For the construction of the 6666 dataset, two most stable taxa from each of Archaea, Bacteria, and Eukarya were replaced with viral proteomes (highlighted in bold) used by Harish et al. (.
Figure 9The space of ages of FSF structural domains reveals supergroups as distinct clouds and global evolutionary tendencies of growth in proteomes. (A) An evolutionary principal coordinate (evoPCO) analysis plot portrays in its first three axes (85% variability explained) the evolutionary distances between cellular and viral proteomes [taxa = 368, characters = 442 universal FSFs, character states = occurrence * (1−nd)]. (B) The most important evoPCO component plotted against universal ABEV FSF reuse in logarithm scale. The reconstructed proteome of the last common ancestor of modern cells was added as reference to infer the direction of evolutionary change (Kim and Caetano-Anollés, 2011). a, Lassa virus; b, Ancestor; c, Pandoravisus salinus; d, Pandoravirus dulcis; e, Acanthamoeba polyphaga mimivirus; f, Megavirus chilensis; g, Megavirus iba; h, Ignicoccus hospitalis; i, Haloarcula marismortui; j, Lactobacillus delbrueckii; k, Sorangium cellulosum; l, Ashbya gossypii; m, Emiliana huxleyi.
Scaling exponents summarizing the Heaps law for the four distinct regimes that correspond to viruses and the cellular superkingdoms (see also Figure S2).
| ABEV | 1-Viruses | 0.81 | 0.94 | 4,243 | 2.2E-16 |
| 2-Archaea | 0.36 | 0.83 | 160 | 5.5E-14 | |
| 3- Bacteria | 0.19 | 0.89 | 259 | 2.2E-16 | |
| 4-Eukarya | 0.03 | 0.49 | 32 | 2.8E-6 | |
| Total | 1-Viruses | 0.81 | 0.94 | 3,874 | 2.2E-16 |
| 2-Archaea | 0.37 | 0.88 | 233 | 3.0E-16 | |
| 3-Bacteria | 0.26 | 0.85 | 182 | 9.6E-15 | |
| 4-Eukarya | 0.12 | 0.76 | 108 | 9.3E-12 |
Linear relationships were tested with the F statistics and coefficients of determination (R.