| Literature DB >> 35784055 |
Andrew J Crapitto1, Amy Campbell2, A J Harris3, Aaron D Goldman1,4.
Abstract
The availability of genomic and proteomic data from across the tree of life has made it possible to infer features of the genome and proteome of the last universal common ancestor (LUCA). A number of studies have done so, all using a unique set of methods and bioinformatics databases. Here, we compare predictions across eight such studies and measure both their agreement with one another and with the consensus predictions among them. We find that some LUCA genome studies show a strong agreement with the consensus predictions of the others, but that no individual study shares a high or even moderate degree of similarity with any other individual study. From these observations, we conclude that the consensus among studies provides a more accurate depiction of the core proteome of the LUCA and its functional repertoire. The set of consensus LUCA protein family predictions between all of these studies portrays a LUCA genome that, at minimum, encoded functions related to protein synthesis, amino acid metabolism, nucleotide metabolism, and the use of common, nucleotide-derived organic cofactors.Entities:
Keywords: LUCA; ancient genomes; ancient life; ancient metabolism; early evolution; last universal common ancestor
Year: 2022 PMID: 35784055 PMCID: PMC9165204 DOI: 10.1002/ece3.8930
Source DB: PubMed Journal: Ecol Evol ISSN: 2045-7758 Impact factor: 3.167
Individual LUCA genome studies and their correspondence with eggNOG clusters of homologous proteins
| LUCA genome study | Number of predictions and source database | Taxonomic genera surveyed | Number of eggNOG clusters corresponding to LUCA predictions | Number of eggNOG clusters corresponding to the consensus LUCA predictions |
|---|---|---|---|---|
| Harris et al. ( | 80 COGs | 31 | 110 | 81 |
| Mirkin et al. ( | 571 COGs | 26 | 848 | 304 |
| Delaye et al. ( | 114 Pfam domains | 20 | 1259 | 302 |
| Yang et al. ( | 66 SCOP folds superfamilies | 122 | 609 | 230 |
| Ranea et al. ( | 140 protein structures representing CATH superfamilies | 71 | 119 | 76 |
| Wang et al. ( | 165 SCOP folds | 91–153 | 2078 | 345 |
| Srinivasan and Morowitz ( | 206 Enzyme Commission codes (via KEGG) | 4 | 794 | 209 |
| Weiss et al. ( | 336 COGs (via GenBank) | 612 | 328 | 117 |
Genera were determined based on reconciling species reportedly sampled in each paper with the NCBI Taxonomy Database (Federhen, 2012).
Consensus LUCA predictions refers to eggNOG clusters associated with the predictions of four or more LUCA genome studies.
The article offers a range of possible predictions. We used the dataset derived from the authors’ gain penalty of 1 (i.e., equal weights assigned to a gain and a loss), which is the focus of their own analysis.
This range is based on a discrepancy between the article reporting that 185 genomes were sampled for the study and the specific genomes that could be ascertained based on the figures in the article that provided information on specific taxonomic sampling made available by the authors. The figures showed 123 unique species, which we determined belonged to 91 unique genera based on taxonomic reconciliation. If the remaining, missing 62 genomes belonged entirely to unique genera, the total number of genera would be 153. In contrast, if the missing genomes represented no additional, unique genera then the number of unique genera would be 91.
Protein families within this study were determined by sequence searches and were not provided in the article's supporting information. However, COGs were associated with 336 out of the 355 protein families predicted by the authors to have been present in the LUCA.
Inter‐rater test scores for the predictions of individual LUCA genome studies (referred to by the last name of the primary author)
| Statistic | Consensus threshold | Harris | Mirkin | Delaye | Yang | Ranea | Wang | Srinivasan | Weiss |
|---|---|---|---|---|---|---|---|---|---|
| Percent agree‐ment | 2 | 0.86 | 0.57 | 0.39 | 0.61 | 0.81 | 0.30 | 0.46 | 0.54 |
| 3 | 0.73 | 0.36 | 0.24 | 0.38 | 0.63 | 0.17 | 0.26 | 0.35 | |
| 4 | 0.56 | 0.16 | 0.10 | 0.17 | 0.45 | 0.06 | 0.10 | 0.19 | |
| 5 | 0.29 | 0.05 | 0.03 | 0.06 | 0.21 | 0.02 | 0.02 | 0.07 | |
| Krippen‐dorff’s α / Scott’s π | 2 | 0.68 | 0.15 | ‒0.08 | 0.19 | 0.55 | ‒0.05 | ‒0.09 | ‒0.03 |
| 3 | 0.65 | 0.24 | 0.13 | 0.24 | 0.52 | 0.09 | 0.12 | 0.18 | |
| 4 | 0.54 | 0.14 | 0.09 | 0.15 | 0.41 | 0.05 | 0.08 | 0.15 | |
| 5 | 0.29 | 0.05 | 0.03 | 0.06 | 0.20 | 0.02 | 0.02 | 0.07 |
Heatmap index:
Jaccard's similarity index between pairs of individual LUCA genome studies (referred to by the last name of the primary author)
| Harris | Mirkin | Delaye | Yang | Ranea | Wang | Srinivasan | Weiss | |
|---|---|---|---|---|---|---|---|---|
| Harris | 1 | 0.13 | 0.05 | 0.08 | 0.24 | 0.03 | 0.02 | 0.08 |
| Mirkin | 1 | 0.22 | 0.17 | 0.09 | 0.19 | 0.20 | 0.12 | |
| Delaye | 1 | 0.15 | 0.06 | 0.19 | 0.17 | 0.09 | ||
| Yang | 1 | 0.06 | 0.27 | 0.11 | 0.09 | |||
| Ranea | 1 | 0.04 | 0.03 | 0.05 | ||||
| Wang | 1 | 0.16 | 0.08 | |||||
| Srinivasan | 1 | 0.08 | ||||||
| Weiss | 1 |
Heatmap index:
FIGURE 1Statistically overrepresented Gene Ontology (GO) terms in the Molecular Function category associated with consensus LUCA eggNOG clusters. Arrows indicate parent‐child GO term relationships. All GO terms shown in the figure have an associated p‐value < 4 × 10−6, which is below the Bonferroni‐corrected threshold of p ≤ 4.6 × 10−6. Note that the term “Molecular Function,” itself, is not statistically significant. The terminal GO terms in each branch of the parent‐child network (i.e., the most specific GO terms) have been removed for clarity, but all statistically significant GO terms are available as Appendix S3
FIGURE 2Ancestral enzyme functions determined from consensus LUCA eggNOG clusters mapped onto a universal metabolic network. The consensus LUCA enzyme functions are represented by 169 Enzyme Commission codes. The universal metabolic network and color‐coding of metabolic categories are from the global Metabolic Pathways network (map 01100) from the KEGG database (Kanehisa et al., 2017; Ogata et al., 1999). “Metabolism of Other Amino Acids” is terminology that the KEGG database uses to indicate amino acids that are not included in proteins, such as D‐amino acids
Metabolic pathway coverage of enzyme functions associated with consensus LUCA eggNOG clusters
| KEGG pathway | Number of EC code matches | Percentage of total pathway EC codes (%) |
|---|---|---|
| Aminoacyl‐tRNA biosynthesis (map00970) | 18 | 58 |
| Valine, leucine, and isoleucine biosynthesis (map00290) | 5 | 36 |
| Drug metabolism—other enzymes (map00983) | 7 | 28 |
| Alanine, aspartate, and glutamate metabolism (map00250) | 13 | 26 |
| Pyrimidine metabolism (map00240) | 15 | 24 |
| Lysine biosynthesis (map00300) | 8 | 24 |
| Carbon fixation in photosynthetic organisms (map00710) | 6 | 24 |
| Phenylalanine, tyrosine and tryptophan biosynthesis (map00400) | 9 | 24 |
| Arginine biosynthesis (map00220) | 7 | 23 |
| Glycolysis/Gluconeogenesis (map00010) | 11 | 22 |
| Purine metabolism (map00230) | 18 | 17 |
| Histidine metabolism (map00340) | 6 | 15 |
| Glycine, serine, and threonine metabolism (map00260) | 10 | 14 |
| Nitrogen metabolism (map00910) | 5 | 13 |
| Cysteine and methionine metabolism (map00270) | 9 | 11 |
| Carbon fixation pathways in prokaryotes (map00720) | 5 | 10 |
| Pentose phosphate pathway (map00030) | 5 | 9 |
| Methane metabolism (map00680) | 8 | 9 |
| Pyruvate metabolism (map00620) | 6 | 8 |
| Arginine and proline metabolism (map00330) | 6 | 7 |
| Amino sugar, and nucleotide sugar metabolism (map00520) | 6 | 5 |