| Literature DB >> 32316034 |
Sarah J Berkemer1,2,3, Shawn E McGlynn4,5,6.
Abstract
Comparative genomics and molecular phylogenetics are foundational for understanding biological evolution. Although many studies have been made with the aim of understanding the genomic contents of early life, uncertainty remains. A study by Weiss et al. (Weiss MC, Sousa FL, Mrnjavac N, Neukirchen S, Roettger M, Nelson-Sathi S, Martin WF. 2016. The physiology and habitat of the last universal common ancestor. Nat Microbiol. 1(9):16116.) identified a number of protein families in the last universal common ancestor of archaea and bacteria (LUCA) which were not found in previous works. Here, we report new research that suggests the clustering approaches used in this previous study undersampled protein families, resulting in incomplete phylogenetic trees which do not reflect protein family evolution. Phylogenetic analysis of protein families which include more sequence homologs rejects a simple LUCA hypothesis based on phylogenetic separation of the bacterial and archaeal domains for a majority of the previously identified LUCA proteins (∼82%). To supplement limitations of phylogenetic inference derived from incompletely populated orthologous groups and to test the hypothesis of a period of rapid evolution preceding the separation of the domains, we compared phylogenetic distances both within and between domains, for thousands of orthologous groups. We find a substantial diversity of interdomain versus intradomain branch lengths, even among protein families which exhibit a single domain separating branch and are thought to be associated with the LUCA. Additionally, phylogenetic trees with long interdomain branches relative to intradomain branches are enriched in information categories of protein families in comparison to those associated with metabolic functions. These results provide a new view of protein family evolution and temper claims about the phenotype and habitat of the LUCA.Entities:
Keywords: LUCA; conserved orthologous groups of proteins; microbial physiology; orthology; progenote
Mesh:
Substances:
Year: 2020 PMID: 32316034 PMCID: PMC7403611 DOI: 10.1093/molbev/msaa089
Source DB: PubMed Journal: Mol Biol Evol ISSN: 0737-4038 Impact factor: 16.240
Table Listing Data Sets Analyzed in This Study.
| Name | Number of Protein Families | Number of Domain Separating Families | Underlying Data Set |
|---|---|---|---|
|
| 286,514* | 355* | Clusters created by |
|
| 293 | 52 | SSC composed of corresp. COG sequences |
| Conserved COGs | 80* | 50* | COGs, |
| Archaeal and bacterial COGs | 2,886 | 661 | COGs, |
Note.—The number of domain separating groups and the corresponding number of domain separating families found in previous studies are marked by * as reported by Weiss et al. (2016) and Harris et al. (2003). SSC is the set of COGs associated with an SSC; these data and those for the total archaeal and bacterial COGs are based on work reported here. Archaeal and bacterial COGs are the set of COGs which include at least one protein sequence from each domain. For details on the construction of the data sets, see Materials and Methods and supplementary sections 2 and 3, and Supplementary Material online.
. 1.Comparison of tree topologies for two trees corresponding to the same protein family, but which contain different collections of sequences (SSC1665 on the left and COG1646 on the right). Blue colors are bacterial sequences and red colors show archaeal sequences. Sequences with darker color shades appear in both trees; lighter color-shaded labels indicate genes that only appear in a single tree. Leaf labels are gene identifiers .
. 2.Left: Violin plot depicting the number of sequences per group discussed in the text and table 1. The black bar in the yellow area indicates interquartile ranges. Right: The number of sequences per orthologous group plotted against the number of interdomain branches (splits) found when the sequences are subjected to phylogenetic analysis (log10 scales). Expanding SSCs (red squares) with the complete set of sequences of the corresponding COGs results in SSC (blue triangles).
. 3.Relationship between the number of archaea:bacteria interdomain branches (splits) and observed in phylogenetic trees drawn from the COGs. Top: Reconstructed trees for COG0048 (ribosomal protein S12), COG1110 (reverse gyrase), and COG1846 (DNA-binding transcriptional regulator, MarR) with corresponding interdomain archaea:bacteria branches (splits) (s) and values. The position of these trees is indicated in part B of the figure. The trees are drawn shading archaea in red and bacteria in blue , and the branch lengths are contained within the shaded region. Bottom: Interdomain split values for each COG plotted against , where lower values represent phylogenetic trees with smaller average intra- to inter-domain phylogenetic distances. The inset shows the distribution on normal scale, and the log (split) version is shown below. Symbols are slightly shifted to avoid overlays, and the differently shaped and colored symbols indicate subgroups as defined by Harris et al. (2003), Catchpole and Forterre, oxygen related COGs (Liu et al. 2018), CODH/ACS COGs, and further examples as indicated in the legend. Brackets on top of the log-plot summarize regions in the plot that correspond to 1, 2, … splits. Labeled symbols refer to corresponding reconstructed phylogenetic trees shown in (top), in supplementary figure 5 and additional table 2, Supplementary Material online. COG0013 is the alanyl-tRNA synthetase, and COG1679 is a predicted Fe-S cluster binding aconitase.