| Literature DB >> 33137105 |
Falk S P Nagies1, Julia Brueckner1, Fernando D K Tria1, William F Martin1.
Abstract
Lateral gene transfer (LGT) has impacted prokaryotic genome evolution, yet the extent to which LGT compromises vertical evolution across individual genes and individual phyla is unknown, as are the factors that govern LGT frequency across genes. Estimating LGT frequency from tree comparisons is problematic when thousands of genomes are compared, because LGT becomes difficult to distinguish from phylogenetic artefacts. Here we report quantitative estimates for verticality across all genes and genomes, leveraging a well-known property of phylogenetic inference: phylogeny works best at the tips of trees. From terminal (tip) phylum level relationships, we calculate the verticality for 19,050,992 genes from 101,422 clusters in 5,655 prokaryotic genomes and rank them by their verticality. Among functional classes, translation, followed by nucleotide and cofactor biosynthesis, and DNA replication and repair are the most vertical. The most vertically evolving lineages are those rich in ecological specialists such as Acidithiobacilli, Chlamydiae, Chlorobi and Methanococcales. Lineages most affected by LGT are the α-, β-, γ-, and δ- classes of Proteobacteria and the Firmicutes. The 2,587 eukaryotic clusters in our sample having prokaryotic homologues fail to reject eukaryotic monophyly using the likelihood ratio test. The low verticality of α-proteobacterial and cyanobacterial genomes requires only three partners-an archaeal host, a mitochondrial symbiont, and a plastid ancestor-each with mosaic chromosomes, to directly account for the prokaryotic origin of eukaryotic genes. In terms of phylogeny, the 100 most vertically evolving prokaryotic genes are neither representative nor predictive for the remaining 97% of an average genome. In search of factors that govern LGT frequency, we find a simple but natural principle: Verticality correlates strongly with gene distribution density, LGT being least likely for intruding genes that must replace a preexisting homologue in recipient chromosomes. LGT is most likely for novel genetic material, intruding genes that encounter no competing copy.Entities:
Mesh:
Year: 2020 PMID: 33137105 PMCID: PMC7660906 DOI: 10.1371/journal.pgen.1009200
Source DB: PubMed Journal: PLoS Genet ISSN: 1553-7390 Impact factor: 5.917
Fig 1Comparison of estimated verticality and number of genomes in a protein cluster for a. all clusters (n = 101,422) and b. all conserved clusters (average branch length ≥ 0.1; n = 8,547). Unrooted trees were analyzed if at least two taxonomic groups were present. Verticality was calculated as the sum of monophyletic taxonomic groups in a cluster adjusted by the fraction of a taxonomic group represented in the cluster. The procedure for determining verticality on the basis of an example is shown in . This value correlates with the number of genomes, an approximation of universality, which is even more apparent when clusters of high evolutionary rate were filtered out (a.: p < 10−300, Pearson´s R2 = 0.726; b.: p < 10−300, R2 = 0.829). In both plots clusters of special interest were marked: The eukaryotic-prokaryotic clusters (EPCs) are highlighted in red and the clusters that correspond to a gene from the mitochondrial genome of Reclinomonas americana [45] are displayed in blue triangles along the abscissa of the plot and in the graph. For the latter, the gene identifier was noted above each plot. Ribosomal proteins are indicated by the black diamond on the right of each plot and in the graph [6]. Notably, the ribosomal protein clusters show a steep gradient of verticality among conserved clusters with similarly wide distribution.
Fig 4Identification of the prokaryotic sister group to the eukaryotes in 2,575 eukaryotic-prokaryotic unrooted gene trees (EPC).
a. shows the average clade sizes for eukaryotes, the sister group to eukaryotes and the outgroup in the analyzed trees for (right) the 229 trees with only plastid derived lineages and (left) for the 456 EPCs containing all taxa except photosynthetic lineages. b. details the list of bacterial (top) and archaeal (bottom) phyla occurring in the trees with only plant lineages (right) and all other trees (left) that were filtered for conservation (average branch length of the tree ≤ 0.1). Archaeal and bacterial phyla with less than 5 representative species in the dataset were collapsed into ‘other archaea’ and ‘other bacteria’ groups. Pmono refers to the proportion of trees with a branch (split) separating the species of the respective phylum from all the others in the tree; Snon refers to the number of occurrence of the phylum only in the outgroup clade; Smix refers to the number of occurrences of the phylum as a mixed sister (more than one phylum in the clade); Spure refers to the number of occurrences of the phylum as pure sister (as the single phylum); Sp,avg shows the average size of the sister clade when the phylum occurs as a pure sister clade. Ntrees show the number of occurrences of the phyla across the trees and Ngen indicates the number of species in each taxon included in the complete dataset.
Maximum likelihood trees from 19,050,992 protein sequences from 5,433 bacterial and 212 archaeal species were calculated for clusters obtained by MCL, yielding 101,422 trees with at least four sequences and two taxonomic groups present. Each of the 101,422 trees were assigned a protein label according to the NCBI sequence header that was represented the most. On the left panel all trees were annotated and sorted according to their verticality score for the genes (V).
The number of organisms in the respective cluster is stated as Nspec. On the right panel the same values are stated only for conserved protein families–determined by average branch length ≤ 0.1.
| All 101,422 protein families | The 8,547 most conserved protein families | |||||
|---|---|---|---|---|---|---|
| Protein family | Protein family | |||||
| 24.00 | 30S ribosomal protein S10 | 5,646 | 24.00 | 30S ribosomal protein S10 | 5,646 | |
| 23.00 | 30S ribosomal protein S11 | 5,652 | 23.00 | 30S ribosomal protein S11 | 5,652 | |
| 22.30 | Asp/glu–tRNA amidotransferase subunit B | 4,269 | 22.30 | Asp/glu–tRNA amidotransferase subunit B | 4,269 | |
| 22.00 | 50S ribosomal protein L1 | 5,650 | 22.00 | 50S ribosomal protein L1 | 5,650 | |
| 21.89 | Alanine–tRNA ligase | 5,598 | 21.89 | Alanine–tRNA ligase | 5,598 | |
| 21.57 | 50S ribosomal protein L2 | 5,616 | 21.57 | 50S ribosomal protein L2 | 5,616 | |
| 20.93 | Sec family type I SRP | 5,571 | 20.93 | Sec family type I SRP | 5,571 | |
| 20.88 | 30S ribosomal protein S5 | 5,653 | 20.88 | 30S ribosomal protein S5 | 5,653 | |
| 19.82 | Translation elongation factor G | 5,624 | 19.82 | Translation elongation factor G | 5,624 | |
| 19.55 | DNA-directed RNA polymerase subunit beta | 5,300 | 19.55 | DNA-directed RNA polymerase subunit beta | 5,300 | |
| 19.32 | tRNA methylthiotransferase MiaB | 4,764 | 18.86 | Translation initiation factor IF-2 | 5,379 | |
| 18.94 | Signal recognition particle-docking protein FtsY | 5,525 | 18.80 | Histidine–tRNA ligase | 5,627 | |
| 18.86 | Translation initiation factor IF-2 | 5,379 | 18.76 | DNA gyrase subunit A | 5,467 | |
| 18.80 | Histidine–tRNA ligase | 5,627 | 18.00 | 50S ribosomal protein L14 | 5,655 | |
| 18.76 | DNA gyrase subunit A | 5,467 | 18.00 | Methionine–tRNA ligase | 5,587 | |
| 18.03 | tRNA pseudouridine synthase B | 5,434 | 17.98 | Excinuclease ABC subunit B | 5,411 | |
| 18.00 | 50S ribosomal protein L14 | 5,655 | 17.96 | DNA-directed RNA polymerase subunit alpha | 5,431 | |
| 18.00 | Methionine–tRNA ligase | 5,587 | 17.93 | CTP synthetase | 5,433 | |
| 17.98 | Excinuclease ABC subunit B | 5,411 | 17.88 | 30S ribosomal protein S8 | 5,653 | |
| 17.96 | DNA-directed RNA polymerase subunit alpha | 5,431 | 17.85 | Preprotein translocase subunit SecA | 5,395 | |
| 0 | Heavy metal-responsive transcriptional regulator | 2,392 | 0 | SDH cyt b556 large subunit | 2,344 | |
| 0 | SDH cyt b556 large subunit | 2,344 | 0 | RnfH family protein | 2,004 | |
| 0 | Anaerobic ribo.-triP | 2,078 | 0 | Hypothetical protein | 1,964 | |
| 0 | Thiol:disulfide interchange protein DsbC | 1,952 | 0 | Amino acid ABC transporter permease | 1,666 | |
| 0 | RnfH family protein | 2,004 | 0 | Succinate dehydrogenase, HMc anchor protein | 1,800 | |
| 0 | Disulfide bond formation protein B 1 | 1,808 | 0 | LysR family transcriptional regulator | 1,267 | |
| 0 | Hypothetical protein | 1,964 | 0 | Hypothetical protein | 1,688 | |
| 0 | Amino acid ABC transporter permease | 1,666 | 0 | Maleylacetoacetate isomerase | 1,430 | |
| 0 | LysR family transcriptional regulator | 1,431 | 0 | Sigma-E factor regulatory protein RseB | 1,599 | |
| 0 | Succinate dehydrogenase, HM | 1,800 | 0 | tRNA synthase TrmP | 1,567 | |
| 0 | LysR family transcriptional regulator | 1,267 | 0 | tRNA 5-methoxyuridine(34) synthase CmoB | 1,525 | |
| 0 | Hypothetical protein | 1,688 | 0 | Chemotaxis phosphatase CheZ family protein | 1,483 | |
| 0 | Maleylacetoacetate isomerase | 1,430 | 0 | Hypothetical protein | 1,505 | |
| 0 | Sigma-E factor regulatory protein RseB | 1,599 | 0 | Hypothetical protein | 1,345 | |
| 0 | tRNA synthase TrmP | 1,567 | 0 | Outer membrane protein assembly protein | 1,301 | |
| 0 | tRNA 5-methoxyuridine(34) synthase CmoB | 1,525 | 0 | Deoxyribonuclease I | 1,269 | |
| 0 | Chemotaxis phosphatase CheZ family protein | 1,483 | 0 | Formate dehydrogenase accessory protein FdhE | 1,241 | |
| 0 | Hypothetical protein | 1,505 | 0 | Flagellar export protein FliJ | 1,208 | |
| 0 | Hypothetical protein | 1,345 | 0 | Hypothetical protein | 1,200 | |
| 0 | Hypothetical protein | 1,325 | 0 | Hypothetical protein | 1,179 | |
Notes
a SRP protein–general secretory pathway protein signal recognition particle protein
b ribo.-triP–ribonucleoside-triphosphate
c HM–hydrophobic membrane
Assignment of KEGG level B functional annotations.
On the left panel all prokaryotic maximum likelihood trees were annotated and sorted according to their average verticality score (Vavg). The number of clusters employed for this analysis are indicated (Nclust). The same procedure was performed on the right panel only for conserved protein families–determined by average branch length ≤ 0.1.
| All 101,422 protein families | The 8,547 most conserved protein families | ||||
|---|---|---|---|---|---|
| Function | Function | ||||
| Translation | 5.31 | 2,428 | Translation | 14.82 | 284 |
| Metabolism of cofactors and vitamins | 4.86 | 2,443 | Nucleotide metabolism | 10.21 | 160 |
| Nucleotide metabolism | 4.28 | 1,419 | Metabolism of cofactors and vitamins | 7.95 | 199 |
| Amino acid metabolism | 3.83 | 3,777 | Carbohydrate metabolism | 7.23 | 534 |
| Carbohydrate metabolism | 3.63 | 4,836 | Replication and repair | 7.11 | 187 |
| Biosynthesis of other secondary metabolites | 3.62 | 507 | Energy metabolism | 7.07 | 208 |
| Glycan biosynthesis and metabolism | 3.42 | 3,349 | Amino acid metabolism | 7.06 | 438 |
| Metabolism | 3.31 | 4,260 | Folding, sorting and degradation | 6.77 | 118 |
| Energy metabolism | 3.28 | 2,705 | Metabolism of other amino acids | 5.87 | 81 |
| Xenobiotics biodegradation and metabolism | 3.26 | 1,606 | Metabolism | 5.67 | 337 |
| Replication and repair | 3.14 | 3,502 | Enzyme families | 5.53 | 164 |
| Transport and catabolism | 3.02 | 2,843 | Biosynthesis of other secondary metabolites | 5.50 | 25 |
| Metabolism of terpenoids and polyketides | 2.97 | 1,473 | Xenobiotics biodegradation and metabolism | 5.36 | 103 |
| Metabolism of other amino acids | 2.95 | 745 | Glycan biosynthesis and metabolism | 5.33 | 158 |
| Transcription | 2.84 | 7,245 | Signal transduction | 5.10 | 240 |
| Folding, sorting and degradation | 2.79 | 1,873 | Membrane transport | 4.69 | 1,431 |
| Lipid metabolism | 2.65 | 2,864 | Cell motility | 4.37 | 124 |
| Enzyme families | 2.59 | 3,735 | Metabolism of terpenoids and polyketides | 4.31 | 85 |
| Cellular processes and signaling | 2.49 | 3,905 | Transport and catabolism | 4.31 | 143 |
| Signal transduction | 2.48 | 6,712 | Lipid metabolism | 4.20 | 215 |
| Membrane transport | 2.46 | 19,992 | Transcription | 4.12 | 409 |
| Genetic information processing | 2.31 | 4,838 | Cellular processes and signaling | 3.75 | 257 |
| Cellular community prokaryotes | 2.21 | 3,986 | Cellular community prokaryotes | 3.55 | 172 |
| Drug resistance | 2.15 | 1,754 | Genetic information processing | 3.23 | 269 |
| Cell motility | 1.94 | 3,620 | Drug resistance | 3.10 | 88 |
| Poorly characterized | 1.41 | 178,665 | Poorly characterized | 1.68 | 2,970 |
Verticality of prokaryotic taxa across protein families with at least two taxonomic groups.
The list of bacterial (top) and archaeal (bottom) taxa occurring in all trees (right) and only trees that were filtered for conservation (average branch length in the tree ≤ 0.1) (left). Archaeal and bacterial phyla with less than 5 representative species in the dataset were excluded. Pmono refers the proportion of monophyletic trees. Nmono indicates the number of trees in which this taxon is monophyletic whereas Ntrees shows the number of occurrences of the phyla in the respective dataset.
| All trees– 101,423 | Conserved trees– 8,547 | |||||||
|---|---|---|---|---|---|---|---|---|
| Taxon | Pmono | Pmono | ||||||
| Acidithiobacillia | 0.81 | 1,677 | 2,067 | 0.91 | 629 | 688 | ||
| Chlamydiae | 0.74 | 1,378 | 1,867 | 0.75 | 482 | 642 | ||
| Tenericutes | 0.68 | 2,770 | 4,076 | 0.50 | 391 | 776 | ||
| Actinobacteria | 0.60 | 30,050 | 49,958 | 0.37 | 1,214 | 3,293 | ||
| Bacilli | 0.59 | 24,365 | 41,526 | 0.25 | 1,017 | 3,997 | ||
| Chlorobi | 0.59 | 1,728 | 2,946 | 0.80 | 494 | 619 | ||
| Thermotogae | 0.57 | 2,252 | 3,937 | 0.65 | 495 | 764 | ||
| Cyanobacteria | 0.56 | 8,655 | 15,446 | 0.64 | 843 | 1,319 | ||
| Deinococcus-Thermus | 0.54 | 3,156 | 5,858 | 0.63 | 705 | 1,113 | ||
| Synergistetes | 0.53 | 1,001 | 1,872 | 0.70 | 484 | 692 | ||
| Epsilonproteobacteria | 0.52 | 3,815 | 7,270 | 0.37 | 513 | 1,397 | ||
| Fusobacteria | 0.51 | 1,805 | 3,516 | 0.60 | 717 | 1,194 | ||
| Spirochaetes | 0.50 | 5,063 | 10,130 | 0.44 | 683 | 1,564 | ||
| Bacteroidetes | 0.49 | 11,677 | 23,755 | 0.40 | 759 | 1,879 | ||
| Gammaproteobacteria | 0.48 | 29,439 | 61,803 | 0.18 | 1,078 | 5,874 | ||
| Negativicutes | 0.45 | 1,892 | 4,170 | 0.59 | 804 | 1,371 | ||
| Nitrospirae | 0.43 | 1,377 | 3,180 | 0.47 | 359 | 762 | ||
| Alphaproteobacteria | 0.43 | 18,086 | 41,953 | 0.35 | 1,312 | 3,735 | ||
| Aquificae | 0.43 | 1,210 | 2,826 | 0.43 | 290 | 672 | ||
| Planctomycetes | 0.40 | 1,755 | 4,399 | 0.55 | 533 | 961 | ||
| Chloroflexi | 0.39 | 2,349 | 6,003 | 0.46 | 521 | 1,141 | ||
| Acidobacteria | 0.38 | 1,789 | 4,666 | 0.58 | 625 | 1,077 | ||
| Betaproteobacteria | 0.38 | 14,203 | 37,225 | 0.34 | 1,601 | 4,775 | ||
| Deltaproteobacteria | 0.37 | 8,512 | 23,013 | 0.38 | 1,005 | 2,618 | ||
| Verrucomicrobia | 0.36 | 1,146 | 3,152 | 0.56 | 601 | 1,067 | ||
| Clostridia | 0.32 | 7,481 | 23,638 | 0.34 | 1,084 | 3,196 | ||
| Erysipelotrichia | 0.17 | 344 | 2,001 | 0.43 | 451 | 1,058 | ||
| Thermococcales | 0.73 | 2,482 | 3,380 | 0.79 | 271 | 341 | ||
| Methanococcales | 0.73 | 1,612 | 2,220 | 0.83 | 236 | 283 | ||
| Methanobacteriales | 0.68 | 1,949 | 2,857 | 0.79 | 282 | 356 | ||
| Sulfolobales | 0.66 | 2,223 | 3,387 | 0.75 | 280 | 374 | ||
| Archaeoglobales | 0.62 | 1,415 | 2,286 | 0.79 | 252 | 318 | ||
| Methanomicrobiales | 0.60 | 1,616 | 2,693 | 0.74 | 301 | 406 | ||
| Methanosarcinales | 0.60 | 3,392 | 5,654 | 0.63 | 318 | 503 | ||
| Thermoproteales | 0.55 | 1,537 | 2,775 | 0.61 | 257 | 420 | ||
| Thermoplasmatales | 0.49 | 662 | 1,364 | 0.58 | 212 | 366 | ||
| Desulfurococcales | 0.41 | 852 | 2,072 | 0.44 | 130 | 298 | ||
| Natrialbales | 0.32 | 1,459 | 4,503 | 0.42 | 246 | 588 | ||
| Haloferacales | 0.27 | 980 | 3,593 | 0.40 | 205 | 513 | ||
| Halobacteriales | 0.20 | 1,024 | 5,057 | 0.30 | 178 | 591 | ||