| Literature DB >> 15927057 |
Marshall Bern1, David Goldberg.
Abstract
BACKGROUND: Although there are now about 200 complete bacterial genomes in GenBank, deep bacterial phylogeny remains a difficult problem, due to confounding horizontal gene transfers and other phylogenetic "noise". Previous methods have relied primarily upon biological intuition or manual curation for choosing genomic sequences unlikely to be horizontally transferred, and have given inconsistent phylogenies with poor bootstrap confidence.Entities:
Mesh:
Substances:
Year: 2005 PMID: 15927057 PMCID: PMC1175084 DOI: 10.1186/1471-2148-5-34
Source DB: PubMed Journal: BMC Evol Biol ISSN: 1471-2148 Impact factor: 3.260
Validation of our methodology on 10 deep phytogeny problems. Organism abbreviations are shown in Table 3, and the accepted clades are shown with parentheses. The column labeled "# Clades" gives the number of accepted clades to be found. The column labeled "# Genes" gives the number of genes used. The Trees column gives the number of gene trees that find all the accepted clades; results for representative proteins are on the left, and results for randomly picked ubiquitous proteins are on the right. For each gene, the most conserved 300-residue sequence was used, and randomly picked proteins were matched to the representative proteins in overall conservation level. Consensus gives the number of accepted clades found over all gene trees; an asterisk indicates that the consensus tree (computed using CONSENSE from the PHYLIP package [52]) finds all the accepted clades. Concatenation gives the number of clades found in 100 bootstraps from a concatenated alignment of all genes; an asterisk here indicates the success of the consensus over bootstrap trees. In problem 6 for example, there are 5 accepted clades, 8 single-gene trees, and 100 bootstrap trees, so a perfect "Consensus" score would be 40, and a perfect "Concatenation" score would be 500.
| Organisms | # Clades | # Genes | Trees | Consensus | Concatenation | |||
| 1. (Borr, Trep) (Chlor, Bac) (Campy, Bruc) | 3 | 8 | 8* | 2 | 24* | 12 | 299 | 112 |
| 2. (Neiss, Rals) (Xyl, Haem) (Rick, Meso) | 3 | 8 | 5* | 3 | 21* | 19 | 247 | 207 |
| 3. (Clost, Lacto) (Mycob, Bifid) (Campy, Rick) | 3 | 8 | 6* | 4* | 18* | 18* | 294 | 283 |
| 4. (Buch, Rick) (Mycob, Bifid) (Staph, Mycop) | 3 | 8 | 2 | 1* | 13 | 15* | 235 | 297 |
| 5. (Urea, Mycop) (Strep, Lacto) (Staph, List) | 3 | 8 | 8* | 5* | 24* | 21* | 300 | 300 |
| 6. (Syn, Pro) (Rick, Buch) (Chlor, Bac) (Staph, Strep) (Borr, Trep) | 5 | 8 | 7* | 2* | 37* | 26* | 481 | 472 |
| 7. ((Rick, Bruc) ((Vib, Esch, Haem), Neiss) (Heli, Campy)) (Syn, Pro) (Clost, Staph) (Borr, Trep) | 8 | 17 | 3* | 3 | 129* | 108 | 762 | 741 |
| 8. ((Caul, Meso), Esch) (Chlor, Bac) (Pro, Nos) | 4 | 8 | 7* | 3* | 30* | 27* | 400 | 398 |
| 9. ((Geo, Desulf), (Wol, Campy), (Caul, Rick)) (Borr, Lep) (Chlor, Bac) | 6 | 8 | 1 | 2 | 31* | 32 | 554 | 512 |
| 10. (Chlor, Bac) (Mycop, Strep, Clost) (Mycob, Bifid) | 3 | 8 | 1* | 2* | 15* | 13 | 255 | 245 |
Bacterial genomes used in this paper. All phyla in GenBank as of December 2004 are represented. Bold letters give abbreviations used in Table 1.
| Specific Genome | Taxonomy (Phylum; Class) |
| Corynebacterium glutamicum | Actinobacteria;Actinobacteria |
| Actinobacteria;Actinobacteria | |
| Actinobacteria;Actinobacteria | |
| Streptomyces avermitilis | Actinobacteria;Actinobacteria |
| Aquifex aeolicus | Aquificae;Aquificae |
| Bacteroidetes/Chlorobi;Bacteroidetes | |
| Bacteroidetes/Chlorobi;Chlorobi | |
| Porphyromonas gingivalis W83 | Bacteroidetes/Chlorobi;Chlorobi |
| Chlamydia trachomatis | Chlamydiae/Verrucomicrobia;Chlamydiae |
| Chlamydophila pneumoniae AR39 | Chlamydiae/Verrucomicrobia;Chlamydiae |
| Chloroflexus aurantiacus | Chloroflexi;Chloroflexi |
| Gloeobacter violaceus | Cyanobacteria;Chroococcales |
| Synechococcus sp WH8102 | Cyanobacteria;Chroococcales |
| Cyanobacteria;Chroococcales | |
| Cyanobacteria;Nostocales | |
| Cyanobacteria;Prochlorophytes | |
| Deinococcus radiodurans | Deinococcus-Thermus;Deinococci |
| Bacillus subtilis | Firmicutes;Bacilli |
| Oceanobacillus iheyensis | Firmicutes;Bacilli |
| Firmicutes;Bacilli | |
| Firmicutes;Bacilli | |
| Firmicutes;Bacilli | |
| Firmicutes;Bacilli | |
| Firmicutes;Clostridia | |
| Thermoanaerobacter tengcongensis | Firmicutes;Clostridia |
| Firmicutes;Mollicutes | |
| Firmicutes;Mollicutes | |
| Fusobacterium nucleatum | Fusobacteria;Fusobacteria |
| Pirellula sp | Planctomycetes;Planctomycetacia |
| Proteobacteria;Alphaproteobacteria | |
| Rhodopseudomonas palustris | Proteobacteria;Alphaproteobacteria |
| Proteobacteria;Alphaproteobacteria | |
| Bradyrhizobium japonicum | Proteobacteria;Alphaproteobacteria |
| Proteobacteria;Alphaproteobacteria | |
| Proteobacteria;Alphaproteobacteria | |
| Proteobacteria;Betaproteobacteria | |
| Proteobacteria;Betaproteobacteria | |
| Chromobacterium violaceum | Proteobacteria;Betaproteobacteria |
| Bordetella pertussis | Proteobacteria;Betaproteobacteria |
| Nitrosomonas europaea | Proteobacteria;Betaproteobacteria |
| Coxiella burnetii | Proteobacteria;Gammaproteobacteria |
| Proteobacteria;Gammaproteobacteria | |
| Proteobacteria;Gammaproteobacteria | |
| Pseudomonas aeruginosa | Proteobacteria;Gammaproteobacteria |
| Shigella flexneri 2a | Proteobacteria;Gammaproteobacteria |
| Shewanella oneidensis | Proteobacteria;Gammaproteobacteria |
| Proteobacteria;Gammaproteobacteria | |
| Xanthomonas campestris | Proteobacteria;Gammaproteobacteria |
| Proteobacteria;Gammaproteobacteria | |
| Proteobacteria;delta/epsilon subdivisions | |
| Proteobacteria;delta/epsilon subdivisions | |
| Proteobacteria;delta/epsilon subdivisions | |
| Proteobacteria;delta/epsilon subdivisions | |
| Proteobacteria;delta/epsilon subdivisions | |
| Spirochaetes;Spirochaetes | |
| Spirochaetes;Spirochaetes | |
| Spirochaetes;Spirochaetes | |
| Thermotoga maritima | Thermotogae;Thermotogae |
Representative proteins used to compute Figure 2. Class is the COG functional code [18]. Rank is rank in a list of most conserved proteins (families of orthologs), from 0 to 199, for the set of genomes under study; thus FtsH (rank 2) is more conserved than DNA polymerase I (rank 59). Coeff, S. Dev, and Max are respectively the correlation coefficient, the standard deviation, and the maximum elementwise difference between the scaled distance matrix given by this protein and the consensus distance matrix. Distances for this set of organisms were approximately 0–150. Each sequence was limited to the most conserved 300-long amino acid sequence for the protein.
| Gene | Class | Name | COG | Rank | Coeff | S.Dev | Max | |
| GidA | D | glucose-inhibited division protein | 0445 | 23 | .92 | 4.27 | 11.90 | |
| - | R | GTP-binding protein | 0012 | 43 | .94 | 3.84 | 12.43 | |
| RuvB | L | Holliday junction DNA helicase | 2255 | 19 | .89 | 4.32 | 12.51 | |
| Pnp | J | polynucleotide phosphorylase | 1185 | 25 | .86 | 4.94 | 13.11 | |
| PyrG | F | CTP synthetase | 0504 | 26 | .91 | 4.19 | 13.95 | |
| LepA | N | GTP-binding elongation factor | 0481 | 15 | .92 | 4.88 | 14.00 | |
| DnaX | L | DNA polymerase III subunits gamma and tau | 2812 | 86 | .90 | 4.37 | 14.59 | |
| Mfd | LK | transcription-repair coupling factor | 1197 | 31 | .88 | 4.82 | 14.94 | |
| UvrB | L | DNA excision nuclease subunit B | 0556 | 12 | .93 | 4.44 | 16.29 | |
| InfB | J | translation initiation factor IF-2 | 0532 | 32 | .90 | 4.59 | 17.46 | |
| Exo | L | DNA polymerase I | 0258 | 59 | .89 | 4.85 | 17.60 | |
| PolC | L | DNA polymerase III, alpha chain | 0587 | 61 | .77 | 6.43 | 17.81 | |
| RecA | L | RecA protein | 0468 | 4 | .85 | 6.22 | 19.18 | |
| GyrA | L | DNA gyrase subunit A | 0188 | 10 | .88 | 5.60 | 19.74 | |
| HflB | 0 | cell division protein FtsH | 0465 | 2 | .86 | 5.29 | 19.89 | |
| ClpX | O | ATP-dependent Clp protease, ClpX | 1219 | 13 | .89 | 5.12 | 20.10 | |
| ThrS | J | threonyl-tRNA synthetase | 0441 | 33 | .77 | 6.94 | 20.19 | |
| Rho | K | transcription termination factor rho | 1158 | 3 | .87 | 5.83 | 20.20 | |
| GroL | O | GroEL, chaperone Hsp60 | 0459 | 8 | .92 | 5.90 | 20.35 | |
| ClpB | 0 | ClpB protein | 0542 | 7 | .75 | 6.23 | 21.01 | |
| - | R | putative GTP-binding protein | 1160 | 165 | .94 | 7.93 | 21.07 | |
| DnaK | 0 | dnaK, chaperone Hsp70 | 0443 | 5 | .81 | 6.48 | 21.27 | |
| RpSA | J | 30S ribosomal subunit protein S1 | 0539 | 38 | .88 | 8.14 | 22.48 | |
| RpoA | K | DNA-directed RNA polymerase alpha chain | 0202 | 102 | .91 | 10.93 | 32.41 | |
| TrxB | 0 | thioredoxin reductase | 0492 | 66 | .87 | 8.67 | 32.54 | |
| UvrC | L | excinuclease ABC subunit C | 0322 | 133 | .93 | 6.67 | 32.68 | |
| NusA | K | transcription pausing | 0195 | 106 | .83 | 11.11 | 39.39 | |
| QRI7 | O | o-sialoglycoprotein endopeptidase | 0533 | 123 | .89 | 7.53 | 43.68 | |
| YidC | N | 60 kD inner membrane protein | 0706 | 179 | .85 | 13.69 | 53.24 | |
| SecY | N | subunit of translocase | 0201 | 82 | .86 | 8.00 | 58.78 |
Figure 1A rooted phylogenetic tree of Bacteria computed from representative proteins. As explained in the text, this tree was computed first altogether, then with an outgroup of Aeropyrum and Methanopyrus to place the root, and then again in overlapping halves using different proteins, with the split at the doubled edge near Chlorobium. The numbers indicate bootstrap support for clades out of 100 trials; omitted numbers are all 100. The bootstrap support for the root is 44; the second choice is shown dashed. The weakest bootstrap support is for the Spirochaetes and Chlamydiales clade; again the second choice is shown dashed. The Chloroflexus genome is available only in a draft; we give it a tentative placement, without bootstrap support or edge lengths, based on about 1200 columns.
Figure 2An unrooted phylogenetic tree of Proteobacteria and related bacteria. This tree shows the left half of Figure 1, including a number of additional genomes. The numbers associated with edges give bootstrap support as before; the best supported alternative choices are shown dashed. This tree slightly favors breaking the Spirochaetes/Chlamydiales clade of Figure 1.
Figure 6Histograms of evolutionary distances. Plotted are the evolutionary distances, between E. coli and three other bacteria, Streptococcus pneumoniae, Neisseria meningitidis, and Haemophilus influenzae. Each distance D(i, j, k), described in the Methods section, is given by a pairwise alignment of amino acid sequences of a given length (typically 300 residues), the most conserved subsequences for a family of orthologous proteins. We can interpret distances as times, with greater time towards the left. All three histograms are roughly bell-shaped but with rather high variances, which suggests that reliable phylogenetic inference requires either a great many sequences or representative sequences that sit near the center in all pairwise histograms. The peaks at 100+ indicate missing orthologs. There are several apparent horizontal transfers (right-side outliers) in S.pneumoniae and N.meningitidis. Even discounting the peaks at 100+, the left-side outliers (rapid evolution, large insertions or deletions, missing domains, hidden paralogs, and horizontal transfers from more distant organisms) outnumber right-side outliers; this pattern holds true even for very distant pairs such as S.pneumoniae and E.coli.
Figure 3An unrooted phylogenetic tree of . The edge labeled 67 is the only edge without bootstrap support of 100; the one alternative topology covers the other 33 bootstrap trials. This tree switches the branching order of Vibrio and Haemophilus from that shown in Figures 1 and 2. This tree should be more reliable, due to better taxon sampling and protein representativeness. For this less diverse set of taxa, many proteins had correlation coefficients greater than .90 with the consensus distance matrix. Notice that the two species of Vibrio are much more diverged than Escherichia and Salmonella.
Figure 4An unrooted phylogenetic tree of Firmicutes, Actinobacteria, and Cyanobacteria. Chlorobium and Bacteroides were included here and in Figure 2 in order to orient the trees relative to each other.
Figure 5An unrooted phylogenetic tree of Firmicutes and Actinobacteria. This tree includes the reduced genomes of Mycoplasma and Ureaplasma. We found that if these genomes were included in larger phylogenies, such as those in Figures 1 and 4, we obtained unreliable results with poor bootstrap support.