| Literature DB >> 30314446 |
Dmitry Penzar1, Mikhail Krivozubov2, Sergey Spirin3,4,5.
Abstract
BACKGROUND: Many algorithms and programs are available for phylogenetic reconstruction of families of proteins. Methods used widely at present use either a number of distance-based principles or character-based principles of maximum parsimony or maximum likelihood.Entities:
Keywords: Algorithm; Open source software; Phylogeny reconstruction; Protein evolution; Web interface
Mesh:
Year: 2018 PMID: 30314446 PMCID: PMC6186109 DOI: 10.1186/s12859-018-2399-4
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Alignment datasets
| Name | Number of alignments |
|---|---|
| Metazoa-10 | 1499 |
| Metazoa-15 | 1283 |
| Fungi-15 | 1191 |
| Proteobacteria-15 | 784 |
| Metazoa-25 | 970 |
| Fungi-30 | 1004 |
| Proteobacteria-30 | 783 |
| Fungi-45 | 827 |
| Proteobacteria-45 | 780 |
The name of each set consists of the taxon name and the number of sequences in each alignment of the set
Mean relative tree scores (), mean normalized Robinson – Foulds distances to the species trees (
| Dataset | < | < |
|
|
|
|---|---|---|---|---|---|
| Metazoa-10 | 0.9919 | 0.345 | − 0.40 | 0.29 | − 0.21 |
| Metazoa-15 | 0.9901 | 0.388 | − 0.44 | 0.37 | − 0.25 |
| Fungi-15 | 0.9915 | 0.329 | − 0.42 | 0.38 | − 0.31 |
| Proteobacteria-15 | 0.9816 | 0.564 | − 0.27 | 0.39 | − 0.03 |
| Metazoa-25 | 0.9900 | 0.418 | − 0.39 | 0.42 | − 0.25 |
| Fungi-30 | 0.9908 | 0.415 | − 0.43 | 0.45 | − 0.33 |
| Proteobacteria-30 | 0.9779 | 0.682 | − 0.25 | 0.42 | − 0.15 |
| Fungi-45 | 0.9912 | 0.445 | − 0.48 | 0.47 | − 0.33 |
| Proteobacteria-45 | 0.9762 | 0.739 | − 0.29 | 0.43 | − 0.18 |
Optimization strategy was 10 times repeated stepwise addition followed by NNI hill climbing, the scoring matrix was BLOSUM62
Percents of alignments for which different search strategies reach a maximum tree score
| Dataset | 1SA | 10SA | 100SA | NNI HC | NNI MC | SPR |
|---|---|---|---|---|---|---|
| Metazoa-10 | 61.4% | 99.3% | 99.9% | 99.5% | 100% | 99.8% |
| Metazoa-15 | 42.3% | 92.2% | 99.8% | 97.7% | 99.4% | 99.1% |
| Fungi-15 | 41.6% | 91.9% | 99.7% | 98.7% | 99.7% | 99.4% |
| Proteobacteria-15 | 25.5% | 73.5% | 97.2% | 93.5% | 98.2% | 97.1% |
| Metazoa-25 | 22.6% | 72.0% | 96.6% | 92.4% | 95.1% | 98.9% |
| Fungi-30 | 8.7% | 42.3% | 87.1% | 85.8% | 92.0% | 97.8% |
| Proteobacteria-30 | 1.4% | 11.4% | 41.0% | 57.3% | 70.9% | 93.4% |
| Fungi-45 | 1.6% | 13.3% | 48.0% | 62.8% | 75.1% | 96.1% |
| Proteobacteria-45 | 0.0% | 0.4% | 4.4% | 27.4% | 37.2% | 88.5% |
1SA, 10SA and 100SA are for single, 10 times and 100 times repeated stepwise addition, respectively; NNI HC is for NNI hill climbing, NNI MC is for NNI Monte Carlo search
Percents of alignments for which different search strategies reach minimum Robinson – Foulds distance to the species tree
| Dataset | 1SA | 10SA | 100SA | NNI HC | NNI MC | SPR |
|---|---|---|---|---|---|---|
| Metazoa-10 | 85.4% | 91.4% | 91.5% | 91.3% | 91.5% | 91.7% |
| Metazoa-15 | 80.3% | 84.8% | 85.1% | 85.3% | 85.0% | 85.3% |
| Fungi-15 | 75.1% | 83.4% | 83.7% | 83.8% | 83.5% | 83.7% |
| Proteobacteria-15 | 71.0% | 80.9% | 81.5% | 80.1% | 81.0% | 81.1% |
| Metazoa-25 | 70.8% | 77.4% | 78.5% | 78.5% | 78.8% | 78.2% |
| Fungi-30 | 50.2% | 63.3% | 65.8% | 65.8% | 65.3% | 65.5% |
| Proteobacteria-30 | 42.5% | 55.6% | 57.7% | 53.8% | 57.2% | 58.5% |
| Fungi-45 | 37.7% | 49.1% | 49.6% | 49.8% | 49.7% | 52.5% |
| Proteobacteria-45 | 31.8% | 38.3% | 39.9% | 43.8% | 40.9% | 46.0% |
Notation is the same as in Table 3
Average Robinson – Foulds distances between the species trees and reconstructions by the programs
| Dataset | PQ | ME | ML | MP | QP |
|---|---|---|---|---|---|
| Metazoa-10 | 0.345 | 0.379 | 0.390 | 0.433 | 0.357 |
| Metazoa-15 | 0.388 | 0.417 | 0.424 | 0.475 | 0.401 |
| Fungi-15 | 0.329 | 0.355 | 0.391 | 0.417 | 0.335 |
| Proteobacteria-15 | 0.564 | 0.584 | 0.620 | 0.633 | 0.574 |
| Metazoa-25 | 0.418 | 0.441 | 0.440 | 0.515 | 0.437 |
| Fungi-30 | 0.415 | 0.421 | 0.444 | 0.486 | 0.417 |
| Proteobacteria-30 | 0.682 | 0.697 | 0.718 | 0.747 | 0.693 |
| Fungi-45 | 0.445 | 0.438 | 0.457 | 0.512 | 0.452 |
| Proteobacteria-45 | 0.739 | 0.744 | 0.761 | 0.790 | 0.744 |
Numbers of “good” reconstructions
| Dataset | Threshold | PQ | ME | ML | MP | QP |
|---|---|---|---|---|---|---|
| Metazoa-10 | 0.143 | 192 | 145 | 152 | 111 | 166 |
| Metazoa-15 | 0.25 | 297 | 267 | 253 | 129 | 262 |
| Fungi-15 | 0.167 | 166 | 143 | 108 | 71 | 161 |
| Proteobacteria-15 | 0.417 | 143 | 126 | 81 | 71 | 127 |
| Metazoa-25 | 0.273 | 188 | 169 | 173 | 61 | 147 |
| Fungi-30 | 0.296 | 206 | 208 | 182 | 96 | 185 |
| Proteobacteria-30 | 0.593 | 186 | 166 | 127 | 78 | 163 |
| Fungi-45 | 0.357 | 198 | 236 | 211 | 108 | 187 |
| Proteobacteria-45 | 0.643 | 152 | 134 | 110 | 57 | 128 |
The column “Threshold” contains first quartils of Robinson – Foulds distances between PQ trees and species trees, for each set. Numbers in other columns are numbers of trees reconstructed by each method whose distance to the corresponding species tree is less than the threshold. Numbers in PQ column are less than 1/4 of total volumes of the sets because the distance can take only few possible values
Numbers of “bad” reconstructions
| Dataset | Threshold | PQ | ME | ML | MP | QP |
|---|---|---|---|---|---|---|
| Metazoa-10 | 0.571 | 193 | 239 | 248 | 317 | 210 |
| Metazoa-15 | 0.5 | 320 | 375 | 402 | 487 | 333 |
| Fungi-15 | 0.417 | 278 | 336 | 413 | 486 | 287 |
| Proteobacteria-15 | 0.667 | 184 | 213 | 250 | 297 | 189 |
| Metazoa-25 | 0.545 | 203 | 247 | 248 | 371 | 213 |
| Fungi-30 | 0.518 | 223 | 235 | 297 | 355 | 212 |
| Proteobacteria-30 | 0.778 | 173 | 210 | 252 | 290 | 197 |
| Fungi-45 | 0.524 | 205 | 202 | 255 | 344 | 200 |
| Proteobacteria-45 | 0.833 | 169 | 172 | 202 | 262 | 172 |
The column “Threshold” contains third (higher) quartils of Robinson – Foulds distances between PQ trees and species trees, for each set. Numbers in other columns are numbers of trees reconstructed by each method whose distance to the corresponding species tree is greater than the threshold. Numbers in PQ column are less than 1/4 of total volumes of the sets because the distance can take only few possible values
Pairwise comparison of PQ with ME, ML, MP, and QP
| Dataset | ME | ML | MP | QP |
|---|---|---|---|---|
| Metazoa-10 |
|
|
|
|
| Metazoa-15 |
|
|
|
|
| Fungi-15 |
|
|
| 352/302 |
| Proteobacteria-15 |
|
|
| 236/184 |
| Metazoa-25 |
|
|
|
|
| Fungi-30 | 412/390 |
|
| 396/360 |
| Proteobacteria-30 |
|
|
|
|
| Fungi-45 |
|
|
| 382/306 |
| Proteobacteria-45 | 350/273 |
|
| 347/279 |
The number before “/” in each cell is the number of alignments for which PQ result is closer to the species tree, the second number is the number of alignments for which PQ result is more distant from the species tree. Statistically significant (p<0.001) results are in boldface
Results of the programs on 100 extractions from the alignment of fungal 18S rRNA
| Value | PQ | ME | ML | MP | QP |
|---|---|---|---|---|---|
| < | 0.20 | 0.21 | 0.23 | 0.28 | 0.22 |
|
| − 0.17 | − 0.21 | − 0.37 | − 0.25 | − 0.21 |
| Perfect | 9 | 9 | 4 | 4 | 6 |
| Bad | 18 | 23 | 29 | 45 | 21 |
| PQ is better | NA | 33 | 46 | 64 | 28 |
| PQ is worse | NA | 22 | 26 | 14 | 16 |
| NA | 0.17 | 0.024 | 8·10−9 | 0.1 |
The row
Results of the programs on 100 extractions from the alignment of proteobacterial 16S rRNA
| Value | PQ | ME | ML | MP | QP |
|---|---|---|---|---|---|
| < | 0.38 | 0.43 | 0.51 | 0.50 | 0.40 |
|
| − 0.26 | − 0.28 | − 0.16 | − 0.09 | − 0.21 |
| Good | 27 | 21 | 9 | 12 | 24 |
| Bad | 14 | 25 | 45 | 41 | 18 |
| PQ is better | NA | 51 | 76 | 74 | 34 |
| PQ is worse | NA | 7 | 8 | 9 | 14 |
| NA | 2·10−9 | 5·10−15 | 8·10−14 | 0.01 |
“Good” are numbers of inferred trees whose distance from the species tree is less than 0.25. “Bad” are numbers of inferred trees whose distance from the species tree is greater than 0.5. Other notations are the same as in Table 9
Results of the programs on 500 simulated amino acid alignments
| Value | PQ | ME | ML | MP | QP |
|---|---|---|---|---|---|
| < | 0.144 | 0.165 | 0.111 | 0.133 | 0.136 |
|
| − 0.34 | − 0.49 | − 0.50 | − 0.48 | − 0.52 |
| Perfect | 15 | 13 | 65 | 24 | 19 |
| Good | 156 | 115 | 316 | 181 | 172 |
| Bad | 106 | 146 | 30 | 96 | 96 |
| PQ is better | NA | 248 | 53 | 167 | 188 |
| PQ is worse | NA | 145 | 347 | 227 | 198 |
| NA | 2·10−7 | 5·10−54 | 0.003 | 0.65 |
The row
Results of the programs on 500 simulated nucleotide alignments
| Value | PQ | ME | ML | MP | QP |
|---|---|---|---|---|---|
| < | 0.259 | 0.248 | 0.218 | 0.150 | 0.277 |
|
| − 0.12 | − 0.15 | − 0.31 | − 0.22 | − 0.15 |
| Perfect | 20 | 28 | 51 | 86 | 10 |
| Good | 177 | 204 | 253 | 362 | 146 |
| Bad | 95 | 85 | 79 | 17 | 108 |
| PQ is better | NA | 118 | 153 | 58 | 154 |
| PQ is worse | NA | 165 | 245 | 341 | 76 |
| NA | 0.006 | 4·10−6 | 7·10−50 | 3·10−7 |
“Good” are numbers of inferred trees whose distance from the corresponding reference trees is less than 0.1667. “Bad” are numbers of inferred trees whose distance from the corresponding reference tree is greater than 0.3333. Other notations are the same as in Tables 9 and 11
Fig. 1Tree of 45 Fungi. The tree of 45 Fungi labeled with two phyla: Basidiomycota and Ascomycota, subphylum Pezizomycotina and five classes of Ascomycota. Letters a, b, c, d, and e denote branches that must be reconstructed by at least one program for using an orthologous group while investigating long-branch attraction. These branches are: a branch separating two phyla (a), a branch separating Pezizomycotina (c), and three branches separating well-represented classes of Ascomycota (b, d, e)
Fig. 2Putative long-branch attraction. a The correct tree for 18 fungal species; among Saccharomycetes and Sordariomycetes only species with the most rapidly evolving proteins have been left. b The erroneous tree, which can be formed from long-branch attraction