| Literature DB >> 18366787 |
Alexis Criscuolo1, Olivier Gascuel.
Abstract
BACKGROUND: Distance-based phylogeny inference methods first estimate evolutionary distances between every pair of taxa, then build a tree from the so-obtained distance matrix. These methods are fast and fairly accurate. However, they hardly deal with incomplete distance matrices. Such matrices are frequent with recent multi-gene studies, when two species do not share any gene in analyzed data. The few existing algorithms to infer trees with satisfying accuracy from incomplete distance matrices have time complexity in O(n4) or more, where n is the number of taxa, which precludes large scale studies. Agglomerative distance algorithms (e.g. NJ 12) are much faster, with time complexity in O(n3) which allows huge datasets and heavy bootstrap analyses to be dealt with. These algorithms proceed in three steps: (a) search for the taxon pair to be agglomerated, (b) estimate the lengths of the two so-created branches, (c) reduce the distance matrix and return to (a) until the tree is fully resolved. But available agglomerative algorithms cannot deal with incomplete matrices.Entities:
Mesh:
Substances:
Year: 2008 PMID: 18366787 PMCID: PMC2335114 DOI: 10.1186/1471-2105-9-166
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Figure 1Topological accuracy depending on the rate of missing entries. Horizontal axis: percentage of missing distances (Pmiss. Vertical axis: topological accuracy measured by the mean (over 500 trials) quartet distance (d) between the correct and inferred trees. s: number of taxon pairs that BIONJ* first selects using NJ-like criterion (6), and then analyzes using score-based criterion (9) (and criteria (8), (10), (11) in case of ties). The distance matrix is additive, and thus all methods recover the correct tree when Pmiss = 0.
Topological accuracy with medium-level distance supermatrices
| 2 | 0.0906 | 0.0926 | 0.0857 | 0.286 | ||
| 4 | 0.0546 | 0.0595 | 0.0524 | 0.466 | ||
| 6 | 0.0445 | 0.0454 | 0.0410 | 0.015 | ||
| 8 | 0.0330 | 0.0356 | 0.0386 | 0.958 | ||
| 10 | 0.0300 | 0.0317 | 0.0286 | 0.364 | ||
| 12 | 0.0317 | 0.0354 | 0.0314 | 0.030 | ||
| 14 | 0.0266 | 0.0286 | 0.0251 | 0.816 | ||
| 16 | 0.0318 | 0.0327 | 0.0303 | 0.028 | ||
| 18 | 0.0278 | 0.0280 | 0.0265 | 0.020 | ||
| 20 | 0.0259 | 0.0281 | 0.0247 | 0.955 | ||
| 2 | 0.2174 | 0.2187 | 0.2163 | 0.920 | ||
| 4 | 0.1778 | 0.1818 | 0.1713 | 0.060 | ||
| 6 | 0.1443 | 0.1534 | 0.1418 | ≈ 0.0 | ||
| 8 | 0.1253 | 0.1302 | 0.1137 | 0.176 | ||
| 10 | 0.1039 | 0.1117 | 0.0959 | 0.033 | ||
| 12 | 0.0968 | 0.1021 | 0.0875 | 0.470 | ||
| 14 | 0.0749 | 0.0850 | 0.0710 | 0.464 | ||
| 16 | 0.0731 | 0.0802 | 0.0658 | 0.335 | ||
| 18 | 0.0617 | 0.0687 | 0.0555 | 0.074 | ||
| 20 | 0.0600 | 0.0682 | 0.0560 | 0.189 | ||
In the medium-level combination, distance matrices are directly estimated from each of the genes and then combined (using SDM) into the distance supermatrix. Topolological accuracy is mesured by the mean (over 500 trials) quartet distance (d) between the correct and inferred trees. k: number of genes. p-value: sign-test significance when comparing the 500 dvalues of the two best methods that are indicated in bold and underlined (1st method) and bold (2nd one)
Topological accuracy with high-level distance supermatrices
| 2 | 0.0561 | 0.0586 | 0.0566 | ≈ 0.0 | ||
| 4 | 0.0345 | 0.0361 | 0.0351 | ≈ 0.0 | ||
| 6 | 0.0265 | 0.0272 | 0.0261 | ≈ 0.0 | ||
| 8 | 0.0227 | 0.0228 | 0.0217 | 0.094 | ||
| 10 | 0.0188 | 0.0194 | 0.0192 | 0.047 | ||
| 12 | 0.0207 | 0.0215 | 0.0199 | 0.949 | ||
| 14 | 0.0164 | 0.0164 | 0.0165 | 0.882 | ||
| 16 | 0.0208 | 0.0210 | 0.0213 | ≈ 0.0 | ||
| 18 | 0.0177 | 0.0177 | 0.0174 | 0.271 | ||
| 20 | 0.0162 | 0.0168 | 0.0171 | 0.648 | ||
| 2 | 0.1876 | 0.1877 | 0.1824 | 0.282 | ||
| 4 | 0.1396 | 0.1397 | 0.1390 | 0.018 | ||
| 6 | 0.1125 | 0.1134 | 0.1119 | 0.166 | ||
| 8 | 0.0892 | 0.0926 | 0.0870 | 0.005 | ||
| 10 | 0.0739 | 0.0766 | 0.0723 | 0.023 | ||
| 12 | 0.0670 | 0.0705 | 0.0677 | 0.015 | ||
| 14 | 0.0538 | 0.0567 | 0.0534 | ≈ 0.0 | ||
| 16 | 0.0518 | 0.0554 | 0.0512 | ≈ 0.0 | ||
| 18 | 0.0416 | 0.0485 | 0.0424 | 0.922 | ||
| 20 | 0.0435 | 0.0453 | 0.0431 | ≈ 0.0 | ||
In the high-level combination, ML trees are first inferred separately for every genes, and then these trees are turned into path-length distance matrices which are combined (using SDM) into the distance supermatrix. For symbols and notation, see note to Table 1.
Topological accuracy with datasets generated from Driskell et al. [20]
| 0.0234 | 0.0268 | 0.0289 | ≈ 0.0 | ≈ 0.0 | |||
| 0.0161 | 0.0165 | 0.0182 | 0.0172 | 0.001 | 0.193 | ||
(a): Medium-level combination of the distance matrices being directly estimated from the gene sequences. (b): High-level combination; ML trees are first inferred separately for every genes; MRP turns the gene trees into a matrix of partial binary characters, which is analyzed with parsimony; with the other (distance) methods, the gene trees are turned into path-length distance matrices which are combined into the distance supermatrix. All combinations of source distance matrices are achieved using SDM. p-value: sign-test significance when comparing the 100 dvalues of MVR* (our best algorithm) to those of FITCH and MRP. For other symbols and notation, see note to Table 1.
Run times
| 10 | 2 | 10 | 20 | 2 | 10 | 20 | 2 | 10 | 20 | |
| < 1 | 11 | 23 | 23 | 21 | 39 | 41 | < 1 | < 1 | < 1 | |
| 5 | 437 | 482 | 479 | 623 | 932 | 926 | 7 | 6 | 5 | |
| 32 | 11,065 | 13,864 | 13,945 | 23,541 | 34,368 | 35,017 | 57 | 60 | 42 | |
| 10 | 2 | 10 | 20 | 2 | 10 | 20 | 2 | 10 | 20 | |
| < 1 | 6 | 17 | 23 | 10 | 28 | 36 | < 1 | < 1 | < 1 | |
| < 1 | 22 | 455 | 492 | 29 | 656 | 667 | < 1 | 4 | 7 | |
| 2 | 448 | 11,532 | 14,025 | 916 | 32,371 | 34,152 | 3 | 31 | 52 | |
| 334 | 132 | 268 | < 1 | |||||||
Mean run times are provided for various taxon numbers (n), gene numbers (k) and proportions of missing entries: (a) 25%, (b) 75%, and (c) 1.2% using Driskell et al. [20]-like datasets. Run times are measured in seconds using a standard PC (Pentium IV 1.8 GHz, 1Gb RAM). The low run times with k = 2 and 75% deletion rate are explained by the low size of the distance super-matrices (see text for explanations).