| Literature DB >> 17454975 |
Abstract
The process of inferring phylogenetic trees from molecular sequences almost always starts with a multiple alignment of these sequences but can also be based on methods that do not involve multiple sequence alignment. Very little is known about the accuracy with which such alignment-free methods recover the correct phylogeny or about the potential for increasing their accuracy. We conducted a large-scale comparison of ten alignment-free methods, among them one new approach that does not calculate distances and a faster variant of our pattern-based approach; all distance-based alignment-free methods are freely available from http://www.bioinformatics.org.au (as Python package decaf+py). We show that most methods exhibit a higher overall reconstruction accuracy in the presence of high among-site rate variation. Under all conditions that we considered, variants of the pattern-based approach were significantly better than the other alignment-free methods. The new pattern-based variant achieved a speed-up of an order of magnitude in the distance calculation step, accompanied by a small loss of tree reconstruction accuracy. A method of Bayesian inference from k-mers did not improve on classical alignment-free (and distance-based) methods but may still offer other advantages due to its Bayesian nature. We found the optimal word length k of word-based methods to be stable across various data sets, and we provide parameter ranges for two different alphabets. The influence of these alphabets was analyzed to reveal a trade-off in reconstruction accuracy between long and short branches. We have mapped the phylogenetic accuracy for many alignment-free methods, among them several recently introduced ones, and increased our understanding of their behavior in response to biologically important parameters. In all experiments, the pattern-based approach emerged as superior, at the expense of higher resource consumption. Nonetheless, no alignment-free method that we examined recovers the correct phylogeny as accurately as does an approach based on maximum-likelihood distance estimates of multiply aligned sequences.Entities:
Mesh:
Year: 2007 PMID: 17454975 PMCID: PMC7107264 DOI: 10.1080/10635150701294741
Source DB: PubMed Journal: Syst Biol ISSN: 1063-5157 Impact factor: 15.683
Figure 1RF distance landscape for method B-bin. Average RF distance (y-axis) of method B-bin on three reference sets (top to bottom: set 2, set 4, and set 6) of two synthetic data sets (a, c, e: control; b, d, f: ASRV). Each subfigure shows the behavior as a function of word length k (x-axis) for two alphabets (AA: original amino acids, CE: chemical equivalence classes). Points are joined for ease of visual inspection only.
Distances for control data set. Median of calculated distances for each reference set of the synthetic control data set (sequence length of 1000 amino acids, no ASRV). Order of methods and values for k are as in Table 1. Note that method B-bin is not listed as it does not calculate distances.
| Reference set of control data | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|
|
| ||||||||||
| No. | Method |
|
| 1 | 2 | 3 | 4 | 5 | 6 | 7 |
| 1 |
| AA | — | 0.7444 | 1.0993 | 1.6257 | 2.0488 | 2.4290 | 2.9717 | 3.3942 |
| 2 |
| CE | — | 1.9895 | 2.4403 | 2.8530 | 3.0180 | 3.0853 | 3.1345 | 3.1525 |
| 3 |
| CE | — | 8.6963 | 9.3471 | 9.8041 | 9.9639 | 10.025 | 10.063 | 10.075 |
| 4 |
| AA | — | 1.1372 | 1.5553 | 1.9732 | 2.1295 | 2.1927 | 2.2274 | 2.2394 |
| 5 |
| AA | — | 6.8358 | 7.9676 | 8.7688 | 8.9946 | 9.0815 | 9.1209 | 9.1362 |
| 6 |
| CE | — | 1.2536 | 1.3577 | 1.4002 | 1.4128 | 1.4176 | 1.4212 | 1.4226 |
| 7 |
| AA | — | 0.8550 | 0.9308 | 0.9668 | 0.9775 | 0.9825 | 0.9856 | 0.9871 |
| 8 |
| CE | 5 | 5.6698 | 6.1015 | 6.3468 | 6.4461 | 6.4881 | 6.4606 | 6.5021 |
| 9 |
| CE | 5 | 0.6343 | 0.8066 | 0.8821 | 0.9027 | 0.9098 | 0.9157 | 0.9175 |
| 10 |
| AA | 4 | 0.8659 | 0.9516 | 0.9778 | 0.9839 | 0.9859 | 0.9870 | 0.9874 |
| 11 |
| CE | 5 | 1.4374 | 1.7130 | 1.8759 | 1.9229 | 1.9437 | 1.9578 | 1.9650 |
| 12 |
| AA | 4 | 1850 | 1946 | 1978 | 1986 | 1990 | 1994 | 1992 |
| 13 |
| CE | 5 | 1780 | 1902 | 1958 | 1976 | 1982 | 1986 | 1990 |
| 14 |
| AA | 4 | 1.7079 | 2.0017 | 2.1366 | 2.1712 | 2.1889 | 2.1889 | 2.1979 |
| 15 |
| AA | 4 | 0.2928 | 0.3099 | 0.3184 | 0.3212 | 0.3230 | 0.3207 | 0.3221 |
| 16 |
| CE | — | 0.8560 | 0.8853 | 0.8967 | 0.9002 | 0.9017 | 0.9024 | 0.9029 |
| 18 |
| AA | — | 0.8721 | 0.8957 | 0.9045 | 0.9073 | 0.9080 | 0.9086 | 0.9094 |
| 20 |
| AA | 3 | 0.4696 | 0.4961 | 0.5060 | 0.5092 | 0.5108 | 0.5114 | 0.5119 |
| 21 |
| CE | 4 | 0.4672 | 0.4924 | 0.5035 | 0.5073 | 0.5086 | 0.5101 | 0.5102 |
| 22 |
| AA | (1) | 0.0043 | 0.0059 | 0.0071 | 0.0081 | 0.0088 | 0.0097 | 0.0099 |
Distances for short-sequences data set. Median of calculated distances for each reference set of the synthetic control data set (sequence length of 300 amino acids, no ASRV). Order of methods and values for k are as in Table A1.
| Reference set of ASRV data | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|
|
| ||||||||||
| No. | Method |
|
| 1 | 2 | 3 | 4 | 5 | 6 | 7 |
| 1 |
| AA | — | 0.7512 | 1.0948 | 1.5985 | 2.0736 | 2.4174 | 2.9717 | 3.3904 |
| 2 |
| CE | — | 1.6917 | 2.1449 | 2.6389 | 2.9273 | 3.0302 | 3.1279 | 3.1704 |
| 3 |
| CE | — | 8.1736 | 8.9673 | 9.5701 | 9.8498 | 9.9464 | 10.048 | 10.076 |
| 4 |
| AA | — | 0.9011 | 1.2594 | 1.7497 | 2.0307 | 2.1513 | 2.2199 | 2.2504 |
| 5 |
| AA | — | 5.8913 | 7.2013 | 8.3587 | 8.8410 | 9.0104 | 9.1216 | 9.1536 |
| 6 |
| CE | 5 | 0.6658 | 0.8564 | 0.9346 | 0.9541 | 0.9627 | 0.9695 | 0.9700 |
| 7 |
| CE | — | 0.8030 | 0.9004 | 0.9506 | 0.9678 | 0.9781 | 0.9871 | 0.9892 |
| 8 |
| CE | 5 | 531 | 568 | 582 | 588 | 590 | 590 | 592 |
| 9 |
| AA | 4 | 0.8566 | 0.9553 | 0.9853 | 0.9910 | 0.9949 | 0.9953 | 0.9954 |
| 10 |
| CE | 5 | 1.5378 | 1.8705 | 2.0634 | 2.1180 | 2.1465 | 2.1758 | 2.1758 |
| 11 |
| AA | 4 | 554 | 580 | 590 | 592 | 594 | 594 | 594 |
| 13 |
| AA | 4 | 1.7678 | 2.0641 | 2.2064 | 2.2374 | 2.2695 | 2.2695 | 2.2695 |
| 14 |
| AA | — | 1.2130 | 1.3537 | 1.4238 | 1.4475 | 1.4597 | 1.4713 | 1.4728 |
| 15 |
| CE | — | 0.8189 | 0.8548 | 0.8728 | 0.8774 | 0.8794 | 0.8819 | 0.8830 |
| 16 |
| CE | 5 | 62.255 | 68.090 | 70.418 | 71.976 | 72.231 | 72.839 | 72.479 |
| 17 |
| AA | — | 0.8366 | 0.8668 | 0.8795 | 0.8842 | 0.8861 | 0.8884 | 0.8880 |
| 19 |
| AA | 4 | 3.2320 | 3.4452 | 3.5341 | 3.5909 | 3.5829 | 3.5996 | 3.5983 |
| 20 |
| AA | 3 | 0.4674 | 0.4920 | 0.5044 | 0.5078 | 0.5095 | 0.5102 | 0.5105 |
| 21 |
| CE | 4 | 0.4623 | 0.4888 | 0.5027 | 0.5067 | 0.5091 | 0.5097 | 0.5102 |
| 22 |
| AA | (1) | 0.0154 | 0.0202 | 0.0245 | 0.0273 | 0.0296 | 0.0325 | 0.0334 |
Control data set. Average RF distance for each reference set of the synthetic control data set (sequence length of 1000 amino acids, no ASRV). For word-based methods, we show the best performing word length k for each alphabet A (AA: original amino acids; CE: chemical equivalence classes), the only exception being B-bin with CE: k = 5 is slightly better on this data set but k = 4 performs better on the other two data sets. Methods are ordered according to their rank sums ∑R. The Friedman test statistic is F = 4758.1 (P < 10−10). Significant differences are found at or beyond the α = 0.05 level between the following pairs (numbers refer to column “No.”): method 1 versus methods 22–2: method 2 versus methods 22–4; method 3 versus methods 22–5; methods 4 and 5 versus methods 22–6; method 6 versus methods 22–18; method 7 versus methods 22–19; methods 8–19 versus methods 22–20; and methods 20 and 21 versus method 22.
| Reference set of control data | |||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
|
| |||||||||||
| No. |
| Method |
|
| 1 | 2 | 3 | 4 | 5 | 6 | 7 |
| 1 | 3228.0 |
| AA | — | 0.024 | 0.044 | 0.068 | 0.092 | 0.140 | 0.160 | 0.192 |
| 2 | 4285.0 |
| CE | — | 0.044 | 0.068 | 0.090 | 0.148 | 0.266 | 0.356 | 0.518 |
| 3 | 4483.5 |
| CE | — | 0.040 | 0.084 | 0.096 | 0.154 | 0.276 | 0.388 | 0.556 |
| 4 | 5374.0 |
| AA | — | 0.044 | 0.070 | 0.104 | 0.176 | 0.362 | 0.570 | 0.736 |
| 5 | 5650.5 |
| AA | — | 0.050 | 0.076 | 0.120 | 0.176 | 0.380 | 0.612 | 0.744 |
| 6 | 8127.5 |
| CE | — | 0.068 | 0.156 | 0.222 | 0.392 | 0.590 | 0.744 | 0.872 |
| 7 | 8285.5 |
| AA | — | 0.076 | 0.108 | 0.234 | 0.398 | 0.660 | 0.756 | 0.872 |
| 8 | 8316.5 |
| CE | 5 | 0.082 | 0.160 | 0.276 | 0.398 | 0.624 | 0.712 | 0.844 |
| 9 | 8336.5 |
| CE | 5 | 0.058 | 0.124 | 0.228 | 0.402 | 0.660 | 0.778 | 0.882 |
| 10 | 8362.5 |
| AA | 4 | 0.062 | 0.112 | 0.224 | 0.420 | 0.666 | 0.798 | 0.870 |
| 11 | 8452.0 |
| CE | 5 | 0.052 | 0.130 | 0.240 | 0.418 | 0.662 | 0.790 | 0.882 |
| 12 | 8529.5 |
| AA | 4 | 0.054 | 0.110 | 0.240 | 0.432 | 0.696 | 0.806 | 0.872 |
| 13 | 8555.0 |
| CE | 5 | 0.060 | 0.128 | 0.244 | 0.430 | 0.676 | 0.784 | 0.880 |
| 14 | 8572.0 |
| AA | 4 | 0.062 | 0.108 | 0.240 | 0.436 | 0.688 | 0.804 | 0.880 |
| 15 | 8706.0 |
| AA | 4 | 0.076 | 0.156 | 0.274 | 0.440 | 0.684 | 0.746 | 0.862 |
| 16 | 8846.5 |
| CE | — | 0.066 | 0.146 | 0.268 | 0.472 | 0.672 | 0.792 | 0.868 |
| 17 | 9015.0 |
| AA | 3 | 0.064 | 0.138 | 0.290 | 0.480 | 0.710 | 0.800 | 0.876 |
| 18 | 9046.0 |
| AA | — | 0.072 | 0.116 | 0.270 | 0.488 | 0.712 | 0.826 | 0.890 |
| 19 | 9192.5 |
| CE | 4 | 0.080 | 0.138 | 0.300 | 0.506 | 0.686 | 0.792 | 0.900 |
| 20 | 10,286.0 |
| AA | 3 | 0.110 | 0.188 | 0.394 | 0.588 | 0.798 | 0.862 | 0.888 |
| 21 | 10,851.0 |
| CE | 4 | 0.116 | 0.240 | 0.420 | 0.648 | 0.792 | 0.884 | 0.904 |
| 22 | 12,599.0 |
| AA | (1) | 0.494 | 0.564 | 0.688 | 0.700 | 0.836 | 0.868 | 0.892 |
ASRV data set. Average RF distance for each reference set of the synthetic ASRV data set (sequence length of 1000 amino acids, high ASRV with α = 0.5). Order of methods and values for k are determined as in Table 1. The Friedman test statistic is F = 4873.2 (P < 10−10). Significant differences are found at or beyond the α = 0.05 level between the following pairs (numbers refer to column “No.”): method 1 versus methods 22–2; methods 2–5 versus methods 22–6; methods 6–8 versus methods 22–12; method 9 versus methods 22–14; method 10 versus methods 22–17; method 11 versus methods 22–19; methods 12–19 versus methods 22–20; and methods 20 and 21 versus method 22.
| Reference set of ASRV data | |||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
|
| |||||||||||
| No. |
| Method |
|
| 1 | 2 | 3 | 4 | 5 | 6 | 7 |
| 1 | 4571.5 |
| AA | — | 0.040 | 0.068 | 0.078 | 0.108 | 0.144 | 0.202 | 0.238 |
| 2 | 4958.5 |
| AA | — | 0.040 | 0.066 | 0.100 | 0.122 | 0.188 | 0.226 | 0.312 |
| 3 | 5121.5 |
| AA | — | 0.042 | 0.070 | 0.108 | 0.130 | 0.196 | 0.244 | 0.316 |
| 4 | 5647.5 |
| CE | — | 0.056 | 0.082 | 0.122 | 0.158 | 0.214 | 0.278 | 0.360 |
| 5 | 5722.0 |
| CE | — | 0.058 | 0.092 | 0.126 | 0.154 | 0.216 | 0.282 | 0.364 |
| 6 | 7329.5 |
| AA | 4 | 0.072 | 0.114 | 0.158 | 0.226 | 0.350 | 0.400 | 0.498 |
| 7 | 7350.0 |
| AA | 4 | 0.074 | 0.116 | 0.146 | 0.228 | 0.348 | 0.430 | 0.492 |
| 8 | 7353.5 |
| AA | 4 | 0.078 | 0.110 | 0.154 | 0.230 | 0.354 | 0.406 | 0.498 |
| 9 | 7628.0 |
| AA | — | 0.062 | 0.102 | 0.158 | 0.226 | 0.364 | 0.460 | 0.558 |
| 10 | 7741.0 |
| AA | — | 0.082 | 0.124 | 0.180 | 0.248 | 0.368 | 0.440 | 0.506 |
| 11 | 8177.5 |
| AA | 3 | 0.090 | 0.112 | 0.174 | 0.244 | 0.400 | 0.510 | 0.582 |
| 12 | 8424.5 |
| CE | 5 | 0.092 | 0.146 | 0.202 | 0.248 | 0.386 | 0.488 | 0.596 |
| 13 | 8452.5 |
| AA | 4 | 0.082 | 0.136 | 0.182 | 0.272 | 0.440 | 0.484 | 0.608 |
| 14 | 8535.5 |
| CE | — | 0.082 | 0.120 | 0.186 | 0.238 | 0.420 | 0.550 | 0.640 |
| 15 | 8546.5 |
| CE | 5 | 0.086 | 0.150 | 0.202 | 0.258 | 0.412 | 0.496 | 0.604 |
| 16 | 8593.5 |
| CE | 5 | 0.086 | 0.132 | 0.192 | 0.256 | 0.438 | 0.514 | 0.624 |
| 17 | 8664.0 |
| CE | — | 0.106 | 0.152 | 0.220 | 0.270 | 0.402 | 0.492 | 0.588 |
| 18 | 9025.0 |
| CE | 4 | 0.090 | 0.130 | 0.238 | 0.280 | 0.460 | 0.540 | 0.660 |
| 19 | 9119.5 |
| CE | 5 | 0.102 | 0.164 | 0.220 | 0.294 | 0.452 | 0.556 | 0.634 |
| 20 | 10,511.0 |
| AA | 3 | 0.116 | 0.212 | 0.278 | 0.394 | 0.574 | 0.644 | 0.720 |
| 21 | 11,216.5 |
| CE | 4 | 0.126 | 0.214 | 0.330 | 0.488 | 0.620 | 0.716 | 0.780 |
| 22 | 14,411.0 |
| AA | (1) | 0.502 | 0.632 | 0.708 | 0.786 | 0.854 | 0.866 | 0.880 |
Short-sequences data set. Average RF distance for each reference set of the synthetic short-sequences data set (sequence length of 300 amino acids, no ASRV). Order of methods and values for k are determined as in Table 1. The Friedman test statistic is F = 3693.4 (P < 10−10). Significant differences are found at or beyond the α = 0.05 level between the following pairs (numbers refer to column “No.”): method 1 versus methods 22–2, methods 2 and 3 versus methods 22–4; methods 4 and 5 versus methods 22–6; methods 6–19 versus methods 22–20; and methods 20 and 21 versus method 22.
| Reference set of short-sequences data | |||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
|
| |||||||||||
| No. |
| Method |
|
| 1 | 2 | 3 | 4 | 5 | 6 | 7 |
| 1 | 3624.5 |
| AA | — | 0.060 | 0.102 | 0.138 | 0.178 | 0.244 | 0.304 | 0.350 |
| 2 | 4765.5 |
| CE | — | 0.076 | 0.108 | 0.172 | 0.218 | 0.360 | 0.492 | 0.632 |
| 3 | 4836.5 |
| CE | — | 0.064 | 0.098 | 0.176 | 0.218 | 0.356 | 0.534 | 0.658 |
| 4 | 5827.5 |
| AA | — | 0.080 | 0.106 | 0.198 | 0.266 | 0.498 | 0.662 | 0.750 |
| 5 | 5984.0 |
| AA | — | 0.062 | 0.100 | 0.204 | 0.272 | 0.504 | 0.712 | 0.764 |
| 6 | 8171.5 |
| CE | 5 | 0.110 | 0.180 | 0.308 | 0.456 | 0.684 | 0.794 | 0.838 |
| 7 | 8206.5 |
| CE | — | 0.112 | 0.180 | 0.312 | 0.464 | 0.670 | 0.806 | 0.830 |
| 8 | 8251.0 |
| CE | 5 | 0.096 | 0.170 | 0.338 | 0.462 | 0.700 | 0.798 | 0.850 |
| 9 | 8258.5 |
| AA | 4 | 0.074 | 0.146 | 0.322 | 0.468 | 0.714 | 0.802 | 0.892 |
| 10 | 8359.0 |
| CE | 5 | 0.108 | 0.186 | 0.328 | 0.468 | 0.706 | 0.792 | 0.842 |
| 11 | 8442.0 |
| AA | 4 | 0.068 | 0.156 | 0.322 | 0.490 | 0.730 | 0.816 | 0.890 |
| 12 | 8456.5 |
| CE | 4 | 0.106 | 0.170 | 0.370 | 0.486 | 0.706 | 0.784 | 0.834 |
| 13 | 8475.5 |
| AA | 4 | 0.078 | 0.146 | 0.330 | 0.492 | 0.728 | 0.820 | 0.892 |
| 14 | 8479.5 |
| AA | — | 0.088 | 0.158 | 0.338 | 0.526 | 0.702 | 0.802 | 0.854 |
| 15 | 8558.5 |
| CE | — | 0.092 | 0.162 | 0.332 | 0.514 | 0.744 | 0.822 | 0.838 |
| 16 | 8628.5 |
| CE | 5 | 0.128 | 0.222 | 0.362 | 0.492 | 0.696 | 0.794 | 0.808 |
| 17 | 8697.0 |
| AA | — | 0.068 | 0.154 | 0.334 | 0.544 | 0.762 | 0.842 | 0.860 |
| 18 | 8791.5 |
| AA | 3 | 0.086 | 0.176 | 0.354 | 0.538 | 0.738 | 0.852 | 0.832 |
| 19 | 9016.5 |
| AA | 4 | 0.104 | 0.244 | 0.358 | 0.524 | 0.730 | 0.816 | 0.868 |
| 20 | 10,198.0 |
| AA | 3 | 0.116 | 0.252 | 0.444 | 0.614 | 0.816 | 0.890 | 0.906 |
| 21 | 10,964.0 |
| CE | 4 | 0.176 | 0.338 | 0.506 | 0.692 | 0.836 | 0.890 | 0.884 |
| 22 | 12,108.0 |
| AA | (1) | 0.482 | 0.546 | 0.668 | 0.734 | 0.800 | 0.872 | 0.886 |
Figure 2Average RF distance for six methods. Average RF distance (y-axis) for six selected methods on all seven reference sets (x-axis) of two synthetic data sets (a: control; b: ASRV). For each data set, we show (1) the ML distance estimate based on correct alignments, (2) the best pattern-based variant, (3 and 4) the best word-based method and the best method not based on words, 5) the best composition distance; and (6) the W-metric; the numbers in the inserted legends refer to the far left-hand column of Tables 1 (Figure 2a) and 2 (Figure 2b) respectively.
FN distance (× 10) for putative orthologs data set. Average FN distance (multiplied by 10) for each reference set of the putative orthologs data set. For word-based methods, we show the best performing word length k for each alphabet A. Methods are ordered according to their rank sums ∑R.
| Reference set | ||||||||
|---|---|---|---|---|---|---|---|---|
|
| ||||||||
| No. |
| Method |
|
| F-S | F-L | M-S | M-L |
| 1 | 15.5 |
| CE | — | 0.607 | 0.272 | 0.735 | 0.866 |
| 2 | 16.5 |
| CE | — | 0.536 | 0.272 | 0.837 | 0.984 |
| 3 | 18.0 |
| AA | 3 | 0.473 | 0.167 | 0.937 | 1.252 |
| 4 | 22.0 |
| AA | 4 | 0.580 | 0.304 | 0.840 | 1.042 |
| 5 | 23.5 |
| AA | — | 0.533 | 0.272 | 0.754 | 1.337 |
| 6 | 27.5 |
| AA | — | 0.650 | 0.385 | 0.712 | 1.053 |
| 7.5 | 37.0 |
| AA | 4 | 0.713 | 0.353 | 0.880 | 1.182 |
| 7.5 | 37.0 |
| CE | 6 | 0.657 | 0.256 | 1.022 | 1.337 |
| 9 | 38.5 |
| CE | 4 | 0.533 | 0.272 | 1.338 | 1.393 |
| 10 | 44.0 |
| AA | — | 0.763 | 0.423 | 0.897 | 1.074 |
| 11 | 45.5 |
| AA | 3 | 0.747 | 0.337 | 0.869 | 1.402 |
| 12.5 | 47.0 |
| AA | 4 | 0.697 | 0.449 | 0.998 | 1.259 |
| 12.5 | 47.0 |
| CE | 4 | 0.833 | 0.176 | 1.170 | 1.344 |
| 14 | 49.0 |
| CE | 6 | 0.800 | 0.353 | 0.991 | 1.328 |
| 15 | 49.5 |
| CE | — | 0.673 | 0.337 | 1.004 | 1.465 |
| 16 | 53.0 |
| CE | 4 | 0.840 | 0.224 | 1.139 | 1.619 |
| 17 | 55.5 |
| AA | — | 0.713 | 0.385 | 1.454 | 1.305 |
| 18 | 64.0 |
| CE | — | 0.973 | 0.321 | 1.453 | 1.437 |
| 19.5 | 75.0 |
| AA | 4 | 0.847 | 0.978 | 1.413 | 2.374 |
| 19.5 | 75.0 |
| CE | 4 | 0.807 | 1.346 | 1.832 | 2.183 |
DPB count for putative orthologs data set. Count of unrecovered DPBs for each reference set of the putative orthologs data set; numbers in parentheses indicate set size/maximal possible values. Values for k are identical to Table 3; order of methods is determined as in Table 3.
| Reference set | ||||||||
|---|---|---|---|---|---|---|---|---|
|
| ||||||||
| F-S | F-L | M-S | M-L | |||||
| No. |
| Method |
|
| (50) | (52) | (80) | (38) |
| 1 | 20.5 |
| CE | — | 1 | 1 | 3 | 1 |
| 2 | 24.5 |
| CE | — | 1 | 1 | 4 | 1 |
| 3 | 25.5 |
| CE | — | 1 | 1 | 3 | 2 |
| 4.5 | 29.5 |
| CE | 6 | 2 | 1 | 3 | 1 |
| 4.5 | 29.5 |
| CE | 6 | 1 | 1 | 4 | 2 |
| 6 | 31.0 |
| AA | — | 1 | 2 | 4 | 1 |
| 7 | 32.0 |
| AA | 4 | 1 | 2 | 3 | 2 |
| 8.5 | 41.0 |
| AA | 3 | 1 | 0 | 8 | 4 |
| 8.5 | 41.0 |
| CE | 4 | 1 | 0 | 8 | 4 |
| 10 | 42.0 |
| AA | 4 | 1 | 3 | 5 | 2 |
| 11 | 42.5 |
| CE | 4 | 2 | 0 | 6 | 3 |
| 12.5 | 44.0 |
| AA | 3 | 1 | 1 | 6 | 4 |
| 12.5 | 44.0 |
| AA | 4 | 1 | 2 | 5 | 3 |
| 14 | 45.0 |
| AA | — | 2 | 2 | 4 | 2 |
| 15 | 45.5 |
| AA | — | 1 | 1 | 14 | 3 |
| 16.5 | 48.0 |
| CE | 4 | 3 | 0 | 5 | 4 |
| 16.5 | 48.0 |
| CE | — | 2 | 1 | 8 | 2 |
| 18 | 54.5 |
| AA | — | 2 | 1 | 7 | 4 |
| 19 | 73.5 |
| AA | 4 | 2 | 4 | 15 | 11 |
| 20 | 78.5 |
| CE | 4 | 3 | 7 | 17 | 5 |
Duration of distance calculation. Duration of the distance calculation step for two variants of the pattern-based method (d, d). We present the time (measured in seconds) averaged over 100 sets of sequences in any given reference set (sets with the lowest/ highest phylogenetic distances from the synthetic data sets are used) encoded using two alphabets A. The hardware consisted of a 64-bit 2.4-GHz x86-compatible Intel processor.
| Control | ASRV | Short-sequences | |||||
|---|---|---|---|---|---|---|---|
|
|
|
| |||||
| Method |
| Set 1 | Set 7 | Set 1 | Set 7 | Set 1 | Set 7 |
|
| CE | 1084 | 1045 | 864 | 1172 | 103.7 | 87.3 |
|
| CE | 81.1 | 97.3 | 63.4 | 103.6 | 7.10 | 7.46 |
|
| AA | 76.2 | 36.0 | 97.2 | 68.9 | 11.77 | 3.24 |
|
| AA | 5.33 | 3.62 | 5.77 | 5.26 | 0.79 | 0.39 |
Extent of the burn-in phase. Summary of the extent of the burn-in phase (measured in samples; e.g., 100 samples correspond to 10,000 generations). We show results for the overall best performing word length k under each alphabet A for the Bayesian phylogenetic inference from K-mers with a binary encoding (B-bin).
| Synthetic data set |
|
| Upper quartile | Median | Lower quartile |
|---|---|---|---|---|---|
| Control | AA | 3 | 176 | 149 | 126 |
| CE | 4 | 152 | 129 | 108 | |
| ASRV | AA | 3 | 158 | 132 | 110 |
| CE | 4 | 141 | 117 | 96 | |
| Short- | AA | 3 | 119 | 99 | 82 |
| sequences | CE | 4 | 106 | 88 | 72 |
Convergence measured by δ ratios. Summary of assessment of convergence for method B-bin as measured by δ ratios of adjacent versus nonadjacent fragments. We show results for the overall best performing word length k under each alphabet A.
| Synthetic data set |
|
| Upper quartile | Median | Lower quartile |
|---|---|---|---|---|---|
| Control | AA | 3 | 1.023 | 1.002 | 0.978 |
| CE | 4 | 1.027 | 1.001 | 0.976 | |
| ASRV | AA | 3 | 1.029 | 1.000 | 0.971 |
| CE | 4 | 1.028 | 0.998 | 0.972 | |
| Short- | AA | 3 | 1.021 | 1.000 | 0.980 |
| sequences | CE | 4 | 1.020 | 0.999 | 0.979 |
Distances for ASRV data set. Median of calculated distances for each reference set of the synthetic ASRV data set (sequence length of 1000 amino acids, high ASRV with α = 0.5). Order of methods and values for k are as in Table 2.
| Reference set of ASRV data | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|
|
| ||||||||||
| No. | Method |
|
| 1 | 2 | 3 | 4 | 5 | 6 | 7 |
| 1 |
| AA | — | 0.4635 | 0.6077 | 0.7746 | 0.9114 | 1.0031 | 1.1273 | 1.2114 |
| 2 |
| AA | — | 0.7188 | 0.8724 | 1.0516 | 1.1905 | 1.2653 | 1.3803 | 1.4582 |
| 3 |
| AA | — | 5.1888 | 5.8917 | 6.5781 | 7.0346 | 7.2585 | 7.5753 | 7.7762 |
| 4 |
| CE | — | 1.6066 | 1.8151 | 2.0460 | 2.2236 | 2.3105 | 2.4451 | 2.5255 |
| 5 |
| CE | — | 7.9433 | 8.3644 | 8.7589 | 9.0329 | 9.1515 | 9.3314 | 9.4407 |
| 6 |
| AA | 4 | 0.6591 | 0.7971 | 0.8737 | 0.9111 | 0.9278 | 0.9431 | 0.9511 |
| 7 |
| AA | 4 | 1612 | 1744 | 1834 | 1882 | 1904 | 1924 | 1938 |
| 8 |
| AA | 4 | 1.2087 | 1.4550 | 1.6646 | 1.7946 | 1.8633 | 1.9301 | 1.9724 |
| 9 |
| AA | — | 0.8026 | 0.8423 | 0.8678 | 0.8799 | 0.8854 | 0.8910 | 0.8942 |
| 10 |
| AA | — | 1.0164 | 1.1450 | 1.2363 | 1.2858 | 1.3101 | 1.3347 | 1.3484 |
| 12 |
| CE | 5 | 0.3952 | 0.5859 | 0.7109 | 0.7782 | 0.8081 | 0.8368 | 0.8528 |
| 13 |
| AA | 4 | 0.2535 | 0.2764 | 0.2913 | 0.2995 | 0.3034 | 0.3081 | 0.3119 |
| 14 |
| CE | — | 0.7958 | 0.8382 | 0.8631 | 0.8762 | 0.8821 | 0.8876 | 0.8902 |
| 15 |
| CE | 5 | 1.0672 | 1.3031 | 1.4939 | 1.6150 | 1.6748 | 1.7413 | 1.7762 |
| 16 |
| CE | 5 | 1556 | 1708 | 1808 | 1862 | 1888 | 1914 | 1926 |
| 17 |
| CE | — | 0.7131 | 0.8085 | 0.8713 | 0.9053 | 0.9202 | 0.9359 | 0.9454 |
| 19 |
| CE | 5 | 4.9359 | 5.4552 | 5.7908 | 5.9795 | 6.0650 | 6.1590 | 6.2287 |
| 20 |
| AA | 3 | 0.4121 | 0.4457 | 0.4665 | 0.4783 | 0.4850 | 0.4906 | 0.4933 |
| 21 |
| CE | 4 | 0.4167 | 0.4493 | 0.4711 | 0.4818 | 0.4877 | 0.4938 | 0.4972 |
| 22 |
| AA | (1) | 0.0032 | 0.0040 | 0.0049 | 0.0053 | 0.0058 | 0.0062 | 0.0064 |