| Literature DB >> 22280525 |
Nam Nguyen1, Siavash Mirarab, Tandy Warnow.
Abstract
BACKGROUND: Supertree methods combine trees on subsets of the full taxon set together to produce a tree on the entire set of taxa. Of the many supertree methods, the most popular is MRP (Matrix Representation with Parsimony), a method that operates by first encoding the input set of source trees by a large matrix (the "MRP matrix") over {0,1, ?}, and then running maximum parsimony heuristics on the MRP matrix. Experimental studies evaluating MRP in comparison to other supertree methods have established that for large datasets, MRP generally produces trees of equal or greater accuracy than other methods, and can run on larger datasets. A recent development in supertree methods is SuperFine+MRP, a method that combines MRP with a divide-and-conquer approach, and produces more accurate trees in less time than MRP. In this paper we consider a new approach for supertree estimation, called MRL (Matrix Representation with Likelihood). MRL begins with the same MRP matrix, but then analyzes the MRP matrix using heuristics (such as RAxML) for 2-state Maximum Likelihood.Entities:
Year: 2012 PMID: 22280525 PMCID: PMC3308190 DOI: 10.1186/1748-7188-7-3
Source DB: PubMed Journal: Algorithms Mol Biol ISSN: 1748-7188 Impact factor: 1.405
Statistics for biological datasets
| Dataset | Number Taxa | Number Source trees | Scaffold density | Resolution of SCM tree | Reference |
|---|---|---|---|---|---|
| Placental | 116 | 726 | 1.00 | 0.01 | [ |
| Seabirds | 121 | 7 | 0.74 | 0.57 | [ |
| Marsupials | 267 | 158 | 1.00 | 0.10 | [ |
| THPL | 558 | 19 | 0.25 | 0.57 | [ |
| CPL | 2,228 | 39 | 0.74 | 0.52 | [ |
We show the number of source trees, total taxa, resolution of the strict consensus merger tree, and the source of the original data for each of the biological datasets. The scaffold density is the proportion of the total taxa present in the largest source tree.
Figure 1Average missing branch rates and running times for 1000-taxon model conditions. The average missing branch rates and running times (in minutes) for the supertree methods for the 1000-taxon model conditions as a function of scaffold density. The standard error is shown for the missing branch rates, and the standard deviation is shown for the running times. Averages are computed only on replicates where there is sufficient taxonomic overlap to perform an accurate supertree analysis. n = 10 for all scaffold densities except n = 7 for the 20% scaffold density, and n = 9 for 50% scaffold density.
Missing branch rates for 1000-taxon model conditions
| Scaffold Density | |||||
|---|---|---|---|---|---|
| Method | 20 | 50 | 75 | 100 | Average |
| MRP(PAUP*) | 20.7 (1.1) | 17.7 (0.8) | 16.2 (0.8) | 11.7 (0.9) | 16.2 (0.7) |
| MRP(TNT) | 27.9 (1.6) | 19.1 (1.3) | 14.9 (1.1) | 11.7 (0.8) | 17.6 (1.1) |
| MRL(RAxML) | 13.8 (1.0) | 11.9 (0.8) | |||
| SCM | 22.6 (0.7) | 22.7 (0.7) | 21.0 (0.7) | 19.0 (0.6) | 21.2 (0.4) |
| SuperFine+MRP(PAUP*) | 14.7 (0.7) | 13.7 (0.9) | |||
| SuperFine+MRP(TNT) | 14.5 (0.7) | 11.8 (0.8) | |||
| SuperFine+MRL(RAxML) | 16.1 (0.8) | 15.0 (0.5) | 13.9 (0.9) | 11.9 (0.8) | 14.0 (0.5) |
We show the average missing branch rates (reported as %) on the 1000-taxon datasets. Missing branch rate is calculated as total number of FN edges in the model tree divided by the total number of internal edges in the model tree. Each simulated dataset has 25 clade-based source trees and 1 scaffold tree. The scaffold density is the percentage of the full taxon set that is present in the scaffold tree. The standard error is shown in parenthesis. n = 7 for the 20% scaffold density, n = 9 for the 50% scaffold density, and n = 10 for the remaining scaffold densities. n = 36 for the average. The lowest missing branch rate for each scaffold density is shown in bold.
Missing branch rates for 500-taxon model conditions
| Scaffold Density | |||||
|---|---|---|---|---|---|
| Method | 20 | 50 | 75 | 100 | Average |
| MRP(PAUP*) | 22.1 (1.0) | 18.8 (0.6) | 14.7 (0.7) | 16.4 (0.5) | |
| MRP(TNT) | 29.4 (1.7) | 18.4 (0.9) | 14.1 (0.6) | 11.2 (0.4) | 17.7 (0.8) |
| MRL(RAxML) | 15.9 (0.6) | 14.0 (0.5) | 12.9 (0.5) | 11.2 (0.4) | 13.4 (0.3) |
| SCM | 22.3 (0.6) | 21.6 (0.5) | 20.6 (0.6) | 18.6 (0.6) | 20.7 (0.3) |
| SuperFine+MRP(PAUP*) | 15.2 (0.5) | 14.0 (0.4) | 12.5 (0.4) | 13.1 (0.3) | |
| SuperFine+MRP(TNT) | 11.2 (0.4) | 13.0 (0.3) | |||
| SuperFine+MRL(RAxML) | 15.4 (0.5) | 14.2 (0.4) | 13.1 (0.4) | 11.3 (0.4) | 13.4 (0.3) |
We present average missing branch rates (reported as %) on the 500-taxon datasets. Missing branch rate is calculated as total number of FN edges in the model tree divided by the total number of internal edges in the model tree. Each simulated dataset has 15 clade-based source trees and 1 scaffold tree. The scaffold density is the percentage of the full taxon set that is present in the scaffold tree. The standard error is shown in parenthesis. n = 24 for the 20% scaffold density, and n = 30 for the remaining scaffold densities. n = 114 for the average. The lowest missing branch rate for each scaffold density is shown in bold.
False positive rates for 1000-taxon model conditions
| Scaffold Density | |||||
|---|---|---|---|---|---|
| Method | 20 | 50 | 75 | 100 | Average |
| MRP(PAUP*) | 20.7 (1.1) | 17.7 (0.8) | 16.1 (0.8) | 11.7 (0.9) | 16.2 (0.7) |
| MRP(TNT) | 27.9 (1.6) | 19.1 (1.3) | 14.9 (1.1) | 11.7 (0.8) | 17.6 (1.1) |
| MRL(RAxML) | 15.7 (0.7) | 14.1 (0.6) | 13.8 (1.0) | 11.9 (0.8) | 13.7 (0.5) |
| SCM | |||||
| SuperFine+MRP(PAUP*) | 14.4 (0.6) | 13.2 (0.6) | 12.7 (0.8) | 11.6 (0.8) | 12.8 (0.4) |
| SuperFine+MRP(TNT) | 14.4 (0.7) | 13.0 (0.6) | 12.6 (0.8) | 11.8 (0.8) | 12.8 (0.4) |
| SuperFine+MRL(RAxML) | 14.8 (0.7) | 13.5 (0.5) | 12.9 (0.8) | 11.9 (0.8) | 13.1 (0.4) |
We show average false positive rates (reported as %) on the 1000-taxon datasets. False positive rate is calculated as total number of FP edges in the estimated tree divided by the total number of internal edges in the internal tree. Each simulated dataset has 25 clade-based source trees and 1 scaffold tree. The scaffold density is the percentage of the full taxon set that is present in the scaffold tree. The standard error is shown in parenthesis. n = 7 for the 20% scaffold density, n = 9 for the 50% scaffold density, and n = 10 for the remaining scaffold densities. n = 36 for the average. The lowest false positive rate for each scaffold density is shown in bold.
False positive rates for 500-taxon model conditions
| Scaffold Density | |||||
|---|---|---|---|---|---|
| Method | 20 | 50 | 75 | 100 | Average |
| MRP(PAUP*) | 22.1 (1.0) | 18.8 (0.6) | 14.7 (0.7) | 11.1 (0.4) | 16.4 (0.5) |
| MRP(TNT) | 29.4 (1.7) | 18.4 (0.9) | 14.1 (0.6) | 11.2 (0.4) | 17.7 (0.8) |
| MRL(RAxML) | 15.9 (0.6) | 14.0 (0.5) | 12.9 (0.5) | 11.2 (0.4) | 13.4 (0.3) |
| SCM | |||||
| SuperFine+MRP(PAUP*) | 13.9 (0.5) | 12.6 (0.4) | 11.5 (0.4) | 11.1 (0.4) | 12.2 (0.2) |
| SuperFine+MRP(TNT) | 13.8 (0.6) | 12.5 (0.4) | 11.4 (0.4) | 11.2 (0.4) | 12.1 (0.2) |
| SuperFine+MRL(RAxML) | 14.2 (0.5) | 12.8 (0.4) | 12.1 (0.4) | 11.3 (0.4) | 12.5 (0.2) |
We present the average false positive rates (reported as %) on the 500-taxon datasets. False positive rate is calculated as total number of FP edges in the estimated tree divided by the total number of internal edges in the internal tree. Each simulated dataset has 15 clade-based source trees and 1 scaffold tree. The scaffold density is the percentage of the full taxon set that is present in the scaffold tree. The standard error is shown in parenthesis. n = 24 for the 20% scaffold density, and n = 30 for the remaining scaffold densities. n = 114 for the average. The lowest false positive rate for each scaffold density is shown in bold.
MRL scores for 1000-taxon model conditions
| Scaffold Density | |||||
|---|---|---|---|---|---|
| Method | 20 | 50 | 75 | 100 | Average |
| MRP(PAUP*) | -16632 (1870) | -19844 (1988) | -21742 (2879) | -24325 (2896) | -20991 (3684) |
| MRP(TNT) | -16584 (1861) | -19764 (1983) | -21645 (2837) | -24332 (2896) | -20937 (3687) |
| MRL(RAxML) | |||||
| SuperFine+MRP(PAUP*) | -16368 (1869) | -19714 (1995) | -21625 (2844) | -24329 (2891) | -20876 (3742) |
| SuperFine+MRP(TNT) | -16366 (1870) | -19718 (1998) | -21630 (2845) | -24326 (2892) | -20878 (3742) |
| SuperFine+MRL(RAxML) | -16389 (1872) | -19749 (1996) | -21648 (2859) | -24336 (2894) | -20897 (3741) |
| True Tree | -16852 (1929) | -20246 (2024) | -22147 (2770) | -24820 (2783) | -21385 (3714) |
We present the average MRL scores (ML scores under the symmetric two-state model with gamma-distributed rates across sites) with respect to the MRP matrix, given as log likelihood scores) for the 1000-taxon supertrees. Thus, numbers with smaller magnitude represent improvements. The scaffold density is the percentage of the full taxon set that is present in the scaffold tree. The standard deviation is shown in parenthesis. All scores are rounded to the nearest integer. The lowest MRL score (in magnitude) for each scaffold density is shown in bold.
MRL scores for 500-taxon model conditions
| Scaffold Density | |||||
|---|---|---|---|---|---|
| Method | 20 | 50 | 75 | 100 | Average |
| MRP(PAUP*) | -7815 (2419) | -9089 (2377) | -10242 (2360) | -11425 (2349) | -9739 (2709) |
| MRP(TNT) | -7799 (2417) | -9039 (2373) | -10218 (2370) | -11426 (2349) | -9716 (2715) |
| MRL(RAxML) | |||||
| SuperFine+MRP(PAUP*) | -7721 (2449) | -9021 (2380) | -10209 (2374) | -11424 (2349) | -9692 (2736) |
| SuperFine+MRP(TNT) | -7722 (2449) | -9022 (2380) | -10209 (2373) | -11426 (2351) | -9693 (2736) |
| SuperFine+MRL(RAxML) | -7731 (2449) | -9035 (2390) | -10221 (2377) | -11433 (2351) | -9704 (2740) |
| True Tree | -7901 (2454) | -9216 (2377) | -10390 (2376) | -11607 (2346) | -9877 (2736) |
We present the average MRL scores (ML scores with respect to the MRP matrix, given as log likelihoods) for the 500-taxon supertrees. Thus, numbers with smaller magnitude represent improvements. The scaffold density is the percentage of the full taxon set that is present in the scaffold tree. The standard deviation is shown in parenthesis. All scores are rounded to the nearest integer. The lowest MRL score (in magnitude) for each scaffold density is shown in bold.
MRP scores for 1000-taxon model conditions
| Scaffold Density | |||||
|---|---|---|---|---|---|
| Method | 20 | 50 | 75 | 100 | Average |
| MRP(PAUP*) | 2653 (276) | 2981 (290) | 3141.30 (428) | 3075 (449) | |
| MRP(TNT) | 2625 (268) | 2962 (292) | 3059 (452) | ||
| MRL(RAxML) | 2618 (275) | 2969 (293) | 3143 (417) | 3424 (429) | 3075 (463) |
| SuperFine+MRP(PAUP*) | |||||
| SuperFine+MRP(TNT) | 2614 (276) | ||||
| SuperFine+MRL(RAxML) | 2628 (277) | 2980 (294) | 3141 (421) | 3407 (425) | 3075 (457) |
| True Tree | 2788 (294) | 3144 (307) | 3308 (411) | 3578 (405) | 3241 (456) |
We present the average MRP scores (MP scores with respect to the MRP matrix) for the 1000-taxon supertrees. The scaffold density is the percentage of the full taxon set that is present in the scaffold tree. The standard deviation is shown in parenthesis. All scores are rounded to the nearest integer. The lowest MRP score for each scaffold density is shown in bold.
MRP scores for 500-taxon model conditions
| Scaffold Density | |||||
|---|---|---|---|---|---|
| Method | 20 | 50 | 75 | 100 | Average |
| MRP(PAUP*) | 1283 (330) | 1434 (334) | 1563 (332) | 1504 (364) | |
| MRP(TNT) | 1273 (331) | 1422 (334) | 1497 (367) | ||
| MRL(RAxML) | 1276 (338) | 1431 (336) | 1570 (336) | 1707 (339) | 1508 (372) |
| SuperFine+MRP(PAUP*) | 1422 (334) | 1557 (335) | |||
| SuperFine+MRP(TNT) | |||||
| SuperFine+MRL(RAxML) | 1277 (337) | 1431 (341) | 1568 (340) | 1704 (340) | 1507 (373) |
| True Tree | 1347 (341) | 1502 (337) | 1634 (340) | 1773 (339) | 1575 (372) |
We show the average MRP scores (MP scores with respect to the MRP matrix) for the 500-taxon supertrees. The scaffold density is the percentage of the full taxon set that is present in the scaffold tree. The standard deviation is shown in parenthesis. All scores are rounded to the nearest integer. The lowest MRP score for each scaffold density is shown in bold.
Running times for 1000-taxon model conditions
| Scaffold Density | |||||
|---|---|---|---|---|---|
| Method | 20 | 50 | 75 | 100 | Average |
| MRP(PAUP*) | 76.14 (15.45) | 55.53 (19.99) | 43.87 (7.72) | 54.56 (10.24) | 56.03 (17.67) |
| MRP(TNT) | 2.01 (0.64) | 2.95 (1.31) | 3.27 (0.59) | 4.33 (0.75) | 3.24 (1.19) |
| MRL(RAxML) | 99.86 (28.15) | 111.57 (48.06) | 81.57 (28.87) | 87.33 (11.59) | 94.22 (33.76) |
| SuperFine+MRP(PAUP*) | 7.14 (1.10) | 6.16 (1.28) | 8.13 (1.27) | 5.56 (1.09) | 6.73 (1.57) |
| SuperFine+MRP(TNT) | |||||
| SuperFine+MRL(RAxML) | 1.00 (0.14) | 1.56 (0.94) | 1.32 (0.71) | 1.27 (0.23) | 1.30 (0.64) |
We show the average running times, in minutes, to calculate the 1000-taxon supertrees estimated from 25 clade-based source trees and 1 scaffold tree. The standard deviation is shown in parenthesis. The lowest running time for each scaffold factor is shown in bold.
Running times for 500-taxon model conditions
| Scaffold Density | |||||
|---|---|---|---|---|---|
| Method | 20 | 50 | 75 | 100 | Average |
| MRP(PAUP*) | 8.98 (1.66) | 8.96 (1.43) | 9.58 (2.28) | 8.12 (1.63) | 8.91 (1.86) |
| MRP(TNT) | 0.32 (0.15) | 0.42 (0.12) | 0.45 (0.11) | 0.53 (0.10) | 0.43 (0.14) |
| MRL(RAxML) | 18.99 (6.88) | 19.24 (4.72) | 20.35 (5.96) | 18.35 (4.84) | 19.24 (5.65) |
| SuperFine+MRP(PAUP*) | 4.75 (1.46) | 4.30 (1.23) | 3.24 (1.89) | 5.87 (1.53) | 4.53 (1.82) |
| SuperFine+MRP(TNT) | |||||
| SuperFine+MRL(RAxML) | 0.40 (0.13) | 0.49 (0.12) | 0.58 (0.13) | 0.55 (0.31) | 0.51 (0.20) |
We give the average running times, in minutes, to calculate the 500-taxon supertrees estimated from 15 clade-based source trees and 1 scaffold tree. The standard deviation is shown in parenthesis. The lowest running time for each scaffold factor is shown in bold.
Figure 2Scatterplot of average missing branch rates versus running times for 1000-taxon 50% scaffold density model conditions. The average missing branch rate versus running times (in minutes) for the supertree methods for 9 replicates of the 1000-taxon 50% scaffold density model conditions.
Correlation analyses for 1000-taxon model conditions
| Scaffold Density | ||||
|---|---|---|---|---|
| Statistic | 20 | 50 | 75 | 100 |
| MRP Score | 0.770 | 0.908 | 0.968 | 0.991 |
| MRL Score | 0.988 | |||
| Sum-FN | 0.762 | 0.907 | 0.966 | |
We show the average Spearman's rank correlation coefficient between different statistics and the FN error rates of trees generated around each of the six estimated supertrees for the 1000-taxon model conditions. For each of the estimated six supertrees, 100 trees were generated using a p-ECR move, for a total of 606 trees (600 p-ECR trees plus 6 supertrees) per replicate. MRP score is the MP score of the estimated tree with respect to the MRP matrix. MRL score is the negative log-likelihood score of the estimated tree with respect to the MRP matrix. Sum-FN is the sum of the bipartitions in the source trees not present in the estimated tree divided by the total number bipartitions in the source trees. Coefficients with larger magnitude represent stronger correlation between the test statistic and FN error rates. The largest correlation coefficient for each scaffold density is shown in bold.
Correlation analyses for 500-taxon model conditions
| Scaffold Density | ||||
|---|---|---|---|---|
| Statistic | 20 | 50 | 75 | 100 |
| MRP Score | 0.690 | 0.879 | 0.947 | 0.984 |
| MRL Score | 0.980 | |||
| Sum-FN | 0.689 | 0.879 | 0.948 | |
We show the average Spearman's rank correlation coefficient between different statistics and the FN error rates of trees generated around each of the six estimated supertrees for the 500-taxon model conditions. For each of the six estimated supertrees, 100 trees were generated using a p-ECR move, for a total of 606 trees (600 p-ECR trees plus 6 supertrees) per replicate. MRP score is the MP score of the estimated tree with respect to the MRP matrix. MRL score is the negative log-likelihood score of the estimated tree with respect to the MRP matrix. Sum-FN is the sum of the bipartitions in the source trees not present in the estimated tree divided by the total number bipartitions in the source trees. Coefficients with larger magnitude represent stronger correlation between the test statistic and FN error rates. The largest correlation coefficient for each scaffold density is shown in bold.
Sum-FN rates for the biological supertrees
| Biological Dataset | |||||
|---|---|---|---|---|---|
| Method | Placental | Seabirds | Marsupials | THPL | CPL |
| MRP(PAUP*) | 16.0 | 26.0 | 25.1 | 33.3 | |
| MRP(TNT) | 13.8 | 26.0 | 21.3 | 33.2 | |
| MRL(RAxML) | 36.3 | 21.0 | 26.2 | 16.2 | 35.1 |
| SuperFine+MRP(PAUP*) | 13.8 | 26.0 | 16.7 | ||
| SuperFine+MRP(TNT) | 33.4 | ||||
| SuperFine+MRL(RAxML) | 36.3 | 16.0 | 26.8 | 18.9 | 34.2 |
The lowest Sum-FN for each dataset is shown in bold.
The MRP scores (MP scores with respect to the MRP matrix) for the biological supertrees
| Biological Dataset | |||||
|---|---|---|---|---|---|
| Method | Placental | Seabirds | Marsupials | THPL | CPL |
| MRP(PAUP*) | 217 | 974 | 5488 | ||
| MRP(TNT) | 931 | 5477 | |||
| MRL(RAxML) | 9508 | 230 | 2286 | 890 | 5738 |
| SuperFine+MRP(PAUP*) | 214 | 902 | 5481 | ||
| SuperFine+MRP(TNT) | |||||
| SuperFine+MRL(RAxML) | 9508 | 220 | 2295 | 911 | 5671 |
The lowest MRP score for each dataset is shown in bold.
The MRL scores (ML scores with respect to the MRP matrix, given as log likelihoods) for the biological supertrees
| Biological Dataset | |||||
|---|---|---|---|---|---|
| Method | Placental | Seabirds | Marsupials | THPL | CPL |
| MRP(PAUP*) | -41544 | -1137 | -10977 | -5182 | -41003 |
| MRP(TNT) | -41544 | -1124 | -10974 | -5043 | -41053 |
| MRL(RAxML) | |||||
| SuperFine+MRP(PAUP*) | -41543 | -1124 | -10974 | -4845 | -40890 |
| SuperFine+MRP(TNT) | -41546 | -1120 | -10968 | -4800 | -40923 |
| SuperFine+MRL(RAxML) | -1128 | -10980 | -4799 | -40533 | |
Numbers with smaller magnitude represent improvements. All scores are rounded to the nearest integer. The lowest MRL score (in magnitude) for each dataset is shown in bold.
Running times, in minutes, for the biological supertrees
| Biological Dataset | |||||
|---|---|---|---|---|---|
| Method | Placental | Seabirds | Marsupials | THPL | CPL |
| MRP(PAUP*) | 3.57 | 0.22 | 3.87 | 31.97 | 675.00 |
| MRP(TNT) | 0.38 | 29.82 | |||
| MRL(RAxML) | 7.47 | 0.45 | 7.20 | 25.37 | 461.82 |
| SuperFine+MRP (PAUP *) | 4.00 | 0.20 | 2.60 | 0.72 | 21.97 |
| SuperFine+MRP(TNT) | 1.30 | 0.07 | 0.67 | ||
| SuperFine+MRL(RAxML) | 9.23 | 0.05 | 5.00 | 0.47 | 29.02 |
The lowest running time for each dataset is shown in bold.