| Literature DB >> 20089168 |
Heroen Verbruggen1, Christine A Maggs, Gary W Saunders, Line Le Gall, Hwan Su Yoon, Olivier De Clerck.
Abstract
BACKGROUND: The assembly of the tree of life has seen significant progress in recent years but algae and protists have been largely overlooked in this effort. Many groups of algae and protists have ancient roots and it is unclear how much data will be required to resolve their phylogenetic relationships for incorporation in the tree of life. The red algae, a group of primary photosynthetic eukaryotes of more than a billion years old, provide the earliest fossil evidence for eukaryotic multicellularity and sexual reproduction. Despite this evolutionary significance, their phylogenetic relationships are understudied. This study aims to infer a comprehensive red algal tree of life at the family level from a supermatrix containing data mined from GenBank. We aim to locate remaining regions of low support in the topology, evaluate their causes and estimate the amount of data required to resolve them.Entities:
Mesh:
Substances:
Year: 2010 PMID: 20089168 PMCID: PMC2826327 DOI: 10.1186/1471-2148-10-16
Source DB: PubMed Journal: BMC Evol Biol ISSN: 1471-2148 Impact factor: 3.260
Figure 1Data availability matrix. Graphical representation of our concatenated alignment, showing the availability of sequence data. The color of column and row headers indicate the amount of data available for that column or row. Green indicates high data availability, red indicates low data availability and yellow/orange represents intermediate data availability. The matrix density is 34% in a locus × OTU context and 35% in a character × OTU context. Numbers in cells indicate length of sequence in alignment, which may include gaps and/or exclude ambiguously aligned regions. Figure generated with the gDAM software http://www.phycoweb.net.
Figure 2Red algal tree of life with current taxonomic classification. The tree was reconstructed using Bayesian phylogenetic inference of DNA data mined from GenBank (Figure 1). Branch colors indicate statistical support of the clades: whereas black branches are strongly supported, the orange parts of the tree are poorly resolved. Intermediate colors represent intermediate support (see gradient legend). Five poorly supported regions are indicated with gray boxes (A-E). Numbers at nodes indicate branch support given as bootstrap values from maximum likelihood analysis before the vertical bar and Bayesian posterior probabilities after the vertical bar. Values are only shown if they exceed 50 and 0.95, respectively.
Likelihood based topological tests
| lnL | BV | ||
|---|---|---|---|
| Bayesian tree | -185,594.97 | 0% | 0.403 |
| Ceramiales | -185,607.63 | 5% | 0.186 |
| Gigartinales | -185,635.98 | 0% | 0.574 |
| region A | -185,622.35 | 0% | < 0.001 |
| region B | -185,678.38 | 0% | < 0.001 |
| region C | -185,818.04 | 0% | < 0.001 |
| region D | -185,686.91 | 0% | 0.001 |
| region E | -185,708.05 | 0% | < 0.001 |
Various alternative topologies are compared to the ML topology using an AU test. For each alternative topology (rows of the table), the lnL of the alternative topology is given along with the percentage of occurrences of the alternative topology in the unconstrained bootstrap analysis (BV), and the P-value of the AU test on a larger set of trees. On the first data line, the Bayesian tree is compared to the ML tree. In this case, the null hypothesis of the AU test is that the ML tree is not significantly more likely than the BI tree. In the middle part of the table, each of the non-monophyletic orders is listed along with the lnL of the topology in which the order is constrained to be monophyletic. In this case, the null hypothesis of the AU test is that unconstrained and constrained topologies are equally likely. In the bottom part of the table, the possibility that the poorly resolved regions represent hard polytomies is tested. The listed lnL are for the trees in which one of the poorly resolved region was collapsed, and in this case the null hypothesis of the AU test is that uncollapsed and collapsed topologies are equally likely. The lnL of the unconstrained, uncollapsed topology is -185,569.97.
Data availability, relative age and node density of poorly supported regions
| informative loci | data overlap | relative age | node density | |
|---|---|---|---|---|
| region A | 9 → 64.3% | 100% | 0.88 - 0.97 | 0.529 |
| region B | 4 → 28.6% | 83.3% | 0.35 - 0.53 | 0.449 |
| region C | 7 → 50.0% | 60.3% | 0.33 - 0.53 | 0.548 |
| region D | 5 → 35.4% | 57.5% | 0.34 - 0.43 | 0.787 |
| region E | 3 → 21.4% | 75.8% | 0.14 - 0.25 | 1.000 |
The four statistics presented in this table describe the current data availability for each of the five poorly supported regions and the relative difficulty of resolving them. The proportion of potentially informative loci and the data overlap among potentially informative loci measure current data availability. Potentially informative loci are those that are present for more than three of the OTUs in the matrix. Data overlap is given as the average relative edge weight in the intersection graph of informative loci (see methods). The relative age and node density may indicate how difficult resolving the region will be. The relative age represents how ancient the region is, on a scale from zero (the present) to one (the root of our tree). The node density index is proportional to the number of nodes that need to be resolved per time unit (see methods). The partial data availability matrices for each region can be found in Additional file 3.
Figure 3Estimated data requirement for resolving the five poorly supported regions. Each graph shows how average bootstrap support increases as a function of alignment length for three types of simulations: nonparametric resampling of the empirical alignment (orange), parametric simulation of data (blue) and parametric simulation followed by introduction of missing data (gray). The approximate amount of data required to resolve a region can be derived for each simulation type by specifying a desired level of bootstrap support (e.g., the dashed line drawn at 80) and deducing the corresponding alignment length on the x-axis. Note that the x-axis uses a logarithmic scale. The lines connect the means of the five values of each condition.