| Literature DB >> 17288576 |
Michael J Sanderson1, Michelle M McMahon.
Abstract
BACKGROUND: Most studies inferring species phylogenies use sequences from single copy genes or sets of orthologs culled from gene families. For taxa such as plants, with very high levels of gene duplication in their nuclear genomes, this has limited the exploitation of nuclear sequences for phylogenetic studies, such as those available in large EST libraries. One rarely used method of inference, gene tree parsimony, can infer species trees from gene families undergoing duplication and loss, but its performance has not been evaluated at a phylogenomic scale for EST data in plants.Entities:
Mesh:
Year: 2007 PMID: 17288576 PMCID: PMC1796612 DOI: 10.1186/1471-2148-7-S1-S3
Source DB: PubMed Journal: BMC Evol Biol ISSN: 1471-2148 Impact factor: 3.260
Figure 1Tree reconciliation example. Two alternative rootings of the same unrooted gene tree (thin black lines) imbedded in a species tree (thick grey lines) visualized with the tool PrIMETV [76]. The gene tree is the maximum likelihood tree for a data set with 12 tentative consensus (TC) sequences assembled from ESTs from seven taxa (our cluster 13024). Bars indicate duplications within species (in-duplications) and black circles indicate out-duplications (those followed by a speciation event). A. The gene tree rooted to minimize the number of duplications required to reconcile the trees (two out-duplications required). B. The gene tree rooted using midpoint rooting, which places the root along the branch to the Arabidopsis sequences. This rooting is less optimal, requiring five out-duplications.
Figure 2Accepted species tree for seven plant model species. Names of clades are indicated at internal nodes. See text for discussion of strength of evidence for this phylogeny.
Sequence and cluster data for each taxon
| Taxona | Releaseb | Original TCsc | MaxORFsd | Clusterse | Final TCsf |
| 12.1 | 28900 | 23737 | 343 | 729 | |
| 12.0 | 31928 | 13930 | 538 | 1065 | |
| 3.0 | 12485 | 3116 | 365 | 452 | |
| 8.0 | 18612 | 12254 | 528 | 852 | |
| 16.0 | 36381 | 25842 | 199 | 418 | |
| 6.0 | 23531 | 13949 | 159 | 315 | |
| 10.0 | 21063 | 12625 | 378 | 705 | |
| Total | 172900 | 105453 | 577 | 4536 |
a Taxon as given by TIGR for the EST collection assembled in the Gene Index Database.
b Versions used in this paper, current as of 18 February 2006.
c The 363,971 sequences in the database for these taxa were screened to include only those sequences assembled by TIGR into Tentative Consensus (TC) sequences.
d TCs were trimmed to the largest sense-direction ORF that was at least 500 nt in length; shorter sequences were discarded.
eNumber of clusters in which the taxon is represented, after screening for phylogenetic informativeness (at least three taxa and at least four sequences).
f Total number of sequences from each taxon in the final set of clusters.
gTIGR assembled this library from several species of Pinus.
Effects of hit fraction threshold on cluster assembly. Bold indicates the threshold chosen for the current study.
| Hit fractiona | Clustersb | Singletonsc | Phylogenetically informative clustersd | Max sizee | TCs in phylogenetically informative clustersf |
| 0.0 | 39924 | 26782 | 4423 | 6565 | 54051 |
| 0.1 | 47798 | 32824 | 4079 | 1947 | 42406 |
| 0.2 | 57229 | 41327 | 3324 | 1362 | 29403 |
| 0.3 | 64691 | 48864 | 2561 | 330 | 21504 |
| 0.4 | 71333 | 56383 | 1876 | 117 | 15457 |
| 0.5 | 77564 | 63890 | 1340 | 98 | 10721 |
| 0.6 | 83435 | 71539 | 897 | 95 | 7105 |
| 0.8 | 94296 | 87186 | 324 | 92 | 2529 |
| 0.9 | 99843 | 95975 | 103 | 89 | 872 |
| 1.0 | 105144 | 104860 | 1 | 6 | 6 |
a Minimum proportion of sequence similarity based on BLAST's pairwise comparisons. The hit fraction determines whether a sequence is linked to another (if a pair is linked, they will be placed in the same cluster) and thus affects the level of heterogeneity within clusters and the number of assembled clusters. Original number of sequences is 105,453 TCs.
b Total number of assembled clusters.
c Number of single-sequence clusters.
d Phylogenetically informative clusters for this study are those that include at least three species and at least four sequences.
e Number of tentative consensus sequences (TCs) in the largest phylogenetically informative cluster.
f Total TCs in all phylogenetically informative clusters.
Distributions of cluster sizes by number of taxa
| Number of taxa in cluster | Number of clusters |
| 1 | 86022 |
| 2 | 1986 |
| 3 | 478 |
| 4 | 162 |
| 5 | 90 |
| 6 | 67 |
| 7 | 59 |
Distributions of cluster sizes by number of tentative consensus sequences (TCs)
| Number of TCs in cluster | Number of clusters |
| 1 | 79122 |
| 2–3 | 8645 |
| 4–9 | 930 |
| 10–94 | 167 |
Distribution of duplication scores among clusters
| MP gene trees | ML gene trees | |
| Number of clusters with zero duplications | 40 | 42 |
| Number of clusters with zero | 211 | 226 |
| Maximum duplications in any cluster | 83 | 81 |
| Maximum | 20 | 17 |
Figure 3Species tree inferred by gene tree parsimony. A. The best species tree obtained using gene tree parsimony based on the maximum likelihood gene tree collection. It is identical to the accepted tree in Figure 2. B. The best species tree obtained using GTP based on the maximum parsimony gene tree collection. It differs from the accepted tree only within the legumes. Bootstrap II support values (resampling the gene trees: see text) are shown in plain text for each bipartition in the tree. Bootstrap I values (resampling the data within the original clusters) are shown in italics for tree B.
Figure 4Distribution of duplication scores across all species trees. Distributions of out-duplication scores across all 945 binary angiosperm species trees (all rooted with Pinus). An out-duplication score is the sum of all out-duplications required to reconcile all 577 gene trees (or sets of trees) to that species tree. The upper panel shows the distribution of scores when the gene trees were estimated using maximum parsimony; the lower panel gives the same for the maximum likelihood gene trees. Arrows indicate the bins in which the accepted species tree occurs. For the MP gene trees, the accepted species tree was fourth from the best and had a score of 796.3 (the optimal species tree had a score of 771.9). For the ML gene trees, the optimal tree was the same as the accepted tree and had a score of 779.0.