| Literature DB >> 23471508 |
Liat Shavit Grievink1, David Penny, Barbara R Holland.
Abstract
Phylogenetic studies based on molecular sequence alignments are expected to become more accurate as the number of sites in the alignments increases. With the advent of genomic-scale data, where alignments have very large numbers of sites, bootstrap values close to 100% and posterior probabilities close to 1 are the norm, suggesting that the number of sites is now seldom a limiting factor on phylogenetic accuracy. This provokes the question, should we be fussy about the sites we choose to include in a genomic-scale phylogenetic analysis? If some sites contain missing data, ambiguous character states, or gaps, then why not just throw them away before conducting the phylogenetic analysis? Indeed, this is exactly the approach taken in many phylogenetic studies. Here, we present an example where the decision on how to treat sites with missing data is of equal importance to decisions on taxon sampling and model choice, and we introduce a graphical method for illustrating this.Entities:
Mesh:
Year: 2013 PMID: 23471508 PMCID: PMC3641631 DOI: 10.1093/gbe/evt032
Source DB: PubMed Journal: Genome Biol Evol ISSN: 1759-6653 Impact factor: 3.416
FThe two competing trees. Mesostigma is positioned either sister to the Streptophyta (S) or is basal to both Streprophyta and Chlorophyta (B). The taxa included in the eight-taxon data set are marked in red.
The Positioning of Mesostigma in Trees Estimated Using Three Different Models (JTT, WAG, and CpREV) and Combination of +I, +G, and +F
| Model | Original 13-Taxa (6,622 Positions) | 13-Taxa Clean (1,948 Positions) | 8-Taxa (6,622 Positions) | 8-Taxa Reduced Then Cleaned (3,910 Positions) | 8-Taxa Cleaned Then Reduced (1,948 Positions) |
|---|---|---|---|---|---|
| JTT | P (30,8,62) | S (82,1,15) | B (5,94,1) | B (7,91,2) | S (71,22,7) |
| JTT + F | P (36,12,52) | S (88,0,12) | B (7,89,4) | B (5,95,0) | S (74,11,12) |
| JTT + I | P (45,4,51) | S (96,0,4) | B (18,78,4) | B (20,78,2) | S (81,8,8) |
| JTT + I + F | P (57,8,35) | S (89,0,10) | B (26,73,1) | B (29,69,2) | S (76,13,10) |
| JTT + G | S (70,4,26) | S (83,0,16) | B (25,75,0) | S (47,52,1) | S (74,11,14) |
| JTT + G + F | S (68,1,31) | S (88,0,11) | S (33,65,2) | S (50,48,2) | S (84,7,7) |
| JTT + I + G | S (78,2,20) | S (90,0,9) | B (43,54,3) | S (50,49,1) | S (81,10,9) |
| JTT + I + G + F | S (68,0,32) | S (91,0,9) | S (45,55,0) | S (54,46,0) | S (78,11,11) |
| WAG | S (26,13,61) | S (83,2,12) | B (6,92,2) | B (6,93,1) | S (66,20,11) |
| WAG + F | P (28,13,59) | S (79,3,14) | B (8,92,0) | B (8,91,1) | S (67,19,9) |
| WAG + I | P (44,8,48) | S (93,1,5) | B (18,82,0) | B (26,74,0) | S (77,9,8) |
| WAG + I + F | P (45,2,53) | S (94,0,6) | B (15,82,3) | B (24,75,1) | S (82,7,10) |
| WAG + G | S (54,6,40) | S (87,0,13) | B (37,62,1) | S (46,54,0) | S (87,6,6) |
| WAG + G + + F | S (65,2,33) | S (78,0,21) | S (31,69,0) | S (43,56,1) | S (84,5,9) |
| WAG + I + G | S (66,2,32) | S (76,0,21) | B (28,71,1) | S (38,62,0) | S (84,7,9) |
| WAG + I + G + F | S (63,4,33) | S (86,0,10) | S (42,57,1) | S (50,50,0) | S (82,5,10) |
| CpREV | S (41,11,48) | S (79,3,15) | B (13,84,3) | B (5,93,2) | S (73,17,8) |
| CpREV + F | P (43,5,52) | S (83,0,15) | B (9,90,1) | B (11,85,4) | S (65,15,18) |
| CpREV + I | S (52,6,42) | S (90,0,10) | B (12,86,2) | B (22,72,6) | S (62,14,15) |
| CpREV + I + + F | P (45,5,50) | S (83,1,13) | B (16,81,3) | B (23,73,4) | S (78,3,16) |
| CpREV + G | S (69,2,29) | S (84,0,14) | S (35,64,1) | S (45,55,0) | S (86,4,10) |
| CpREV + G + F | S (72,3,25) | S (83,1,16) | S* (39,61,0) | S (41,55,4) | S (80,6,11) |
| CpREV + I + G | S (71,1,28) | S (90,0,9) | B (44,53,3) | S (42,54,4) | S (74,11,12) |
| CpREV + I + + G + F | S* (70,1,29) | S* (89,0,9) | S (34,63,3) | S* (38,59,3) | S* (84,3,11) |
Note.—"S” indicates that Mesostigma is sister to Streptophyta, “B” indicates that Mesostigma is basal to green plants, and “P” indicates that Mesostigma is sister to Prototheca (fig. 1). The best-fit model, found using ProtTest, for each of the settings is marked with an *. Numbers in brackets show bootstrap support for the S split, B split, and P split in turn.
Summary of Site Likelihoods Using the WAG Model for the 8- and 13-Taxon Data Sets, with and without the Removal of Missing Data
| Data Set | No. of Taxa | Treatment of Sites with Missing Data | Mesostigma Position in ML Tree | No. of Sites Preferring Position S | No. of Sites Preferring Position B | Total Number of Sites | Average Difference in Likelihood between Trees |
|---|---|---|---|---|---|---|---|
| a | 13 | Included | S | 2,506 | 6,622 | 0.0017 | |
| b | 13 | Excluded | S | 491 | 1,948 | 0.0134 | |
| c | 8 | Included | B | 2,768 | 6,622 | 0.0074 | |
| d | 8 | Excluded after taxon sampling | B | 1,280 | 3,910 | 0.003 | |
| e | 8 | Excluded before taxon sampling | S | 438 | 1,948 | 0.0138 |
Note.—"S,” within Streptophyta; “B,” basal to green plants. The Majority of sites are underlined and marked in bold.
FTruncated histograms of the differences in site likelihood for the two competing positions of Mesostigma for the five data sets of table 2. For each site, the log likelihoods are calculated for the two positions (S vs. B, see fig. 1) and then subtracted. For example, in (a) most sites (3,885) support position S, but the distribution is not symmetrical; a small number of sites (<1%) support B very strongly and dominate the larger number of sites supporting position S. (b–e) are the other data sets from table 2.
FNonrandomness of sites with missing data. Sites in the eight-taxon alignment have been ranked in order of increasing level of preference for the S tree over the B tree (x axis), and the y axis shows the cumulative total number of sites with missing data. The solid blue line records sites with missing data in the 13-taxon data (4,674 in total), and the solid red line records sites with missing data in the eight-taxon data (2,712 in total). Dashed straight lines show the expectation if sites with missing data were allocated randomly with respect to level of preference.