| Literature DB >> 21840842 |
Soowon Cho1, Andreas Zwick, Jerome C Regier, Charles Mitter, Michael P Cummings, Jianxiu Yao, Zaile Du, Hong Zhao, Akito Y Kawahara, Susan Weller, Donald R Davis, Joaquin Baixeras, John W Brown, Cynthia Parr.
Abstract
This paper addresses the question of whether one can economically improve the robustness of a molecular phylogeny estimate by increasing gene sampling in only a subset of taxa, without having the analysis invalidated by artifacts arising from large blocks of missing data. Our case study stems from an ongoing effort to resolve poorly understood deeper relationships in the large clade Ditrysia ( > 150,000 species) of the insect order Lepidoptera (butterflies and moths). Seeking to remedy the overall weak support for deeper divergences in an initial study based on five nuclear genes (6.6 kb) in 123 exemplars, we nearly tripled the total gene sample (to 26 genes, 18.4 kb) but only in a third (41) of the taxa. The resulting partially augmented data matrix (45% intentionally missing data) consistently increased bootstrap support for groupings previously identified in the five-gene (nearly) complete matrix, while introducing no contradictory groupings of the kind that missing data have been predicted to produce. Our results add to growing evidence that data sets differing substantially in gene and taxon sampling can often be safely and profitably combined. The strongest overall support for nodes above the family level came from including all nucleotide changes, while partitioning sites into sets undergoing mostly nonsynonymous versus mostly synonymous change. In contrast, support for the deepest node for which any persuasive molecular evidence has yet emerged (78-85% bootstrap) was weak or nonexistent unless synonymous change was entirely excluded, a result plausibly attributed to compositional heterogeneity. This node (Gelechioidea + Apoditrysia), tentatively proposed by previous authors on the basis of four morphological synapomorphies, is the first major subset of ditrysian superfamilies to receive strong statistical support in any phylogenetic study. A "more-genes-only" data set (41 taxa×26 genes) also gave strong signal for a second deep grouping (Macrolepidoptera) that was obscured, but not strongly contradicted, in more taxon-rich analyses.Entities:
Mesh:
Substances:
Year: 2011 PMID: 21840842 PMCID: PMC3193767 DOI: 10.1093/sysbio/syr079
Source DB: PubMed Journal: Syst Biol ISSN: 1063-5157 Impact factor: 15.683
Comparison of bootstrap support (nodes above the family level) between five-gene complete and partially augmented matrices
|
|
Gene regions sequenced
| A. 21 gene segments adopted from arthropod study of (Regier, Shultz, et al.(2008)) | |||
| PCR amplicon name | Gene name/function | Fragment length(bp) | Average number of substitutions per nt2 site |
| 36fin1_3 |
| 471 | 0.39 |
| 44fin2_3 |
| 528 | 0.85 |
| 3007fin1_2 |
| 621 | 1.22 |
| 8091fin1_2 |
| 666 | 1.22 |
| 3006fin1_2 |
| 222 | 1.24 |
| 113fin1_2 |
| 975 | 1.27 |
| acc2_4 |
| 501 | 1.31 |
| 69fin2_3 | Clathrin coat assembly protein | 627 | 1.36 |
| 109fin1_2 |
| 594 | 1.40 |
| 3070fin4_5 |
| 705 | 1.43 |
| 262fin1_2 | Proteasome subunit | 501 | 1.48 |
| 268fin1_2 |
| 768 | 1.68 |
| 270fin2_3 | (Hypothetical protein) | 447 | 1.70 |
| 3017fin1_2 |
| 594 | 1.74 |
| 40fin2_3 |
| 750 | 1.77 |
| 8028fin1_2 | Nucleolar cysteine-rich protein | 324 | 1.81 |
| 42fin1_2 | Putative GTP-binding protein | 840 | 1.95 |
| 3059fin1_3 |
| 732 | 2.23 |
| 197fin1_2 |
| 444 | 2.30 |
| 192fin1_2 |
| 402 | 2.78 |
| 265fin2_3 |
| 447 | 3.86 |
| B. Gene segments from (Regier et al.(2009)), with estimated substitution rates | |||
| nt3 | Gene name | Fragment length | “nt2 est.” |
| 85.0 |
| 2928 | 1.60 |
| 18.4 |
| 1281 | 1.62 |
| 22.4 |
| 1134 | 1.17 |
| 10.8 |
| 888 | 6.00 |
| 42.4 |
| 402 | 0.34 |
Notes: Provided are PCR amplicon names, gene names/functions and fragment lengths (excluding nucleotide characters of uncertain alignment) of the 21 additional gene regions sequenced for 41 taxa, ordered by evolutionary rate at nt2 on a phylogeny for 13 arthropod exemplars (Regier2008c).
Average number of nucleotide changes per second codon position site, estimated by ML on a constrained tree of 13 divergent arthropod species, from table 2 of (Regier, Shultz, et al.(2008)).
Average number of nucleotide changes per third codon position site, estimated by ML on a constrained tree of 32 species of Bombycoidea (Lepidoptera), from table 4 of (Regier, Cook, et al.(2008)).
Average number of nucleotide changes per site in a character set consisting of nt2 plus all nt1 sites at which no leucine or arginine occurs in any taxon, estimated by ML on a constrained tree of 32 species of Bombycoidea (Lepidoptera), from table 4 of (Regier, Cook, et al.(2008)). This is an estimate of the rate of nonsynonymous change. The ratio of rates at nt3 to rates in this character set is an estimate of the relative rate of synonymous to nonsynonymous substitution.
Approximation of nonsynonymous substitution rate, for comparison to the 21 additional gene fragments above. Gene 19 in table 2 of (Regier, Shultz, et al.(2008)), not included in the 21 additional genes of this study, is a 600 base pair piece of CAD. Estimates of nonsynonymous rates (noLR1 + nt2, above) for the five genes genes used by (Regier et al.(2009)) were first converted to proportions of the rate for CAD, then rescaled to reflect the ranking of CAD among the nt2 rates for the 21 additional gene fragments. The result is an approximate scale of comparison for rates of nonsynonymous substitution across all 26 genes, assuming that rates of substitution at nt2 and at nt1 sites undergoing only nonsynonymous substitutions are comparable.
FDiagram of gene and taxon sampling design, showing relationships among the three data sets analyzed. a) Five-gene complete matrix (123 taxa; from Regier et al. 2009). b) Partially augmented matrix, deliberately incomplete, created by adding 21 genes for just 41 of the 123 taxa in the five-gene complete matrix. c) More-genes-only matrix, consisting of just the 41 species sequenced for all 26 genes.
FComparison of ML trees of family relationships inferred from the five-gene complete matrix (left column) to those from the partially augmented matrix (123 taxa×5 or 26 genes; right column), simplified from full 123-taxon trees shown in Figures S1–S4. Black triangles denote families with multiple exemplars. Numbers in parentheses after family name represent number of exemplars for the five-gene complete matrix, number with 26 genes/total number for partially augmented matrix). ### denotes families with one or more exemplars scored for 26 genes for partially augmented matrix. BPs > 50% are shown above branches; number of replicates is 1000 for a) and b), 2000 for c) and d). a) nt123 partitioned, five-gene complete matrix; b) nt123 partitioned, partially augmented matrix; c) all nonsynonymous coding, five-gene complete matrix; d) all nonsynonymous coding, partially augmented matrix.
Fa)–c) ML trees of family relationships inferred from more-genes-only data set (41 taxa×26 genes), simplified (except in c) from full 41-taxon trees shown in Figures S5 and S6. Black triangles denote families with multiple exemplars; number of exemplars shown in parentheses after family name. a) all nonsynonymous coding, phylogram; b) all nonsynonymous coding, cladogram; c) nt123 partitioned. BPs > 50% are shown above branches; number of replicates is 1000 for a), 2000 for b). d) Relationships among the sampled families (only) according to the morphology-based working hypothesis of Kristensen and Skalski (1998).