| Literature DB >> 30903171 |
Karen Siu-Ting1,2,3, María Torres-Sánchez4,5, Diego San Mauro4, David Wilcockson6, Mark Wilkinson7, Davide Pisani8, Mary J O'Connell9,10, Christopher J Creevey1.
Abstract
Increasingly, large phylogenomic data sets include transcriptomic data from nonmodel organisms. This not only has allowed controversial and unexplored evolutionary relationships in the tree of life to be addressed but also increases the risk of inadvertent inclusion of paralogs in the analysis. Although this may be expected to result in decreased phylogenetic support, it is not clear if it could also drive highly supported artifactual relationships. Many groups, including the hyperdiverse Lissamphibia, are especially susceptible to these issues due to ancient gene duplication events and small numbers of sequenced genomes and because transcriptomes are increasingly applied to resolve historically conflicting taxonomic hypotheses. We tested the potential impact of paralog inclusion on the topologies and timetree estimates of the Lissamphibia using published and de novo sequencing data including 18 amphibian species, from which 2,656 single-copy gene families were identified. A novel paralog filtering approach resulted in four differently curated data sets, which were used for phylogenetic reconstructions using Bayesian inference, maximum likelihood, and quartet-based supertrees. We found that paralogs drive strongly supported conflicting hypotheses within the Lissamphibia (Batrachia and Procera) and older divergence time estimates even within groups where no variation in topology was observed. All investigated methods, except Bayesian inference with the CAT-GTR model, were found to be sensitive to paralogs, but with filtering convergence to the same answer (Batrachia) was observed. This is the first large-scale study to address the impact of orthology selection using transcriptomic data and emphasizes the importance of quality over quantity particularly for understanding relationships of poorly sampled taxa.Entities:
Keywords: Lissamphibia; orthology; paralogy; phylogenomics; timetree
Mesh:
Year: 2019 PMID: 30903171 PMCID: PMC6526904 DOI: 10.1093/molbev/msz067
Source DB: PubMed Journal: Mol Biol Evol ISSN: 0737-4038 Impact factor: 16.240
. 1.Evolutionary relationships in the Lissamphibia and hypotheses of ancient duplications, speciation, and loss of gene copies. (A) The three possible hypothetical phylogenetic relationships (Batrachia, Procera, and a Third topology) among the living lineages of Lissamphibia: Gymnophiona (G), Anura (A), and Caudata (C). (B) Hypothesis of two rounds of ancient duplication events (“DUP” in orange) prior to the Lissamphibia speciation events (“SP” in light green) that give rise to multiple gene copies assuming the most accepted hypothesis in the literature, Batrachia. In superscript are the IDs of the different copies of the gene that have arisen following the duplication events. (C) Framework of “Early Loss” resulting from gene deletions (red crosses) prior to the speciation events. In this example, all gene copies are orthologs and retrieve Batrachia. (D) Framework of “Late Loss” resulting from gene deletions (red crosses) after the speciation events. In this example, there is a mix of ortholog and paralog genes which retrieve Procera. For figures (B–D), lineages in black represent retained copies and lineages in gray represent lost copies.
. 2.Resulting topologies for the extant lineages of Lissamphibia with all four data sets tested. Data sets 2656 and 768 have no enrichment of orthologs (no Clan_Check filter) and Data sets 2019 and 348 have enrichment of orthologs (after Clan_Check filter). Alternative hypotheses retrieved (shown on the right side) for Lissamphibia are represented by a different color in the table: Batrachia is in purple, Procera is in olive green. Numbers in the table are the support values for each data set and method used, and these are represented by color intensity (higher support for the node, more intense color). The complete set of results for the entire 33 taxa can be found in supplementary figures S2–S13, Supplementary Material online. Abbreviations: “ML” stands for maximum likelihood, “BI” for Bayesian inference, and “QS” for quartet-based supertree method. In the case of ML, support values correspond to bootstrap proportions, for BI they represent Posterior Probability (scaled over 1), and for QS they represent local posterior PP (scaled over 1).
Results from AU Tests of All Three Possible Topologies with the 768 Gene Families Capable of Addressing the Root of the Lissamphibia.
| Gene-Family Data Set | Number with Single Accepted Tree | Hypothesis Favored | ||
|---|---|---|---|---|
| Percentage Supporting Batrachia (number) | Percentage Supporting Procera (number) | Percentage Supporting Third Topology (number) | ||
| All 768 gene families | 90 | 45% (41) | 36% (33) | 17% (16) |
| 348 genes passing Clan_Check | 35 | 51% (18) | 28% (10) | 20% (7) |
| 420 gene failing Clan_Check | 55 | 42% (23) | 42% (23) | 16% (9) |
Number of gene families that had enough phylogenetic information to reject all hypotheses except one.
. 3.Timetree of vertebrates based on the most curated data set after enrichment for orthologs. Estimated upper and lower ranges for each node are represented as red/blue bars in each node. Red bars correspond to the divergence times estimated using Data set 348, the most curated data set. Dark blue bars correspond to the divergence times estimated using the larger Data set 768, which is before the “Clan_Check” step. Timescale in million years. Background colors for the geological time periods follow the International Commission on Stratigraphy.