| Literature DB >> 29874797 |
Dominic J Bennett1,2, Hannes Hettling3, Daniele Silvestro4,5, Alexander Zizka6,7, Christine D Bacon8,9, Søren Faurby10,11, Rutger A Vos12, Alexandre Antonelli13,14,15,16.
Abstract
The exceptional increase in molecular DNA sequence data in open repositories is mirrored by an ever-growing interest among evolutionary biologists to harvest and use those data for phylogenetic inference. Many quality issues, however, are known and the sheer amount and complexity of data available can pose considerable barriers to their usefulness. A key issue in this domain is the high frequency of sequence mislabeling encountered when searching for suitable sequences for phylogenetic analysis. These issues include, among others, the incorrect identification of sequenced species, non-standardized and ambiguous sequence annotation, and the inadvertent addition of paralogous sequences by users. Taken together, these issues likely add considerable noise, error or bias to phylogenetic inference, a risk that is likely to increase with the size of phylogenies or the molecular datasets used to generate them. Here we present a software package, phylotaR that bypasses the above issues by using instead an alignment search tool to identify orthologous sequences. Our package builds on the framework of its predecessor, PhyLoTa, by providing a modular pipeline for identifying overlapping sequence clusters using up-to-date GenBank data and providing new features, improvements and tools. We demonstrate and test our pipeline's effectiveness by presenting trees generated from phylotaR clusters for two large taxonomic clades: Palms and primates. Given the versatility of this package, we hope that it will become a standard tool for any research aiming to use GenBank data for phylogenetic analysis.Entities:
Keywords: BLAST; DNA; R; open source; phylogenetics; sequence orthology
Year: 2018 PMID: 29874797 PMCID: PMC6027284 DOI: 10.3390/life8020020
Source DB: PubMed Journal: Life (Basel) ISSN: 2075-1729
Comparing phylotaR and PhyLoTa.
| phylotaR | PhyLoTa | |
|---|---|---|
|
| ||
| Direct clades | Yes | Yes |
| Subtree clades | Yes | Yes |
| Paraphyletic clades | Yes | No |
| Merged clades | Yes | No |
| Outputs | Clusters | Clusters, alignments, trees |
|
| ||
| Language | R | Perl |
| Open source | Yes | No |
| Execution | Local computer | Web-interface |
| Modular design | Yes | No |
|
| ||
| GenBank release | Latest | 2013 |
| Search-tool | BLAST, user-choice * | BLAST |
| Taxonomy | NCBI, user-choice * | NCBI |
| Sequence features | Yes | No |
| Non-NCBI sequences | Yes * | No |
* Yet to be implemented features.
Figure 1The phylotaR pipeline identifies all sequences in GenBank associated with a user-specified taxonomic identity (a). The pipeline then performs all-vs.-all BLAST across all the sequences to identify orthologous clusters (b). These searches are constrained to run within taxonomic groups up to a user-determined limit (default 50,000 sequences and 100,000 nodes). To generate higher taxonomic level clusters, an additional BLAST search is performed of the most connected sequences within clusters (i.e., the seed sequences) from the lower-level clusters. The clusters of overlapping seed sequences are then merged into larger clusters (c). All clusters, merged and non-merged, are then reported for inspection by the user. For more details on the pipeline, see Appendix A.
Figure 2Initiating the phylotaR pipeline in R for primates (TaxID: 9443).
Figure 3Presence/absence of tribes and genera for palms (a) and primates (b), respectively, across the top ten best clusters. X-axis numbers are unique cluster Ids. For more details on each of these clusters, see Tables S4a and S4b.
Figure 4Tribe- and genus-level trees for palms (a) and primates (b). Roots were determined manually by rooting with Strepsirrhini and Calamoideae for primates and palms respectively. Branch lengths have been removed. Support calculated from 100 rapid bootstraps: *** >0.95, ** >0.75 and * >0.50. Complete tree construction methods are in Appendix B. For tree comparisons with published trees for palms [37] and primates [38], see Figures S4 and S5.