| Literature DB >> 26648681 |
Abstract
The amount of phylogenetically informative sequence data in GenBank is growing at an exponential rate, and large phylogenetic trees are increasingly used in research. Tools are needed to construct phylogenetic sequence matrices from GenBank data and evaluate the effect of missing data. Supermatrix Constructor (SUMAC) is a tool to data-mine GenBank, construct phylogenetic supermatrices, and assess the phylogenetic decisiveness of a matrix given the pattern of missing sequence data. SUMAC calculates a novel metric, Missing Sequence Decisiveness Scores (MSDS), which measures how much each individual missing sequence contributes to the decisiveness of the matrix. MSDS can be used to compare supermatrices and prioritize the acquisition of new sequence data. SUMAC constructs supermatrices either through an exploratory clustering of all GenBank sequences within a taxonomic group or by using guide sequences to build homologous clusters in a more targeted manner. SUMAC assembles supermatrices for any taxonomic group recognized in GenBank and is optimized to run on multicore computer systems by parallelizing multiple stages of operation. SUMAC is implemented as a Python package that can run as a stand-alone command-line program, or its modules and objects can be incorporated within other programs. SUMAC is released under the open source GPLv3 license and is available at https://github.com/wf8/sumac.Entities:
Keywords: GenBank; data-mining; decisiveness; partial taxon coverage; phylogenetics; supermatrix
Year: 2015 PMID: 26648681 PMCID: PMC4666519 DOI: 10.4137/EBO.S35384
Source DB: PubMed Journal: Evol Bioinform Online ISSN: 1176-9343 Impact factor: 1.625
Figure 1MSDS for a sequence matrix with 10 genes, 384 operational taxonomic units (OTUs), taxon coverage density of 0.26, and PD of 0.31.
Notes: Pale yellow represents sequence data present, shades of orange represent missing sequences with low-to-intermediate MSDS (0–0.75), and red to maroon represents missing sequences with high MSDS (0.75–1.0). MSDS measures how much the individual missing sequence contributes to the decisiveness of the matrix given the overall pattern of missing data. MSDS prioritizes which sequences to add to the matrix; when MSDS is high, the addition of new data will increase the decisiveness of the matrix more than where MSDS is low.
MSDS for some of the 2857 missing sequences in the data matrix shown in Figure 1.
| MSDS RANK | MSDS | OTU | GENE REGION | GENE NAME |
|---|---|---|---|---|
| 1 | 0.862 | Ludwigia peploides | 1 | ITS |
| 2 | 0.857 | Ludwigia hyssopifolia | 1 | ITS |
| 3 | 0.775 | Epilobium brachycarpum | 1 | ITS |
| 4 | 0.772 | Clarkia lewisii | 1 | ITS |
| 5 | 0.772 | Epilobium macropus | 1 | ITS |
| 2855 | 0.001 | Sonneratia ovata | 2 | matK |
| 2856 | 0.001 | Sonneratia ovata | 9 | pgiC |
| 2857 | <0.001 | Sonneratia ovata | 3 | ndhF |
Notes: The scores are shown in descending order, prioritizing which holes in the data matrix should be filled to increase the phylogenetic decisiveness of the sequence matrix. SUMAC outputs the entire list as a CSV spreadsheet.