| Literature DB >> 33264284 |
Jacob L Steenwyk1, Thomas J Buida2, Yuanning Li1, Xing-Xing Shen3, Antonis Rokas1.
Abstract
Highly divergent sites in multiple sequence alignments (MSAs), which can stem from erroneous inference of homology and saturation of substitutions, are thought to negatively impact phylogenetic inference. Thus, several different trimming strategies have been developed for identifying and removing these sites prior to phylogenetic inference. However, a recent study reported that doing so can worsen inference, underscoring the need for alternative alignment trimming strategies. Here, we introduce ClipKIT, an alignment trimming software that, rather than identifying and removing putatively phylogenetically uninformative sites, instead aims to identify and retain parsimony-informative sites, which are known to be phylogenetically informative. To test the efficacy of ClipKIT, we examined the accuracy and support of phylogenies inferred from 14 different alignment trimming strategies, including those implemented in ClipKIT, across nearly 140,000 alignments from a broad sampling of evolutionary histories. Phylogenies inferred from ClipKIT-trimmed alignments are accurate, robust, and time saving. Furthermore, ClipKIT consistently outperformed other trimming methods across diverse datasets, suggesting that strategies based on identifying and retaining parsimony-informative sites provide a robust framework for alignment trimming.Entities:
Mesh:
Year: 2020 PMID: 33264284 PMCID: PMC7735675 DOI: 10.1371/journal.pbio.3001007
Source DB: PubMed Journal: PLoS Biol ISSN: 1544-9173 Impact factor: 8.029
The 14 different MSA trimming strategies tested in this study.
| Software | MSA trimming strategies | Approach | Parameter(s) | Reference |
|---|---|---|---|---|
| ClipKIT | Keep parsimony-informative sites | kpi mode | This study | |
| Keep parsimony-informative sites and remove highly gappy sites | kpi-gappy mode; remove sites with 90% gaps | |||
| Keep parsimony-informative and constant sites | kpic mode | |||
| Keep parsimony-informative and constant sites and remove highly gappy sites | kpic-gappy mode; remove sites with 90% gaps | |||
| Remove highly gappy sites | gappy mode; remove sites with 90% gaps | |||
| BMGE | Remove sites with high entropy | Entropy threshold of 0.3 | [ | |
| Default entropy threshold of 0.5 | ||||
| Entropy threshold of 0.7 | ||||
| Gblocks | Remove sites that are gap rich and highly variable | default | [ | |
| Noisy | Predicts homoplastic sites and remove them | default | [ | |
| trimAl | Remove highly gappy and variable sites | strict mode | [ | |
| Remove highly gappy and variable sites | strictplus mode | |||
| Remove highly gappy sites | gappyout mode | |||
| No trimming | N/A | N/A | N/A |
Each MSA trimming strategy tested by our study, the software used, a general description of its trimming approach, its parameters, and a citation for the software used are described here.
BMGE, Block Mapping and Gathering with Entropy; MSA, multiple sequence alignment; N/A, not applicable.
Fig 1The 14 alignment trimming strategies tested differ in resulting MSAs and metrics of phylogenetic tree accuracy and support.
Principal component analysis of alignment length, nRF, and ABS values across the 14 MSA trimming strategies for 4 empirical datasets (A) and 4 simulated datasets (B). Insets of scree plots depict the percentage of variation explained (y-axis) for the first 5 dimensions (x-axis). Data were scaled prior to conducting principal component analysis. Note that the BMGE 0.3 and Gblocks strategies are not represented in Fig 1B because they frequently removed entire alignments and were therefore removed from the analysis of simulated sequenced. Data used to generate this figure can be found on figshare (doi: 10.6084/m9.figshare.12401618). ABS, average bipartition support; BMGE, Block Mapping and Gathering with Entropy; MSA, multiple sequence alignment; nRF, normalized Robinson–Foulds.
Fig 2ClipKIT is a top-performing software for trimming MSAs.
Desirability-based integration of accuracy and support metrics per MSA facilitated the comparison of relative performance of the 14 different MSA trimming strategies for empirical (A–D) and simulated (E–H) datasets. Examination of performance for individual datasets and average performance across empirical (I) and simulated (J) datasets revealed that ClipKIT is a top-performing software. MSA trimming strategies are ordered along the x-axis from the highest-performing strategy to the lowest-performing one according to average desirability–based rank. Boxplots embedded in violin plots have upper, middle, and lower hinges that represent the first, second, and third quartiles. Whiskers extend to 1.5 times the interquartile range. Data used to generate this figure can be found on figshare (doi: 10.6084/m9.figshare.12401618). AA, amino acid; BMGE, Block Mapping and Gathering with Entropy; MSA, multiple sequence alignment; NT, nucleotide.