| Literature DB >> 31888452 |
Emanuel Maldonado1, Agostinho Antunes2,3.
Abstract
BACKGROUND: Recent advances in genome sequencing technologies and the cost drop in high-throughput sequencing continue to give rise to a deluge of data available for downstream analyses. Among others, evolutionary biologists often make use of genomic data to uncover phenotypic diversity and adaptive evolution in protein-coding genes. Therefore, multiple sequence alignments (MSA) and phylogenetic trees (PT) need to be estimated with optimal results. However, the preparation of an initial dataset of multiple sequence file(s) (MSF) and the steps involved can be challenging when considering extensive amount of data. Thus, it becomes necessary the development of a tool that removes the potential source of error and automates the time-consuming steps of a typical workflow with high-throughput and optimal MSA and PT estimations.Entities:
Keywords: Accuracy; Character coding; Consensus; High-throughput; Multi-core; Multigene; Multiple sequence alignment; Phylogeny; Software package; Uncertainty
Mesh:
Substances:
Year: 2019 PMID: 31888452 PMCID: PMC6937843 DOI: 10.1186/s12859-019-3292-5
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
LMAP_S listing of integrated software (31) and related stages
| LMAP_S Stage | Integrated Software | References | Algorithms Implemented |
|---|---|---|---|
| Stage 2 (AE) | Clustal Omega (v.1.2.1) | [ | |
| ClustalW (v.2.1) | [ | ||
| Dialign-tx (v.1.0.2) | [ | (3) Dialign-tx | |
| FSA (v.1.15.9) | [ | (2) FSA | |
| GramAlign (v.3.0) | [ | ||
| Kalign (v.2.04) | [ | ||
| MACSE (v.1.0.2) | [ | (2) MACSE | |
| MAFFT (v.7.271) | [ | (8) MAFFT | |
| MUSCLE (v.3.8.31) | [ | ||
| Opal (v.2.1.3) | [ | ||
| Prank (v.150803) | [ | (6) Prank | |
| ProbAlign (v.1.4) | [ | ||
| ProbCons (v.1.12) | [ | ||
| T-COFFEE (v.11.00.8cbe486) | [ | (4) | |
| Stage 3 (AOD) | OD-Seq (v.1.0) | [ | |
| EvalMSA (v.1.0) | [ | ||
| Stage 4 (ARC) | Gblocks (v.0.91b) | [ | (2) |
| MaxAlign (v.1.1) | [ | ||
| MergeAlign (n.f.) | [ | ||
| Noisy (v.1.5.12) | [ | ||
| PSAR-Align (v.1.0) | [ | ||
| TCS (T-COFFEE) (v.11.00.8cbe486) | [ | (3) TCS, TCS_original, TCS_FM | |
| TrimAl (v.1.4) | [ | (6) TrimAl | |
| WeaveAlign (v.1.2.1) | [ | ||
| Stage 5 (PE) | IQ-TREE (v.1.6.2) | [ | (15) IQ-TREE DNA, IQ-TREE DNA (DEG), IQ-TREE DNA (RY), IQ-TREE CODON, IQ-TREE NT2AA. Each case is available for |
| MPBoot (v.1.1.0) | [ | (2) MPBoot DNA. Each case is available for | |
| Ninja (v.1.2.2) | [ | ||
| SMS (v.1.8.1) | [ | (4) AIC + NNI, AIC + SPR, BIC + NNI, BIC + SPR | |
| Degen (v.1.4) (*) | [ | ||
| RYcode (v.1.0.0) (*) | This work | ||
| Stage 6 (PCC) | CONSEL (v.1.20) | [ | |
| TreeCmp (v.1.1) | [ |
Legend: (Number) – Algorithms Implemented column, where present, indicates the total number of algorithms implemented. (n.f.) – not found. (*) – Integrated as part of Stage 5 IQ-TREE algorithms DNA (DEG) and DNA (RY). (#) – Stage 4 consensus algorithms. DNA (nucleotide coding), DEG (degeneracy coding), RY (puRine and pirYmidine coding), NT2AA (translated – amino acid coding). AIC (Akaike Information Criterion) [89], BIC (Bayesian Information Criterion) [90], NNI (Nearest-Neighbor Interchange), SPR (Subtree Prunning and Regrafting). dN (non-synonymous distance), dS (synonymous distance). Listed software versions (see also Additional file 2: Figure S7) are only for reference of working cases and can be replaced by newer ones
Fig. 1LMAP_S workflow. Flowchart exhibiting the lmap-s.pl workflow where stages are organized in a sequential fashion. The omission or inclusion of certain Stages helps devise specific workflows based on researcher requirements. Gray boxes reflect optional stages. Stages 3 (AOD) and 6 (PCC) produce reports only, seven in total. Stage 4 (ARC) additionally produces one report. NDP – Nucleotide Data Pre-processing, AE – MSA Estimation, AOD – MSA Outlier Detection, ARC – MSA Refinement and Consensus, PE – PT Estimation, PCC – PT Comparison and Consensus, PDP – PT Data Post-processing
Fig. 2LMAP_S interactive functioning. a default or main “Run Status” screen presenting the currently running tasks; by pressing “2”, shows the “Task Status” screen, showing (b) the tasks that will be running next (first ten) and (c) the tasks currently finished (last ten) (press “1” to go back to (a)); (d) when interrupting the execution of lmap-s.pl (by typing “Ctrl-c” or “Ctrl-\”), beyond the choice of quitting, the user has also the choice to proceed to the built-in process manager here presented, allowing the termination of specific tasks. In this case, it is possible to terminate a group of tasks by typing “G:MMAPID” or a single task “P:PROCID”. The identifiers for MMAPID and PROCID are shown in the table, in the respective columns
Fig. 3Pie charts exhibiting the optimal consensus strategies (with highest TTS). Illustration of the optimal results for the provided dataset derived from LMAP_S consensus histogram report (Additional file 5: Table S8). Each pie chart presents the results for each gene showing at the center the highest TTS value in parenthesis. The arc lines surrounding each pie chart highlights the amount of optimal consensus strategies. The fraction at the bottom right corner of each chart shows the number of consensus strategies with equal highest TTS over the total number of strategies. The three squared cases (genes ATP8, COX2 and ND4L) are the only ones showing a unique consensus strategy with highest TTS. These optimal strategies are “ATP8_MAFFTF2_TRIMALS_DNA_UB”, “COX2_PRANKCDF_MAXALIGN_DNA_UB” and “ND4L_MAFFTEI_MAXALIGN_DNA_UB”, from where it is clearly visible the different optimal algorithms (from AE and ARC Stages). Notably the optimal CC option (DNA) was the same in all three. For the remaining cases, it is possible to take any of the consensus strategies as optimal as long as they have the highest TTS