| Literature DB >> 31345254 |
Andrzej Zielezinski1, Hani Z Girgis2, Guillaume Bernard3, Chris-Andre Leimeister4, Kujin Tang5, Thomas Dencker4, Anna Katharina Lau4, Sophie Röhling4, Jae Jin Choi6,7, Michael S Waterman5,8, Matteo Comin9, Sung-Hou Kim6,7, Susana Vinga10,11, Jonas S Almeida12, Cheong Xin Chan13, Benjamin T James2, Fengzhu Sun5,8, Burkhard Morgenstern4, Wojciech M Karlowski14.
Abstract
BACKGROUND: Alignment-free (AF) sequence comparison is attracting persistent interest driven by data-intensive applications. Hence, many AF procedures have been proposed in recent years, but a lack of a clearly defined benchmarking consensus hampers their performance assessment.Entities:
Keywords: Alignment-free; Benchmark; Horizontal gene transfer; Sequence comparison; Web service; Whole-genome phylogeny
Mesh:
Year: 2019 PMID: 31345254 PMCID: PMC6659240 DOI: 10.1186/s13059-019-1755-7
Source DB: PubMed Journal: Genome Biol ISSN: 1474-7596 Impact factor: 13.583
Alignment-free sequence comparison tools included in this study
| Software | Approach class | Software version | Availability |
|---|---|---|---|
| AAF [ | Exact | 10/01/2017 | |
| AFKS [ | 1.0 | ||
| alfpy [ | 1.0.6 | ||
| CAFÉ [ | 1.0.0 | ||
| FFP [ | 2v.2.1 | ||
| jD2Stat [ | 1.0 | ||
| LZW-Kernel [ | Information theory | NA | |
| spaced [ | Inexact | 1.0 | |
| kWIP [ | 0.2.0–13-g3cf8a9e | ||
| ALFRED-G [ | Maximal length of exact common substrings | NA | |
| kmacs [ | 1.0 | ||
| kr [ | 2.0.2 | ||
| Underlying Approach [ | NA | ||
| andi [ | Micro-alignments | 0.02 | |
| co-phylog [ | NA | ||
| FSWM [ | 1.0 | ||
| Multi-SpaM [ | 1.0 | ||
| phylonium [ | 0.3 | ||
| mash [ | Number of word matches | 2.1 | |
| Slope-SpaM | 0.1 | ||
| Skmer [ | 3.0.0 | ||
| RTD-Phylogeny [ | Return time distribution | 1.0.1 | |
| kSNP3 [ | SNP count | 3.1 | |
| EP-sim [ | Variable-length word counts | 1.0 |
Detailed information about the tools’ parameter values used in this study for different reference data sets is provided in Additional file 1: Table S1. A concise description of the listed tools is provided in the “Methods” section
Overview of the reference data sets
| Category | Name | # Sequences | Average sequence length | # Files | # Sequence comparisons |
|---|---|---|---|---|---|
| Regulatory element detection | 370 | 764 nt | 370 | 68,256 | |
| Protein sequence classification | Low sequence identity (< 40%) [ | 1,066 | 180 aa | 1,066 | 567,645 |
| High sequence identity (≥ 40%) [ | 2,128 | 184 aa | 2,128 | 2,263,128 | |
| Gene tree inference | SwissTree [ | 651 | 398 aa | 651 | 211,575 |
| Genome-based phylogeny | Assembled genomes | ||||
| 29 | 29 | 4,895,247 nt | 29 | 406 | |
| 14 plant species | 14 | 337,515,688 nt | 14 | 91 | |
| 25 fish mitochondrial genomes [ | 25 | 16,623 nt | 25 | 300 | |
| Unassembled genomes | |||||
| 29 | |||||
| Coverage 0.03125 | 29,557 | 150 nt | 29 | 406 | |
| Coverage 0.0625 | 59,116 | 150 nt | 29 | 406 | |
| Coverage 0.125 | 118,266 | 150 nt | 29 | 406 | |
| Coverage 0.25 | 236,541 | 150 nt | 29 | 406 | |
| Coverage 0.5 | 473,081 | 150 nt | 29 | 406 | |
| Coverage 1 | 946,169 | 150 nt | 29 | 406 | |
| Coverage 5 | 4,730,778 | 150 nt | 29 | 406 | |
| 14 plant species | |||||
| Coverage 0.015625 | 48,274 | 150 nt | 14 | 91 | |
| Coverage 0.03125 | 96,489 | 150 nt | 14 | 91 | |
| Coverage 0.0625 | 1,931,268 | 150 nt | 14 | 91 | |
| Coverage 0.125 | 3,862,905 | 150 nt | 14 | 91 | |
| Coverage 0.25 | 7,725,928 | 150 nt | 14 | 91 | |
| Coverage 0.5 | 15,461,718 | 150 nt | 14 | 91 | |
| Coverage 1 | 30,903,727 | 150 nt | 14 | 91 | |
| Horizontal gene transfer | 27 | 27 | 4,905,896 nt | 27 | 351 |
| 8 Yersinia species [ | 8 | 4,605,553 nt | 8 | 28 | |
| 33 simulated genomes [ | |||||
| HGT level 0 | 33 | 2,205,524 nt | 33 | 528 | |
| HGT level 250 | 33 | 2,149,620 nt | 33 | 528 | |
| HGT level 500 | 33 | 2,230,317 nt | 33 | 528 | |
| HGT level 750 | 33 | 2,263,926 nt | 33 | 528 | |
| HGT level 1,000 | 33 | 2,238,661 nt | 33 | 528 |
An interactive visualization of all results for all data sets can be found online (http://afproject.org)
Fig. 1Overview of the AFproject benchmarking service facilitating assessment and comparison of AF methods. AF method developers run their methods on a reference sequence set and submit the computed pairwise sequence distances to the service. The submitted distances are subjected to a test specific to given data sets, and the results are returned to the method developer, who can choose to make the results publicly available
Fig. 2Summary of AF tool performance across all reference data sets. The numbers in the fields indicate the performance scores (from 0 to 100; see the “Methods” section) of a given AF method for a given data set. Fields are color-coded by performance values. The numbers in bold indicate the highest performance obtained within a given data set. An empty field indicates the corresponding tool’s inability to be run on a data set. An extended version of this figure including values of the overall performance score is provided in Additional file 1: Table S14. The most up-to-date summary of AF tool performance can be found at: http://afproject.org/app/tools/performance/