| Literature DB >> 26664519 |
Emanuel Weitschek1, Fabio Cunial2, Giovanni Felici3.
Abstract
Alignment-free algorithms can be used to estimate the similarity of biological sequences and hence are often applied to the phylogenetic reconstruction of genomes. Most of these algorithms rely on comparing the frequency of all the distinct substrings of fixed length (k-mers) that occur in the analyzed sequences. In this paper, we present Logic Alignment Free (LAF), a method that combines alignment-free techniques and rule-based classification algorithms in order to assign biological samples to their taxa. This method searches for a minimal subset of k-mers whose relative frequencies are used to build classification models as disjunctive-normal-form logic formulas (if-then rules). We apply LAF successfully to the classification of bacterial genomes to their corresponding taxonomy. In particular, we succeed in obtaining reliable classification at different taxonomic levels by extracting a handful of rules, each one based on the frequency of just few k-mers. State of the art methods to adjust the frequency of k-mers to the character distribution of the underlying genomes have negligible impact on classification performance, suggesting that the signal of each class is strong and that LAF is effective in identifying it.Entities:
Keywords: Alignment-free sequence comparison; Bacterial taxonomy; Supervised classification
Year: 2015 PMID: 26664519 PMCID: PMC4673791 DOI: 10.1186/s13040-015-0073-1
Source DB: PubMed Journal: BioData Min ISSN: 1756-0381 Impact factor: 2.522
Example of frequencies vectors matrix extracted by LAF and provided as input to rule-based classifiers
|
|
| … |
|
| ||
|---|---|---|---|---|---|---|
|
|
| … |
|
| ||
| AAA | 0.46 | 0.26 | … | 0.24 | 0.26 | |
| AAC | 0.12 | 0.16 | … | 0.23 | 0.24 | |
| AAG | 0.13 | 0.23 | … | 0.23 | 0.22 | |
| … | … | … | … | … | … |
Fig. 1Flow chart of the LAF method
Percent accuracy of the rule-based classifiers for each taxonomic unit (10-fold cross validation) on the filtered data set
| Level | RIPPER | RIDOR | PART | DMB | Avg ±std.dev |
|---|---|---|---|---|---|
| Species | 93.21 | 97.33 | 96.36 |
| 96.13 ±2.0 |
| Genus | 93.98 |
| 97.10 | 98.44 | 97.08 ±2.2 |
| Order | 98.79 |
| 98.31 | 98.58 | 98.74 ±0.4 |
| Class | 96.50 | 97.81 |
| 97.06 | 97.79 ±0.9 |
| Phylum | 96.88 |
| 98.07 | 98.53 | 98.06 ±0.8 |
| 95.87 ±2.2 | 97.72 ±1.0 | 98.24 ±0.4 | 97.55 ±1.0 |
The best performances are highlighted in bold for each taxon
Accuracy (ACC) [%] and computational times (T) [sec] on the order level with different values of K
| Data set | Classifier | K=3 | K=4 | K=5 | K=6 | ||||
|---|---|---|---|---|---|---|---|---|---|
| ACC | T | ACC | T [s] | ACC | T | ACC | T | ||
| Original | RIPPER | 64.50 |
| 69.82 | 83.53 | 69.76 | 203.53 | 69.92 | 765.34 |
| Original | RIDOR | 61.63 |
| 62.25 | 320.72 | 64.19 | 1509.75 | 64.75 | 10320.40 |
| Original | PART | 65.37 |
| 67.05 | 24.58 | 67.77 | 70.13 | 70.02 | 280.23 |
| Original | SVM | 70.69 |
| 85.37 | 937.32 | 88.59 | 1312.52 | 89.56 | 2020.60 |
| Original | NN | 83.27 |
| 85.67 | 12.13 | 86.49 | 19.34 | 87.06 | 114.48 |
| Filtered | RIPPER | 98.79 |
| 98.79 | 1.55 | 99.27 | 4.56 | 98.79 | 27.76 |
| Filtered | RIDOR | 96.12 |
| 99.27 | 3.05 | 96.36 | 26.16 | 97.33 | 34.31 |
| Filtered | PART | 97.34 |
| 98.31 | 1.00 | 97.58 | 2.28 | 97.09 | 23.11 |
| Filtered | SVM | 99.56 |
| 99.87 | 11.58 | 99.65 | 13.10 | 99.68 | 14.71 |
| Filtered | NN | 99.45 |
| 99.93 | 3.30 | 99.34 | 3.70 | 99.63 | 4.18 |
|
| - | 83.67 |
| 86.63 | 139.88 | 86.90 | 316.51 | 87.38 | 1360.51 |
Fig. 2Accuracy and computational times of RIPPER with respect to increasing values of k on the original data set
A sample of classification rules at the species level extracted by the DMB software. f(W) represents the relative frequency of substring W in a genome, multiplied by 105 for readability
| A. baumannii | |
| B. cereus | 384.04≤ |
| B. animalis | 762.28≤ |
| B. longum | |
| B. aphidicola | 57.77≤ |
| C. jejuni | 490.11≤ |
| C. trachomatis | 305.55≤ |
| C. botulinum | 371.77≤ |
| C. diphtheriae | 819.04≤ |
| C. pseudotuberculosis | 875.80≤ |
| E. coli | 710.86≤ |
| F. tularensis | 592.00≤ |
| H. influenzae | 549.73≤ |
| H. pylori | 5.56≤ |
| L. monocytogenes | 411.43≤ |
| M. tuberculosis | 649.71≤ |
| N. meningitidis | 590.29≤ |
| P. marinus | ( |
| ∧117.33≤ | |
| S. enterica | 525.98≤ |
| S. aureus | 1082.23≤ |
| S. pneumoniae | 393.10≤ |
| S. pyogenes | 596.06≤ |
| S. suis | 918.25≤ |
| S. islandicus | 218.01≤ |
| Y. pestis | 596.17≤ |
Percent accuracy of the classifiers for each taxonomic unit (10-fold cross validation) on the original data set
| Level | RIPPER | RIDOR | PART | DMB | SVM | NN | Avg ±std.dev | |
|---|---|---|---|---|---|---|---|---|
| Species | - | - | - | - | - | - | - | |
| Genus | 54.17 | 47.67 | 50.17 | 48.54 | - |
| 45.60 ±24.2 | |
| Order | 69.82 | 62.25 | 67.05 | 63.78 | 85.37 |
| 72.32 ±10.5 | |
| Class | 75.08 | 69.92 | 71.76 | 72.05 | 88.43 |
| 77.72 ±8.7 | |
| Phylum | 75.85 | 70.99 | 56.77 | 71.45 | 85.93 |
| 74.51 ±8.2 | |
| 68.73 ±10.0 | 62.71 ±10.7 | 61.44 ±9.7 | 63.96 ±11 | 64.93 ±43.3 | 67.54 ±14.8 |
The best performances are highlighted in bold for each taxon