| Literature DB >> 25573913 |
Ivan Borozan1, Stuart Watt1, Vincent Ferretti1.
Abstract
MOTIVATION: Alignment-based sequence similarity searches, while accurate for some type of sequences, can produce incorrect results when used on more divergent but functionally related sequences that have undergone the sequence rearrangements observed in many bacterial and viral genomes. Here, we propose a classification model that exploits the complementary nature of alignment-based and alignment-free similarity measures with the aim to improve the accuracy with which DNA and protein sequences are characterized.Entities:
Mesh:
Substances:
Year: 2015 PMID: 25573913 PMCID: PMC4410667 DOI: 10.1093/bioinformatics/btv006
Source DB: PubMed Journal: Bioinformatics ISSN: 1367-4803 Impact factor: 6.937
Fig. 1.A graphical representation of how the CSSS model can improve the classification accuracy of a given test sample. M1 and M2 are two similarity measures and T is a test sample that can be assigned either a class ‘circle’ or a class ‘triangle’ that are present in the training set. In (a) according to M1, T is assigned the correct class (i.e. ‘circle’), whereas according to M2, T is assigned the incorrect class (i.e. a ‘triangle’). In (b), we show the classification according to the combined unweighted score. In (c), we show the classification according to the combined weighted score
The classification accuracy [see Equation (16)] for Dataset I obtained with the CSSS (1-NN classifier) and the five other models: PhymmBL (Brady and Salzberg, 2009), NBC (Rosen et al., 2011), Kraken (Wood and Salzberg, 2014), RAIphy (Nalbantoglu et al., 2011) and PAUDA (Huson and Xie, 2014) when predicting 147 different viral genera across 266 viral DNA sequences as a function of the viral fragment length
| Classifier | Full-length genomes accuracy (%) | Viral fragment length 1000-bp accuracy (%) | 500-bp accuracy (%) | 100-bp accuracy (%) |
|---|---|---|---|---|
| CSSS | 91.43 ± 0.99 | 70.02 ± 2.01 | 63.02 ± 1.49 | 35.94 ± 3.31 |
| PhymmBL | 86.56 ± 2.19 | 68.90 ± 1.78 | 57.28 ± 2.09 | 29.79 ± 1.66 |
| NBC | 74.67 ± 0.64 | 59.06 ± 1.49 | 50.39 ± 2.77 | 34.04 ± 1.53 |
| Kraken | 48.47 ± 1.85 | 26.66 ± 1.94 | 23.07 ± 2.19 | 16.26 ± 1.40 |
| RAIphy | 42.03 ± 1.56 | 30.72 ± 1.66 | 23.97 ± 1.66 | 14.06 ± 1.17 |
| PAUDA | 0.10 ± 0.15 | 6.73 ± 1.40 | 21.22 ± 1.32 | 31.89 ± 2.42 |
The classification accuracy [see Equation (16)] for Dataset II obtained with the CSSS (1-NN classifier) and the six other models: PhymmBL (Brady and Salzberg, 2009), PhyloPythiaS (Patil et al., 2011), NBC (Rosen et al., 2011), Kraken (Wood and Salzberg, 2014), RAIphy (Nalbantoglu et al., 2011) and PAUDA (Huson and Xie, 2014) when predicting the phyla for 20 907 reads belonging to Leptospirillum sp. groups II and III genomes (18 579 reads) and Ferroplasma acidarmanus genome (2328 reads)
| Classifier | Euryarchaeota accuracy (%) | Nitrospirae accuracy (%) |
|---|---|---|
| CSSS | 87.03 | 96.66 |
| PhymmBL | 81.14 | 97.67 |
| PhyloPythiaS | 72.76 | 95.42 |
| NBC | 16.15 | 82.07 |
| Kraken | 0.26 | 77.14 |
| RAIphy | 1.03 | 66.99 |
| PAUDA | 4.38 | 8.41 |
The classification performance on protein domain sequences for the CSSS model (1-NN classifier) with the k-mer size=1 (see Section 3), expressed as the integral of the AUC curve shown in Supplementary Figure S2 in Supplementary Data
| Similarity/distance measure | Classification method | |
|---|---|---|
| SVM | 1-NN | |
| SW | 48.66 | 50.22 |
| LZW-BLAST | 49.0 | 37.18 |
| CSSS | NA | 50.64 |
Since Dataset III contains 54 protein families, the maximum value for the integral of the AUC curve is 54, which correspond to all 54 protein families being classified without error.
aSimilarity/distance measures presented in Kocsor .