| Literature DB >> 29720081 |
Jie Lin1, Jing Wei1, Donald Adjeroh2, Bing-Hua Jiang3, Yue Jiang4.
Abstract
BACKGROUND: Alignment-free sequence similarity analysis methods often lead to significant savings in computational time over alignment-based counterparts.Entities:
Keywords: Complex numbers; Frequency domain; Sequence similarity; Wavelet transform; k-mers
Mesh:
Year: 2018 PMID: 29720081 PMCID: PMC5930706 DOI: 10.1186/s12859-018-2155-9
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Length 2 k-mers and associated standardized frequencies (Eq. 4)
| k-mers | AA | AC | AG | AT | CA | CC | CG | CT | GA | GC | GG | GT | TA | TC | TG | TT | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| S1 |
| 2 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
|
| 0.07 | -0.84 | -0.17 | -0.38 | -0.76 | -0.76 | -0.55 | -0.38 | -0.09 | -0.76 | -0.42 | -0.14 | -0.09 | -0.35 | -0.18 | -0.3 | |
| S2 |
| 0 | 0 | 0 | 0 | 0 | 2 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
|
| -0.41 | -1.13 | -0.17 | -0.38 | -1.02 | -0.23 | -0.29 | -0.38 | -0.09 | -0.48 | -0.42 | -0.14 | -0.09 | -0.35 | -0.18 | -0.3 | |
|
| 1.7 | 3.9 | 0.9 | 1.3 | 3.9 | 2.9 | 2.1 | 1.3 | 0.3 | 2.7 | 1.5 | 0.7 | 0.3 | 1.2 | 0.7 | 1.1 | |
| sd | 4.14 | 3.45 | 5.17 | 3.45 | 3.84 | 3.84 | 3.84 | 3.45 | 3.45 | 3.55 | 3.55 | 5.07 | 3.45 | 3.45 | 3.89 | 3.71 |
Fig. 1The distribution of 16 k-mers (AA, AC, …, TT) on the unit circle, moving counterclockwise
Confusion matrix
| Predicted class | |||
|---|---|---|---|
| Positive | Negative | ||
| Actual | Positive | True positives(TP) | False negatives(FN) |
| class | Negative | False positives(FP) | True negatives(TN) |
Correlations between edit distance (the global alignment identity score) and three methods
| DNA | Protein | |||||
|---|---|---|---|---|---|---|
| SSAW | WFV |
| SSAW | WFV |
| |
|
| 0.779 | 0.837 | -0.67 | 0.852 | 0.861 | -0.842 |
|
| -0.741 | -0.742 | 0.799 | -0.841 | -0.822 | 0.789 |
Comparison of the clustering results on DNA dataset
| DNA-Data | Model | F-score | Precision | Recall |
|---|---|---|---|---|
| HOG100 | SSAW | 0.6099 | 0.5953 | 0.6648 |
| HOG100 | WFV | 0.5724 | 0.5569 | 0.6227 |
| HOG100 |
| 0.5551 | 0.5112 | 0.6073 |
| HOG200 | SSAW | 0.5982 | 0.5841 | 0.6508 |
| HOG200 | WFV | 0.5635 | 0.5610 | 0.6214 |
| HOG200 |
| 0.5788 | 0.5364 | 0.6285 |
| HOG300 | SSAW | 0.5961 | 0.5869 | 0.6421 |
| HOG300 | WFV | 0.5359 | 0.5434 | 0.5800 |
| HOG300 |
| 0.5466 | 0.5081 | 0.5915 |
Comparison of the classification results on DNA datasets
| DNA-Data | Model | Accuracy | F-score | Precision | Recall |
|---|---|---|---|---|---|
| HOG100 | SSAW | 0.9576 | 0.9315 | 0.9326 | 0.9305 |
| HOG100 | WFV | 0.9574 | 0.9426 | 0.9475 | 0.9447 |
| HOG100 |
| 0.9587 | 0.9335 | 0.9472 | 0.9202 |
| HOG200 | SSAW | 0.9548 | 0.9256 | 0.9366 | 0.9149 |
| HOG200 | WFV | 0.9544 | 0.9355 | 0.9430 | 0.9350 |
| HOG200 |
| 0.9439 | 0.9320 | 0.9331 | 0.9309 |
| HOG300 | SSAW | 0.9509 | 0.9311 | 0.9354 | 0.9268 |
| HOG300 | WFV | 0.9402 | 0.9208 | 0.9286 | 0.9219 |
| HOG300 |
| 0.9328 | 0.9255 | 0.9229 | 0.9282 |
Running time for clustering and classification on DNA datasets. The fold improvement from a given method to the proposed SSAW approach is listed inside the parenthesis
| DNA-Data | Model | Total | Total |
|---|---|---|---|
| clustering time | classification time | ||
| HOG100 | SSAW | 19.8000 | 16.8159 |
| HOG100 | WFV | 55.4619(3) | 10.4614 |
| HOG100 |
| 39.676(2) | 11.3421 |
| HOG200 | SSAW | 50.9515 | 51.5956 |
| HOG200 | WFV | 238.5061(5) | 26.8309 |
| HOG200 |
| 104.327(2) | 37.8473 |
| HOG300 | SSAW | 63.9960 | 77.7017 |
| HOG300 | WFV | 640.1409(10) | 31.4625 |
| HOG300 |
| 238.712(4) | 94.8274 |
Comparison of the cluster results on protein data set
| Protein-Data | Model | F-score | Precision | Recall |
|---|---|---|---|---|
| HOG100 | SSAW | 0.7651 | 0.7497 | 0.8001 |
| HOG100 | WFV | 0.5874 | 0.5687 | 0.6382 |
| HOG100 |
| 0.6604 | 0.642 | 0.6798 |
| HOG200 | SSAW | 0.7746 | 0.7573 | 0.8103 |
| HOG200 | WFV | 0.6410 | 0.6195 | 0.6913 |
| HOG200 |
| 0.6435 | 0.5969 | 0.6979 |
| HOG300 | SSAW | 0.7246 | 0.7088 | 0.7653 |
| HOG300 | WFV | 0.5016 | 0.4826 | 0.5551 |
| HOG300 |
| 0.6429 | 0.6111 | 0.6782 |
Comparison of the classification results on protein data
| Data | Model | Accuracy | F-score | Precision | Recall |
|---|---|---|---|---|---|
| HOG100 | SSAW | 0.8158 | 0.6274 | 0.6225 | 0.6644 |
| HOG100 | WFV | 0.6741 | 0.5092 | 0.5012 | 0.5518 |
| HOG100 |
| 0.8329 | 0.6540 | 0.6248 | 0.6861 |
| HOG200 | SSAW | 0.8222 | 0.5626 | 0.5441 | 0.6174 |
| HOG200 | WFV | 0.7051 | 0.4454 | 0.4359 | 0.4902 |
| HOG200 |
| 0.8061 | 0.6279 | 0.5875 | 0.6743 |
| HOG300 | SSAW | 0.8690 | 0.7345 | 0.7466 | 0.7642 |
| HOG300 | WFV | 0.5685 | 0.3468 | 0.3551 | 0.3774 |
| HOG300 |
| 0.8098 | 0.6308 | 0.5983 | 0.6670 |
Running time for clustering and classification on protein datasets. The fold improvement from the a given method to the proposed SSAW is listed inside the parenthesis
| Protein-data | Models | Total clustering | Total classification |
|---|---|---|---|
| time | time | ||
| HOG100 | SSAW | 0.1638 | 0.1262 |
| HOG100 | WFV | 5.5554(34) | 0.4164(3) |
| HOG100 |
| 10.964(67) | 1.3780(11) |
| HOG200 | SSAW | 0.3542 | 0.2738 |
| HOG200 | WFV | 11.5037(32) | 0.9362(3) |
| HOG200 |
| 49.016(138) | 3.091(11) |
| HOG300 | SSAW | 0.6965 | 0.5077 |
| HOG300 | WFV | 27.2514(39) | 1.7460(3) |
| HOG300 |
| 126.984(182) | 5.284(10) |
Comparison of the clustering results on simulated dataset
| Model | F-score | Precision | Recall |
|---|---|---|---|
| SSAW | 0.8151 | 0.8085 | 0.8467 |
| WFV | 0.8211 | 0.8056 | 0.8587 |
|
| 0.8584 | 0.8750 | 0.8425 |
Comparison of the classification results on simulated data
| Model | Accuracy | F-score | Precision | Recall |
|---|---|---|---|---|
| SSAW | 0.9789 | 0.9789 | 0.9804 | 0.9789 |
| WFV | 0.9992 | 0.9992 | 0.9993 | 0.9992 |
|
| 0.9607 | 0.9662 | 0.9696 | 0.9628 |
Running time for three methods on clustering and classification using simulated data
| Models | Total clustering | Total classification |
|---|---|---|
| time | time | |
| SSAW | 0.0632 | 0.0810 |
| WFV | 0.9288(15) | 0.9313(11) |
|
| 1.123(18) | 0.172(2) |
Recommended methods for clustering and classification given three datasets. Model inside parentheses is competitive
| Data | Cluster | Classification |
|---|---|---|
| DNA | SSAW | WFV(SSAW) |
| Protein | SSAW | SSAW |
| Simulated | SSAW(WFV) | SSAW |