| Literature DB >> 28842700 |
Yuan Jiang1, Jun Wang1, Dawen Xia2,3, Guoxian Yu4.
Abstract
Metagenomics brings in new discoveries and insights into the uncultured microbial world. One fundamental task in metagenomics analysis is to determine the taxonomy of raw sequence fragments. Modern sequencing technologies produce relatively short fragments and greatly increase the number of fragments, and thus make the taxonomic classification considerably more difficult than before. Therefore, fast and accurate techniques are called to classify large-scale fragments. We propose EnSVM (Ensemble Support Vector Machine) and its advanced method called EnSVMB (EnSVM with BLAST) to accurately classify fragments. EnSVM divides fragments into a large confident (or small diffident) set, based on whether the fragments get consistent (or inconsistent) predictions from linear SVMs trained with different k-mers. Empirical study shows that sensitivity and specificity of EnSVM on confident set are higher than 90% and 97%, but on diffident set are lower than 60% and 75%. To further improve the performance on diffident set, EnSVMB takes advantage of best hits of BLAST to reclassify fragments in that set. Experimental results show EnSVM can efficiently and effectively divide fragments into confident and diffident sets, and EnSVMB achieves higher accuracy, sensitivity and more true positives than related state-of-the-art methods and holds comparable specificity with the best of them.Entities:
Mesh:
Year: 2017 PMID: 28842700 PMCID: PMC5573435 DOI: 10.1038/s41598-017-09947-y
Source DB: PubMed Journal: Sci Rep ISSN: 2045-2322 Impact factor: 4.379
Results of linear SVM on validation set with different k-mers.
| Methods | Accuracy | Sensitivity | Specificity | True positives |
|---|---|---|---|---|
| SVM ( | 86.95% | 85.62% | 86.77% | 72086 |
| SVM ( | 86.34% | 88.17% | 89.21% | 74041 |
| SVM ( | 90.39% | 90.37% | 89.37% | 74912 |
| SVM ( | 90.46% | 90.58% | 89.57% | 74970 |
| SVM ( | 88.96% | 89.21% | 88.43% | 73726 |
| SVM ( | 89.49% | 89.85% | 89.17% | 74166 |
| SVM ( | 83.95% | 84.23% | 85.54% | 69574 |
Accuracy is computed as the ratio between the number of true positives and the number of fragments in the validation set[8].
Accuracy, sensitivity, specificity, number of true positives and runtime of EnSVM and EnSVMB on different stages. Experiment platform configuration: CentOS 6.5, Intel Xeon E5-2678v3 and 256GB RAM.
| Stage | Accuracy | Sensitivity | Specificity | True positives | Runtime |
|---|---|---|---|---|---|
| EnSVM (confident set 71496 fragments) | 95.12% | 95.47% | 97.10% | 68886 | 5 min 21 s |
| EnSVM (diffident set 11380 fragments) | 54.95% | 60.37% | 50.23% | 6253 | 5 min 21 s |
| EnSVMB (diffident set 11380 fragments) | 88.85% | 87.55% | 98.12% | 8272 | 6 min 33 s |
| EnSVMB (validation set 82876 fragments) | 95.32% | 94.06% | 97.76% | 77158 | 11 min 54 s |
1st row are the results of EnSVM on the confident set (71496 fragments). 2nd row are the results of EnSVM on the diffident set (11380 fragments). 3rd row are results of EnSVMB on the diffident set and BLAST parallel runs on 6 CPU cores. 4th row is the prediction results of EnSVMB on the validation set.
Results on a medium metagenomics dataset in species level.
| Methods | Accuracy | Sensitivity | Specificity | True positives | Training time | Prediction time |
|---|---|---|---|---|---|---|
| VW | 85.24% | 84.63% | 90.11% | 201146 | 401 min 09 s | 22 min 45 s |
| NBC | 75.45% | 76.52% | 83.54% | 203405 | 1 min 52 s | 20 s |
| Kraken | 84.33% | 80.03% | 95.60% | 227344 | — | 1 min32 s |
| BLAST(blastn) | 83.71% | 82.81% |
| 225673 | — | 1773 min 36 s |
| BWA | 81.57% | 78.80% |
| 213204 | — | 38 min 14 s |
| EnSVM on confident | ||||||
| set (175985) | 97.83% | 94.14% | 97.12% | 175985 | 51 min 3 s | 10 min 20 s |
| EnSVM on diffident | ||||||
| set (93604) | 64.09% | 67.31% | 75.29% | 59991 | 51 min 3 s | 10 min 20 s |
| EnSVM | 86.12% | 84.44% | 90.52% | 232170 | 51 min 3 s | 10 min 20 s |
| EnSVMB |
|
|
|
| 51 min 3 s | 35 min 14 s |
Results on a medium metagenomics dataset in phylum level.
| Methods | Accuracy | Sensitivity | Specificity | True positives |
|---|---|---|---|---|
| VW | 85.12% | 86.11% | 92.42% | 229474 |
| NBC | 79.32% | 76.64% | 84.65% | 213838 |
| BLAST (blastn) | 85.44% | 86.65% |
| 230337 |
| BWA | 84.23% | 82.36% |
| 227074 |
| Kraken | 86.71% | 89.36% | 98.63% | 233760 |
| EnSVM | 87.78% | 85.54% | 90.04% | 236645 |
| EnSVMB |
|
|
|
|
Figure 1The performance of six methods under different lengths of fragments. Particularly, EnSVMB(vote = 3), EnSVMB(vote = 4) and EnSVMB(vote = 5) means that the voting threshold of EnSVMB is set as 3, 4 and 5, respectively.
Figure 2The performance of six methods on large-scale dataset.
Results on a simulated metagenomics dataset.
| Methods | Accuracy | Sensitivity | Specificity | True positives |
|---|---|---|---|---|
| VW | 84.23% | 83.79% | 89.16% | 198763 |
| NBC | 75.17% | 76.29% | 83.25% | 202650 |
| BLAST (blastn) | 83.65% | 82.73% |
| 225511 |
| BWA | 81.03% | 78.34% |
| 218448 |
| Kraken | 84.25% | 79.98% | 95.52% | 227129 |
| EnSVM | 85.82% | 84.06% | 90.21% | 231361 |
| EnSVMB |
|
|
|
|
Figure 3Abundance profiles identified by BWA, BLAST, EnSVMB and NBC. ‘Providers’ means that the abundance profiles are taken from EBI (https://www.ebi.ac.uk/metagenomics/).
Details of the small dataset.
| Species | Number of genome sequences in the reference set |
|---|---|
| Corynebacterium diphtheriae | 12 |
| Brucella abortus | 6 |
| Methylobacterium extorquens | 7 |
| Lactobacillus rhamnosus | 5 |
| Erwinia amylovora | 3 |
| Shigella boydii | 6 |
| Desulfovibrio vulgaris | 5 |
| Bacteroides fragilis | 3 |
Figure 4Five linear SVMs are integrated into an ensemble classifier (EnSVM). EnSVM then divides fragments in the validation set into the confident and diffident sets based on the aggregated predictions from these SVMs. The voting threshold (labeled as vote) is adjustable. EnSVMB further applies BLAST to reclassify fragments in the diffident set and tags fragments can not be retrieved from the reference set with confident e-value as unknown.