| Literature DB >> 28486927 |
Xiang Gao1, Huaiying Lin1,2, Kashi Revanna1,2, Qunfeng Dong3,4,5,6.
Abstract
BACKGROUND: Species-level classification for 16S rRNA gene sequences remains a serious challenge for microbiome researchers, because existing taxonomic classification tools for 16S rRNA gene sequences either do not provide species-level classification, or their classification results are unreliable. The unreliable results are due to the limitations in the existing methods which either lack solid probabilistic-based criteria to evaluate the confidence of their taxonomic assignments, or use nucleotide k-mer frequency as the proxy for sequence similarity measurement.Entities:
Keywords: 16S rRNA gene; Taxonomic classification
Mesh:
Substances:
Year: 2017 PMID: 28486927 PMCID: PMC5424349 DOI: 10.1186/s12859-017-1670-4
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Fig. 1The overview of the BLCA algorithm. See main text for details
Comparison of the classification accuracies using the simulated dataset
| CST = 0.8 | V2 | V4 | V1V3 | V3V5 | V6V9 | |
| Species | BLCA | 0.7594 ± 0.0164* | 0.5331 ± 0.0208 | 0.9323 ± 0.0054* | 0.8335 ± 0.0072* | 0.8690 ± 0.0012* |
| Kraken | 0.7275 ± 0.0054 | 0.5326 ± 0.0181 | 0.8672 ± 0.0072 | 0.7542 ± 0.0087 | 0.7572 ± 0.0056 | |
| MEGAN | 0.7290 ± 0.0114 | 0.5238 ± 0.0161 | 0.7071 ± 0.0053 | 0.5206 ± 0.0108 | 0.5227 ± 0.0140 | |
| RDP | 0.6102 ± 0.0042 | 0.3928 ± 0.0292 | 0.8549 ± 0.0199 | 0.7307 ± 0.0203 | 0.7823 ± 0.0124 | |
| SPINGO | 0.5700 ± 0.0187 | 0.3910 ± 0.0106 | 0.7907 ± 0.0061 | 0.6900 ± 0.0071 | 0.7318 ± 0.0116 | |
| Genus | BLCA | 0.9498 ± 0.0019* | 0.8982 ± 0.0107* | 0.9965 ± 0.0012* | 0.9863 ± 0.0011* | 0.9925 ± 0.0012* |
| Kraken | 0.9072 ± 0.0066 | 0.8612 ± 0.0189 | 0.9691 ± 0.0051 | 0.9463 ± 0.0006 | 0.9437 ± 0.0034 | |
| MEGAN | 0.9334 ± 0.0079 | 0.8830 ± 0.0115 | 0.9528 ± 0.0040 | 0.9002 ± 0.0027 | 0.8939 ± 0.0041 | |
| RDP | 0.8768 ± 0.0065 | 0.8067 ± 0.0139 | 0.9629 ± 0.0072 | 0.9562 ± 0.0065 | 0.9657 ± 0.0042 | |
| SPINGO | 0.8481 ± 0.0002 | 0.7726 ± 0.0077 | 0.9333 ± 0.0057 | 0.9192 ± 0.0034 | 0.9238 ± 0.0067 | |
| Family | BLCA | 0.9791 ± 0.0009* | 0.9787 ± 0.0018* | 0.9984 ± 0.0019* | 0.9975 ± 0.0019* | 0.9970 ± 0.0014* |
| Kraken | 0.9594 ± 0.0038 | 0.9480 ± 0.0028 | 0.9882 ± 0.0021 | 0.9850 ± 0.0033 | 0.9799 ± 0.0032 | |
| MEGAN | 0.9495 ± 0.0089 | 0.9413 ± 0.0015 | 0.9517 ± 0.0032 | 0.9397 ± 0.0044 | 0.9447 ± 0.0034 | |
| RDP | 0.9461 ± 0.0093 | 0.9295 ± 0.0062 | 0.9818 ± 0.0007 | 0.9806 ± 0.0054 | 0.9855 ± 0.0013 | |
| SPINGO | NA | NA | NA | NA | NA | |
| CST = 0.5 | V2 | V4 | V1V3 | V3V5 | V6V9 | |
| Species | BLCA | 0.8485 ± 0.0128* | 0.6813 ± 0.0115* | 0.9629 ± 0.0077* | 0.9050 ± 0.0034* | 0.9315 ± 0.0045* |
| Kraken | 0.7275 ± 0.0054 | 0.5326 ± 0.0181 | 0.8672 ± 0.0072 | 0.7542 ± 0.0087 | 0.7572 ± 0.0056 | |
| MEGAN | 0.7290 ± 0.0114 | 0.5238 ± 0.0161 | 0.7071 ± 0.0053 | 0.5206 ± 0.0108 | 0.5227 ± 0.0140 | |
| RDP | 0.7526 ± 0.0107 | 0.5692 ± 0.0194 | 0.8997 ± 0.0144 | 0.8221 ± 0.0105 | 0.8621 ± 0.0094 | |
| SPINGO | 0.6570 ± 0.0124 | 0.5008 ± 0.0114 | 0.8256 ± 0.0038 | 0.7497 ± 0.0041 | 0.7805 ± 0.0021 | |
| Genus | BLCA | 0.9722 ± 0.0028* | 0.9467 ± 0.0031* | 0.9985 ± 0.0019* | 0.9947 ± 0.0013* | 0.9972 ± 0.0002* |
| Kraken | 0.9072 ± 0.0066 | 0.8612 ± 0.0189 | 0.9691 ± 0.0051 | 0.9463 ± 0.0006 | 0.9437 ± 0.0034 | |
| MEGAN | 0.9334 ± 0.0079 | 0.8830 ± 0.0115 | 0.9528 ± 0.0040 | 0.9002 ± 0.0027 | 0.8939 ± 0.0041 | |
| RDP | 0.9319 ± 0.0044 | 0.8960 ± 0.0086 | 0.9710 ± 0.0049 | 0.9693 ± 0.0046 | 0.9729 ± 0.0003 | |
| SPINGO | 0.8807 ± 0.0034 | 0.8354 ± 0.0041 | 0.9400 ± 0.0030 | 0.9287 ± 0.0024 | 0.9317 ± 0.0083 | |
| Family | BLCA | 0.9870 ± 0.0013* | 0.9856 ± 0.0035* | 0.9987 ± 0.0021* | 0.9991 ± 0.0012* | 0.9984 ± 0.0019* |
| Kraken | 0.9594 ± 0.0038 | 0.9480 ± 0.0028 | 0.9882 ± 0.0021 | 0.9850 ± 0.0033 | 0.9799 ± 0.0032 | |
| MEGAN | 0.9495 ± 0.0089 | 0.9413 ± 0.0015 | 0.9517 ± 0.0032 | 0.9397 ± 0.0044 | 0.9447 ± 0.0034 | |
| RDP | 0.9696 ± 0.0040 | 0.9674 ± 0.0015 | 0.9836 ± 0.0017 | 0.9830 ± 0.0033 | 0.9868 ± 0.0004 | |
| SPINGO | NA | NA | NA | NA | NA | |
Each entry in the table shows the average and standard deviation of the F-scores for a particular classifier (i.e., rows) at a specific 16S region (i.e., columns) based on three random sets of 1000 test sequences. Two confidence score thresholds (CST), 0.8 and 0.5, were applied for BLCA, RDP Classifier, and SPINGO as described in the main text. The *indicates that the F-scores of BLCA are significantly higher than those of other software, based on a one-tailed paired t-test with a p-value less than 0.05. Similar statistical significance was also obtained using the one-tailed Wilcoxon signed-rank test. Note that the SPINGO program does not produce family-level classification. In addition, Kraken and MEGAN do not provide any probabilistic-based parameters for evaluating the assigned taxa, thus we used their default taxonomic assignments for comparison
BLCA accuracy is insenesitve to the inclusion of dissimilar BLAST hits
| Taxonomic levels | Genus | Species | |||
|---|---|---|---|---|---|
| 16S region |
| BLCA | MEGAN | BLCA | MEGAN |
| V2 | 5% | 0.9539 ± 0.0038 | 0.9531 ± 0.0044 | 0.7747 ± 0.0150 | 0.8091 ± 0.0153 |
| 10% | 0.9498 ± 0.0019 | 0.9334 ± 0.0079 | 0.7594 ± 0.0164 | 0.7290 ± 0.0114 | |
| 20% | 0.9487 ± 0.0018 | 0.8966 ± 0.0080 | 0.7580 ± 0.0176 | 0.5983 ± 0.0075 | |
| V4 | 5% | 0.9078 ± 0.0078 | 0.9230 ± 0.0082 | 0.5597 ± 0.0175 | 0.6497 ± 0.0058 |
| 10% | 0.8982 ± 0.0107 | 0.8830 ± 0.0115 | 0.5331 ± 0.0208 | 0.5238 ± 0.0161 | |
| 20% | 0.8965 ± 0.0092 | 0.8016 ± 0.0041 | 0.5317 ± 0.0189 | 0.3915 ± 0.0119 | |
| V1V3 | 5% | 0.9960 ± 0.0009 | 0.9778 ± 0.0006 | 0.9314 ± 0.0058 | 0.8394 ± 0.0069 |
| 10% | 0.9965 ± 0.0012 | 0.9528 ± 0.004 | 0.9323 ± 0.0054 | 0.7071 ± 0.0053 | |
| 20% | 0.9959 ± 0.0009 | 0.8609 ± 0.0087 | 0.9321 ± 0.0053 | 0.4673 ± 0.0150 | |
| V3V5 | 5% | 0.9865 ± 0.0020 | 0.9550 ± 0.0041 | 0.8380 ± 0.0064 | 0.7025 ± 0.0112 |
| 10% | 0.9863 ± 0.0011 | 0.9002 ± 0.0027 | 0.8335 ± 0.0072 | 0.5206 ± 0.0108 | |
| 20% | 0.9863 ± 0.0011 | 0.7369 ± 0.0094 | 0.8361 ± 0.0039 | 0.2880 ± 0.0061 | |
| V6V9 | 5% | 0.9933 ± 0.0011 | 0.9532 ± 0.0050 | 0.8722 ± 0.0066 | 0.7258 ± 0.0129 |
| 10% | 0.9925 ± 0.0012 | 0.8939 ± 0.0041 | 0.8690 ± 0.0012 | 0.5227 ± 0.0140 | |
| 20% | 0.9931 ± 0.0017 | 0.7138 ± 0.0083 | 0.8701 ± 0.0050 | 0.2691 ± 0.0255 | |
The parameter topPercent is for keeping only the BLAST hits whose bit scores are within a given percentage of the best BLAST hit. The larger the parameter is, the more dissimilar database hits are included for taxonomic classification for the query sequence. The default value in MEGAN for this parameter is 10%. In our comparisons, we set the value of topPercent to be 5, 10 and 20% for both BLCA and MEGAN, the recommended range by the original MEGAN publication, to compare the performance of BLCA and MEGAN under different stringencies of retaining BLAST hits. Each table entry shows the average and standard deviation of the F-scores, based on the confidence score threshold of 0.8, for each tested software at the corresponding 16S region. The F-scores of BLCA are much less sensitive to the value of topPercent when compared to MEGAN
Comparison of the classification accuracies using a real-world dataset
| Taxonomy Level | Method | V1V2 Region | |
|---|---|---|---|
| CST = 0.8 | CST = 0.5 | ||
| Species | BLCA | 0.570 | 0.716 |
| Kraken | 0.589 | 0.589 | |
| MEGAN | 0.544 | 0.544 | |
| RDP | 0.490 | 0.613 | |
| SPINGO | 0.486 | 0.562 | |
| Genus | BLCA | 0.729 | 0.79 |
| Kraken | 0.694 | 0.694 | |
| MEGAN | 0.745 | 0.745 | |
| RDP | 0.643 | 0.708 | |
| SPINGO | 0.605 | 0.650 | |
| Family | BLCA | 0.814 | 0.832 |
| Kraken | 0.777 | 0.777 | |
| MEGAN | 0.869 | 0.869 | |
| RDP | 0.775 | 0.805 | |
| SPINGO | NA | NA | |
Each entry in the table shows the F-scores for a classifier (i.e., rows) based on all the OTU sequences in the msd16s dataset, as described in the main text. Two confidence score thresholds (CST), 0.8 and 0.5, were applied for BLCA, RDP Classifier, and SPINGO, the thresholds as in Table 1. Note that the SPINGO program does not produce family-level classification. In addition, Kraken and MEGAN do not provide any probabilistic-based parameters for evaluating the assigned taxa, thus we used their default taxonomic assignments for comparison