| Literature DB >> 21541181 |
Gail L Rosen1, Robi Polikar, Diamantino A Caseiro, Steven D Essinger, Bahrad A Sokhansanj.
Abstract
High-throughput sequencing technologies enable metagenome profiling, simultaneous sequencing of multiple microbial species present within an environmental sample. Since metagenomic data includes sequence fragments ("reads") from organisms that are absent from any database, new algorithms must be developed for the identification and annotation of novel sequence fragments. Homology-based techniques have been modified to detect novel species and genera, but, composition-based methods, have not been adapted. We develop a detection technique that can discriminate between "known" and "unknown" taxa, which can be used with composition-based methods, as well as a hybrid method. Unlike previous studies, we rigorously evaluate all algorithms for their ability to detect novel taxa. First, we show that the integration of a detector with a composition-based method performs significantly better than homology-based methods for the detection of novel species and genera, with best performance at finer taxonomic resolutions. Most importantly, we evaluate all the algorithms by introducing an "unknown" class and show that the modified version of PhymmBL has similar or better overall classification performance than the other modified algorithms, especially for the species-level and ultrashort reads. Finally, we evaluate the performance of several algorithms on a real acid mine drainage dataset.Entities:
Mesh:
Year: 2011 PMID: 21541181 PMCID: PMC3085467 DOI: 10.1155/2011/495849
Source DB: PubMed Journal: J Biomed Biotechnol ISSN: 1110-7243
Figure 1The datasets used in this paper are composed of a classifier training set/database, novel genomes used to train the detector, and a separate novel-genome test set. The blue areas represent the percentage of genomes that have “known” genera/species; the green areas represent the percentage of genomes that are “known” at the genus level but “unknown” on the species level; the red areas represent the percentage of genomes that are “unknown” at both the species and genus levels.
Figure 2The ability of BLAST to discern the correct strain genome (in red-dashed), species genome (in green-dash) and the correct genus label (in blue) for the known 63, 500 (100 randomly selected reads from each of 635 known genomes) plus unknown 10, 200 (100 randomly selected reads from 102 novel genomes) 25 bp reads. The ROC curves compare BLAST's bit scores against a varying threshold. The plot demonstrates that BLAST predicts most “known” genomes correctly at the optimal operating point, but incorrectly detects “unknown” genomes. For the strain detection, the area-under-the-curve is 60.1% with the best threshold yielding a sensitivity of 99.8% and specificity of 20.4%. For the species-level detection, the AUC is 65% with 99.1% sensitivity and a specificity of 34.7%. For the genus detection, the area-under-the-curve is 78.9% with the best threshold yielding a sensitivity of 98.6% and specificity of 59.3%. The red line represents the 50% chance line.
Figure 3Comparison of ROC curves using the likelihood-based NBC scores. The AUC metric shows that the detector performs best on 500 bp reads (and not that much lower for 25 bp) for both the species and genus levels for the 100 reads each from the 635 known and 102 unknown training genomes.
Figure 4The PhymmBL ROCs follow a similar shape to the BLAST ROCs but have significantly better optimal operating points. Yet, the overall AUC of each of the curves falls below that of NBC curves.
Sensitivity, specificity, and (detector) accuracy rates of detectors for accepting/rejecting reads as “known” from a 275-strain test-set. Using 5-fold cross-validation, the maximum standard deviation is 1%. If all fragments were rejected, the species level would obtain 66% accuracy and the genus level 37% accuracy, and PhymmBL+Detector achieves 15–30% above this threshold. SOrt-ITEMS did not classify any fragment below the genus level, so N/A is designated for the species level. WebCarma's performance using 500 bp fragments resulted in a 20.1% sensitivity, 86.9% specificity, and 54% detector accuracy for the species level, and 23% sensitivity, 85% specificity, and 40.3% detector accuracy for the genus level. WebCarma only classified about 10 K of the 27.5 K reads. Due to its poor performance, we did not include it in the table.
| NBC detector | ||||||
|---|---|---|---|---|---|---|
| Species | Genus | |||||
| Fragment length | Sensitivity | Specificity | Accuracy | Sensitivity | Specificity | Accuracy |
| 500 bp | 53.7% | 96.3% | 81.9% | 32.9% | 99.9% | 58.0% |
| 100 bp | 62.2% | 95.5% | 84.3% | 39.3% | 99.5% | 61.8% |
| 25 bp | 77.4% | 89.6% | 85.5% | 61.7% | 76.6% | 67.3% |
| PhymmBL detector | ||||||
| Species | Genus | |||||
| Fragment length | Sensivitiy | Specificity | Accuracy | Sensitivity | Specificity | Accuracy |
| 500 bp | 84.0% | 88.3% | 86.8% | 58.5% | 97.4% | 73.0% |
| 100 bp | 79.9% | 92.0% | 87.9% | 52.5% | 98.3% | 69.6% |
| 25 bp | 77.2% | 86.8% | 83.5% | 51.2% | 92.6% | 66.7% |
| MEGAN as a detector | ||||||
| Species | Genus | |||||
| Fragment length | Sensivitiy | Specificity | Accuracy | Sensitivity | Specificity | Accuracy |
| 500 bp | 83.3% | 60.0% | 68.1% | 76.6% | 66.5% | 72.8% |
| 100 bp | 79.5% | 71.4% | 74.2% | 66.9% | 76.8% | 70.6% |
| 25 bp | 71.0% | 74.5% | 73.2% | 55.3% | 73.4% | 62.1% |
| SOrt-ITEMS as a detector | ||||||
| Species | Genus | |||||
| Fragment length | Sensivitiy | Specificity | Accuracy | Sensitivity | Specificity | Accuracy |
| 500 bp | N/A | N/A | N/A | 57.1% | 96.5% | 71.2% |
| 100 bp | N/A | N/A | N/A | 44.8% | 97.9% | 64.5% |
| 25 bp | N/A | N/A | N/A | 6.1% | 98.7% | 40.5% |
Comparison of overall classification accuracies (the number of reads that are identified as “known” that are classified into their correct class plus the no. of unknowns that are correctly rejected divided by all reads) on the 275-strain test set. Using 5-fold cross-validation, the maximum standard deviation is 1%. NBC, BLAST, and PhymmBL, in their native form, cannot detect “unknown” classes while the methods combined with a detector can. Performance is also compared to MEGAN and SOrt-ITEMS accuracy. N/A is designated for the species-level for SOrt-ITEMS since it did not classify anything below the genus level. SOrt-ITEMS obtains the best performance for 500 bp reads for the genus level but is under the 1% standard deviation threshold to be statistically significant. WebCarma was not included because its overall performance for 500 bp reads was 50% for the species level and 37% for the genus level. Note that the overall classification performance increases dramatically when a detector is added to NBC and PhymmBL.
| Species | |||||||
|---|---|---|---|---|---|---|---|
| Fragment length | NBC | BLAST | PhymmBL | MEGAN | SOrt-ITEMS | NBC + detector | PhymmBL + detector |
| 500 bp | 27.5% | 28.1% | 28.0% | 63.2% | N/A | 78.0% | 78.6% |
| 100 bp | 25.3% | 26.1% | 26.9% | 69.4% | N/A | 78.3% | 81.1% |
| 25 bp | 20.9% | 22.8% | 23.5% | 68.1% | N/A | 74.7% | 73.6% |
| Genus | |||||||
| Fragment length | NBC | BLAST | PhymmBL | MEGAN | SOrt-ITEMS | NBC + detector | PhymmBL + detector |
| 500 bp | 43.4% | 49.2% | 51.4% | 68.8% | 71.0% | 53.6% | 70.8% |
| 100 bp | 37.6% | 42.8% | 44.4% | 66.5% | 64.0% | 54.9% | 67.4% |
| 25 bp | 30.0% | 32.7% | 33.5% | 54.8% | 40.1% | 45.3% | 60.3% |
PhymmBL and NBC detector accuracy rates versus coding/noncoding reads (coding includes full and partial coding regions).
| PhymmBL | NBC | |||||
|---|---|---|---|---|---|---|
| Method | All | Coding | Noncoding | All | Coding | Noncoding |
| Species | 87.5% | 79.4% | 82.8% | 82.7% | ||
| Genus | 72.6% | 68.2% | 57.1% | 56.9% | ||
The table shows the distribution of top 8 most abundant genus reads that passed the genus-resolution detectors for the red soudan acid mine drainage dataset, using the 635-genome training database.
| NBC detector | PhymmBL detector | SOrt-ITEMS | |||
|---|---|---|---|---|---|
| Organism | Matched reads | Organism | Matched reads | Organism | Matched reads |
| Marinobacter | 40 | Dinoroseobacter | 101 | Marinobacter | 476 |
| Dinoroseobacter | 24 | Marinobacter | 73 | Gramella | 388 |
| Rhodobacter | 23 | Ruegeria | 59 | Dinoroseobacter | 297 |
| Shewanella | 20 | Rhodobacter | 41 | Rhodobacter | 264 |
| Ruegeria | 19 | Shewanella | 41 | Flavobacterium | 161 |
| Paracoccus | 9 | Pseudomonas | 26 | Pseudomonas | 131 |
| Desulfotalea | 4 | Bacillus | 21 | Alkalilimnicola | 111 |
| Bartonella | 4 | Clostridium | 21 | Roseobacter | 101 |
(a)
| PhymmBL | |
|---|---|
| Organism | Matched reads |
| Gramella forsetii | 4102 |
| Marinobacter hydrocarbonoclasticus | 3885 |
| Flavobacterium johnsoniae | 3480 |
| Dinoroseobacter shibae | 3402 |
| Ruegeria pomeroyi | 3119 |
| Polaromonas naphthalenivorans | 3116 |
| Aeromonas salmonicida | 2899 |
| Rhodobacter sphaeroides | 2616 |
| Rhizobium leguminosarum | 2541 |
| Paracoccus denitrificans | 2533 |
(b)
| NBC detector | PhymmBL detector | ||
| Organism | Matched reads | Organism | Matched reads |
| Marinobacter hydrocarbonoclasticus | 31 | Dinoroseobacter shibae | 85 |
| Dinoroseobacter shibae | 18 | Marinobacter hydrocarbonoclasticus | 62 |
| Ruegeria sp. TM1040 | 17 | Rhodobacter sphaeroides | 24 |
| Rhodobacter sphaeroides | 15 | Ruegeria pomeroyi | 24 |
| Shewanella sp. ANA-3 | 11 | Ruegeria sp. TM1040 | 22 |
| Shewanella baltica | 5 | Paracoccus denitrificans | 20 |
| Desulfotalea psychrophila | 4 | Shewanella baltica | 17 |
| Paracoccus denitrificans | 4 | Shewanella sp. ANA-3 | 14 |