| Literature DB >> 23145153 |
Anders Lanzén1, Steffen L Jørgensen, Daniel H Huson, Markus Gorfer, Svenn Helge Grindhaug, Inge Jonassen, Lise Øvreås, Tim Urich.
Abstract
Sequencing of taxonomic or phylogenetic markers is becoming a fast and efficient method for studying environmental microbial communities. This has resulted in a steadily growing collection of marker sequences, most notably of the small-subunit (SSU) ribosomal RNA gene, and an increased understanding of microbial phylogeny, diversity and community composition patterns. However, to utilize these large datasets together with new sequencing technologies, a reliable and flexible system for taxonomic classification is critical. We developed CREST (Classification Resources for Environmental Sequence Tags), a set of resources and tools for generating and utilizing custom taxonomies and reference datasets for classification of environmental sequences. CREST uses an alignment-based classification method with the lowest common ancestor algorithm. It also uses explicit rank similarity criteria to reduce false positives and identify novel taxa. We implemented this method in a web server, a command line tool and the graphical user interfaced program MEGAN. Further, we provide the SSU rRNA reference database and taxonomy SilvaMod, derived from the publicly available SILVA SSURef, for classification of sequences from bacteria, archaea and eukaryotes. Using cross-validation and environmental datasets, we compared the performance of CREST and SilvaMod to the RDP Classifier. We also utilized Greengenes as a reference database, both with CREST and the RDP Classifier. These analyses indicate that CREST performs better than alignment-free methods with higher recall rate (sensitivity) as well as precision, and with the ability to accurately identify most sequences from novel taxa. Classification using SilvaMod performed better than with Greengenes, particularly when applied to environmental sequences. CREST is freely available under a GNU General Public License (v3) from http://apps.cbu.uib.no/crest and http://lcaclassifier.googlecode.com.Entities:
Mesh:
Substances:
Year: 2012 PMID: 23145153 PMCID: PMC3493522 DOI: 10.1371/journal.pone.0049334
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Figure 1Overview of the resources of CREST.
The flow of information during the construction of a new reference database (top part) or classification (bottom part) is represented by arrows. The classification tools MEGAN or LCAClassifier can utilize CREST taxonomy files and databases such as SilvaMod for classification of environmental sequences, aligned to the reference database with Megablast.
Assignment accuracy from ten-fold cross validation.
| Accuracy per rank | |||||
| Method | Training/Reference set | Fragment length | Genus | Family | Phylum |
| LCA | SilvaMod | F.L. |
| 92% |
|
| LCA | SilvaMod | 450 bp | 62% | 88% |
|
| LCA | SilvaMod | 100 bp |
| 61% |
|
| LCA | Greengenes | F.L. | 69% | 94% | 99% |
| LCA | Greengenes | 450 bp | 48% | 87% | 99% |
| LCA | Greengenes | 100 bp | 33% |
|
|
| RDP | Greengenes | F.L. | – |
| 98% |
| RDP | Greengenes | 450 bp | – |
| 95% |
| RDP | Greengenes | 100 bp | – | 49% | 51% |
| RDP | RDP v6 | F.L. | 81% | 95% | 99% |
| RDP | RDP v6 | 450 bp |
| 92% | 98% |
| RDP | RDP v6 | 100 bp | 35% | 56% | 90% |
Assignment accuracy defined as number of correct assignments divided by the total number of sequences tested, given at three different ranks. The best values for each combination of rank and fragment length are indicated in bold.
Classification using Megablast alignments and the CREST LCAClassifier within a 2% LCA range of the highest bitscore as well as percent similarity filters.
Naïve Bayes classification using the RDP Classifier with a bootstrap of 0.8. With the Greengenes training set, RDP Classifier was run via the QIIME script assign_taxonomy, which does not classify sequences beyond the family level.
Un-cropped full-length sequences from the reference or training dataset.
Figure 2Precision-recall curves from ten-fold cross validation.
Shows the precision (number of correct assignments/number of assignments made) on the y-axis and measured recall (sensitivity or true positive rate) on the x-axis, when varying LCA range or confidence cutoff. Circles indicate the default cutoffs (cutoff for RDP = 0.8, LCA range = 00.2).
Assignment accuracy from removal-of-taxa cross validation.
| Method | Training/Reference set | Fragment length | Accuracy | ||
| Genera | Families | Phyla | |||
| LCA | SilvaMod | F.L. |
| 90% | 7% |
| LCA | SilvaMod | 450 bp | 77% | 64% | 27% |
| LCA | SilvaMod | 100 bp | 81% | 66% | 76% |
| LCA | Greengenes | F.L. | 85% |
| 37% |
| LCA | Greengenes | 450 bp |
|
| 24% |
| LCA | Greengenes | 100 bp | 87% | 72% | 71% |
| RDP | Greengenes | F.L. | – | 57% |
|
| RDP | Greengenes | 450 bp | – | 83% |
|
| RDP | Greengenes | 100 bp | – |
|
|
| RDP | RDP v6 | F.L. | 62% | 62% | 21% |
| RDP | RDP v6 | 450 bp | 75% | 78% | 89% |
| RDP | RDP v6 | 100 bp |
| 92% | 96% |
Accuracy defined as number of correct assignments divided by the total number of sequences tested, given at three different ranks. The best values for each combination of rank and fragment length are indicated in bold.
Classification using Megablast alignments and the CREST LCAClassifier within a 2% LCA range of the highest bitscore as well as percent similarity filters.
Naïve Bayes classification using the RDP Classifier with a bootstrap confidence cutoff of 0.8. With the Greengenes training set, RDP Classifier was run via the QIIME script assign_taxonomy, which does not classify sequences beyond the family level.
Un-cropped full-length sequences from the reference or training dataset.
Datasets used for performance testing.
| Dataset | Sequencing technology | Library type | Total SSU rRNA reads |
| Lake Lanier | GS FLX Ti | Shotgun metagenome | 558 |
| Forest soil | GS FLX Ti | Shotgun metatranscriptome | 51,202 |
| Siberian soil | Illumina | 16S rRNA amplicons | 2,173 |
| Hydrothermal mat | GS FLX Ti | 16S rRNA amplicons | 8,903 |
Reads with a BLASTN alignment bitscore >50 to a sequence in SilvaMod.
Results from performance testing using environmental datasets.
| Share of reads assigned | Unique taxa (B+A+E) | |||||||
| Method | Training/Reference set | Dataset | Genus | Family | Phylum | Genera | Families | Phyla |
| LCA | SilvaMod | Lake Lanier |
|
|
|
|
|
|
| LCA | SilvaMod | Forest soil |
|
|
|
|
|
|
| LCA | SilvaMod | Siberian soil | 31.2% |
|
|
|
|
|
| LCA | SilvaMod | Hydrothermal mat |
|
|
|
|
| 19+2+1 |
| LCA | Greengenes | Lake Lanier | 11.5% | 64.5% | 98.9% | 15+0+0 | 25+0+0 |
|
| LCA | Greengenes | Forest soil | 14.7% | 55.1% | 84.1% | 130+0+0 | 126+1+0 | 31+2+2 |
| LCA | Greengenes | Siberian soil |
| 60.1% | 85.6% | 38+1+0 | 53+1+0 | 18+1+0 |
| LCA | Greengenes | Hydrothermal mat | 77.5% | 89.0% | 99.4% | 15+1+0 | 23+6+0 |
|
| RDP | Greengenes | Lake Lanier | 0 | 72.2% | 91.8% | 0 | 28+0+0 | 9+0+0 |
| RDP | Greengenes | Forest soil | 0 | 52.2% | 86.7% | 0 | 111+0+0 | 16+2+1 |
| RDP | Greengenes | Siberian soil | 0 | 53.4% | 90.5% | 0 | 53+1+0 | 10+1+0 |
| RDP | Greengenes | Hydrothermal mat | 0 | 81.6% | 97.8% | 0 | 19+3+0 | 9+2+0 |
| RDP | RDP v6 | Lake Lanier | 9.3% | 51.1% | 87.1% | 17+0+0 | 20+0+2 | 10+0+2 |
| RDP | RDP v6 | Forest soil | 11.9% | 40.4% | 80.9% | 176+2+0 | 95+2+0 | 20+2+1 |
| RDP | RDP v6 | Siberian soil | 6.7% | 39.7% | 66.0% | 36+1+0 | 39+1+0 | 10+1+0 |
| RDP | RDP v6 | Hydrothermal mat | 84.4% | 91.7% | 97.7% | 21+2+0 | 17+2+0 | 8+2+0 |
| SINA | SSURef108 | Hydrothermal mat | 20.4% | 27.7% | 93.2% | 32+1+0 | 25+5+0 | 9+2+0 |
Proportion of the total reads in the dataset for which taxonomical assignment was achieved at the given taxonomical level.
Number of unique taxa identified given separately for bacteria + archaea + eukaryotes. Where the highest total number of taxa was predicted from a test dataset, the number is indicated in bold.
Classification using Megablast alignments and the CREST LCAClassifier within a 2% LCA range of the highest bitscore as well as percent similarity filters.
Naïve Bayes classification using the RDP Classifier with a bootstrap confidence cutoff of 0.8. With the Greengenes training set, RDP Classifier was run via the QIIME script assign_taxonomy.
LCA clasification based on SINA Aligner, using default parameters at SILVA website.
Figure 3Average proportion of reads classified at different ranks in four environmental datasets.
The CREST LCAClassifier (analogous to MEGAN) was tested using the full SilvaMod and Greengenes [21] reference databases with their respective taxonomies, as well as the RDP Classifier [22] retrained with Greengenes (99%OTU dataset; executed via QIIME) and version 6 of the default RDP training dataset.