| Literature DB >> 19648916 |
Arthur Brady1, Steven L Salzberg.
Abstract
Metagenomics projects collect DNA from uncharacterized environments that may contain thousands of species per sample. One main challenge facing metagenomic analysis is phylogenetic classification of raw sequence reads into groups representing the same or similar taxa, a prerequisite for genome assembly and for analyzing the biological diversity of a sample. New sequencing technologies have made metagenomics easier, by making sequencing faster, and more difficult, by producing shorter reads than previous technologies. Classifying sequences from reads as short as 100 base pairs has until now been relatively inaccurate, requiring researchers to use older, long-read technologies. We present Phymm, a classifier for metagenomic data, that has been trained on 539 complete, curated genomes and can accurately classify reads as short as 100 base pairs, a substantial improvement over previous composition-based classification methods. We also describe how combining Phymm with sequence alignment algorithms improves accuracy.Entities:
Mesh:
Substances:
Year: 2009 PMID: 19648916 PMCID: PMC2762791 DOI: 10.1038/nmeth.1358
Source DB: PubMed Journal: Nat Methods ISSN: 1548-7091 Impact factor: 28.547
Figure 1Percent accuracy of Phymm, with species-level matches masked, for read lengths from 100–1000 bp
Colored dots show classification accuracy reported for PhyloPythia at 1000 bp for genus-through phylum-level predictions, and for CARMA at 100 bp (as apercentage of the entire input data set) for genus-and phylum-level predictions.
Comparison of performance accuracy, with same-species matches masked, for 1,000-bp reads, showing all three methods described in this study alongside PhyloPythia.
| Query Length | Phymm | BLAST | PhymmBL | PhyloPythia |
|---|---|---|---|---|
| Genus | 71.1 | 73.8 | 78.4 | 7.1 |
| Family | 77.5 | 79.2 | 84.8 | – |
| Order | 80.6 | 80.8 | 86.9 | 25.1 |
| Class | 85.4 | 84.1 | 90.6 | 30.8 |
| Phylum | 89.8 | 88.0 | 93.8 | 50.3 |
Accuracy is measured as the percentage of all reads for which each method produced the correct phylogenetic label. We performed this set of experiments once for each read length. All other experiments with our synthetic metagenome data set were performed only for 100-bp reads; each of these experiments was repeated 10 times. Results for experiments at all levels of the phylogeny, along with standard deviations for each result, are provided in Supplementary Tables 4–6.
Figure 2PhymmBL’s phylum-level population characterization of the AMD data
using (A) the RefSeq-generated IMMs plus IMMs generated from the draft genomes of the three dominant species in the AMD set, and (B) the RefSeq-generated IMMs on their own.
Figure 3PhymmBL’s species-level population characterization of the AMD data
using the RefSeq-generated IMMs plus IMMs generated from the draft genomes of the three dominant species in the AMD set.