| Literature DB >> 22745671 |
Kaustubh Raosaheb Patil1, Linus Roune, Alice Carolyn McHardy.
Abstract
Metagenome sequencing is becoming common and there is an increasing need for easily accessible tools for data analysis. An essential step is the taxonomic classification of sequence fragments. We describe a web server for the taxonomic assignment of metagenome sequences with PhyloPythiaS. PhyloPythiaS is a fast and accurate sequence composition-based classifier that utilizes the hierarchical relationships between clades. Taxonomic assignments with the web server can be made with a generic model, or with sample-specific models that users can specify and create. Several interactive visualization modes and multiple download formats allow quick and convenient analysis and downstream processing of taxonomic assignments. Here, we demonstrate usage of our web server by taxonomic assignment of metagenome samples from an acidophilic biofilm community of an acid mine and of a microbial community from cow rumen.Entities:
Mesh:
Year: 2012 PMID: 22745671 PMCID: PMC3380018 DOI: 10.1371/journal.pone.0038581
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Figure 1Taxonomic assignments of the acid mine drainage metagenome scaffolds.
Each slice represents number of bases assigned. (a) the PhyloPythiaS generic model at the phylum level, (b) the PhyloPythiaS sample-specific model at the phylum level, (c) the PhyloPythiaS sample-specific model at various ranks, (d) taxonomic reference composition, obtained by alignment of the scaffolds with draft genome assemblies, (e) quantitative cell counts from a FISH study, reproduced from Tyson et al. (2004) [13] and (f) NBC with N-mer length 15 and Bacteria/Archaea genomes at the phylum level. The “Other” slice represents sequences that were unassigned or assigned at a higher level. Assignments were mapped to phylum level in plots a, b and f for ease of visualization.
Percentage of bases correctly assigned to modeled taxa by different methods for the AMD metagenome scaffolds.
| Rank | PhyloPythiaS sample-specific | PhyloPythiaS generic | BLASTN | MEGAN | NBC |
|
| 41.353 | 0.000 | 0.000 | 0.000 | 0.000 |
|
| 41.353 | 0.000 | 1.685 | 0.000 | 0.000 |
|
| 74.706 | 38.189 | 45.536 | 42.210 | 1.742 |
|
| 74.706 | 38.189 | 45.536 | 42.210 | 1.742 |
|
| 89.540 | 47.821 | 47.011 | 42.673 | 1.798 |
|
| 92.673 | 88.978 | 86.042 | 70.194 | 44.805 |
The reference taxonomic affiliations were obtained by aligning the test scaffolds with the draft genomes. For PhyloPythiaS (both generic and sample-specific), the drop in accuracy is mostly due to unassigned sequences at a particular rank, while other methods produced more false assignments. Thermoplasmatales archaeon Gpl (comprising 21.8% of the total bases) has no defined parental clade at the genus and family ranks, contributing to the observed lower accuracy values for these ranks. Additional measures are shown in Figure S6.
Taxonomic distance analysis for AMD metagenome scaffolds assignment to draft genome assemblies generated for five strains of three different genera in the AMD metagenome project.
| Method | Measure | Genus | Taxonomic Distance | |||
| L (543) | T (404) | F (236) | Micro average | Macro average | ||
|
| Assigned | 528 | 404 | 236 | – | – |
| Const_n_scaff | 0.92 | 0.83 | 0.97 | 0.89 | 0.91 | |
| Const_n_bp | 0.97 | 0.89 | 0.99 | 0.95 | 0.95 | |
| Tax dist | 0.96 | 1.79 | 2.22 | 1.48 | 1.65 | |
|
| Assigned | 540 | 403 | 236 | – | – |
| Const_n_scaff | 0.36 | 0.81 | 0.95 | 0.63 | 0.71 | |
| Const_n_bp | 0.24 | 0.86 | 0.98 | 0.62 | 0.70 | |
| Tax dist | 6.90 | 1.96 | 2.53 | 4.32 | 3.80 | |
|
| Assigned | 542 | 403 | 236 | – | – |
| Const_n_scaff | 0.13 | 0.13 | 0.07 | 0.12 | 0.11 | |
| Const_n_bp | 0.06 | 0.08 | 0.02 | 0.05 | 0.05 | |
| Tax dist | 9.36 | 3.78 | 4.95 | 6.56 | 6.03 | |
|
| Assigned | 337 | 272 | 194 | – | – |
| Const_n_scaff | 0.60 | 0.14 | 0.12 | 0.22 | 0.28 | |
| Const_n_bp | 0.58 | 0.14 | 0.11 | 0.30 | 0.28 | |
| Tax dist | 5.77 | 2.12 | 3.93 | 2.78 | 3.94 | |
|
| Assigned | 539 | 403 | 235 | – | – |
| Const_n_scaff | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | |
| Const_n_bp | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | |
| Tax dist | 10.10 | 9.44 | 11.79 | 10.16 | 10.45 | |
The genera are Leptospirillum (L), Thermoplasmatales (T), and Ferroplasma (F). The evaluated methods are the PhyloPythiaS sample-specific model (PPS SS), the PhyloPythiaS generic model (PPS G), BLASTN, MEGAN and the Naïve Bayesian classifier method (NBC). The assignments provided by each method were mapped to the genus or corresponding clade at a higher taxonomic rank for this analysis. The numbers in brackets after the population name show the number of scaffolds originating from each genus. The rows show the number of assigned scaffolds (Assigned), the fraction of scaffolds assigned to either the correct clade itself or a parental clade thereof (Const_n_scaff), the fraction of base-pairs in the same lineage as the correct taxon (Const_n_bp) and the average taxonomic distance of assignments with respect to genus level clades of the draft reference genomes (Tax Dist). See ‘Results’ for the definitions of consistency and taxonomic distance. Micro average shows average value over all test scaffolds and macro average shows average over the three genera.
Figure 2Taxonomic assignments of the cow rumen metagenome scaffolds with the PhyloPythiaS generic model.
This data-set contained 26,042scaffolds in total. The assignments are shown at the order level. Each slice represents number of bases assigned. The “Other” slice represents sequences that were unassigned or assigned at a higher level.
Taxonomic distance and consistency analysis of the 15 genome bins from the cow rumen metagenome consisting of 466 scaffolds in total.
| Genome bin | Correct order | #Scaff | PhyloPythiaS generic model prediction | ||
| Tax Dist | Const_n_scaff | Const_n_bp | |||
|
| Bacteroidales | 20.000 | 0.000 | 1.000 | 1.000 |
|
| Bacteroidales | 22.000 | 0.000 | 1.000 | 1.000 |
|
| Spirochaetales | 19.000 | 0.000 | 1.000 | 1.000 |
|
| Bacteroidales | 24.000 | 0.000 | 1.000 | 1.000 |
|
| Bacteroidales | 26.000 | 0.231 | 0.962 | 0.990 |
|
| Clostridiales | 32.000 | 0.625 | 0.906 | 0.967 |
|
| Bacteroidales | 35.000 | 0.743 | 0.886 | 0.938 |
|
| Clostridiales | 42.000 | 1.738 | 0.690 | 0.776 |
|
| Spirochaetales | 28.000 | 1.893 | 0.714 | 0.759 |
|
| Clostridiales | 55.000 | 3.636 | 0.382 | 0.454 |
|
| Clostridiales | 53.000 | 5.245 | 0.189 | 0.114 |
|
| Clostridiales | 22.000 | 6.682 | 0.182 | 0.086 |
|
| Myxococcales | 20.000 | 3.100 | 0.250 | 0.076 |
|
| Clostridiales | 27.000 | 3.704 | 0.074 | 0.046 |
|
| Clostridiales | 41.000 | 7.073 | 0.000 | 0.000 |
|
| – | 31.067 | 2.311 | 0.616 | 0.614 |
|
| – | – | 2.693 | 0.560 | 0.613 |
The first three columns describe the dataset while the last three columns summarize the predictions of the PhyloPythiaS generic model. The last three columns show the average taxonomic distances between the predicted order and the correct order (Tax Dist), the consistency calculated based on the fraction of assigned scaffolds (Const_n_scaff) and the consistency calculated based on the fraction of assigned base-pairs (Const_n_bp). See ‘Results’ for the definitions of taxonomic distance and consistency. The micro average is the average value over all scaffolds and the macro average represents the average over the genome bins.
Figure 3Schematic representation of the PhyloPythiaS web server implementation.
Arrows represent the direction of communication.