| Literature DB >> 18596968 |
Dongying Wu1, Amber Hartman, Naomi Ward, Jonathan A Eisen.
Abstract
Comparative analysis of small-subunit ribosomal RNA (ss-rRNA) gene sequences forms the basis for much of what we know about the phylogenetic diversity of both cultured and uncultured microorganisms. As sequencing costs continue to decline and throughput increases, sequences of ss-rRNA genes are being obtained at an ever-increasing rate. This increasing flow of data has opened many new windows into microbial diversity and evolution, and at the same time has created significant methodological challenges. Those processes which commonly require time-consuming human intervention, such as the preparation of multiple sequence alignments, simply cannot keep up with the flood of incoming data. Fully automated methods of analysis are needed. Notably, existing automated methods avoid one or more steps that, though computationally costly or difficult, we consider to be important. In particular, we regard both the building of multiple sequence alignments and the performance of high quality phylogenetic analysis to be necessary. We describe here our fully-automated ss-rRNA taxonomy and alignment pipeline (STAP). It generates both high-quality multiple sequence alignments and phylogenetic trees, and thus can be used for multiple purposes including phylogenetically-based taxonomic assignments and analysis of species diversity in environmental samples. The pipeline combines publicly-available packages (PHYML, BLASTN and CLUSTALW) with our automatic alignment, masking, and tree-parsing programs. Most importantly, this automated process yields results comparable to those achievable by manual analysis, yet offers speed and capacity that are unattainable by manual efforts.Entities:
Mesh:
Substances:
Year: 2008 PMID: 18596968 PMCID: PMC2432038 DOI: 10.1371/journal.pone.0002566
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Comparison of STAP's computational abilities relative to existing commonly-used ss-RNA analysis tools.
| STAP | ARB | Greengenes | RDP | |
| Installed where? | Locally | Locally | Web only | Web only |
| User interface | Command line | GUI | Web portal | Web portal |
| Parallel processing | YES | NO | NO | NO |
| Manual curation for taxonomy assignment | NO | YES | NO | NO |
| Manual curation for alignment | NO | YES | NO | NO |
| Open source | YES | NO | NO | NO |
| Processing speed | Fast | Slow | Medium | Medium |
It is important to note, that STAP is the only software that runs on the command line and can take advantage of parallel processing on linux clusters and, further, is more amenable to downstream code manipulation.
Note: Greengenes alignment output is compatible with upload into ARB and downstream manual alignment.
The STAP program itself is open source, the programs it depends on are freely available but not open source.
Figure 1A flow chart of the STAP pipeline.
Figure 2Domain assignment.
In Step 1, STAP assigns a domain to each query sequence based on its position in a maximum likelihood tree of representative ss-rRNA sequences. Because the tree illustrated here is not rooted, domain assignment would not be accurate and reliable (sequence similarity based methods cannot make an accurate assignment in this case either). However the figure illustrates an important role of the tree-based domain assignment step, namely automatic identification of deep-branching environmental ss-rRNAs.
Figure 3Determination of the quality score cutoff for automated alignment trimming.
The average quality score for all columns for alignments of randomly-generated sequences is plotted against the number of sequences in the alignment (see Methods). Standard deviations are indicated by gray shading.
Figure 4Comparison of reliability of BLASTN and STAP taxonomic assignments.
The number below each taxonomic level indicates the number of bacterial sequences in the analysis that were annotated at that level (see Results and Discussion).
Discrepancies between taxonomic assignments made by BLASTN and STAP.
| Taxonomic Level | Phylum | Class | Order | Family | Genus |
| STAP more accurate | 1 | 1 | 11 | 10 | 25 |
| BLASTN more accurate | 0 | 0 | 0 | 3 | 8 |
| Unresolved | 2 | 3 | 7 | 6 | 8 |
Bacterial sequences for which the assignments made by BLASTN differed from those made by STAP were identified, and the level of the Hugenholtz annotation for each was noted. Accuracy was scored based on comparisons with the Hugenholtz annotation. Those few cases where the BLASTN results matched the annotations but the STAP results did not were always found to be due to incorrect annotation in the Greengenes database for the sequence's closest neighbor in the tree. Sequences whose position in the STAP-generated tree was between neighboring groups were classified as “unresolved.”