| Literature DB >> 20540779 |
Amber L Hartman1, Sean Riddle, Timothy McPhillips, Bertram Ludäscher, Jonathan A Eisen.
Abstract
BACKGROUND: For more than two decades microbiologists have used a highly conserved microbial gene as a phylogenetic marker for bacteria and archaea. The small-subunit ribosomal RNA gene, also known as 16 S rRNA, is encoded by ribosomal DNA, 16 S rDNA, and has provided a powerful comparative tool to microbial ecologists. Over time, the microbial ecology field has matured from small-scale studies in a select number of environments to massive collections of sequence data that are paired with dozens of corresponding collection variables. As the complexity of data and tool sets have grown, the need for flexible automation and maintenance of the core processes of 16 S rDNA sequence analysis has increased correspondingly.Entities:
Mesh:
Year: 2010 PMID: 20540779 PMCID: PMC2898799 DOI: 10.1186/1471-2105-11-317
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Figure 1Overview of WATERS. Schema of WATERS where white boxes indicate "behind the scenes" analyses that are performed in WATERS. Quality control files are generated for white boxes, but not otherwise routinely analyzed. Black arrows indicate that metadata (e.g., sample type) has been overlaid on the data for downstream interpretation. Colored boxes indicate different types of results files that are generated for the user for further use and biological interpretation. Colors indicate different types of WATERS actors from Fig. 2 which were used: green, Diversity metrics, WriteGraphCoordinates, Diversity graphs; blue, Taxonomy, BuildTree, Rename Trees, Save Trees; CreateUnifrac; yellow, CreateOtuTable, CreateCytoscape, CreateOTUFile; white, remaining unnamed actors.
Figure 2Screenshot of WATERS in Kepler software. Key features: the library of actors un-collapsed and displayed on the left-hand side, the input and output paths where the user declares the location of their input files and desired location for the results files. Each green box is an individual Kepler actor that performs a single action on the data stream. The connectors (black arrows) direct and hook up the actors in a defined sequence. Double-clicking on any actor or connector allows it to be manipulated and re-arranged.
Comparison of WATERS' tools to existing web services and stand-alone software tools.
| Greengenes | RDP II | RDP-Py | Silva | Mothur | QIIME | WATERS | |
|---|---|---|---|---|---|---|---|
| Web | Web | Web | Web | Command line | Command line | GUI | |
| NAST | Infernal | Infernal | SINA | NAST | Infernal | ||
| Bellerohpon | No | No | No | Unknown | Mallard | ||
| Yes | DOTUR | Complete-linkage | No | DOTUR | OTUHunter | ||
| Simrank; 7mer classification | naïve Bayesian classifier | naïve Bayesian classifier | Yes | Yes | STAP | ||
| No | NJ | NJ | No | Yes | ML; NJ | ||
| No | No | Yes | No | Yes | Yes | ||
| No | No | No | No | Yes | Yes | ||
| Yes | Yes | No | Yes | No | No | ||
| Yes | Yes | Yes | No | Yes | No | ||
| hundreds | hundreds | 500,000 | hundreds | tens of thousands | tens of thousands | tens of thousands |
Along the left column, "Use" indicates where or how the software is used; "Align" indicates the alignment programs available; "Chimeras" indicates the chimera removal software available; "OTUs" indicates the software used to detect and determine operational taxonomic units; "Taxonomy" indicates the software used to assign taxonomy to OTUs; "Trees" indicates the software used to build phylogenetic trees; "Ecology" indicates whether or not ecological indices such as Chao1 and the Shannon index are calculated; "Unifrac" indicates whether Unifrac analyses are done within the software or whether data is formatted for downstream use in Unifrac; "Export DB" indicates whether a quality-controlled, curated 16 S dataset is available for export and/or for comparison to the user's own dataset; "Trim" indicates the availability of quality control trimming to remove sequence vectors or low-quality bases from the initial upload of sequences; "Dataset size" indicates the estimated amount of sequences that can be readily processed through each software type. Along the top are all known multi-tool 16 S rDNA analysis software suites. Note that these software are each under very active development. This table represents a snapshot in time of current tool availabilities. ML, maximum-likelihood; NJ, neighbor-joining.
Results files generated by WATERS for further analyses.
| Results file | Contents | Purpose |
|---|---|---|
| aligned_sequences.fas | All seqs pre-OTUHunter | QC; Alignment |
| bad_infernal_sequences.fas | Seqs un-alignable by Infernal | QC |
| chimeras.fas | Seqs removed by Mallard | QC |
| coordinates_Rank_abundance.csv | x,y coordinates for rank-abundance | Create graphs |
| coordinates_Rarefaction.csv | x,y coordinated for rarefaction | Create graphs |
| graph_*_variable.xgmml | Similarities between libraries based on shared OTUs | Cytoscape |
| graph-Rank_Abundance.bip/.ps | Printed graph of rank-abundance curves | View graphs |
| graph-Rarefaction.bip/.ps | Printed graph of rarefaction curves | View graphs |
| otu-table.txt | Counts of OTUs and diversity indices at each cutoff and metadata variable | Graph OTUs; diversity metrics |
| sequences-*.fas | One representative seq for each OTU found | Alignment |
| short_sequences.fas | Short seqs that did not pass cut off | QC |
| tree_*.txt | Phylogenetic tree of representatives with taxonomy information | Unifrac |
| unifrac_*_variable.txt | "Environment file" for Unifrac; OTU abundance and library info | Unifrac |
| workflow.trace | Provenance file written by Kepler describing the worklow run | QC |
Fourteen different types of results files can be generated from one run of WATERS in its complete configuration. * represents the cutoffs used in OTUHunter, by default 97 and 99 percent similarity, which will generate two different files at each cutoff used. Abbreviations: Seq(s), sequence(s); QC, Quality Control; OTU, Operational Taxonomic Unit.
Figure 3Biologically similar results automatically produced by WATERS on published colonic microbiota samples. (A) Rarefaction curves similar to curves shown in Eckburg et al. Fig. 2; 70-72, indicate patient numbers, i.e., 3 different individuals. (B) Weighted Unifrac analysis based on phylogenetic tree and OTU data produced by WATERS very similar to Eckburg et al. Fig. 3B. (C) Neighbor-joining phylogenetic tree (Quicktree) representing the sequences analyzed by WATERS, which is clearly similar to Fig. S1 in Eckburg et al.
Comparison of OTU abundance between WATERS' automated results and previously published data.
| Taxonomy | WATERS | |||
|---|---|---|---|---|
| Abundance | OTUs | Abundance | OTUs | |
| Actinobacteria | 18 | 10 | 22 | 10 |
| Alphaproteobacteria | 10 | 4 | 10 | 4 |
| Bacteroides | 5510 | 67 | 5640 | 65 |
| Betaproteobacteria | 27 | 6 | 32 | 5 |
| Cyanobacteria | 3 | 1 | 3 | 1 |
| Deltaproteobacteria | 24 | 4 | 24 | 4 |
| Epsilonproteobacteria | 2 | 1 | 2 | 1 |
| Firmicutes Clostridia | 4849 | 265 | 5721 | 274 |
| Firmicutes Mollicutes | 318 | 19 | 287 | 27 |
| Fusobacteria | 9 | 1 | 9 | 1 |
| Gammaproteobacteria | 5 | 2 | 5 | 2 |
| Verrucomicrobia | 69 | 1 | 76 | 1 |
Along the left are the bacterial taxonomic groups detected in the dataset. Across the top are the results from WATERS compared to the previously published results. Columns 1 and 3 provide the total abundance of all OTUs in that taxonomic category. Columns 2 and 4 provide the number of discreet OTUs observed in that taxonomic group. OTU abundance data for the Eckburg et al. dataset can be found on page 23 of the original publication's supplemental material [48].