| Literature DB >> 22134646 |
Daniel McDonald1, Morgan N Price, Julia Goodrich, Eric P Nawrocki, Todd Z DeSantis, Alexander Probst, Gary L Andersen, Rob Knight, Philip Hugenholtz.
Abstract
Reference phylogenies are crucial for providing a taxonomic framework for interpretation of marker gene and metagenomic surveys, which continue to reveal novel species at a remarkable rate. Greengenes is a dedicated full-length 16S rRNA gene database that provides users with a curated taxonomy based on de novo tree inference. We developed a 'taxonomy to tree' approach for transferring group names from an existing taxonomy to a tree topology, and used it to apply the Greengenes, National Center for Biotechnology Information (NCBI) and cyanoDB (Cyanobacteria only) taxonomies to a de novo tree comprising 408,315 sequences. We also incorporated explicit rank information provided by the NCBI taxonomy to group names (by prefixing rank designations) for better user orientation and classification consistency. The resulting merged taxonomy improved the classification of 75% of the sequences by one or more ranks relative to the original NCBI taxonomy with the most pronounced improvements occurring in under-classified environmental sequences. We also assessed candidate phyla (divisions) currently defined by NCBI and present recommendations for consolidation of 34 redundantly named groups. All intermediate results from the pipeline, which includes tree inference, jackknifing and transfer of a donor taxonomy to a recipient tree (tax2tree) are available for download. The improved Greengenes taxonomy should provide important infrastructure for a wide range of megasequencing projects studying ecosystems on scales ranging from our own bodies (the Human Microbiome Project) to the entire planet (the Earth Microbiome Project). The implementation of the software can be obtained from http://sourceforge.net/projects/tax2tree/.Entities:
Mesh:
Substances:
Year: 2011 PMID: 22134646 PMCID: PMC3280142 DOI: 10.1038/ismej.2011.139
Source DB: PubMed Journal: ISME J ISSN: 1751-7362 Impact factor: 10.302
Figure 1Overview of the tax2tree workflow. (i) The inputs to tax2tree; a taxonomy file that matches known taxonomy strings to identifiers that are associated with tips of (that is, sequences within) a phylogenetic tree. To simplify the diagram, only the family, genus and species are used, although the full algorithm uses all phylogenetic ranks. (ii) The input taxonomy represented as a tree and a taxon name legend for the figure. (iii, iv) Nodes chosen by the F-measure procedure at each rank; (iii) species, (iv) genus and (v) family. In this example, the genus Clostridium is polyphyletic, and the F-measure procedure picked the ‘best' internal node for the name (uniting tips A–F). However, as unique names at a given rank can only be placed once on the tree, this leaves tips I–L without a genus name placed on an interior node. (vi) The backfilling procedure detects that tips I–L have an incomplete taxonomic path (species to family) and (vi) prepends the missing genus name (obtained from the input taxonomy) to the lower rank because this step of the procedure examines only ancestors but not siblings. (vii) The common name promotion step identifies internal nodes in which all of the nearest named descendants share a common name. In this example, the node that is the lowest common ancestor for tips I–L has immediate descendants that all share the same genus name, Clostridium. This name can be safely promoted to the lowest common ancestor (interior node) uniting tips I–L. (viii) The resulting taxonomy. Note that the sequence identified as B was unclassified in the donor taxonomy but is now classified as f__Lachnospiraceae; g__Clostridium; s__.
Figure 2A comparison of the NCBI taxonomy to the updated Greengenes taxonomy for sequences in tree_16S_all_gg_2011_1. (a) Lowest taxonomic rank assigned to each sequence; (b) taxonomic differences between NCBI and Greengenes at each rank, showing the percentage of sequences classified to each of five possible categories (see inset legend; GG, Greengenes) highlighting cases where NCBI and Greengenes differ.
Greengenes classifications of NCBI-defined candidate phyla (divisions) based on tree_16S_candiv_gg_2011_1. SILVA_106 and RDP classifications are included for reference
| AC1 | 6 (7) | p__AC1 | TA06 | |
| OS-K | 3 (7) | p__Acidobacteria | Acidobacteria | Acidobacteria |
| OP10 | 69 (279) | p__Armatimonadetes | OP10 | OP10 |
| KSA1 | 0 (2) | p__Bacteroidetes | Bacteroidetes | |
| KSB1 | 13 (23) | p__Caldithrix | Deferribacteres | |
| MSBL5 | 0 (1) | p__Chloroflexi | Chloroflexi | |
| NT-B4 | 0 (1) | p__Chloroflexi | ||
| CAB-I | 7 (59) | p__Cyanobacteria | Cyanobacteria | Cyanobacteria |
| OP2 | 1 (25) | p__Elusimicrobia | Thermotogae | |
| GN01 | 10 (12) | p__GN01 | Spirochaetes | |
| GN02 | 4 (10) | p__GN02 | BD1-5 | |
| GN10 | 3 (4) | p__GN02 | BD1-5 | |
| GN11 | 3 (0) | p__GN02 | BD1-5 | |
| GN07 | 0 (4) | p__GN02 | ||
| GN08 | 0 (1) | p__GN02 | ||
| GN04 | 7 (7) | p__GN04 | TA06 | |
| GN12 | 0 (2) | p__GN04 | ||
| GN15 | 0 (2) | p__GN04 | ||
| GN13 | 0 (2) | p__GN13 | ||
| GN14 | 0 (2) | p__GN14 | ||
| GN06 | 1 (2) | p__KSB3 | Proteobacteria | |
| NC10 | 6 (27) | p__NC10 | Nitrospirae | Firmicutes |
| NKB19 | 4 (11) | p__NKB19 | BRC1 | |
| KB1 group | 7 (20) | p__OP1 | EM19 | |
| OP1 | 10 (38) | p__OP1 | EM19 | |
| MSBL6 | 0 (5) | p__OP1 | ||
| Sediment-3 | 0 (1) | p__OP1 | ||
| MSBL4 | 0 (3) | p__OP3 | ||
| kpj58rc | 0 (1) | p__OP3 | ||
| OP8 | 36 (390) | p__OP8 | Nitrospirae | |
| JS1 | 26 (89) | p__OP9 | OP9 | Firmicutes |
| VC2 | 0 (2) | p__Proteobacteria | ||
| Marine group | 0 (2) | p__SAR406 | ||
| SBR1093 | 9 (1) | p__SBR1093 | Proteobacteria | |
| SPAM | 8 (1) | p__SPAM | Nitrospirae | |
| GN05 | 4 (9) | p__Spirochaetes | Spirochaetes | |
| WWE1 | 3 (2) | p__Spirochaetes | Spirochaetes | |
| OP4 | 1 (1) | p__Spirochaetes | Spirochaetes | |
| MSBL2 | 0 (6) | p__Spirochaetes | ||
| KSA2 | 0 (1) | p__Spirochaetes | ||
| Sediment-4 | 0 (3) | p__Spirochaetes | ||
| Sediment-2 | 0 (2) | p__Spirochaetes | ||
| GN09 | 6 (4) | p__TG3 | Fibrobacteres | |
| TG3 | 41 (40) | p__TG3 | Fibrobacteres | |
| MSBL3 | 0 (1) | p__Verrucomicrobia | ||
| Sediment-1 | 0 (3) | p__WS3 | ||
| GN03 | 0 (27) | p__WS3 | ||
| KSB4 | 0 (1) | p__WS3 | WS3 | |
| WS5 | 1 (2) | p__WS5 | WCHB1-60 | |
| WWE3 | 116 (0) | p__WWE3 | OD1 | |
| ZB3 | 11 (0) | p__ZB3 | Cyanobacteria | |
| TG2 | 4 (0) | p__ZB3 | Cyanobacteria | |
| SAM | 1 (0) | Chimera | Chloroflexi | |
Abbreviation: NCBI, National Center for Biotechnology Information.
The following candidate phyla are not shown because they were consistent between NCBI, Greengenes, SILVA and RDP (where classifications were available): BRC1, KSB2, KSB3, OD1, OP11, OP3, OP6, OP7, OP9, SR1, TM6, TM7, WS1, WS2, WS3, WS4, WS6 and WYO.
Full-length representatives ⩾1200 nt, partial length <1200 nt, not all sequences are 16S rRNA. Phylogenetic placements based only on partial sequences should be considered probationary until full-length or genomic sequence data become available.
Name of phylum that encompasses the majority of the NCBI representative sequences, except where specifically noted. Gaps indicate no classification.
Not robustly supported as a monophyletic group in tree_408135 (jackknife <70%).
On the basis of the position of the single full-length representative after which the group was originally named, the 25 partial length representatives are not affiliated with the full-length sequence and belong to the Chlorobi.
On the basis of the longest representative of this proposed group (AF142890), the two shorter sequences are members of the Firmicutes.
One representative belongs to each phylum; AF142866—Spirochaetes, AF142828—SAR406.
Between Planctomycetes and Chloroflexi.