| Literature DB >> 28558684 |
Jonathan L Golob1,2, Elisa Margolis3,4, Noah G Hoffman5, David N Fredricks3,6.
Abstract
BACKGROUND: Microbiome studies commonly use 16S rRNA gene amplicon sequencing to characterize microbial communities. Errors introduced at multiple steps in this process can affect the interpretation of the data. Here we evaluate the accuracy of operational taxonomic unit (OTU) generation, taxonomic classification, alpha- and beta-diversity measures for different settings in QIIME, MOTHUR and a pplacer-based classification pipeline, using a novel software package: DECARD.Entities:
Keywords: Classification; MOTHUR; Microbiome; Operational taxonomic unit; Optimization; QIIME; UniFrac
Mesh:
Substances:
Year: 2017 PMID: 28558684 PMCID: PMC5450146 DOI: 10.1186/s12859-017-1690-0
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Fig. 1Estimated Real versus Synthetic Health Human Stool Microbiota. a Each column represents one sample. Each band represents one organism. The height of each band of color is proportional to the relative abundance of each sequence type. Taxonomically similar organisms are closer in color. Colors are by phylum (inspired by a gram stain): Blue and purple for Firmucutes; orange for Bacteroides; Tan and pinks for Proteobacteria. Estimated relative abundances from real data are on the left and underlined in purple for healthy donor human stool microbiota, blue for the human microbiome project samples; synthetic data is on the right, and underlined in green. b The diversity of the each microbiota (synthetic in green, healthy donor in purple and Human Microbiome Project (HMP) in blue) for Hill numbers varying from −1 to 5, in 0.5 intervals. Solid lines are the mean, and dashed lines span the 95% confidence interval after bootstrapping 5000 iterations (with replacement) for the mean
Fig. 2Assessment of OTU Performance. On the left are the various conditions tested. The first column specifies the pipeline, the second the strategy, the third the methodological details (e.g. reference set or algorithm used). Abbreviations: gg is GreenGenes. Sub is Subsetted OTU generation. a No sequencing error. b Simulated sequencing error
Species Level Classification
| Pipeline | OTU Strategy | OTU algorithm | Reference | Undercalled (%) | Undercalled (Ranks off) | Correct (%) | Misscalled (%) | Miscalled (Ranks off) | Lost (%) |
|---|---|---|---|---|---|---|---|---|---|
| QIIME | Closed | GreenGenes | 55.8 | 1 (1–4) | 18.9 | 22.3 | 4 (1–10) | 3.0 | |
| QIIME | Sub | UClust | GreenGenes | 63.3 | 1 (1–3) | 12 | 24.5 | 4 (1–10) | 0.2 |
| QIIME | Sub | UClust | Silva | 77.1 | 1 (1–6) | 8.8 | 13.8 | 4 (1–14) | 0.2 |
| QIIME | De novo | UClust | GreenGenes | 61.4 | 1 (1–3) | 12.2 | 26.2 | 4 (1–10) | 0.1 |
| QIIME | De novo | UClust | Silva | 77.7 | 1 (1–3) | 8.7 | 13.4 | 4 (1–12) | 0.1 |
| QIIME | De novo | Swarm | GreenGenes | 61.5 | 1 (1–4) | 12.4 | 25.9 | 4 (1–10) | 0.1 |
| MOTHUR | Closed | Silva/RDP | 54.6 | 1 (1–3) | 6.9 | 21.9 | 10 (4–12) | 16.6 | |
| pplacer | De novo | Swarm | RDP | 68.2 | 1 (1–8) | 18.1 | 12.5 | 4 (1–10) | 1.2 |
Summary of Classification Performance. On the left are the various conditions tested. The first column specifies the pipeline, the second the OTU strategy, the third the methodological details (e.g. reference set or algorithm used). Table 1 is for species-level classification, Table 2 is for genus-level. Source organisms can be correctly called, undercalled (in the correct clade, but not the target species or genus level classification), or miscalled (placed down the wrong taxonomic clade). We present both the percentage in each category (correct, undercalled, and miscalled) and the median (min and max parenthetical) taxonomic ranks off for underacalled and miscalled source organisms
Genus Level Classification
| Pipeline | OTU Strategy | OTU algorithm | Reference | Undercalled (%) | Undercalled (Ranks off) | Correct (%) | Misscalled (%) | Miscalled (Ranks off) | Lost (%) |
|---|---|---|---|---|---|---|---|---|---|
| QIIME | Closed | GreenGenes | 24.0 | 1 (1–3) | 53.7 | 19.3 | 4 (1–9) | 3.0 | |
| QIIME | Sub | UClust | GreenGenes | 27.8 | 1 (1–3) | 50.4 | 21.6 | 4 (1–9) | 0.2 |
| QIIME | Sub | UClust | Silva | 11.6 | 1 (1–5) | 74.8 | 13.4 | 4 (1–13) | 0.2 |
| QIIME | De novo | UClust | GreenGenes | 22.9 | 1 (1–3) | 53.5 | 23.5 | 4 (1–9) | 0.1 |
| QIIME | De novo | UClust | Silva | 12.0 | 1 (1–3) | 74.6 | 13.3 | 4 (1–11) | 0.1 |
| QIIME | De novo | Swarm | GreenGenes | 26.3 | 1 (1–3) | 50.5 | 23.1 | 5 (1–9) | 0.1 |
| MOTHUR | Closed | Silva/RDP | 5 | 1 (1–2) | 56.5 | 21.9 | 9 (1–11) | 16.6 | |
| pplacer | De novo | Swarm | RDP | 31.7 | 2 (1–7) | 55.2 | 11.9 | 4 (1–9) | 1.2 |
Summary of Classification Performance. On the left are the various conditions tested. The first column specifies the pipeline, the second the OTU strategy, the third the methodological details (e.g. reference set or algorithm used). Table 1 is for species-level classification, Table 2 is for genus-level. Source organisms can be correctly called, undercalled (in the correct clade, but not the target species or genus level classification), or miscalled (placed down the wrong taxonomic clade). We present both the percentage in each category (correct, undercalled, and miscalled) and the median (min and max parenthetical) taxonomic ranks off for underacalled and miscalled source organisms
Classification outcomes by order for all pipelines
| Order | Percent | Ranks Off | |||||
|---|---|---|---|---|---|---|---|
| Correct | Miscalled | Undercalled | Dropped | Miscalled | Undercalled | Total | |
| Verrucomicrobiae | 57.4 | 0.0 | 35.6 | 7.1 | 0.0 | 0.5 | 0.5 |
| Lentisphaeria | 30.9 | 0.0 | 57.5 | 11.5 | 0.0 | 1.3 | 1.3 |
| Fusobacteriales | 23.9 | 8.0 | 51.0 | 17.2 | 0.6 | 0.5 | 1.2 |
| Acholeplasmatales | 22.6 | 13.1 | 54.3 | 10.0 | 0.8 | 1.0 | 1.8 |
| Pasteurellales | 19.5 | 36.9 | 34.0 | 9.6 | 1.8 | 0.4 | 2.2 |
| Bacteroidia | 15.8 | 8.2 | 67.1 | 8.9 | 0.6 | 0.8 | 1.3 |
| Lactobacillales | 13.9 | 11.9 | 65.7 | 8.6 | 0.8 | 0.8 | 1.7 |
| Selenomonadales | 12.9 | 13.8 | 65.3 | 8.0 | 0.6 | 0.9 | 1.5 |
| Mycoplasmatales | 12.3 | 65.9 | 10.7 | 11.2 | 5.5 | 0.6 | 6.1 |
| Clostridiales | 10.0 | 30.9 | 48.7 | 10.5 | 1.8 | 0.8 | 2.6 |
| Deltaproteobacteria | 9.1 | 7.5 | 74.8 | 8.6 | 0.6 | 1.3 | 1.9 |
| Burkholderiales | 8.0 | 29.1 | 56.0 | 6.9 | 1.2 | 0.6 | 1.8 |
| Actinobacteridae | 7.8 | 15.2 | 70.1 | 6.9 | 1.0 | 0.8 | 1.8 |
| Coriobacteridae | 7.7 | 11.1 | 74.0 | 7.3 | 0.9 | 1.6 | 2.5 |
| Erysipelotrichales | 7.4 | 2.9 | 81.7 | 7.9 | 0.2 | 1.5 | 1.8 |
| Enterobacteriales | 2.2 | 37.9 | 50.0 | 10.0 | 2.3 | 1.0 | 3.3 |
| Rhodospirillales | 0.0 | 85.3 | 0.0 | 14.7 | 5.2 | 0.0 | 5.2 |
Classification Performance by Order of Source Organism. Combined performance for all pipelines and settings, broken down by the order of the organism. Correct are correctly classified organisms. Miscalled are organisms that are classified into the wrong clade. Undercalled are organisms placed into the correct clade, but at the higher order than species
Fig. 3True versus Estimated Shannon Diversity. In each scatter plot, the x-axis is the true Shannon diversity for a community, and the y-axis is the estimated for the given pipeline. The top graph is true-versus-true for comparison in the others. We used Spearman’s correlations coefficients (inset, with 95% confidence intervals in parentheses) to test for monotonicity (consistency) of the estimates to true
Fig. 4True versus Estimated Pairwise Distance. In each density plot, the x-axis is the true pairwise distance and the y-axis is the estimated pairwise distance between communities. We used Spearman’s correlations coefficients (inset, with 95% confidence intervals in parentheses) to test for monotonicity (consistency) of the estimates to true. The left column is pairwise distance as calculated by Weighted UniFrac distance. The right column is pairwise distances as calculated by double principle coordinate analysis (DPCoA)