| Literature DB >> 25177538 |
Jai Ram Rideout1, Yan He2, Jose A Navas-Molina3, William A Walters4, Luke K Ursell5, Sean M Gibbons6, John Chase7, Daniel McDonald8, Antonio Gonzalez9, Adam Robbins-Pianka8, Jose C Clemente10, Jack A Gilbert11, Susan M Huse12, Hong-Wei Zhou2, Rob Knight13, J Gregory Caporaso14.
Abstract
We present a performance-optimized algorithm, subsampled open-reference OTU picking, for assigning marker gene (e.g., 16S rRNA) sequences generated on next-generation sequencing platforms to operational taxonomic units (OTUs) for microbial community analysis. This algorithm provides benefits over de novo OTU picking (clustering can be performed largely in parallel, reducing runtime) and closed-reference OTU picking (all reads are clustered, not only those that match a reference database sequence with high similarity). Because more of our algorithm can be run in parallel relative to "classic" open-reference OTU picking, it makes open-reference OTU picking tractable on massive amplicon sequence data sets (though on smaller data sets, "classic" open-reference OTU clustering is often faster). We illustrate that here by applying it to the first 15,000 samples sequenced for the Earth Microbiome Project (1.3 billion V4 16S rRNA amplicons). To the best of our knowledge, this is the largest OTU picking run ever performed, and we estimate that our new algorithm runs in less than 1/5 the time than would be required of "classic" open reference OTU picking. We show that subsampled open-reference OTU picking yields results that are highly correlated with those generated by "classic" open-reference OTU picking through comparisons on three well-studied datasets. An implementation of this algorithm is provided in the popular QIIME software package, which uses uclust for read clustering. All analyses were performed using QIIME's uclust wrappers, though we provide details (aided by the open-source code in our GitHub repository) that will allow implementation of subsampled open-reference OTU picking independently of QIIME (e.g., in a compiled programming language, where runtimes should be further reduced). Our analyses should generalize to other implementations of these OTU picking algorithms. Finally, we present a comparison of parameter settings in QIIME's OTU picking workflows and make recommendations on settings for these free parameters to optimize runtime without reducing the quality of the results. These optimized parameters can vastly decrease the runtime of uclust-based OTU picking in QIIME.Entities:
Keywords: Bioinformatics; Microbial ecology; Microbiome; OTU picking; Qiime
Year: 2014 PMID: 25177538 PMCID: PMC4145071 DOI: 10.7717/peerj.545
Source DB: PubMed Journal: PeerJ ISSN: 2167-8359 Impact factor: 2.984
Figure 1Schematic of the subsampled open-reference OTU picking algorithm.
Method definitions.
Definitions of the OTU picking methods being compared here, based on the abbreviations used throughout the paper. From here, we refer to each method by its abbreviation for simplicity. We note that the both de novo (uc) and classic openreference OTU picking (ucr) are accessed through QIIME’s pick_de_novo_otus.py command. ucr is applied when pick_otus:otu_picking_method uclust_ref is specified in the parameters file, and uc is applied when that option is absent. The exact command/parameter combinations used for each OTU picking run are provided in the study’s GitHub repository (see Data Availability).
| Abbreviation | Title | Command | max_ | max_ | step | word | prefilter_ | min_ | speed_ | Processors | reference_ | subsample_ |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| uc | De novo | pick_de_novo_otus.py | 20 | 500 | 20 | 12 | NA | NA | slow | 1 | 0.97 | NA |
| ucr | Legacy open | pick_de_novo_otus.py | 20 | 500 | 20 | 12 | NA | NA | slow | 10 | 0.97 | NA |
| ucrC | Closed reference | pick_closed_reference_otus.py | 20 | 500 | 20 | 12 | NA | NA | slow | 10 | 0.97 | NA |
| ucrss | Subsampled open | pick_open_reference_otus.py | 20 | 500 | 20 | 12 | 0 | 1 | slow | 10 | 0.97 | 0.001 |
| ucrss_wfilter | Subsampled open | pick_open_reference_otus.py | 20 | 500 | 20 | 12 | 0.6 | 1 | slow | 10 | 0.97 | 0.001 |
| uc_fast | De novo, | pick_de_novo_otus.py | 1 | 8 | 8 | 8 | NA | NA | fast | 1 | 0.97 | NA |
| ucr_fast | Legacy open | pick_de_novo_otus.py | 1 | 8 | 8 | 8 | NA | NA | fast | 10 | 0.97 | NA |
| ucrC_fast | Closed reference, | pick_closed_reference_otus.py | 1 | 8 | 8 | 8 | NA | NA | fast | 10 | 0.97 | NA |
| ucrss_fast | Subsampled open | pick_open_reference_otus.py | 1 | 8 | 8 | 8 | 0 | 1 | fast | 10 | 0.97 | 0.001 |
| ucrss_wfilter_fast | Subsampled open | pick_open_reference_otus.py | 1 | 8 | 8 | 8 | 0.6 | 1 | fast | 10 | 0.97 | 0.001 |
| ucr_fast_O29_r82 | Legacy open | pick_de_novo_otus.py | 1 | 8 | 8 | 8 | 0 | 1 | fast | 29 | 0.82 | 0.001 |
| ucr_fast_O29_r97 | Legacy open | pick_de_novo_otus.py | 1 | 8 | 8 | 8 | 0 | 1 | fast | 29 | 0.97 | 0.001 |
| ucrss_fast_O29_r82 | Subsampled open | pick_open_reference_otus.py | 1 | 8 | 8 | 8 | 0 | 1 | fast | 29 | 0.82 | 0.001 |
| ucrss_fast_O29_r97 | Subsampled open | pick_open_reference_otus.py | 1 | 8 | 8 | 8 | 0 | 1 | fast | 29 | 0.97 | 0.001 |
| ucrss_fast_O29_s1 | Subsampled open | pick_open_reference_otus.py | 1 | 8 | 8 | 8 | 0 | 1 | fast | 29 | 0.97 | 0.1 |
Alpha diversity results.
Pearson correlation coefficients (r) of alpha diversity for (a) 88-soils PD, (b) moving-pictures PD, (c) whole-body PD, (d) 88-soils observed species, (e) moving-pictures observed species, and (f) moving-pictures observed species.
| (a) | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|
| uc | ucr | ucrC | ucrss | ucrss_wfilter | uc_fast | ucr_fast | ucrC_fast | ucrss_fast | ucrss_fast_wfilter | |
| uc | 1 | 0.951 | 0.933 | 0.934 | 0.953 | 0.956 | 0.936 | 0.927 | 0.948 | 0.947 |
| ucr | 0.951 | 1 | 0.902 | 0.931 | 0.93 | 0.946 | 0.94 | 0.903 | 0.952 | 0.944 |
| ucrC | 0.933 | 0.902 | 1 | 0.894 | 0.909 | 0.905 | 0.914 | 0.978 | 0.902 | 0.911 |
| ucrss | 0.934 | 0.931 | 0.894 | 1 | 0.929 | 0.944 | 0.935 | 0.894 | 0.948 | 0.949 |
| ucrss_wfilter | 0.953 | 0.93 | 0.909 | 0.929 | 1 | 0.952 | 0.933 | 0.903 | 0.931 | 0.943 |
| uc_fast | 0.956 | 0.946 | 0.905 | 0.944 | 0.952 | 1 | 0.953 | 0.898 | 0.956 | 0.96 |
| ucr_fast | 0.936 | 0.94 | 0.914 | 0.935 | 0.933 | 0.953 | 1 | 0.914 | 0.95 | 0.952 |
| ucrC_fast | 0.927 | 0.903 | 0.978 | 0.894 | 0.903 | 0.898 | 0.914 | 1 | 0.902 | 0.903 |
| ucrss_fast | 0.948 | 0.952 | 0.902 | 0.948 | 0.931 | 0.956 | 0.95 | 0.902 | 1 | 0.962 |
| ucrss_fast_wfilter | 0.947 | 0.944 | 0.911 | 0.949 | 0.943 | 0.96 | 0.952 | 0.903 | 0.962 | 1 |
Beta diversity results.
Mantel correlation coefficients (r) of beta diversity for (a) 88-soils unweighted UniFrac, (b) moving-pictures unweighted UniFrac, (c) whole-body unweighted UniFrac, (d) 88-soils weighted UniFrac, (e) moving-pictures weighted UniFrac, and (f) moving-pictures weighted UniFrac.
| (a) | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|
| uc | ucr | ucrC | ucrss | ucrss_wfilter | uc_fast | ucr_fast | ucrC_fast | ucrss_fast | ucrss_fast_wfilter | |
| uc | NA | 0.935 | 0.908 | 0.944 | 0.942 | 0.939 | 0.945 | 0.909 | 0.943 | 0.941 |
| ucr | NA | NA | 0.915 | 0.94 | 0.945 | 0.934 | 0.942 | 0.918 | 0.944 | 0.949 |
| ucrC | NA | NA | NA | 0.917 | 0.91 | 0.926 | 0.913 | 0.95 | 0.917 | 0.92 |
| ucrss | NA | NA | NA | NA | 0.94 | 0.938 | 0.945 | 0.914 | 0.938 | 0.942 |
| ucrss_wfilter | NA | NA | NA | NA | NA | 0.934 | 0.943 | 0.907 | 0.942 | 0.941 |
| uc_fast | NA | NA | NA | NA | NA | NA | 0.938 | 0.92 | 0.939 | 0.941 |
| ucr_fast | NA | NA | NA | NA | NA | NA | NA | 0.909 | 0.946 | 0.947 |
| ucrC_fast | NA | NA | NA | NA | NA | NA | NA | NA | 0.917 | 0.924 |
| ucrss_fast | NA | NA | NA | NA | NA | NA | NA | NA | NA | 0.945 |
| ucrss_fast_wfilter | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
Taxonomic profile results.
Pearson correlation coefficients (r) of taxonomic summaries for (a) 88-soils at phylum level, (b) 88-soils at genus level, (c) moving-pictures at phylum level, (d) movingpictures at genus level, (e) whole-body at phylum level, and (f) whole-body at genus level.
| (a) | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|
| uc | ucr | ucrC | ucrss | ucrss_wfilter | uc_fast | ucr_fast | ucrC_fast | ucrss_fast | ucrss_fast_wfilter | |
| uc | NA | 1 | 0.983 | 1 | 1 | 1 | 1 | 0.981 | 1 | 1 |
| ucr | NA | NA | 0.983 | 1 | 1 | 1 | 1 | 0.981 | 1 | 1 |
| ucrC | NA | NA | NA | 0.983 | 0.983 | 0.983 | 0.983 | 0.999 | 0.983 | 0.983 |
| ucrss | NA | NA | NA | NA | 1 | 1 | 1 | 0.981 | 1 | 1 |
| ucrss_wfilter | NA | NA | NA | NA | NA | 1 | 1 | 0.981 | 1 | 1 |
| uc_fast | NA | NA | NA | NA | NA | NA | 1 | 0.981 | 1 | 1 |
| ucr_fast | NA | NA | NA | NA | NA | NA | NA | 0.981 | 1 | 1 |
| ucrC_fast | NA | NA | NA | NA | NA | NA | NA | NA | 0.981 | 0.981 |
| ucrss_fast | NA | NA | NA | NA | NA | NA | NA | NA | NA | 1 |
| ucrss_fast_wfilter | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
Figure 2Runtime comparison.
Runtime comparison.
Comparison of runtimes (as seconds of wall time) for each method on each data set.
| 88-soil | Moving-picture | Whole-body | |
|---|---|---|---|
| uc | 1220 | 27748 | 1095 |
| ucr | 1358 | 46576 | 1082 |
| ucrC | 226 | 28572 | 388 |
| ucrss | 1493 | 47207 | 1212 |
| ucrss_wfilter | 1885 | 76061 | 2088 |
| uc_fast | 914 | 23510 | 489 |
| ucr_fast | 1052 | 19371 | 621 |
| ucrC_fast | 44 | 2428 | 68 |
| ucrss_fast | 1021 | 23710 | 707 |
| ucrss_fast_wfilter | 1525 | 52811 | 1661 |
Runtime comparisons (subsampled open-reference OTU picking variants).
Comparison of runtimes (as seconds of wall time) for subsampled and “classic” open-reference OTU picking methods with variations on the default parameters.
| Abbreviation | Moving-picture |
|---|---|
| ucr_fast_O29_r82 | 21737 |
| ucr_fast_O29_r97 | 16241 |
| ucrss_fast_O29_r82 | 17812 |
| ucrss_fast_O29_r97 | 16169 |
| ucrss_fast_O29_s1 | 14911 |
Significantly different OTUs by environmental metadata.
Top 10 OTUs identified as significantly different across (a) binned pH in 88-soils, (b) body site in moving-pictures, and (c) body site in whole-body.
| (a) | ||
|---|---|---|
| Taxonomy | Test-statistic | |
| OTU | ||
| 113212 | k__Bacteria;p__Acidobacteria;c__DA052;o__Ellin6513;f__;g__;s__ | 55.859 |
| 1123837 | k__Bacteria;p__Actinobacteria;c__Rubrobacteria;o__Rubrobacterales;f__Rubrobacteraceae; | 50.433 |
| New.Reference | k__Bacteria;p__Actinobacteria;c__Actinobacteria;o__Actinomycetales;f__;g__;s__ | 49.172 |
| 252012 | k__Bacteria;p__Proteobacteria;c__Gammaproteobacteria;o__Xanthomonadales;f__Sinobacteraceae;g__;s__ | 48.65 |
| 843189 | k__Bacteria;p__Acidobacteria;c__Solibacteres;o__Solibacterales;f__Solibacteraceae; | 47.006 |
| 1127423 | k__Bacteria;p__Acidobacteria;c__Acidobacteriia;o__Acidobacteriales;f__Koribacteraceae;g__;s__ | 43.87 |
| 1129210 | k__Bacteria;p__Acidobacteria;c__Acidobacteriia;o__Acidobacteriales;f__Koribacteraceae;g__;s__ | 43.804 |
| 831520 | k__Bacteria;p__Actinobacteria;c__Rubrobacteria;o__Rubrobacterales; | 43.625 |
| 1139779 | k__Bacteria;p__Proteobacteria;c__Alphaproteobacteria | 41.863 |
| 804187 | k__Bacteria;p__Acidobacteria;c__[Chloracidobacteria];o__RB41;f__;g__;s__ | 41.151 |
OTU counts by environment.
Comparison of OTUs with closed-reference and open-reference OTU picking by biome in the Earth Microbiome Project dataset.
| Average | SD de novo | Average | SD Reference | % novel | % error | Number of samples | |
|---|---|---|---|---|---|---|---|
| Environmental Biome | |||||||
| Mangrove biome | 2,169 | 1,159 | 354 | 73 | 0.86 | 0.46 | 7 |
| Tropical humid forests | 2,398 | 260 | 397 | 35 | 0.858 | 0.094 | 26 |
| Tundra biome | 1,771 | 403 | 312 | 117 | 0.85 | 0.201 | 110 |
| Deserts and xeric | 3,917 | 127 | 707 | 15 | 0.847 | 0.028 | 7 |
| Taiga | 2,598 | 102 | 505 | 35 | 0.837 | 0.035 | 4 |
| Marine biome | 2,040 | 1,048 | 484 | 410 | 0.808 | 0.446 | 890 |
| Aquatic biome | 714 | 299 | 177 | 199 | 0.801 | 0.403 | 762 |
| Freshwater biome | 768 | 541 | 194 | 120 | 0.798 | 0.576 | 375 |
| Warm deserts and semideserts | 2,386 | 473 | 607 | 147 | 0.797 | 0.166 | 97 |
| Tropical and subtropical moist broadleaf forest biome | 3,072 | 125 | 846 | 18 | 0.784 | 0.032 | 2 |
| Temperate needle-leaf forests | 2,836 | 159 | 785 | 132 | 0.783 | 0.057 | 21 |
| Polar biome | 1,721 | 886 | 483 | 218 | 0.781 | 0.414 | 277 |
| Tropical and subtropical coniferous forest biome | 1,993 | 256 | 579 | 94 | 0.775 | 0.106 | 3 |
| Mixed island systems | 1,552 | 618 | 511 | 203 | 0.752 | 0.315 | 124 |
| Marginal sea | 1,795 | 325 | 611 | 225 | 0.746 | 0.164 | 7 |
| Temperate coniferous | 2,504 | 1,206 | 885 | 201 | 0.739 | 0.361 | 19 |
| Mediterranean forests, | 695 | 361 | 275 | 195 | 0.717 | 0.424 | 371 |
| Large river biome | 1,844 | 629 | 743 | 369 | 0.713 | 0.282 | 5 |
| Terrestrial biome | 2,714 | 222 | 1,138 | 163 | 0.705 | 0.072 | 627 |
| Nest of bird | 821 | 276 | 355 | 138 | 0.698 | 0.262 | 313 |
| Temperate broadleaf and | 1,910 | 491 | 879 | 235 | 0.685 | 0.195 | 14 |
| Temperate grasslands | 2,745 | 290 | 1,315 | 164 | 0.676 | 0.082 | 696 |
| Animal-associated habitat | 758 | 329 | 376 | 240 | 0.668 | 0.359 | 1036 |
| Mammalia-associated habitat | 973 | 357 | 583 | 222 | 0.625 | 0.27 | 1918 |
| Cold-winter (continental) | 847 | 210 | 551 | 215 | 0.606 | 0.215 | 102 |
| Temperate grasslands, | 1,688 | 272 | 1,497 | 275 | 0.53 | 0.121 | 85 |
| Human-associated habitat | 292 | 242 | 590 | 366 | 0.331 | 0.498 | 1597 |