Literature DB >> 23202435

Quality-filtering vastly improves diversity estimates from Illumina amplicon sequencing.

Nicholas A Bokulich1, Sathish Subramanian, Jeremiah J Faith, Dirk Gevers, Jeffrey I Gordon, Rob Knight, David A Mills, J Gregory Caporaso.   

Abstract

High-throughput sequencing has revolutionized microbial ecology, but read quality remains a considerable barrier to accurate taxonomy assignment and α-diversity assessment for microbial communities. We demonstrate that high-quality read length and abundance are the primary factors differentiating correct from erroneous reads produced by Illumina GAIIx, HiSeq and MiSeq instruments. We present guidelines for user-defined quality-filtering strategies, enabling efficient extraction of high-quality data and facilitating interpretation of Illumina sequencing results.

Entities:  

Mesh:

Year:  2012        PMID: 23202435      PMCID: PMC3531572          DOI: 10.1038/nmeth.2276

Source DB:  PubMed          Journal:  Nat Methods        ISSN: 1548-7091            Impact factor:   28.547


Recent advances in high-throughput, short-amplicon sequencing are revolutionizing efforts to describe microbial diversity within and across complex biomes, including the human body[1] and Earth’s biosphere[2]. The greater sequence coverage and lower per-base cost of the Illumina GAIIx, HiSeq, and MiSeq instruments aids this progress over more expensive, lower-coverage platforms. Given unknown sequencing error rates for amplicon data generated by these rapidly evolving instruments and changing chemistries, and the potential for PCR error introduced during short-amplicon sample preparation, quality-filtering is integral to high-throughput sequencing data analysis, removing erroneous reads that otherwise overestimate microbial diversity. “Denoising”[3,4], an approach employed to address this issue for amplicon sequencing by the 454 Life Sciences pyrosequencer, is specific to the 454 platform’s error profile, and does not scale to Illumina instruments, which generate tens (MiSeq) to hundreds (GAIIx) to thousands (HiSeq2000) of times more data per run. Illumina systems provide per-nucleotide Phred quality scores representing the probability that a given base call is erroneous. How best to incorporate these scores in marker-gene-based microbial ecology studies has not been thoroughly investigated, and stringent filtration that discards many reads has been recommended to avoid exaggerated diversity estimates[5]. Previous investigation into quality-filtering of Illumina data[6] focused on whole-genome sequencing applications, where error profiles are expected to differ from those in amplicon-sequencing runs. Additionally, the strategy discussed here differs from Illumina’s quality-filtering software CASAVA, which filters on a per-read basis, while our strategy works on a per-nucleotide basis, truncating reads at the position where their quality begins to drop. To illuminate the “black box” of Illumina amplicon quality-filtering, we tested the effects of different quality-filtering parameters on taxonomic classification, α– diversity, and β-diversity estimates using the Quantitative Insights into Microbial Ecology (QIIME)[7] pipeline (Table S1), outlined in Figure 1. To evaluate the effect of varying parameters in Figure 1, we tested four different ‘mock’ communities sequenced on the GAIIx (n = 1), HiSeq (n = 2), and MiSeq (n = 3) (Table S2). These comprised deliberately combined collections of 12 to 67 bacterial or fungal species whose genomes had been previously sequenced (Tables S3–S6). We also compared free-living and host-associated communities [5,8], representing samples with high β-diversity, and wine[9] and spontaneous beer fermentation-associated communities[10], representing samples with lower β-diversity, to evaluate the effects of filtering settings on β-diversity comparisons of different community types. Raw read counts and sample counts for all datasets are presented in Table S7.
Figure 1

Quality Filtration Process Flow in QIIME v1.5.0.

We evaluated how primary (p, q, r, and n) and secondary (c; see Figure 1 for definitions) quality-filtering parameters affect analyses using five separate evaluations, defined here. α– diversity and qualitative taxonomic composition, using mock communities, tests which settings best measure true community composition, minimizing spurious additional OTUs (Figure 2; Figure S1–S7).
Figure 2

α- and β-Diversity comparisons of mock community reads filtered using select phred_quality_score (q) settings (dataset 1). A, B: Family-level (A) and genus-level (B) taxon counts for mock communities filtered with variable (q) values at multiple OTU minimum abundance thresholds (c) (as %). Arrows below color key indicate expected genus- (blue) and family-level (red) taxon counts. C, D, E: Procrustes PCoA biplot of GAIIx weighted UniFrac distance comparing variation in (q). Comparison of (q) setting listed in bottom-right corner to (q) = 3. Top-right corner indicates Bonferroni-corrected p-value for Procrustes goodness of fit. Red, human feces; Magenta, mock community; Cyan, human skin; Dark cyan, human tongue; Blue, freshwater; Orange, freshwater creek; Purple, ocean; Yellow, estuary sediment; Pink, soil. All other settings represent defaults in both α- and β-diversity comparisons.

quantitative taxonomic composition, using defined mock communities, tests whether different settings introduce biases in specific taxa (Figure S8–S10). β-diversity, using mock communities, determines whether different settings cause significant differences in phylogenetic composition between identical communities (Table S8). β-diversity, using real communities, tests whether different settings affect our ability to differentiate sample types in principal coordinates (PCoA) plots (Table S9; Figure 2; Figure S11–S16). β-diversity, using real communities, tests whether differences detected between communities on different sequencing platforms are consistent with one another. Our results across Evaluations 1–5 reveal general patterns. First, parameters (p), (q) and (c) have a marked effect on α– diversity and estimates of taxonomic composition, but not (n) and (r) (Figure 2A–B; Figure S1–S7). The effects of (p) and (q) were variable across runs in an apparently platform-independent fashion (Figure S4–S5). All settings except high (q) values required secondary filtration with (c) to reach expected taxon counts, but the required level varied between 0.01% to 0.0001% of total sequences, dependant upon (q) and (p) settings.. Increasing (p) also decreased abundance of unassigned sequences and sequences given shallow taxonomic assignment. In all mock data sets studied, extreme settings of (q) and (p), but not (r) and (n), had a marked impacted on taxonomic distribution (Figure S8–S10). These results are described in detail in Supplementary Text Evaluations 1–2. Second, weighted UniFrac[11] distances between mock communities (see Supplementary Text Evaluation 3) were more robust to changes in parameter settings than unweighted UniFrac distances at low (c); however, these differences disappear at high (c). Thus as expected, differences in low-abundance OTUs have a larger impact on the unweighted metric. We note that any filtering strategies that remove low-abundance reads make it impossible to apply richness estimation metrics such as ACE and Chao1, which incorporate low-abundance read counts in their calculations. These metrics are unlikely to be accurate, however, if many of these reads actually represent sequencing errors. Because observations in microbial ecology are often based on PCoA of samples, we applied Procrustes analysis to compare PCoA plots from different parameter settings on both biological and mock communities. We found that conclusions derived from PCoA plots were also robust to differences in parameter settings: the only notable differences occurred at stringent (q), (p), and (c), which result in extreme levels of read filtration that blur the known major distinction between host-associated and free-living communities (Figure 2C–E; Figure S11–S12) and closely related wine and beer fermentation-associated communities (Figure S13–S16). These results are described in detail in Supplementary Text Evaluations 3–4. Finally, these observations generalize from the GAIIx to the HiSeq2000 and MiSeq platforms. The same β-diversity trends (e.g., separation in host-associated and free-living communities) were observed on all three platforms, and heavily decreased (p) (e.g., p = 0.25) and increased (q) (e.g., q ≥ 20) were the only factors that caused these sample types to erroneously cluster together on the HiSeq2000. These results are described in detail in Supplementary Text Evaluation 5. Strategic quality-filtering of Illumina sequence data is necessary to retrieve reliable assessments of α- and β-diversity and taxonomic distribution of microbial communities. These results confirm the need to apply appropriate (q), (p), and (c) filters to eliminate erroneous OTU assignments, regulating these parameters depending upon raw sequence length and quality to reach the necessary balance of stringency and sequencing depth. To calibrate optimal filtering settings, we highly recommend including a standardized mock community in each individual sequencing run. We believe this will be necessary for confident comparison of samples from multiple sequencing runs in order to normalize run-to-run PCR and sequencing error, but further work is needed to evaluate which factors (e.g., community composition and complexity) define optimal mock communities for filter calibration under different experimental conditions. For datasets where a mock community is not included for calibration, we recommend the conservative threshold of (c = 0.005%). Further work is also required to address the impact of filtering strategies on the analysis of paired-end reads. These results also enable users to process sequencing data under specific filtering conditions to support different downstream analyses. For example, users with a majority of high-quality, full-length sequences may wish to increase (q) and (p) in lieu of using (c), thereby retrieving only full-length sequences with low error rates, potentially increasing the discovery rate of rare OTUs (as sequence selection will be based on length and quality, not count). Alternatively, users with shorter reads or reads truncated by early low-quality base calls may wish to increase (r), lower (p), and use a higher (c) threshold. In this way, lower-quality but taxonomically useful reads will be retained, and reliable sequences selected based on abundance rather than error probability. Other users may be more interested in maximizing read count for implementation of machine-learning tools, identifying OTUs with significantly different abundances across metadata categories or treatment regimes, or jackknifing/permutational tests for β-diversity, all of which benefit from increased sample sizes. In this scenario, reads should be filtered using primary filters (q) and (p) instead of (c), which greatly reduces read count. Experienced users can adjust their filtering parameters to control the primary source of filtration (read length, error probability, or abundance) based on the idiosyncrasies of each sequencing run and the demands of the downstream analysis. These conclusions serve as a benchmark for informed quality-filtering of Illumina amplicon sequence data. With these guidelines, users can confidently extract more, higher-quality sequences and decrease OTU filtration thresholds (c), increasing acuity for rare OTU discrimination and β-diversity comparisons. The Earth Microbiome Project (EMP)[2] is adopting these guidelines for routine analysis of all SSU rRNA gene sequencing on the Illumina HiSeq and MiSeq systems, facilitating deeper, more efficient insight into how microbial diversity varies over spatial and temporal scales across our planet. The conclusions drawn from this study are conserved across the HiSeq2000, MiSeq, and GAIIx systems, supporting confident cross-platform data handling. In addition, we recommend new default settings for Illumina processing in QIIME (r = 3; p = 0.75 total read length; q = 3; n = 0; c = 0.005%; see Supplementary Text: Default recommendations for QIIME parameters for additional details), incorporated in the recent release of QIIME 1.5.0. Finally, although quality parameters tested here were evaluated using QIIME, these conclusions are relevant to Illumina amplicon quality-filtering across all bioinformatics pipelines for improved diversity estimates in all taxa and environments. Table S1. Parameter Values and Cross-interactions Tested in This Study. Table S2. Mock Communities Analyzed in This Study. Table S3. Composition of Mock Community A. Table S4. Composition of Mock Community B. Table S5. Composition of Mock Community C. Table S6. Composition of Mock Community D. Table S7. Raw Sequence Counts and Sample Counts for Datasets Analyzed in This Study. Table S8. β-Diversity Comparisons of Mock Communities Between Filtration Settings. Table S9. β-Diversity Comparisons of Host-associated and Free-living Communities Between Filtration Settings. Figure S1. α-Diversity comparison of mock community reads filtered using select quality parameter settings (dataset 1). Family-level taxon counts (z-axis) for mock communities filtered with variable split_library_fastq.py parameters, showing all cross-interactions between minimum per-read length as % of total read length (p) (x-axes), OTU abundance threshold (c) (as %; y-axes), maximum bad run length (r) (increasing values left to right), and maximum ambiguous base-call count (n) (increasing values top to bottom) at multiple OTU minimum abundance thresholds (c). Red arrow below color key indicates expected family-level taxon count. Figure S2. α-Diversity comparison of mock community reads filtered using select quality parameter settings (dataset 1). Genus-level taxon counts (z-axis) for mock communities filtered with variable split_library_fastq.py parameters, showing all cross-interactions between minimum per-read length as % total read length (p) (x-axes), OTU abundance threshold (c) (as %; y-axes), maximum bad run length (r) (increasing values left to right), and maximum ambiguous base-call count (n) (increasing values top to bottom) at multiple OTU minimum abundance thresholds (c). Red arrow below the color key indicates expected genus-level taxon count. Figure S3. α-Diversity comparison of mock community reads filtered using select quality parameter settings (dataset 1). Observed counts (z-axis) of OTUs clustered at 97% similarity for mock communities filtered with variable split_library_fastq.py parameters, showing all cross-interactions between minimum per-read length as % total read length (p) (x-axes), OTU abundance threshold (c) (as absolute values) (y-axes), maximum bad run length (r) (increasing values left to right), and maximum ambiguous base-call count (n) (increasing values top to bottom) at multiple OTU minimum abundance thresholds (c). Red arrow below the color key indicates expected species count. Figure S4. Interaction of Phred score threshold (q) and OTU abundance threshold (c) (as %) on observed family and genus-level taxa (datasets 2–6). NA = no OTU abundance threshold was applied. Heading indicates platform type, read length, and expected family-and genus-level counts. Figure S5. Interaction of minimum per-read length (p) (as % total length) and OTU abundance threshold (c) (as %) on observed family and genus-level taxa (datasets 2–6). NA = no OTU abundance threshold applied. Heading indicates platform type, read length, and expected family- and genus-level counts. Figure S6. Interaction of maximum ambiguous base-call count (n) and OTU abundance threshold (c) (as %) on observed family and genus-level taxa (datasets 2–6). NA = no OTU abundance threshold applied. Heading indicates platform type, read length, and expected family- and genus-level counts. Figure S7. Interaction of maximum bad run length (r) and OTU abundance threshold (c) (as %) on observed family- and genus-level taxa (datasets 2–6). NA = no OTU abundance threshold was applied. Heading indicates type of sequencing platform used, read length, and expected family- and genus-level counts. Figure S8. Family-level taxonomic distribution of mock communities filtered using variable (p) (as % total length), (c), and (q) thresholds (dataset 1). All other parameters were held constant. E, expected values. *default parameters. Figure S9. Family-level taxonomic Distribution of mock communities filtered using variable (n) and (r) thresholds (dataset 1). All other parameters were held constant. E, expected values. Figure S10. Taxonomic key to Figures S8 and S9. Figure S11. Procrustes PCoA of GAIIx weighted UniFrac distance biplot comparing variation in OTU abundance threshold (c), using Illumina GAIIx data (dataset 1). Comparison of (c) setting listed in bottom-right corner to dataset without (c) filtering. Top-right corner indicates Bonferroni-corrected p-value for Procrustes goodness of fit. Red, feces; Magenta, mock community; Cyan, skin; Dark cyan, tongue; Blue, freshwater; Orange, freshwater creek; Purple, ocean; Yellow, estuary sediment; Pink, soil. Figure S12. Procrustes PCoA weighted UniFrac distance biplot comparing variation in minimum per-read length (p) (as % total read length), using Illumina GAIIx data (dataset 1). Comparison of setting listed in bottom-right corner to (p) = 0.75. Top-right corner indicates Bonferroni-corrected p-value for Procrustes goodness of fit. Red, human feces; Magenta, mock community; Cyan, human skin; Dark cyan, human tongue; Blue, freshwater; Orange, freshwater creek; Purple, ocean; Yellow, estuary sediment; Pink, soil. Figure S13. Filtering minimally impacts abundance-weighted β-diversity comparisons of gradient communities. Weighted UniFrac PCoA plots comparing bacterial communities of spontaneous beer fermentations (dataset 9) with different filter settings. Row label indicates variable filter setting tested. Labels above plots indicate parameter setting. All other variables held constant to QIIME v1.4.0 defaults (r = 1; n = 0; p = 75 bp; q = 3). Note that (p) represents an absolute value (in nucleotides) in this figure, 150 nt being the maximum read length. Figure S14. Filtering minimally impacts unweighted β-diversity comparisons of gradient communities. Unweighted UniFrac PCoA plots comparing bacterial communities of spontaneous beer fermentations (dataset 9) with different filter settings. Row label indicates variable filter setting tested. Labels above plots indicate parameter setting. All other variables held constant to QIIME v1.4.0 defaults (r = 1; n = 0; p = 75 bp; q = 3). Note that (p) represents an absolute value (in nucleotides) in this figure, 150 nt being the maximum read length. Figure S15. Filtering minimally impacts abundance-weighted β-diversity comparisons of cluster communities. Weighted UniFrac PCoA plots comparing bacterial communities of inoculated (blue) and uninoculated (red) wine fermentations (dataset 10) with different filter settings. Row label indicates variable filter setting tested. Labels above plots indicate parameter setting. All other variables were held constant to QIIME v1.4.0 defaults (r = 1; n = 0; p = 75 bp; q = 3). Note that (p) represents an absolute value (in nucleotides) in this figure, 150 nt being the maximum read length. Figure S16. Filtering minimally impacts unweighted β-diversity comparisons of cluster communities. Unweighted UniFrac PCoA plots comparing bacterial communities of inoculated (blue) and uninoculated (red) wine fermentations (dataset 10) with different filter settings. Row label indicates variable filter setting tested, labels above plots indicate parameter setting. All other variables held constant to QIIME v1.4.0 defaults (r = 1; n = 0; p = 75 bp; q = 3). Note that (p) represents an absolute value (in nucleotides) in this figure, 150 nt being the maximum read length.
  15 in total

1.  Search and clustering orders of magnitude faster than BLAST.

Authors:  Robert C Edgar
Journal:  Bioinformatics       Date:  2010-08-12       Impact factor: 6.937

2.  Organismal, genetic, and transcriptional variation in the deeply sequenced gut microbiomes of identical twins.

Authors:  Peter J Turnbaugh; Christopher Quince; Jeremiah J Faith; Alice C McHardy; Tanya Yatsunenko; Faheem Niazi; Jason Affourtit; Michael Egholm; Bernard Henrissat; Rob Knight; Jeffrey I Gordon
Journal:  Proc Natl Acad Sci U S A       Date:  2010-04-02       Impact factor: 11.205

3.  Global patterns of 16S rRNA diversity at a depth of millions of sequences per sample.

Authors:  J Gregory Caporaso; Christian L Lauber; William A Walters; Donna Berg-Lyons; Catherine A Lozupone; Peter J Turnbaugh; Noah Fierer; Rob Knight
Journal:  Proc Natl Acad Sci U S A       Date:  2010-06-03       Impact factor: 11.205

4.  Greengenes, a chimera-checked 16S rRNA gene database and workbench compatible with ARB.

Authors:  T Z DeSantis; P Hugenholtz; N Larsen; M Rojas; E L Brodie; K Keller; T Huber; D Dalevi; P Hu; G L Andersen
Journal:  Appl Environ Microbiol       Date:  2006-07       Impact factor: 4.792

5.  Naive Bayesian classifier for rapid assignment of rRNA sequences into the new bacterial taxonomy.

Authors:  Qiong Wang; George M Garrity; James M Tiedje; James R Cole
Journal:  Appl Environ Microbiol       Date:  2007-06-22       Impact factor: 4.792

6.  Rapidly denoising pyrosequencing amplicon reads by exploiting rank-abundance distributions.

Authors:  Jens Reeder; Rob Knight
Journal:  Nat Methods       Date:  2010-09       Impact factor: 28.547

7.  Brewhouse-resident microbiota are responsible for multi-stage fermentation of American coolship ale.

Authors:  Nicholas A Bokulich; Charles W Bamforth; David A Mills
Journal:  PLoS One       Date:  2012-04-18       Impact factor: 3.240

8.  Ultra-high-throughput microbial community analysis on the Illumina HiSeq and MiSeq platforms.

Authors:  J Gregory Caporaso; Christian L Lauber; William A Walters; Donna Berg-Lyons; James Huntley; Noah Fierer; Sarah M Owens; Jason Betley; Louise Fraser; Markus Bauer; Niall Gormley; Jack A Gilbert; Geoff Smith; Rob Knight
Journal:  ISME J       Date:  2012-03-08       Impact factor: 10.302

9.  Evaluation of genomic high-throughput sequencing data generated on Illumina HiSeq and genome analyzer systems.

Authors:  André E Minoche; Juliane C Dohm; Heinz Himmelbauer
Journal:  Genome Biol       Date:  2011-11-08       Impact factor: 13.583

10.  Human gut microbiome viewed across age and geography.

Authors:  Tanya Yatsunenko; Federico E Rey; Mark J Manary; Indi Trehan; Maria Gloria Dominguez-Bello; Monica Contreras; Magda Magris; Glida Hidalgo; Robert N Baldassano; Andrey P Anokhin; Andrew C Heath; Barbara Warner; Jens Reeder; Justin Kuczynski; J Gregory Caporaso; Catherine A Lozupone; Christian Lauber; Jose Carlos Clemente; Dan Knights; Rob Knight; Jeffrey I Gordon
Journal:  Nature       Date:  2012-05-09       Impact factor: 49.962

View more
  978 in total

1.  Ectomycorrhizal fungal spore bank recovery after a severe forest fire: some like it hot.

Authors:  Sydney I Glassman; Carrie R Levine; Angela M DiRocco; John J Battles; Thomas D Bruns
Journal:  ISME J       Date:  2015-10-16       Impact factor: 10.302

2.  Characterization of the stromatolite microbiome from Little Darby Island, The Bahamas using predictive and whole shotgun metagenomic analysis.

Authors:  Giorgio Casaburi; Alexandrea A Duscher; R Pamela Reid; Jamie S Foster
Journal:  Environ Microbiol       Date:  2015-12-10       Impact factor: 5.491

3.  Fructose malabsorption induces cholecystokinin expression in the ileum and cecum by changing microbiota composition and metabolism.

Authors:  Xufei Zhang; Alexandra Grosfeld; Edek Williams; Daniel Vasiliauskas; Sharon Barretto; Lorraine Smith; Mahendra Mariadassou; Catherine Philippe; Fabienne Devime; Chloé Melchior; Guillaume Gourcerol; Nathalie Dourmap; Nicolas Lapaque; Pierre Larraufie; Hervé M Blottière; Christine Herberden; Philippe Gerard; Jens F Rehfeld; Ronaldo P Ferraris; J Christopher Fritton; Sandrine Ellero-Simatos; Veronique Douard
Journal:  FASEB J       Date:  2019-04-02       Impact factor: 5.191

4.  Facility-specific "house" microbiome drives microbial landscapes of artisan cheesemaking plants.

Authors:  Nicholas A Bokulich; David A Mills
Journal:  Appl Environ Microbiol       Date:  2013-06-21       Impact factor: 4.792

5.  Development of a dual-index sequencing strategy and curation pipeline for analyzing amplicon sequence data on the MiSeq Illumina sequencing platform.

Authors:  James J Kozich; Sarah L Westcott; Nielson T Baxter; Sarah K Highlander; Patrick D Schloss
Journal:  Appl Environ Microbiol       Date:  2013-06-21       Impact factor: 4.792

6.  Analysis of multiple tsetse fly populations in Uganda reveals limited diversity and species-specific gut microbiota.

Authors:  Emre Aksoy; Erich L Telleria; Richard Echodu; Yineng Wu; Loyce M Okedi; Brian L Weiss; Serap Aksoy; Adalgisa Caccone
Journal:  Appl Environ Microbiol       Date:  2014-05-09       Impact factor: 4.792

7.  Microbial community composition and diversity in Caspian Sea sediments.

Authors:  Nagissa Mahmoudi; Michael S Robeson; Hector F Castro; Julian L Fortney; Stephen M Techtmann; Dominique C Joyner; Charles J Paradis; Susan M Pfiffner; Terry C Hazen
Journal:  FEMS Microbiol Ecol       Date:  2014-12-05       Impact factor: 4.194

8.  Characterization of digestate microbial community structure following thermophilic anaerobic digestion with varying levels of green and food wastes.

Authors:  Jesus D Fernandez-Bayo; Christopher W Simmons; Jean S VanderGheynst
Journal:  J Ind Microbiol Biotechnol       Date:  2020-10-30       Impact factor: 3.346

9.  Supplementation with dairy matrices impacts on homocysteine levels and gut microbiota composition of hyperhomocysteinemic mice.

Authors:  Paola Zinno; Vincenzo Motta; Barbara Guantario; Fausta Natella; Marianna Roselli; Cristiano Bello; Raffaella Comitato; Domenico Carminati; Flavio Tidona; Aurora Meucci; Paola Aiello; Giuditta Perozzi; Fabio Virgili; Paolo Trevisi; Raffaella Canali; Chiara Devirgiliis
Journal:  Eur J Nutr       Date:  2019-01-30       Impact factor: 5.614

Review 10.  Toward Accurate and Quantitative Comparative Metagenomics.

Authors:  Stephen Nayfach; Katherine S Pollard
Journal:  Cell       Date:  2016-08-25       Impact factor: 41.582

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.