| Literature DB >> 22237546 |
David A W Soergel1, Neelendu Dey, Rob Knight, Steven E Brenner.
Abstract
Microbial community profiling using 16S rRNA gene sequences requires accurate taxonomy assignments. 'Universal' primers target conserved sequences and amplify sequences from many taxa, but they provide variable coverage of different environments, and regions of the rRNA gene differ in taxonomic informativeness--especially when high-throughput short-read sequencing technologies (for example, 454 and Illumina) are used. We introduce a new evaluation procedure that provides an improved measure of expected taxonomic precision when classifying environmental sequence reads from a given primer. Applying this measure to thousands of combinations of primers and read lengths, simulating single-ended and paired-end sequencing, reveals that these choices greatly affect taxonomic informativeness. The most informative sequence region may differ by environment, partly due to variable coverage of different environments in reference databases. Using our Rtax method of classifying paired-end reads, we found that paired-end sequencing provides substantial benefit in some environments including human gut, but not in others. Optimal primer choice for short reads totaling 96 nt provides 82-100% of the confident genus classifications available from longer reads.Entities:
Mesh:
Substances:
Year: 2012 PMID: 22237546 PMCID: PMC3379642 DOI: 10.1038/ismej.2011.208
Source DB: PubMed Journal: ISME J ISSN: 1751-7362 Impact factor: 10.302
Figure 1Classification performance, at three levels of estimated accuracy (Supplementary Methods), of 6617 possible choices of amplification primer, sequencing primer and read length for single-ended reads from different environments (left portion of each panel) and 3061 possible choices of primer pair and read length for paired-end reads (right portion). Combinations of primers and read lengths are sorted on the x axis according to a measure of overall classification performance (Supplementary Methods). Stacked bars show the proportion of non-chimeric, non-unique sequences from each sample—not the proportion of the total sample—that can be classified to each taxonomic level for each combination. See Supplementary Figure S1 and Supplementary Table S1 for the excluded proportion of novel (and thus a priori unclassifiable) sequences in each sample. The top of each colored section indicates how much of the sample can be classified to the given level or better. ‘Primer miss' (black) indicates sequences that did not match a given primer and so would not be amplified. Classifications more specific than the genus level are exceedingly rare and so are not visible here. Horizontal lines indicate the maximum proportion of each sample classifiable to the genus level using 96 nt or less of sequence (i.e., with an optimal choice of primer or primer pair; see also Supplementary Tables S4 and S5), showing that short reads from the best primers frequently—but not always—provide taxonomic information nearly matching that obtained from longer read lengths. Full-size versions of these panels are available in the supplementary data.
Genus classification rates for optimal choices of primers, grouped by total read length