| Literature DB >> 23308262 |
Orna Mizrahi-Man1, Emily R Davenport, Yoav Gilad.
Abstract
Massively parallel high throughput sequencing technologies allow us to interrogate the microbial composition of biological samples at unprecedented resolution. The typical approach is to perform high-throughout sequencing of 16S rRNA genes, which are then taxonomically classified based on similarity to known sequences in existing databases. Current technologies cause a predicament though, because although they enable deep coverage of samples, they are limited in the length of sequence they can produce. As a result, high-throughout studies of microbial communities often do not sequence the entire 16S rRNA gene. The challenge is to obtain reliable representation of bacterial communities through taxonomic classification of short 16S rRNA gene sequences. In this study we explored properties of different study designs and developed specific recommendations for effective use of short-read sequencing technologies for the purpose of interrogating bacterial communities, with a focus on classification using naïve Bayesian classifiers. To assess precision and coverage of each design, we used a collection of ∼8,500 manually curated 16S rRNA gene sequences from cultured bacteria and a set of over one million bacterial 16S rRNA gene sequences retrieved from environmental samples, respectively. We also tested different configurations of taxonomic classification approaches using short read sequencing data, and provide recommendations for optimal choice of the relevant parameters. We conclude that with a judicious selection of the sequenced region and the corresponding choice of a suitable training set for taxonomic classification, it is possible to explore bacterial communities at great depth using current technologies, with only a minimal loss of taxonomic resolution.Entities:
Mesh:
Substances:
Year: 2013 PMID: 23308262 PMCID: PMC3538547 DOI: 10.1371/journal.pone.0053608
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Training sets used for the naïve Bayesian classification of bacterial 16S rRNA gene sequences.
| Abbreviation | Description | Sequence Database | Underlying Taxonomy |
| RDP TS6 | RDP classifier training set v.6 (default forv. 2.3 of the RDP classifier) | 8,127 bacterial and 295 archaeal sequences | Based on “The Taxonomic Outline of Bacteria and Archaea” (TOBA) 7.7 |
| LTP | Bacterial subset of “The Living TreeProject” v. 106 | 8,494 bacterial sequences | “List of Prokaryotic names with Standing in Nomenclature (LPSN; |
| unfiltered RDP | All bacterial isolates in RDP database | 31,334 non-redundant bacterial sequences | Based on “The Taxonomic Outline of Bacteria and Archaea” (TOBA) 7.7 |
| filtered NCBI | All bacterial isolates in RDP database,filtered for annotation quality | 21,240 non-redundant bacterial sequences | NCBI taxonomy |
Except for the ‘RDP TS6’ training set, which always trains on the full sequence, numbers are only for the testing of 100 nt single-reads from the V4 region. For the three other training sets, which train only on the region to be classified, the number of sequences reflects both the number of sequences covering this region (all three training sets) and its degree of redundancy (‘unfiltered RDP’ and ‘filtered NCBI’).
The numbers are for the ‘original non-redundant training set’ (see Methods section ‘Leave k out classification testing’); numbers for each leave-k-out iteration may vary slightly.
Figure 1Performance of different training sets in the classification of 100 nt reads from the V4 amplicon.
Each panel compares the performance of the training sets (described in Table 1) for a different rank. We used the results of leave-k-out tests classifying the LTP sequences to determine confidence score thresholds for a set of desired false prediction rate (FPR) values (x axis), so that the FPR would be at most the desired value. We then used these thresholds to calculate the classification coverage of sequences from environmental (uncultured) bacteria that corresponds to the desired FPR (y axis).
Comparison of classification coverage of V4 reads from fecal samples among different training sets.
| Genus | Family | Order | Class | Phylum | |||||||||||
| Training set | DB | A | B | DB | A | B | DB | A | B | DB | A | B | DB | A | B |
| RDP TS6 | 41.7 | 46.2 | 29.6 | 64.2 | 73.8 | 59.0 | 84.1 | 95.8 | 79.3 |
|
|
| 93.1 | 99.7 | 86.3 |
| LTP | 42.6 | 44.3 | 30.8 | 64.8 | 66.1 | 54.5 | 80.5 | 94.5 | 75.6 | 88.7 | 98.7 |
| 91.8 | 99.8 | 93.4 |
| Unfiltered RDP |
|
|
| 71.2 |
|
|
|
|
| 87.1 | 96.7 | 82.3 | 93.9 |
|
|
| Filtered NCBI | 41.9 | 46.8 | 38.8 |
| 78.2 | 61.4 |
| 96.4 | 79.2 | 90.7 | 97.5 | 85.6 |
|
|
|
We used each of the four training sets to classify single 100 bp reads excised from environmental (uncultured) bacteria 16S rRNA gene sequence from the RDP database (DB), as well as single100 bp reads from the same region sequenced from two fecal samples: A (6,298,382 sequences) and B (3,452,321 sequences). We then computed coverage for each of the ranks: phylum, class, order, family and genus, using per-rank confidence score thresholds that would ensure an FPR of at most 5%. The highest coverage in each column is underlined.
The confidence score threshold for these cases was lower than that of a higher level/s, and a sequence could thus be classified at the current level but not at the higher taxonomic levels. We found that the classification of such sequences is associated with a high error rate and our recommendation is to exclude them. We have therefore adjusted coverage accordingly.
Figure 2Classification performance of different experimental designs.
Each panel compares performance of different regions for a different combination of rank (genus or family) and sequencing strategy (100/120 nt single/paired-end reads). We used the results of leave-k-out tests classifying the LTP sequences to determine confidence score thresholds for a set of desired false prediction rate (FPR) values (x axis), so that the FPR would be at most the desired value (Tables S4, S5, S6, S7, S8, S9, and S10). We then used these thresholds to calculate the classification coverage of sequences from environmental (uncultured) bacteria that corresponds to the desired FPR (y axis). Figure S5 compares the performance of different regions across the same sequencing configurations for the ranks order, class, and phylum.
Recommended experimental designs.
| Primer pair | Genus | Family | Order | Class | Phylum | ||||||||
| Sequencingconfiguration | Region | Forward | Reverse | CT | Coverage | CT | Coverage | CT | Coverage | CT | Coverage | CT | Coverage |
| 100 nt single | V3 | F343 | R534 | 95 | 54 | 95 | 68 | 60 | 92 | 50 | 95 | 60 | 96 |
| V4 | F515 | R806 | 90 | 49 | 80 | 74 | 60 | 88 | 80 | 90 | 65 | 95 | |
| 120 nt single | V3 | F343 | R534 | 95 | 58 | 90 | 77 | 70 | 91 | 55 | 95 | 50 | 97 |
| V4 | F515 | R806 | 90 | 59 | 80 | 79 | 60 | 91 | 50 | 95 | 55 | 96 | |
| 100 nt paired | V3 | F343 | R534 | 95 | 60 | 95 | 73 | 75 | 92 | 55 | 96 | 60 | 96 |
| V4 | F515 | R806 | 95 | 62 | 85 | 84 | 70 | 93 | 70 | 95 | 45 | 98 | |
| 120 nt paired | V4 | F515 | R806 | 95 | 67 | 85 | 85 | 80 | 92 | 80 | 94 | 45 | 98 |
Primer would be used only for amplification, not for sequencing.
The lowest confidence value threshold (CT) that is consistent with an FPR of 5% (see methods).
Coverage (in percentage units) observed for the confidence threshold in environmental sequences.
Median number of predictions in the interval [CT.CT+4] was smaller than 10.
Results for 100 nt and 120 nt paired end configurations were practically identical for this region, as we encountered few V3 amplicons that were longer than 100 nt (all in the environmental sequences).
CT for these cases is lower than that of a higher level/s, and a sequence can thus be classified at the current level but not at the higher taxonomic levels. We find that the classification of such sequences is associated with a high error rate and our recommendation is to exclude them, and have adjusted coverage accordingly.
Figure 3Classification performance of combined 100 nt single-read predictions, as compared to the best performing paired-end configurations.
We combined predictions made for different 100 nt fragments of the same sequence, by selecting the prediction with the highest confidence score at the genus level (or the lowest common level available). We evaluated the performance, at ranks genus and family (left and right panels, respectively), of combinations of fragments from the V3 and V4 regions (top and bottom panels, respectively) with fragments from each of the other regions examined, and compared it to the performance of the V3 and V4 100 nt paired-end configurations (pointed to by arrows). We used the results of leave-k-out tests classifying the LTP sequences to determine confidence score thresholds for a set of desired false prediction rate (FPR) values (x axis), so that the FPR would be at most the desired value. We then used these thresholds to calculate the classification coverage of sequences from environmental (uncultured) bacteria that corresponds to the desired FPR (y axis). Figure S6 compares the performance of the combinations for the ranks order, class, and phylum.