| Literature DB >> 23991013 |
Matthew J Morgan1, Anthony A Chariton, Diana M Hartley, Leon N Court, Christopher M Hardy.
Abstract
Accurate estimation of biological diversity in environmental DNA samples using high-throughput amplicon pyrosequencing must account for errors generated by PCR and sequencing. We describe a novel approach to distinguish the underlying sequence diversity in environmental DNA samples from errors that uses information on the abundance distribution of similar sequences across independent samples, as well as the frequency and diversity of sequences within individual samples. We have further refined this approach into a bioinformatics pipeline, Amplicon Pyrosequence Denoising Program (APDP) that is able to process raw sequence datasets into a set of validated sequences in formats compatible with commonly used downstream analyses packages. We demonstrate, by sequencing complex environmental samples and mock communities, that APDP is effective for removing errors from deeply sequenced datasets comprising biological and technical replicates, and can efficiently denoise single-sample datasets. APDP provides more conservative diversity estimates for complex datasets than other approaches; however, for some applications this may provide a more accurate and appropriate level of resolution, and result in greater confidence that returned sequences reflect the diversity of the underlying sample.Entities:
Mesh:
Substances:
Year: 2013 PMID: 23991013 PMCID: PMC3753314 DOI: 10.1371/journal.pone.0071974
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Accuracy of APDP for eight Titanium pyrosequence data sets.
| Expectedsequences | Pyrosequencing reads | Unique Sequences | Filtered sequences | Preliminary Validation | Final Validation | |
|
| ||||||
| 18SEnv1 | ||||||
| All | Unknown | 357,432 | 99,303 | 18,559 | 1,164 | 929 |
| Decapoda only | 3 | 523 | 4 | 3 | ||
| 18SEnv2 | ||||||
| All | Unknown | 314,414 | 52,048 | 13,684 | 990 | 841 |
| Decapoda only | 2 | 754 | 2 | 2 | ||
|
| ||||||
| 18Smock1-3 | 16 | 268,874 | 26,675 | 6,088 | 55 | 16 |
| 18Smock4-6 | 16 | 275,876 | 24,934 | 5,874 | 63 | 16 |
| 16Sv13 | 24 | 268,818 | 12,793 | 4,144 | 1,127 | 34 |
| 16Sv34 | 20 | 75,447 | 9.135 | 1,056 | 57 | 17 |
| 16Sv45 | 80 | 62,873 | 13,831 | 1,321 | 112 | 93 |
| 16Sv6 | 20 | 53,653 | 1,040 | 266 | 56 | 19 |
Also shown are the number of raw pyrosequences, number of unique sequences, number of unique sequences after APDP filtering, and the number of unique sequences after Preliminary and Final validation.
Accuracy and sensitivity of APDP to rare taxa in high and low-diversity communities.
| Data Set | Minimum reads (data set) | Minimum reads (sample) | Minimum relative frequency | Reference sequence frequency | Expected sequences | Potentially valid sequences | FP | FN |
| 18SEnv1 | 3 | 2 | 0.0008% | 0.002%–2.465% | 3 | 523 | 0 | 0 |
| 18SEnv2 | 3 | 2 | 0.0009% | 0.015%–8.149% | 2 | 754 | 0 | 0 |
| 18Smock1-3 | 8 | 2 | 0.0105% | 0.04%–17.51% | 16 | 1,381 | 0 | 0 |
| 18Smock4-5 | 8 | 2 | 0.0099% | 0.05%–17.67% | 16 | 1,364 | 0 | 0 |
| 16Sv13 | 6 | 2 | 0.0007% | 0.13%–31.53% | 24 | 1,276 | 11 | 1 |
| 16Sv34 | 2 | 2 | 0.0027% | 0.001%–1.178% | 20 | 1,344 | 4 | 7 |
| 16Sv45 | 2 | 2 | 0.0032% | 0.021%–0.556% | 80 | 1,330 | 13 | 0 |
| 16Sv6 | 2 | 2 | 0.0037% | 0.002%–14.193% | 20 | 311 | 3 | 4 |
For each data set the table shows the theoretical minimum number of reads required for a sequence to be potentially validated by APDP in the total data set and in each sample, the minimum relative frequency, the relative frequency of reference sequences, the expected number of sequences or OTUs, the number of unique sequences that passed the minimum read thresholds and were therefore potentially valid sequences, the number of errors retained (FP), and the number of undetected reference sequences (FN).
Overall accuracy of denoised pyrosequences and 3% OTUs retained by APDP and three alternative denoising approaches using six artificial community datasets.
| Sequences | 3% OTUs | ||||||||||
| TP only | TP+NM | TP only | TP+MC+NM | ||||||||
| Method | Number of Datasets | Total Expected Sequences | Total Observed Sequences | FP | FN | FP | FN | FP | FN | FP | FN |
| APDP | 6 | 176 | 195 |
|
|
|
|
|
|
| 11 |
| AmpliconNoise | 6 | 174 | 506 | 383 | 50 | 360 | 27 | 189 | 39 | 164 | 14 |
| cutoff | 6 | 174 | 172 | 63 | 65 | 48 | 50 | 59 | 46 | 44 | 31 |
| mother | 6 | 183 | 6205 | 6052 | 38 | 6038 | 24 | 2358 | 35 | 2329 |
|
| cutoff | 6 | 183 | 538 | 385 | 39 | 377 | 31 | 315 | 36 | 292 | 13 |
| QIIME | 6 | 204 | 1113 | 1024 | 115 | 946 | 37 | 431 | 102 | 351 | 22 |
| cutoff | 6 | 204 | 772 | 683 | 115 | 610 | 42 | 269 | 102 | 190 | 23 |
Bold numbers indicate best result in the column. The number of false positives (FP) and false negatives (FN) is shown for cases where miscalled and near-match OTUs are considered to be incorrect (TP only) or correct (TP+NM and TP+MC+NM). In addition, the results are shown for each method after excluding OTUs with fewer reads than the minimum detection threshold of APDP (cutoff).
Accuracy of denoised pyrosequences from the three multi-sample artificial community datasets retained by APDP and three alternative denoising approaches.
| 18S1–3 | 18S4–6 | 16Sv13 | ||||||||||
| TP only | TP+NM | TP only | TP+NM | TP only | TP+NM | |||||||
| Method | FP | FN | FP | FN | FP | FN | FP | FN | FP | FN | FP | FN |
| APDP |
|
|
|
|
|
|
|
|
|
|
|
|
| AmpliconNoise | 33 | 6 | 30 | 3 | 38 | 3 | 37 | 2 | 243 | 3 | 243 | 3 |
| cutoff | 2 | 12 | 2 | 12 | 0 | 10 | 0 | 10 | 36 | 3 | 36 | 3 |
| mothur | 2226 | 0 | 2226 | 0 | 2502 | 0 | 2502 | 0 | 742 | 9 | 738 | 5 |
| cutoff | 110 | 0 | 110 | 0 | 102 | 0 | 102 | 0 | 20 | 9 | 19 | 8 |
| QIIME | 188 | 2 | 186 | 0 | 172 | 1 | 171 | 0 | 496 | 34 | 476 | 14 |
| cutoff | 87 | 2 | 85 | 0 | 91 | 1 | 90 | 0 | 357 | 34 | 341 | 18 |
The number of false positives (FP) and false negatives (FN) is shown for near-match sequences are considered to be incorrect (TP only) or correct (TP+NM) sequences. In addition, the results are shown for each method after excluding denoised sequences with fewer reads than the minimum detection threshold of APDP (cutoff).
Accuracy of denoised pyrosequences from the three single-sample artificial community datasets retained by APDP and three alternative denoising approaches.
| 16Sv34 | 16Sv45 | 16Sv6 | ||||||||||
| TP only | TP+NM | TP only | TP+NM | TP only | TP+NM | |||||||
| Method | FP | FN | FP | FN | FP | FN | FP | FN | FP | FN | FP | FN |
| APDP |
| 13 | 4 | 7 | 17 |
| 13 |
|
| 6 |
| 4 |
| AmpliconNoise | 20 | 16 | 9 |
| 28 | 17 | 22 | 11 | 21 |
| 19 |
|
| cutoff | 12 | 16 |
| 6 |
| 17 |
| 13 | 8 | 7 | 7 | 6 |
| mothur | 51 |
| 51 | 11 | 448 | 10 | 438 | 0 | 83 | 8 | 83 | 8 |
| cutoff | 20 |
| 20 | 11 | 99 | 10 | 92 | 3 | 34 | 9 | 34 | 9 |
| QIIME | 19 | 19 | 11 | 11 | 105 | 52 | 59 | 6 | 44 | 7 | 43 | 6 |
| cutoff | 19 | 19 | 11 | 11 | 86 | 52 | 41 | 7 | 43 | 7 | 42 | 6 |
Nomenclature and abbreviations are as Table 4.
Accuracy of 3% OTUs constructed from denoised pyrosequences from the three multi-sample artificial community datasets retained by APDP and three alternative denoising approaches.
| 18S1–3 | 18S4–6 | 16Sv13 | ||||||||||
| TP only | TP+NM | TP only | TP+NM | TP only | TP+NM | |||||||
| Method | FP | FN | FP | FN | FP | FN | FP | FN | FP | FN | FP | FN |
| APDP |
|
|
|
|
|
|
|
|
|
|
|
|
| AmpliconNoise | 14 | 7 | 9 | 2 | 17 | 7 | 13 | 3 | 98 | 1 | 97 | 0 |
| cutoff | 2 | 10 | 2 | 10 | 2 | 9 | 1 | 8 | 30 | 1 | 29 | 0 |
| mother | 1017 | 1 | 1016 | 0 | 1021 | 0 | 1021 | 0 | 71 | 1 | 70 | 0 |
| cutoff | 97 | 1 | 96 | 0 | 101 | 0 | 101 | 0 | 4 | 1 | 3 | 0 |
| QIIME | 73 | 4 | 69 | 0 | 70 | 1 | 69 | 0 | 144 | 13 | 131 | 0 |
| cutoff | 37 | 4 | 33 | 0 | 36 | 1 | 35 | 0 | 82 | 13 | 69 | 0 |
The number of false positives (FP) and false negatives (FN) is shown for cases where miscalled and near-match OTUs are considered to be incorrect (TP only) or correct (TP+NM). In addition, the results are shown for each method after excluding OTUs with fewer reads than the minimum detection threshold of APDP (cutoff).
Accuracy of 3% OTUs constructed from denoised pyrosequences from the three single-sample artificial community datasets retained by APDP and three alternative denoising approaches.
| 16Sv34 | 16Sv45 | 16Sv6 | ||||||||||
| TP only | TP+NM | TP only | TP+NM | TP only | TP+NM | |||||||
| Method | FP | FN | FP | FN | FP | FN | FP | FN | FP | FN | FP | FN |
| APDP |
|
| 3 | 5 |
|
|
|
|
| 5 |
| 3 |
| AmpliconNoise | 19 | 13 | 9 | 3 | 22 | 7 | 19 | 4 | 19 |
| 17 |
|
| cutoff | 11 | 13 |
| 4 | 5 | 7 | 2 | 4 | 9 | 6 | 8 | 5 |
| mothur | 20 | 10 | 12 |
| 138 | 17 | 124 | 3 | 91 | 6 | 86 | 1 |
| cutoff | 10 | 10 | 3 | 3 | 60 | 15 | 49 | 6 | 43 | 7 | 40 | 4 |
| QIIME | 16 | 15 | 4 | 3 | 77 | 40 | 41 | 4 | 23 | 12 | 19 | 8 |
| cutoff | 16 | 15 | 4 | 3 | 63 | 40 | 27 | 4 | 22 | 12 | 18 | 8 |
Nomenclature and abbreviations are as Table 6.
Figure 1Abundance and accuracy of OTUs recovered by APDP, QIIME and mothur analyses of high diversity environmental datasets.
Sequences are ordered by rank abundance on the x-axes. Note that the scale varies for each graph as the number of OTUs recovered differs for each dataset and method. Y-axes are log-scaled. The total number of OTUs recovered (n), and the relative frequency distribution for all OTUs (rank abundance bar chart) are shown for each analysis. a) All OTUs recovered from the high-diversity 18SEnv1 dataset that contains an unknown number of OTUs by each analysis method. b) OTUs recovered from 18SEnv1 that were classified as Decapoda (expected to comprise three OTUs). Spacing between some high-frequency bars has been manipulated to aid visualization of the mothur result due to the high OTU return. c) All OTUs recovered from the high-diversity 18SEnv2 dataset. d) OTUs recovered from 18SEnv2 that were classified as Decapoda (expected to comprise two OTUs). The expected number of decapod OTUs (b and d) is shown in parentheses along with a pie chart showing the proportion classified as good (blue), noise (red) or missed (no color).