| Literature DB >> 20388221 |
Beifang Niu1, Limin Fu, Shulei Sun, Weizhong Li.
Abstract
BACKGROUND: Artificial duplicates from pyrosequencing reads may lead to incorrect interpretation of the abundance of species and genes in metagenomic studies. Duplicated reads were filtered out in many metagenomic projects. However, since the duplicated reads observed in a pyrosequencing run also include natural (non-artificial) duplicates, simply removing all duplicates may also cause underestimation of abundance associated with natural duplicates.Entities:
Mesh:
Year: 2010 PMID: 20388221 PMCID: PMC2874554 DOI: 10.1186/1471-2105-11-187
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Genome projects with full genomes from Refseq and pyrosequencing reads from Short Read Archive
| IDa | SRA | SRA | Platform | Genome | Genome | GC | Number | Read | % of total | % of natural | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 20067 | SRP000091 | SRR000351 | GS_20 | NC_010741 | 1.13946 | 52 | 529181 | 0.4644 | 13.585 | 5.032 | (0.030) |
| 20739 | SRP000868 | SRR017616 | GS_FLX | NC_013170 | 1.61780 | 50 | 513712 | 0.3175 | 17.751 | 4.999 | (0.022) |
| 29525 | SRP000571 | SRR013433 | GS_FLX | NC_013124 | 2.15816 | 68 | 570098 | 0.2641 | 9.938 | 4.464 | (0.034) |
| 19265 | SRP000036 | SRR000223 | GS_20 | NC_010085 | 1.64526 | 34 | 429372 | 0.2609 | 12.120 | 4.418 | (0.026) |
| 19981 | SRP000204 | SRR001584 | GS_20 | NC_010830 | 1.88436 | 35 | 399515 | 0.2120 | 26.027 | 4.293 | (0.030) |
| 20655 | SRP000207 | SRR001568 | GS_20 | NC_012803 | 2.50109 | 72 | 528437 | 0.2112 | 9.734 | 4.099 | (0.032) |
| 20833 | SRP000867 | SRR017612 | GS_FLXe | NC_013174 | 2.74965 | 58 | 574027 | 0.2087 | 16.266 | 3.745 | (0.027) |
| 18819 | SRP000035 | SRR000219 | GS_20 | NC_009637 | 1.77269 | 33 | 332809 | 0.1877 | 15.145 | 3.664 | (0.024) |
| 29443 | SRP000895 | SRR017790 | GS_FLX | NC_013166 | 2.85207 | 43 | 529344 | 0.1855 | 16.017 | 3.044 | (0.025) |
| 29419 | SRP000560 | SRR013388 | GS_FLX | NC_012785 | 2.30212 | 41 | 416146 | 0.1807 | 9.537 | 3.025 | (0.031) |
| 19543 | SRP000205 | SRR001565 | GS_20 | NC_010483 | 1.87769 | 46 | 321938 | 0.1714 | 25.373 | 2.886 | (0.038) |
| 29381 | SRP000558 | SRR013382 | GS_FLX | NC_011832 | 2.92292 | 55 | 461295 | 0.1578 | 8.526 | 2.911 | (0.024) |
| 29403 | SRP000584 | SRR013477 | GS_FLX | NC_013162 | 2.61292 | 39 | 400460 | 0.1532 | 11.796 | 2.549 | (0.021) |
| 29177 | SRP000442 | SRR007446 | GS_FLX | NC_011901 | 3.46455 | 65 | 438386 | 0.1265 | 14.140 | 2.239 | (0.023) |
| 29493 | SRP000569 | SRR013431 | GS_FLX | NC_011883 | 2.87344 | 58 | 362855 | 0.1262 | 11.171 | 2.204 | (0.023) |
| 29175 | SRP000928 | SRR018125 | GS_FLX | NC_011661 | 1.85556 | 33 | 225795 | 0.1216 | 6.209 | 2.151 | (0.021) |
| 27731 | SRP000397 | SRR006411 | GS_FLX | NC_011769' | 4.04030 | 67 | 488823 | 0.1209 | 18.793 | 2.073 | (0.020) |
| 31289 | SRP000919 | SRR018042 | GS_FLX | NC_012917 | 4.86291 | 51 | 517593 | 0.1064 | 3.211 | 1.948 | (0.037) |
| 20635 | SRP000049 | SRR000266 | GS_20 | NC_011666 | 4.30543 | 63 | 401125 | 0.0931 | 10.364 | 1.943 | (0.022) |
| 31295 | SRP000921 | SRR018051 | GS_FLX | NC_012912 | 4.81385 | 54 | 441287 | 0.0916 | 6.938 | 1.941 | (0.019) |
| 29527 | SRP000893 | SRR017783 | GS_FLX | NC_013173 | 3.94266 | 58 | 352814 | 0.0894 | 8.356 | 1.926 | (0.022) |
| 20039 | SRP000209 | SRR001574 | GS_FLXf | NC_010524 | 4.90940 | 68 | 422674 | 0.0860 | 8.566 | 1.785 | (0.023) |
| 19701 | SRP000046 | SRR000255 | GS_20 | NC_010644 | 1.64356 | 39 | 136514 | 0.0830 | 5.922 | 1.744 | (0.022) |
| 19743 | SRP000045 | SRR000254 | GS_20 | NC_011145 | 5.06163 | 74 | 409136 | 0.0808 | 7.464 | 1.739 | (0.019) |
| 20095 | SRP000054 | SRR000278 | GS_20 | NC_011891 | 5.02933 | 74 | 404796 | 0.0804 | 8.363 | 1.515 | (0.022) |
| 30681 | SRP000922 | SRR018054 | GS_FLX | NC_012947 | 4.57094 | 50 | 367491 | 0.0803 | 11.324 | 1.449 | (0.018) |
| 21119 | SRP000208 | SRR001573 | GS_FLXf | NC_012032 | 5.26895 | 56 | 392222 | 0.0744 | 10.026 | 1.321 | (0.025) |
| 18637 | SRP000034 | SRR000215 | GS_20 | NC_010172 | 5.47115 | 68 | 395973 | 0.0723 | 12.998 | 1.306 | (0.016) |
| 20167 | SRP000053 | SRR000277 | GS_20 | NC_011004 | 5.74404 | 64 | 413261 | 0.0719 | 14.572 | 1.248 | (0.022) |
| 19989 | SRP000211 | SRR001579 | GS_20 | NC_010571 | 5.95761 | 65 | 378824 | 0.0635 | 5.484 | 1.145 | (0.023) |
| 19449 | SRP000043 | SRR000248 | GS_20 | NC_011768 | 6.51707 | 54 | 395672 | 0.0607 | 16.631 | 1.108 | (0.018) |
| 33873 | SRP000554 | SRR013372 | GS_FLX | NC_012691 | 3.47129 | 49 | 191873 | 0.0552 | 43.680 | 1.001 | (0.025) |
| 27951 | SRP000587 | SRR013487 | GS_FLX | NC_013132 | 9.12735 | 45 | 496792 | 0.0544 | 15.017 | 0.283 | (0.033) |
| 20827 | SRP000582 | SRR013470 | GS_FLX | NC_012669 | 4.66918 | 73 | 246279 | 0.0527 | 4.140 | 0.295 | (0.023) |
| 33069 | SRP000920 | SRR018045 | GS_FLX | NC_012880 | 4.67945 | 55 | 226208 | 0.0483 | 13.585 | 5.032 | (0.030) |
| 19705 | SRP000576 | SRR013446 | GS_FLX | NC_013093 | 8.24814 | 73 | 381851 | 0.0462 | 17.751 | 4.999 | (0.022) |
| 29975 | SRP000443 | SRR013137 | GS_FLX | NC_011992 | 3.79657 | 66 | 161655 | 0.0425 | 9.938 | 4.464 | (0.034) |
| 17265 | SRP000067 | SRR000311 | GS_20 | NC_008369 | 1.89572 | 32 | 28221 | 0.0148 | 12.120 | 4.418 | (0.026) |
| 20729 | SRP000267 | SRR004103 | GS_FLX | NC_012918 | 4.74581 | 60 | 22822 | 0.0048 | 26.027 | 4.293 | (0.030) |
aProject IDs, SRA study accessions, and SRA run accessions are from NCBI Short Read Archive at http://www.ncbi.nlm.nih.gov/Traces/sra/sra.cgi.
bRead Density is the number of reads divided by the genome length.
dσ is the standard deviation, which is based on the results of 100 simulations (see the "Duplicated reads of genomic datasets" section).
eThe platform provided by SRA is GS_FLX, and the read length (~400 bp) suggests GS_FLX Titanium.
fThe platform provided by SRA is GS_20, but the read length (~200 bp) suggests GS_FLX.
pyrosequencing error rate of GS-20 platform
| Percentage by error types (%) | ||||
|---|---|---|---|---|
| Project IDa | Error Rateb | Insertion | Deletion | Substitution |
| 17265 | 0.00774 | 27.06 | 17.28 | 55.67 |
| 18637 | 0.00250 | 60.94 | 22.24 | 16.82 |
| 18819 | 0.01194 | 36.91 | 19.62 | 43.47 |
| 19265 | 0.00893 | 41.32 | 14.51 | 44.18 |
| 19449 | 0.00569 | 38.98 | 20.23 | 40.79 |
| 19543 | 0.00522 | 49.07 | 18.97 | 31.96 |
| 19701 | 0.01097 | 32.89 | 17.93 | 49.18 |
| 19743 | 0.00287 | 57.51 | 28.29 | 14.21 |
| 19981 | 0.00530 | 28.85 | 17.73 | 53.42 |
| 19989 | 0.00216 | 54.29 | 22.33 | 23.38 |
| 20067 | 0.00679 | 46.67 | 12.15 | 41.19 |
| 20095 | 0.00216 | 50.48 | 28.75 | 20.77 |
| 20167 | 0.00157 | 53.65 | 26.70 | 19.65 |
| 20635 | 0.00231 | 57.82 | 18.25 | 23.93 |
| 20655 | 0.00211 | 51.31 | 29.94 | 18.75 |
| Average | 0.00522 | 45.85 | 20.99 | 33.16 |
aProject ID is the same as in Table 1.
bError rate is calculated for aligned reads as the number of errors (insertion, deletion, and substitution) divided by the number of bases of reads.
pyrosequencing error rate of GS-GLX platform
| Percentage by error types (%) | ||||
|---|---|---|---|---|
| Project IDa | Error Rateb | Insertion | Deletion | Substitution |
| 20039 | 0.00196 | 51.35 | 26.49 | 22.16 |
| 21119 | 0.00360 | 60.85 | 22.18 | 16.97 |
| 19705 | 0.00189 | 53.24 | 31.03 | 15.73 |
| 20729 | 0.00413 | 19.49 | 23.86 | 56.65 |
| 20739 | 0.00244 | 55.33 | 24.00 | 20.66 |
| 20827 | 0.00122 | 46.32 | 32.48 | 21.20 |
| 20833 | 0.00540 | 53.55 | 37.38 | 9.07 |
| 27731 | 0.00280 | 34.28 | 19.11 | 46.61 |
| 27951 | 0.00377 | 42.11 | 14.75 | 43.14 |
| 29175 | 0.00909 | 40.33 | 16.96 | 42.72 |
| 29177 | 0.00645 | 68.49 | 16.74 | 14.76 |
| 29381 | 0.00607 | 59.57 | 17.15 | 23.29 |
| 29403 | 0.01035 | 58.48 | 19.48 | 22.04 |
| 29419 | 0.00689 | 39.99 | 17.75 | 42.26 |
| 29443 | 0.00396 | 39.97 | 16.41 | 43.62 |
| 29493 | 0.00741 | 46.69 | 17.81 | 35.50 |
| 29525 | 0.00196 | 55.62 | 28.66 | 15.72 |
| 29527 | 0.00613 | 57.44 | 21.91 | 20.66 |
| 29975 | 0.00391 | 57.17 | 19.47 | 23.36 |
| 30681 | 0.00605 | 60.41 | 20.04 | 19.55 |
| 31289 | 0.00389 | 53.18 | 17.61 | 29.21 |
| 31295 | 0.00444 | 58.20 | 15.56 | 26.24 |
| 33069 | 0.00508 | 60.19 | 18.06 | 21.76 |
| 33873 | 0.00540 | 63.31 | 16.22 | 20.47 |
| Average | 0.00476 | 51.48 | 21.30 | 27.22 |
aProject ID is same as in Table 1.
bError rate is calculated for aligned reads as the number of errors (insertion, deletion, and substitution) divided by the number of bases of reads.
Figure 1Ratio of all duplicates and average natural duplicates to all reads from genome projects. X-axis is project identifier of datasets, which are ordered by decreasing read density (number of reads divided by genome size). Y-axis is the ratio of duplicated reads to all reads.
Metagenomic datasets used in this study
| % of natural duplicates under | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|
| High-complexityb | Moderate-complexityc | |||||||||
| Project/Samplea | Environment | Platform | Number | % of total | 3 mb | 100 kb | 10 kb | 3 mb | 100 kb | 10 kb |
| 16339/SRR000905 | Marine | GS_20 | 208633 | 5.74 | 0.01 | 0.52 | 4.98 | 0.10 | 3.22 | 24.88 |
| 28969/SRR000674 | Coastal water | GS_FLX | 201671 | 17.65 | 0.02 | 0.51 | 4.87 | 0.10 | 3.13 | 24.27 |
| 29421/SRR001308 | Waste water | GS_FLX | 378601 | 12.39 | 0.03 | 0.93 | 8.94 | 0.20 | 5.65 | 37.09 |
| 30445/SRR001663 | Marine | GS_FLX | 369811 | 15.39 | 0.03 | 0.93 | 8.68 | 0.19 | 5.49 | 36.53 |
| 30563/SRR001669 | Human gut | GS_20 | 41649 | 7.26 | 0.00 | 0.11 | 1.00 | 0.03 | 0.65 | 6.16 |
| 33243/SRR006907 | Freshwater | GS_FLX | 255722 | 20.57 | 0.02 | 0.61 | 6.07 | 0.13 | 3.88 | 28.71 |
| 38721/SRR023845 | Phyllosphere | GS_FLX | 543285 | 11.17 | 0.05 | 1.33 | 12.41 | 0.29 | 7.93 | 45.07 |
| Western channel/Apr_Day_gDNA | Saline water | Titanium | 421004 | 23.38 | 0.04 | 1.04 | 9.80 | 0.20 | 6.23 | 39.42 |
| Ocean viruses/Arctic_Shotgun | Ocean viruses | GS_20 | 688590 | 7.14 | 0.05 | 1.67 | 15.46 | 0.36 | 9.86 | 50.15 |
| North Atlantic/BATS-174-2 | Ocean gyre | GS_20 | 288735 | 17.56 | 0.02 | 0.73 | 6.92 | 0.16 | 4.43 | 31.24 |
aDatasets are either from NCBI Short Read Archive with project IDs and run accession numbers at http://www.ncbi.nlm.nih.gov/Traces/sra/sra.cgi or from CAMERA with project and sample names at http://camera.calit2.net.
bHigh-, cmoderate-complexity microbial (or viral) environment with average genome length of 3 mb, 100 kb, and 10 kb
Figure 2Ratio of all duplicates and natural duplicates under different hypothetical types for metagenomic samples. X-axis is the name or project identifier of metagenomic samples. For the real metagenomic dataset, the duplicates include both artificial and natural duplicates. For other hypothetical sample types, the duplicates are natural duplicates.