| Literature DB >> 23698723 |
A Murat Eren, Hilary G Morrison, Susan M Huse, Mitchell L Sogin.
Abstract
The extremely high error rates reported by Keegan et al. in 'A platform-independent method for detecting errors in metagenomic sequencing data: DRISEE' (PLoS Comput Biol 2012; 8: :e1002541) for many next-generation sequencing datasets prompted us to re-examine their results. Our analysis reveals that the presence of conserved artificial sequences, e.g. Illumina adapters, and other naturally occurring sequence motifs accounts for most of the reported errors. We conclude that DRISEE reports inflated levels of sequencing error, particularly for Illumina data. Tools offered for evaluating large datasets need scrupulous review before they are implemented.Entities:
Keywords: PCR; adapter ligation; next-generation sequencing; quality score; sequencing error
Mesh:
Substances:
Year: 2013 PMID: 23698723 PMCID: PMC4171678 DOI: 10.1093/bib/bbt010
Source DB: PubMed Journal: Brief Bioinform ISSN: 1467-5463 Impact factor: 11.622
Figure 1:Change in DRISEE error estimation for reads with and without Illumina adapter contamination for all 12 datasets that were used in the original publication to demonstrate how DRISEE error profiles differ markedly from quality scores.
Figure 2:DRISEE error by position. The largest bin contained 15 264 reads and the prefix appeared to be a true ADR (bacterial genomic sequence). The per cent error at each position is plotted on the y-axis (light blue). Scores for an adapter-generated bin with 8177 reads are shown for comparison (dark red).
Change in DRISEE error estimation for SRR061488 after removing adapter-contaminated and low-complexity bins from the analysis
| Category | Number | DRISEE error (%) |
|---|---|---|
| All bins | 4766 | 39.9 |
| Adapter-contaminated bins | 1645 | 45.3 |
| Low-complexity bins | 2718 | 34.8 |
| Remaining bins | 403 | 6.6 |
aNumber of bins containing ≥20 reads and no ambiguous bases in their prefixes.
Figure 3:Some of the motifs that generated invalid bins for dataset 4441625.3. The first 20 largest bins are shown. The first column is bin size and the second is the 50-nt prefix. Similar motifs are shown using the same font colour.