Literature DB >> 23698723

DRISEE overestimates errors in metagenomic sequencing data.

A Murat Eren, Hilary G Morrison, Susan M Huse, Mitchell L Sogin.

Abstract

The extremely high error rates reported by Keegan et al. in 'A platform-independent method for detecting errors in metagenomic sequencing data: DRISEE' (PLoS Comput Biol 2012; 8: :e1002541) for many next-generation sequencing datasets prompted us to re-examine their results. Our analysis reveals that the presence of conserved artificial sequences, e.g. Illumina adapters, and other naturally occurring sequence motifs accounts for most of the reported errors. We conclude that DRISEE reports inflated levels of sequencing error, particularly for Illumina data. Tools offered for evaluating large datasets need scrupulous review before they are implemented.

Entities: Species

Keywords: PCR; adapter ligation; next-generation sequencing; quality score; sequencing error

Mesh：

Substances：
DNA

Year: 2013 PMID： 23698723 PMCID： PMC4171678 DOI： 10.1093/bib/bbt010

Source DB: PubMed Journal: Brief Bioinform ISSN： 1467-5463 Impact factor: 11.622

INTRODUCTION

Error identification and correction in high-throughput sequencing datasets, especially at the single read level, have been addressed by many investigators [1-18]. Many approaches use platform-dependent quality scores, read consensus or k-mer analysis. Recently, Keegan et al. [19] described DRISEE, a method to assess quality of genomic and metagenomic next-generation sequencing runs. The authors analysed numerous publicly available datasets with DRISEE and reported widely variable levels of sequencing errors, generally far higher than other published estimates [1, 20–22]. DRISEE bases its error estimates on variation from a consensus sequence in bins of artificially duplicated reads (ADRs). DRISEE assumes that prior to sequencing, over-amplification from a given start point in the template leads to formation of ADRs, and that sequencing error, not naturally occurring sequence diversity, accounts for sequence variation within an ADR bin. An ADR bin consists of all reads starting with an identical prefix, by default the first 50 nt of the read. DRISEE as described might provide an improved method for estimating sequencing errors than the platform-based quality scores; however, the authors failed to carefully examine the origins of ADR bins. DRISEE analyses all reads except those that contain ambiguous bases. The authors correctly note, ‘Bins can be screened for eukaryotic content, sequences with low complexity, and/or known sequences that may exhibit an unusually high level of biological repetition (16s rRNA-based, sequences with low complexity, eukaryotic sequences etc.). Bins that contain such sequences should be excluded from further consideration’. However, the Supplemental Methods in the DRISEE manuscript reveal that the authors did not exclude such reads.

Widespread Illumina adapter contamination

We obtained from the NCBI Sequence Read Archive (SRA) the 12 metagenomic datasets that were used in the original publication to generate Figure 4b. DRISEE error estimation demonstrated a significant discrepancy from the quality scores reported by the Illumina platform. Our analysis of DRISEE-generated ADR bins with ≥20 reads showed that Illumina adapter sequences drive the formation of these bins. This 65-nt adaptor sequence usually occurs upstream of the sequencing primer-binding site. Unfortunately, Illumina adapter artifacts sometimes contaminate libraries. Unless they are filtered out or trimmed, reads starting with Illumina adapters will present identical 65-nt prefixes at the start of the read and create a spurious ADR bin. DRISEE interprets the actual biological variation that follows the adapter sequences in these bins as extensive sequencing error. With DRISEE (version 1.2), we re-analysed the 12 datasets, identifying reads as ‘adapter contaminated’ if they presented at least 15 nt perfect identity to the Illumina adapter sequences in the first 50 nt (see Supplemental Methods). Figure 1 shows the marked difference in error estimation for reads with and without Illumina adapters. Although Keegan et al.’s [19] claim that the true error rates are higher than reported in the quality scores may be correct, the exceedingly high error rates presented in Figure 4b from the original publication reflect the presence of untrimmed Illumina adapter sequences and do not support their claims.

Figure 1:

Change in DRISEE error estimation for reads with and without Illumina adapter contamination for all 12 datasets that were used in the original publication to demonstrate how DRISEE error profiles differ markedly from quality scores. Spurious ADR bins caused by adaptor sequences differ markedly from valid bins in the magnitude of errors and their distribution by nucleotide position. Figure 2 shows DRISEE output from individual large bins from dataset SRR061459. The adapter-generated bin exhibits error greater than zero at all positions following the prefix and the average error greatly exceeds that of the valid ADR bin.

Figure 2:

DRISEE error by position. The largest bin contained 15 264 reads and the prefix appeared to be a true ADR (bacterial genomic sequence). The per cent error at each position is plotted on the y-axis (light blue). Scores for an adapter-generated bin with 8177 reads are shown for comparison (dark red).

Low-complexity and conserved gene reads

Analysis of all 10 Illumina genomic datasets, as well as 10 randomly chosen Illumina metagenomic runs from Keegan et al.’s Figure 3 [19], detected significant Illumina adapter contamination and a high proportion of low-complexity reads in all datasets, both of which generated spurious bins that inflated DRISEE error drastically. Table 1 demonstrates the inflation of DRISEE error for one of these datasets chosen randomly (SRA accession SRR061488). Genes with conserved regions followed by biological variation that commonly occurs in both bacterial and eukaryotic genomes can create bins large enough to be considered by DRISEE and inflate overall error estimations. For instance, 74 of 403 bins in SRR061488 derive from the 16S rRNA gene.

Table 1:

Change in DRISEE error estimation for SRR061488 after removing adapter-contaminated and low-complexity bins from the analysis

Category	Number^a	DRISEE error (%)
All bins	4766	39.9
Adapter-contaminated bins	1645	45.3
Low-complexity bins	2718	34.8
Remaining bins	403	6.6

aNumber of bins containing ≥20 reads and no ambiguous bases in their prefixes.

Some of the motifs that generated invalid bins for dataset 4441625.3. The first 20 largest bins are shown. The first column is bin size and the second is the 50-nt prefix. Similar motifs are shown using the same font colour. Change in DRISEE error estimation for SRR061488 after removing adapter-contaminated and low-complexity bins from the analysis aNumber of bins containing ≥20 reads and no ambiguous bases in their prefixes.

Platform-specific error

Keegan et al. [19] also report a striking difference in error rates between 454 and Illumina datasets. As we have shown, contaminating adapter sequences account for much of the DRISEE error in Illumina datasets. We next analysed 55 of the 65 Roche/454 metagenomic datasets used to generate Figure 3 in the DRISEE manuscript (the other 10 datasets were not available in MG-RAST or SRA). Our analysis showed that while adapter contamination is rare in 454 data, the 50-nt prefixes from 34 of the datasets were dominated by similar sequence motifs from sources we could not identify (see Supplemental Methods). Figure 3 exemplifies some of these motifs in one dataset (MG-RAST ID 4441625.3). Identical motifs in multiple datasets from the same research project suggest a library preparation artifact. Bins from another eight datasets had low-complexity, repetitive sequence prefixes. Whole genome amplification provided material for at least six of these libraries. Other datasets derived from metatranscriptomic material and contained a high proportion of rRNA-templated reads. The majority of the datasets used to compare the error rates of sequencing platforms in Figure 3 from the original publication violate underlying assumptions of DRISEE and led to publication of misleading results.

Figure 3:

Improving DRISEE

Not all reads that share the same first 50 bases represent artificial duplication. Meaningful results from DRISEE require understanding the source and distribution of sequence sets with identical prefixes. Suspicious bins must be excluded. However, this adds a layer of complexity and might result in too few bins to reach a robust error estimate. The minimum number of bins necessary to reach a reliable estimate and the impact of the sub-sampling necessary to complete the analysis in a reasonable time were not adequately addressed by the authors. Although DRISEE may eventually have the potential to identify problematic datasets and assess the sequencing quality of next-generation sequencing runs based on ADRs, the current version of the software is inadequate and its results are unrealistic.

SUPPLEMENTARY DATA

Supplementary data are available online at http://bib.oxfordjournals.org/. DRISEE is proposed as a method for detecting errors in metagenomic sequencing data by binning reads that contain the same prefix and investigating their divergence. DRISEE does not eliminate bins created by adapter contamination or that arise from closely related or low-complexity sequences, which results in inflated error estimates. DRISEE in its current implementation is inaccurate, and error rates reported in the DRISEE publication regarding Illumina and 454 technologies are misleading.

FUNDING

National Institutes of Health [1UH2DK083993 to M.L.S.]; National Science Foundation [BDI-096026 to S.M.H.].

22 in total

1. Reptile: representative tiling for short read error correction.

Authors: Xiao Yang; Karin S Dorman; Srinivas Aluru
Journal: Bioinformatics Date: 2010-08-16 Impact factor: 6.937

2. ECHO: a reference-free short-read error correction algorithm.

Authors: Wei-Chun Kao; Andrew H Chan; Yun S Song
Journal: Genome Res Date: 2011-04-11 Impact factor: 9.043

3. Reference-free validation of short read data.

Authors: Jan Schröder; James Bailey; Thomas Conway; Justin Zobel
Journal: PLoS One Date: 2010-09-22 Impact factor: 3.240

4. Repeat-aware modeling and correction of short read errors.

Authors: Xiao Yang; Srinivas Aluru; Karin S Dorman
Journal: BMC Bioinformatics Date: 2011-02-15 Impact factor: 3.169

5. SAMQA: error classification and validation of high-throughput sequenced read data.

Authors: Thomas Robinson; Sarah Killcoyne; Ryan Bressler; John Boyle
Journal: BMC Genomics Date: 2011-08-18 Impact factor: 3.969

6. ConDeTri--a content dependent read trimmer for Illumina data.

Authors: Linnéa Smeds; Axel Künstner
Journal: PLoS One Date: 2011-10-19 Impact factor: 3.240

7. Error correction of high-throughput sequencing datasets with non-uniform coverage.

Authors: Paul Medvedev; Eric Scott; Boyko Kakaradov; Pavel Pevzner
Journal: Bioinformatics Date: 2011-07-01 Impact factor: 6.937

8. Identification and correction of systematic error in high-throughput sequence data.

Authors: Frazer Meacham; Dario Boffelli; Joseph Dhahbi; David I K Martin; Meromit Singer; Lior Pachter
Journal: BMC Bioinformatics Date: 2011-11-21 Impact factor: 3.169

9. Evaluation of genomic high-throughput sequencing data generated on Illumina HiSeq and genome analyzer systems.

Authors: André E Minoche; Juliane C Dohm; Heinz Himmelbauer
Journal: Genome Biol Date: 2011-11-08 Impact factor: 13.583

10. BIGpre: a quality assessment package for next-generation sequencing data.

Authors: Tongwu Zhang; Yingfeng Luo; Kan Liu; Linlin Pan; Bing Zhang; Jun Yu; Songnian Hu
Journal: Genomics Proteomics Bioinformatics Date: 2011-12 Impact factor: 7.691

3 in total

1. Co-registered Geochemistry and Metatranscriptomics Reveal Unexpected Distributions of Microbial Activity within a Hydrothermal Vent Field.

Authors: Heather C Olins; Daniel R Rogers; Christina Preston; William Ussler; Douglas Pargett; Scott Jensen; Brent Roman; James M Birch; Christopher A Scholin; M Fauzi Haroon; Peter R Girguis
Journal: Front Microbiol Date: 2017-06-13 Impact factor: 5.640

2. ReSeq simulates realistic Illumina high-throughput sequencing data.

Authors: Stephan Schmeing; Mark D Robinson
Journal: Genome Biol Date: 2021-02-19 Impact factor: 13.583

3. Fragmentation and Coverage Variation in Viral Metagenome Assemblies, and Their Effect in Diversity Calculations.

Authors: Rodrigo García-López; Jorge Francisco Vázquez-Castellanos; Andrés Moya
Journal: Front Bioeng Biotechnol Date: 2015-09-17

3 in total