Warning: Undefined array key "mm" in /www/wwwroot/www.ai-bt.com/si.php on line 10 Deprecated: trim(): Passing null to parameter #1 ($string) of type string is deprecated in /www/wwwroot/www.ai-bt.com/si.php on line 10 EDAR: an efficient error detection and removal algorithm for next generation sequencing data.

Literature DB >> 20973743

EDAR: an efficient error detection and removal algorithm for next generation sequencing data.

Xiaohong Zhao¹, Lance E Palmer, Randall Bolanos, Cristian Mircean, Dan Fasulo, Gayle M Wittenberg.

Abstract

Genomic sequencing techniques introduce experimental errors into reads which can mislead sequence assembly efforts and complicate the diagnostic process. Here we present a method for detecting and removing sequencing errors from reads generated in genomic shotgun sequencing projects prior to sequence assembly. For each input read, the set of all length k substrings (k-mers) it contains are calculated. The read is evaluated based on the frequency with which each k-mer occurs in the complete data set (k-count). For each read, k-mers are clustered using the variable-bandwidth mean-shift algorithm. Based on the k-count of the cluster center, clusters are classified as error regions or non-error regions. For the 23 real and simulated data sets tested (454 and Solexa), our algorithm detected error regions that cover 99% of all errors. A heuristic algorithm is then applied to detect the location of errors in each putative error region. A read is corrected by removing the errors, thereby creating two or more smaller, error-free read fragments. After performing error removal, the error-rate for all data sets tested decreased (∼35-fold reduction, on average). EDAR has comparable accuracy to methods that correct rather than remove errors and when the error rate is greater than 3% for simulated data sets, it performs better. The performance of the Velvet assembler is generally better with error-removed data. However, for short reads, splitting at the location of errors can be problematic. Following error detection with error correction, rather than removal, may improve the assembly results.

Entities: Species

Mesh：

Year: 2010 PMID： 20973743 DOI： 10.1089/cmb.2010.0127

Source DB: PubMed Journal: J Comput Biol ISSN： 1066-5277 Impact factor: 1.479

Keyword Cloud
Cited

15 in total

1. Effects of short read quality and quantity on a de novo vertebrate transcriptome assembly.

Authors: T I Garcia; Y Shen; J Catchen; A Amores; M Schartl; J Postlethwait; R B Walter
Journal: Comp Biochem Physiol C Toxicol Pharmacol Date: 2011-06-01 Impact factor: 3.228

2. Analysis of the evolution and structure of a complex intrahost viral population in chronic hepatitis C virus mapped by ultradeep pyrosequencing.

Authors: Brendan A Palmer; Zoya Dimitrova; Pavel Skums; Orla Crosbie; Elizabeth Kenny-Walsh; Liam J Fanning
Journal: J Virol Date: 2014-09-17 Impact factor: 5.103

EDAR: an efficient error detection and removal algorithm for next generation sequencing data.

1. Effects of short read quality and quantity on a de novo vertebrate transcriptome assembly.

2. Analysis of the evolution and structure of a complex intrahost viral population in chronic hepatitis C virus mapped by ultradeep pyrosequencing.

3. Slim-filter: an interactive Windows-based application for illumina genome analyzer data assessment and manipulation.

4. Challenges and opportunities in estimating viral genetic diversity from next-generation sequencing data.

5. Ultrafast clustering algorithms for metagenomic sequence analysis.

6. Efficient error correction for next-generation sequencing of viral amplicons.

7. Overcoming bias and systematic errors in next generation sequencing data.

8. Error correction of high-throughput sequencing datasets with non-uniform coverage.

9. QuorUM: An Error Corrector for Illumina Reads.

10. Using false discovery rates to benchmark SNP-callers in next-generation sequencing projects.