Literature DB >> 20973743

EDAR: an efficient error detection and removal algorithm for next generation sequencing data.

Xiaohong Zhao1, Lance E Palmer, Randall Bolanos, Cristian Mircean, Dan Fasulo, Gayle M Wittenberg.   

Abstract

Genomic sequencing techniques introduce experimental errors into reads which can mislead sequence assembly efforts and complicate the diagnostic process. Here we present a method for detecting and removing sequencing errors from reads generated in genomic shotgun sequencing projects prior to sequence assembly. For each input read, the set of all length k substrings (k-mers) it contains are calculated. The read is evaluated based on the frequency with which each k-mer occurs in the complete data set (k-count). For each read, k-mers are clustered using the variable-bandwidth mean-shift algorithm. Based on the k-count of the cluster center, clusters are classified as error regions or non-error regions. For the 23 real and simulated data sets tested (454 and Solexa), our algorithm detected error regions that cover 99% of all errors. A heuristic algorithm is then applied to detect the location of errors in each putative error region. A read is corrected by removing the errors, thereby creating two or more smaller, error-free read fragments. After performing error removal, the error-rate for all data sets tested decreased (∼35-fold reduction, on average). EDAR has comparable accuracy to methods that correct rather than remove errors and when the error rate is greater than 3% for simulated data sets, it performs better. The performance of the Velvet assembler is generally better with error-removed data. However, for short reads, splitting at the location of errors can be problematic. Following error detection with error correction, rather than removal, may improve the assembly results.

Entities:  

Mesh:

Year:  2010        PMID: 20973743     DOI: 10.1089/cmb.2010.0127

Source DB:  PubMed          Journal:  J Comput Biol        ISSN: 1066-5277            Impact factor:   1.479


  15 in total

1.  Effects of short read quality and quantity on a de novo vertebrate transcriptome assembly.

Authors:  T I Garcia; Y Shen; J Catchen; A Amores; M Schartl; J Postlethwait; R B Walter
Journal:  Comp Biochem Physiol C Toxicol Pharmacol       Date:  2011-06-01       Impact factor: 3.228

2.  Analysis of the evolution and structure of a complex intrahost viral population in chronic hepatitis C virus mapped by ultradeep pyrosequencing.

Authors:  Brendan A Palmer; Zoya Dimitrova; Pavel Skums; Orla Crosbie; Elizabeth Kenny-Walsh; Liam J Fanning
Journal:  J Virol       Date:  2014-09-17       Impact factor: 5.103

3.  Slim-filter: an interactive Windows-based application for illumina genome analyzer data assessment and manipulation.

Authors:  Georgiy Golovko; Kamil Khanipov; Mark Rojas; Antonio Martinez-Alcántara; Jesse J Howard; Efren Ballesteros; Sharu Gupta; William Widger; Yuriy Fofanov
Journal:  BMC Bioinformatics       Date:  2012-07-16       Impact factor: 3.169

4.  Challenges and opportunities in estimating viral genetic diversity from next-generation sequencing data.

Authors:  Niko Beerenwinkel; Huldrych F Günthard; Volker Roth; Karin J Metzner
Journal:  Front Microbiol       Date:  2012-09-11       Impact factor: 5.640

5.  Ultrafast clustering algorithms for metagenomic sequence analysis.

Authors:  Weizhong Li; Limin Fu; Beifang Niu; Sitao Wu; John Wooley
Journal:  Brief Bioinform       Date:  2012-07-06       Impact factor: 11.622

6.  Efficient error correction for next-generation sequencing of viral amplicons.

Authors:  Pavel Skums; Zoya Dimitrova; David S Campo; Gilberto Vaughan; Livia Rossi; Joseph C Forbi; Jonny Yokosawa; Alex Zelikovsky; Yury Khudyakov
Journal:  BMC Bioinformatics       Date:  2012-06-25       Impact factor: 3.169

7.  Overcoming bias and systematic errors in next generation sequencing data.

Authors:  Margaret A Taub; Hector Corrada Bravo; Rafael A Irizarry
Journal:  Genome Med       Date:  2010-12-10       Impact factor: 11.117

8.  Error correction of high-throughput sequencing datasets with non-uniform coverage.

Authors:  Paul Medvedev; Eric Scott; Boyko Kakaradov; Pavel Pevzner
Journal:  Bioinformatics       Date:  2011-07-01       Impact factor: 6.937

9.  QuorUM: An Error Corrector for Illumina Reads.

Authors:  Guillaume Marçais; James A Yorke; Aleksey Zimin
Journal:  PLoS One       Date:  2015-06-17       Impact factor: 3.240

10.  Using false discovery rates to benchmark SNP-callers in next-generation sequencing projects.

Authors:  Rhys A Farrer; Daniel A Henk; Dan MacLean; David J Studholme; Matthew C Fisher
Journal:  Sci Rep       Date:  2013       Impact factor: 4.379

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.