Literature DB >> 19439514

Efficient frequency-based de novo short-read clustering for error trimming in next-generation sequencing.

Wei Qu1, Shin-Ichi Hashimoto, Shinichi Morishita.   

Abstract

Novel massively parallel sequencing technologies provide highly detailed structures of transcriptomes and genomes by yielding deep coverage of short reads, but their utility is limited by inadequate sequencing quality and short-read lengths. Sequencing-error trimming in short reads is therefore a vital process that could improve the rate of successful reference mapping and polymorphism detection. Toward this aim, we herein report a frequency-based, de novo short-read clustering method that organizes erroneous short sequences originating in a single abundant sequence into a tree structure; in this structure, each "child" sequence is considered to be stochastically derived from its more abundant "parent" sequence with one mutation through sequencing errors. The root node is the most frequently observed sequence that represents all erroneous reads in the entire tree, allowing the alignment of the reliable representative read to the genome without the risk of mapping erroneous reads to false-positive positions. This method complements base calling and the error correction of making direct alignments with the reference genome, and is able to improve the overall accuracy of short-read alignment by consulting the inherent relationships among the entire set of reads. The algorithm runs efficiently with a linear time complexity. In addition, an error rate evaluation model can be derived from bacterial artificial chromosome sequencing data obtained in the same run as a control. In two clustering experiments using small RNA and 5'-end mRNA reads data sets, we confirmed a remarkable increase ( approximately 5%) in the percentage of short reads aligned to the reference sequence.

Entities:  

Mesh:

Substances:

Year:  2009        PMID: 19439514      PMCID: PMC2704438          DOI: 10.1101/gr.089151.108

Source DB:  PubMed          Journal:  Genome Res        ISSN: 1088-9051            Impact factor:   9.043


  18 in total

1.  Analysis of human transcriptomes.

Authors:  V E Velculescu; S L Madden; L Zhang; A E Lash; J Yu; C Rago; A Lal; C J Wang; G A Beaudry; K M Ciriello; B P Cook; M R Dufault; A T Ferguson; Y Gao; T C He; H Hermeking; S K Hiraldo; P M Hwang; M A Lopez; H F Luderer; B Mathews; J M Petroziello; K Polyak; L Zawel; K W Kinzler
Journal:  Nat Genet       Date:  1999-12       Impact factor: 38.330

2.  An SNP map of the human genome generated by reduced representation shotgun sequencing.

Authors:  D Altshuler; V J Pollara; C R Cowles; W J Van Etten; J Baldwin; L Linton; E S Lander
Journal:  Nature       Date:  2000-09-28       Impact factor: 49.962

3.  An Eulerian path approach to DNA fragment assembly.

Authors:  P A Pevzner; H Tang; M S Waterman
Journal:  Proc Natl Acad Sci U S A       Date:  2001-08-14       Impact factor: 11.205

Review 4.  MicroRNAs: genomics, biogenesis, mechanism, and function.

Authors:  David P Bartel
Journal:  Cell       Date:  2004-01-23       Impact factor: 41.582

5.  5'-end SAGE for the analysis of transcriptional start sites.

Authors:  Shin-ichi Hashimoto; Yutaka Suzuki; Yasuhiro Kasai; Kei Morohoshi; Tomoyuki Yamada; Jun Sese; Shinichi Morishita; Sumio Sugano; Kouji Matsushima
Journal:  Nat Biotechnol       Date:  2004-08-08       Impact factor: 54.908

6.  Adjust quality scores from alignment and improve sequencing accuracy.

Authors:  Ming Li; Magnus Nordborg; Lei M Li
Journal:  Nucleic Acids Res       Date:  2004-09-30       Impact factor: 16.971

7.  High-resolution profiling of histone methylations in the human genome.

Authors:  Artem Barski; Suresh Cuddapah; Kairong Cui; Tae-Young Roh; Dustin E Schones; Zhibin Wang; Gang Wei; Iouri Chepelev; Keji Zhao
Journal:  Cell       Date:  2007-05-18       Impact factor: 41.582

8.  Base-calling of automated sequencer traces using phred. II. Error probabilities.

Authors:  B Ewing; P Green
Journal:  Genome Res       Date:  1998-03       Impact factor: 9.043

9.  Universality and flexibility in gene expression from bacteria to human.

Authors:  Hiroki R Ueda; Satoko Hayashi; Shinichi Matsuyama; Tetsuya Yomo; Seiichi Hashimoto; Steve A Kay; John B Hogenesch; Masamitsu Iino
Journal:  Proc Natl Acad Sci U S A       Date:  2004-03-03       Impact factor: 11.205

10.  Substantial biases in ultra-short read data sets from high-throughput DNA sequencing.

Authors:  Juliane C Dohm; Claudio Lottaz; Tatiana Borodina; Heinz Himmelbauer
Journal:  Nucleic Acids Res       Date:  2008-07-26       Impact factor: 16.971

View more
  33 in total

1.  Fulcrum: condensing redundant reads from high-throughput sequencing studies.

Authors:  Matthew S Burriesci; Erik M Lehnert; John R Pringle
Journal:  Bioinformatics       Date:  2012-03-13       Impact factor: 6.937

2.  ECHO: a reference-free short-read error correction algorithm.

Authors:  Wei-Chun Kao; Andrew H Chan; Yun S Song
Journal:  Genome Res       Date:  2011-04-11       Impact factor: 9.043

3.  Selective constraint on human pre-mRNA splicing by protein structural properties.

Authors:  Jean-Christophe Gelly; Hsuan-Yu Lin; Alexandre G de Brevern; Trees-Juen Chuang; Feng-Chi Chen
Journal:  Genome Biol Evol       Date:  2012-08-30       Impact factor: 3.416

4.  Incorporating sequence quality data into alignment improves DNA read mapping.

Authors:  Martin C Frith; Raymond Wan; Paul Horton
Journal:  Nucleic Acids Res       Date:  2010-01-27       Impact factor: 16.971

5.  Reference-free validation of short read data.

Authors:  Jan Schröder; James Bailey; Thomas Conway; Justin Zobel
Journal:  PLoS One       Date:  2010-09-22       Impact factor: 3.240

6.  Development and assessment of an optimized next-generation DNA sequencing approach for the mtgenome using the Illumina MiSeq.

Authors:  Jennifer A McElhoe; Mitchell M Holland; Kateryna D Makova; Marcia Shu-Wei Su; Ian M Paul; Christine H Baker; Seth A Faith; Brian Young
Journal:  Forensic Sci Int Genet       Date:  2014-05-20       Impact factor: 4.882

7.  Metagenomic study of the oral microbiota by Illumina high-throughput sequencing.

Authors:  Vladimir Lazarevic; Katrine Whiteson; Susan Huse; David Hernandez; Laurent Farinelli; Magne Osterås; Jacques Schrenzel; Patrice François
Journal:  J Microbiol Methods       Date:  2009-09-29       Impact factor: 2.363

8.  Ultrafast clustering algorithms for metagenomic sequence analysis.

Authors:  Weizhong Li; Limin Fu; Beifang Niu; Sitao Wu; John Wooley
Journal:  Brief Bioinform       Date:  2012-07-06       Impact factor: 11.622

9.  Probabilistic error correction for RNA sequencing.

Authors:  Hai-Son Le; Marcel H Schulz; Brenna M McCauley; Veronica F Hinman; Ziv Bar-Joseph
Journal:  Nucleic Acids Res       Date:  2013-04-04       Impact factor: 16.971

10.  Genetic heterogeneity revealed by sequence analysis of Mycobacterium tuberculosis isolates from extra-pulmonary tuberculosis patients.

Authors:  Sarbashis Das; Tanmoy Roychowdhury; Parameet Kumar; Anil Kumar; Priya Kalra; Jitendra Singh; Sarman Singh; H K Prasad; Alok Bhattacharya
Journal:  BMC Genomics       Date:  2013-06-17       Impact factor: 3.969

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.