Literature DB >> 18227115

Compressed indexing and local alignment of DNA.

T W Lam1, W K Sung, S L Tam, C K Wong, S M Yiu.   

Abstract

MOTIVATION: Recent experimental studies on compressed indexes (BWT, CSA, FM-index) have confirmed their practicality for indexing very long strings such as the human genome in the main memory. For example, a BWT index for the human genome (with about 3 billion characters) occupies just around 1 G bytes. However, these indexes are designed for exact pattern matching, which is too stringent for biological applications. The demand is often on finding local alignments (pairs of similar substrings with gaps allowed). Without indexing, one can use dynamic programming to find all the local alignments between a text T and a pattern P in O(|T||P|) time, but this would be too slow when the text is of genome scale (e.g. aligning a gene with the human genome would take tens to hundreds of hours). In practice, biologists use heuristic-based software such as BLAST, which is very efficient but does not guarantee to find all local alignments.
RESULTS: In this article, we show how to build a software called BWT-SW that exploits a BWT index of a text T to speed up the dynamic programming for finding all local alignments. Experiments reveal that BWT-SW is very efficient (e.g. aligning a pattern of length 3 000 with the human genome takes less than a minute). We have also analyzed BWT-SW mathematically for a simpler similarity model (with gaps disallowed), and we show that the expected running time is O(/T/(0.628)/P/) for random strings. As far as we know, BWT-SW is the first practical tool that can find all local alignments. Yet BWT-SW is not meant to be a replacement of BLAST, as BLAST is still several times faster than BWT-SW for long patterns and BLAST is indeed accurate enough in most cases (we have used BWT-SW to check against the accuracy of BLAST and found that only rarely BLAST would miss some significant alignments). AVAILABILITY: www.cs.hku.hk/~ckwong3/bwtsw CONTACT: twlam@cs.hku.hk.

Entities:  

Mesh:

Substances:

Year:  2008        PMID: 18227115     DOI: 10.1093/bioinformatics/btn032

Source DB:  PubMed          Journal:  Bioinformatics        ISSN: 1367-4803            Impact factor:   6.937


  20 in total

Review 1.  A survey of sequence alignment algorithms for next-generation sequencing.

Authors:  Heng Li; Nils Homer
Journal:  Brief Bioinform       Date:  2010-05-11       Impact factor: 11.622

2.  Short Read Mapping: An Algorithmic Tour.

Authors:  Stefan Canzar; Steven L Salzberg
Journal:  Proc IEEE Inst Electr Electron Eng       Date:  2015-09-07       Impact factor: 10.961

3.  BarraCUDA - a fast short read sequence aligner using graphics processing units.

Authors:  Petr Klus; Simon Lam; Dag Lyberg; Ming Sin Cheung; Graham Pullan; Ian McFarlane; Giles Sh Yeo; Brian Yh Lam
Journal:  BMC Res Notes       Date:  2012-01-13

4.  Ultrafast and memory-efficient alignment of short DNA sequences to the human genome.

Authors:  Ben Langmead; Cole Trapnell; Mihai Pop; Steven L Salzberg
Journal:  Genome Biol       Date:  2009-03-04       Impact factor: 13.583

5.  How do alignment programs perform on sequencing data with varying qualities and from repetitive regions?

Authors:  Xiaoqing Yu; Kishore Guda; Joseph Willis; Martina Veigl; Zhenghe Wang; Sanford Markowitz; Mark D Adams; Shuying Sun
Journal:  BioData Min       Date:  2012-06-18       Impact factor: 2.522

6.  Fast and accurate short read alignment with Burrows-Wheeler transform.

Authors:  Heng Li; Richard Durbin
Journal:  Bioinformatics       Date:  2009-05-18       Impact factor: 6.937

7.  Fast and accurate long-read alignment with Burrows-Wheeler transform.

Authors:  Heng Li; Richard Durbin
Journal:  Bioinformatics       Date:  2010-01-15       Impact factor: 6.937

8.  Levenshtein Distance, Sequence Comparison and Biological Database Search.

Authors:  Bonnie Berger; Michael S Waterman; Yun William Yu
Journal:  IEEE Trans Inf Theory       Date:  2020-05-21       Impact factor: 2.501

9.  Long read alignment based on maximal exact match seeds.

Authors:  Yongchao Liu; Bertil Schmidt
Journal:  Bioinformatics       Date:  2012-09-15       Impact factor: 6.937

10.  SAP--a sequence mapping and analyzing program for long sequence reads alignment and accurate variants discovery.

Authors:  Zheng Sun; Weidong Tian
Journal:  PLoS One       Date:  2012-08-07       Impact factor: 3.240

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.