Literature DB >> 21636593

A memory-efficient data structure representing exact-match overlap graphs with application for next-generation DNA assembly.

Hieu Dinh1, Sanguthevar Rajasekaran.   

Abstract

MOTIVATION: Exact-match overlap graphs have been broadly used in the context of DNA assembly and the shortest super string problem where the number of strings n ranges from thousands to billions. The length ℓ of the strings is from 25 to 1000, depending on the DNA sequencing technologies. However, many DNA assemblers using overlap graphs suffer from the need for too much time and space in constructing the graphs. It is nearly impossible for these DNA assemblers to handle the huge amount of data produced by the next-generation sequencing technologies where the number n of strings could be several billions. If the overlap graph is explicitly stored, it would require Ω(n(2)) memory, which could be prohibitive in practice when n is greater than a hundred million. In this article, we propose a novel data structure using which the overlap graph can be compactly stored. This data structure requires only linear time to construct and and linear memory to store.
RESULTS: For a given set of input strings (also called reads), we can informally define an exact-match overlap graph as follows. Each read is represented as a node in the graph and there is an edge between two nodes if the corresponding reads overlap sufficiently. A formal description follows. The maximal exact-match overlap of two strings x and y, denoted by ov(max)(x, y), is the longest string which is a suffix of x and a prefix of y. The exact-match overlap graph of n given strings of length ℓ is an edge-weighted graph in which each vertex is associated with a string and there is an edge (x, y) of weight ω=ℓ-|ov(max)(x, y)| if and only if ω ≤ λ, where |ov(max)(x, y)| is the length of ov(max)(x, y) and λ is a given threshold. In this article, we show that the exact-match overlap graphs can be represented by a compact data structure that can be stored using at most (2λ-1)(2⌈logn⌉+⌈logλ⌉)n bits with a guarantee that the basic operation of accessing an edge takes O(log λ) time. We also propose two algorithms for constructing the data structure for the exact-match overlap graph. The first algorithm runs in O(λℓnlogn) worse-case time and requires O(λ) extra memory. The second one runs in O(λℓn) time and requires O(n) extra memory. Our experimental results on a huge amount of simulated data from sequence assembly show that the data structure can be constructed efficiently in time and memory. AVAILABILITY: Our DNA sequence assembler that incorporates the data structure is freely available on the web at http://www.engr.uconn.edu/~htd06001/assembler/leap.zip

Mesh:

Substances:

Year:  2011        PMID: 21636593      PMCID: PMC3129531          DOI: 10.1093/bioinformatics/btr321

Source DB:  PubMed          Journal:  Bioinformatics        ISSN: 1367-4803            Impact factor:   6.937


  3 in total

1.  The fragment assembly string graph.

Authors:  Eugene W Myers
Journal:  Bioinformatics       Date:  2005-09-01       Impact factor: 6.937

2.  Genome assembly reborn: recent computational challenges.

Authors:  Mihai Pop
Journal:  Brief Bioinform       Date:  2009-05-29       Impact factor: 11.622

3.  Efficient construction of an assembly string graph using the FM-index.

Authors:  Jared T Simpson; Richard Durbin
Journal:  Bioinformatics       Date:  2010-06-15       Impact factor: 6.937

  3 in total
  4 in total

1.  Readjoiner: a fast and memory efficient string graph-based sequence assembler.

Authors:  Giorgio Gonnella; Stefan Kurtz
Journal:  BMC Bioinformatics       Date:  2012-05-06       Impact factor: 3.169

2.  A Practical and Scalable Tool to Find Overlaps between Sequences.

Authors:  Maan Haj Rachid; Qutaibah Malluhi
Journal:  Biomed Res Int       Date:  2015-04-19       Impact factor: 3.411

3.  Evolutionary potential, cross-stress behavior and the genetic basis of acquired stress resistance in Escherichia coli.

Authors:  Martin Dragosits; Vadim Mozhayskiy; Semarhy Quinones-Soto; Jiyeon Park; Ilias Tagkopoulos
Journal:  Mol Syst Biol       Date:  2013       Impact factor: 11.429

4.  A Comprehensive Study of De Novo Genome Assemblers: Current Challenges and Future Prospective.

Authors:  Abdul Rafay Khan; Muhammad Tariq Pervez; Masroor Ellahi Babar; Nasir Naveed; Muhammad Shoaib
Journal:  Evol Bioinform Online       Date:  2018-02-20       Impact factor: 1.625

  4 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.