Literature DB >> 22084150

A new efficient data structure for storage and retrieval of multiple biosequences.

Sascha Steinbiss1, Stefan Kurtz.   

Abstract

Today's genome analysis applications require sequence representations allowing for fast access to their contents while also being memory-efficient enough to facilitate analyses of large-scale data. While a wide variety of sequence representations exist, lack of a generic implementation of efficient sequence storage has led to a plethora of poorly reusable or programming language-specific implementations. We present a novel, space-efficient data structure (GtEncseq) for storing multiple biological sequences of variable alphabet size, with customizable character transformations, wildcard support and an assortment of internal representations optimized for different distributions of wildcards and sequence lengths. For the human genome (3.1 gigabases, including 237 million wildcard characters) our representation requires only 2 + 8 × 10^-6bits per character. Implemented in C, our portable software implementation provides a variety of methods for random and sequential access to characters and substrings (including different reading directions) using an object-oriented interface. In addition, it includes access to metadata like sequence descriptions or character distributions. The library is extensible to be used from various scripting languages. GtEncseq is much more versatile than previous solutions, adding features that were previously unavailable. Benchmarks show that it is competitive with respect to space and time requirements.

Entities:  

Mesh:

Year:  2011        PMID: 22084150     DOI: 10.1109/TCBB.2011.146

Source DB:  PubMed          Journal:  IEEE/ACM Trans Comput Biol Bioinform        ISSN: 1545-5963            Impact factor:   3.710


  5 in total

1.  An automated real-time integration and interoperability framework for bioinformatics.

Authors:  Pedro Lopes; José Luís Oliveira
Journal:  BMC Bioinformatics       Date:  2015-10-13       Impact factor: 3.169

2.  Readjoiner: a fast and memory efficient string graph-based sequence assembler.

Authors:  Giorgio Gonnella; Stefan Kurtz
Journal:  BMC Bioinformatics       Date:  2012-05-06       Impact factor: 3.169

3.  Data compression for sequencing data.

Authors:  Sebastian Deorowicz; Szymon Grabowski
Journal:  Algorithms Mol Biol       Date:  2013-11-18       Impact factor: 1.405

4.  Vertical lossless genomic data compression tools for assembled genomes: A systematic literature review.

Authors:  Kelvin V Kredens; Juliano V Martins; Osmar B Dordal; Mauri Ferrandin; Roberto H Herai; Edson E Scalabrin; Bráulio C Ávila
Journal:  PLoS One       Date:  2020-05-26       Impact factor: 3.240

5.  LTRsift: a graphical user interface for semi-automatic classification and postprocessing of de novo detected LTR retrotransposons.

Authors:  Sascha Steinbiss; Sascha Kastens; Stefan Kurtz
Journal:  Mob DNA       Date:  2012-11-07
  5 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.