Literature DB >> 20377446

Storage and retrieval of highly repetitive sequence collections.

Veli Mäkinen1, Gonzalo Navarro, Jouni Sirén, Niko Välimäki.   

Abstract

A repetitive sequence collection is a set of sequences which are small variations of each other. A prominent example are genome sequences of individuals of the same or close species, where the differences can be expressed by short lists of basic edit operations. Flexible and efficient data analysis on such a typically huge collection is plausible using suffix trees. However, the suffix tree occupies much space, which very soon inhibits in-memory analyses. Recent advances in full-text indexing reduce the space of the suffix tree to, essentially, that of the compressed sequences, while retaining its functionality with only a polylogarithmic slowdown. However, the underlying compression model considers only the predictability of the next sequence symbol given the k previous ones, where k is a small integer. This is unable to capture longer-term repetitiveness. For example, r identical copies of an incompressible sequence will be incompressible under this model. We develop new static and dynamic full-text indexes that are able of capturing the fact that a collection is highly repetitive, and require space basically proportional to the length of one typical sequence plus the total number of edit operations. The new indexes can be plugged into a recent dynamic fully-compressed suffix tree, achieving full functionality for sequence analysis, while retaining the reduced space and the polylogarithmic slowdown. Our experimental results confirm the practicality of our proposal.

Mesh:

Year:  2010        PMID: 20377446     DOI: 10.1089/cmb.2009.0169

Source DB:  PubMed          Journal:  J Comput Biol        ISSN: 1066-5277            Impact factor:   1.479


  19 in total

1.  Compressive genomics.

Authors:  Po-Ru Loh; Michael Baym; Bonnie Berger
Journal:  Nat Biotechnol       Date:  2012-07-10       Impact factor: 54.908

2.  The design and construction of reference pangenome graphs with minigraph.

Authors:  Heng Li; Xiaowen Feng; Chong Chu
Journal:  Genome Biol       Date:  2020-10-16       Impact factor: 13.583

Review 3.  Pangenome Graphs.

Authors:  Jordan M Eizenga; Adam M Novak; Jonas A Sibbesen; Simon Heumos; Ali Ghaffaari; Glenn Hickey; Xian Chang; Josiah D Seaman; Robin Rounthwaite; Jana Ebler; Mikko Rautiainen; Shilpa Garg; Benedict Paten; Tobias Marschall; Jouni Sirén; Erik Garrison
Journal:  Annu Rev Genomics Hum Genet       Date:  2020-05-26       Impact factor: 8.929

4.  Document retrieval on repetitive string collections.

Authors:  Travis Gagie; Aleksi Hartikainen; Kalle Karhu; Juha Kärkkäinen; Gonzalo Navarro; Simon J Puglisi; Jouni Sirén
Journal:  Inf Retr Boston       Date:  2017-04-01       Impact factor: 2.293

5.  PFP Compressed Suffix Trees.

Authors:  Christina Boucher; Ondřej Cvacho; Travis Gagie; Jan Holub; Giovanni Manzini; Gonzalo Navarro; Massimiliano Rossi
Journal:  Proc Worksh Algorithm Eng Exp       Date:  2021

6.  deBWT: parallel construction of Burrows-Wheeler Transform for large collection of genomes with de Bruijn-branch encoding.

Authors:  Bo Liu; Dixian Zhu; Yadong Wang
Journal:  Bioinformatics       Date:  2016-06-15       Impact factor: 6.937

Review 7.  Computational pan-genomics: status, promises and challenges.

Authors: 
Journal:  Brief Bioinform       Date:  2018-01-01       Impact factor: 11.622

Review 8.  Prospects and limitations of full-text index structures in genome analysis.

Authors:  Michaël Vyverman; Bernard De Baets; Veerle Fack; Peter Dawyndt
Journal:  Nucleic Acids Res       Date:  2012-05-13       Impact factor: 16.971

9.  Indexes of large genome collections on a PC.

Authors:  Agnieszka Danek; Sebastian Deorowicz; Szymon Grabowski
Journal:  PLoS One       Date:  2014-10-07       Impact factor: 3.240

10.  HIVE-hexagon: high-performance, parallelized sequence alignment for next-generation sequencing data analysis.

Authors:  Luis Santana-Quintero; Hayley Dingerdissen; Jean Thierry-Mieg; Raja Mazumder; Vahan Simonyan
Journal:  PLoS One       Date:  2014-06-11       Impact factor: 3.240

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.