Literature DB >> 35609994

Lossless indexing with counting de Bruijn graphs.

Mikhail Karasikov1,2,3, Harun Mustafa1,2,3, Gunnar Rätsch1,2,3,4,5, André Kahles1,2,3.   

Abstract

Sequencing data are rapidly accumulating in public repositories. Making this resource accessible for interactive analysis at scale requires efficient approaches for its storage and indexing. There have recently been remarkable advances in building compressed representations of annotated (or colored) de Bruijn graphs for efficiently indexing k-mer sets. However, approaches for representing quantitative attributes such as gene expression or genome positions in a general manner have remained underexplored. In this work, we propose counting de Bruijn graphs, a notion generalizing annotated de Bruijn graphs by supplementing each node-label relation with one or many attributes (e.g., a k-mer count or its positions). Counting de Bruijn graphs index k-mer abundances from 2652 human RNA-seq samples in over eightfold smaller representations compared with state-of-the-art bioinformatics tools and is faster to construct and query. Furthermore, counting de Bruijn graphs with positional annotations losslessly represent entire reads in indexes on average 27% smaller than the input compressed with gzip for human Illumina RNA-seq and 57% smaller for Pacific Biosciences (PacBio) HiFi sequencing of viral samples. A complete searchable index of all viral PacBio SMRT reads from NCBI's Sequence Read Archive (SRA) (152,884 samples, 875 Gbp) comprises only 178 GB. Finally, on the full RefSeq collection, we generate a lossless and fully queryable index that is 4.6-fold smaller than the MegaBLAST index. The techniques proposed in this work naturally complement existing methods and tools using de Bruijn graphs, and significantly broaden their applicability: from indexing k-mer counts and genome positions to implementing novel sequence alignment algorithms on top of highly compressed graph-based sequence indexes.
© 2022 Karasikov et al.; Published by Cold Spring Harbor Laboratory Press.

Entities:  

Year:  2022        PMID: 35609994      PMCID: PMC9528980          DOI: 10.1101/gr.276607.122

Source DB:  PubMed          Journal:  Genome Res        ISSN: 1088-9051            Impact factor:   9.438


  38 in total

1.  ART: a next-generation sequencing read simulator.

Authors:  Weichun Huang; Leping Li; Jason R Myers; Gabor T Marth
Journal:  Bioinformatics       Date:  2011-12-23       Impact factor: 6.937

2.  PBSIM: PacBio reads simulator--toward accurate genome assembly.

Authors:  Yukiteru Ono; Kiyoshi Asai; Michiaki Hamada
Journal:  Bioinformatics       Date:  2012-11-04       Impact factor: 6.937

3.  Variation graph toolkit improves read mapping by representing genetic variation in the reference.

Authors:  Erik Garrison; Jouni Sirén; Adam M Novak; Glenn Hickey; Jordan M Eizenga; Eric T Dawson; William Jones; Shilpa Garg; Charles Markello; Michael F Lin; Benedict Paten; Richard Durbin
Journal:  Nat Biotechnol       Date:  2018-08-20       Impact factor: 54.908

4.  Ultrafast search of all deposited bacterial and viral genomic data.

Authors:  Phelim Bradley; Henk C den Bakker; Eduardo P C Rocha; Gil McVean; Zamin Iqbal
Journal:  Nat Biotechnol       Date:  2019-02-04       Impact factor: 54.908

Review 5.  Data structures based on k-mers for querying large collections of sequencing data sets.

Authors:  Camille Marchet; Christina Boucher; Simon J Puglisi; Paul Medvedev; Mikaël Salson; Rayan Chikhi
Journal:  Genome Res       Date:  2020-12-16       Impact factor: 9.043

Review 6.  Array programming with NumPy.

Authors:  Charles R Harris; K Jarrod Millman; Stéfan J van der Walt; Ralf Gommers; Pauli Virtanen; David Cournapeau; Eric Wieser; Julian Taylor; Sebastian Berg; Nathaniel J Smith; Robert Kern; Matti Picus; Stephan Hoyer; Marten H van Kerkwijk; Matthew Brett; Allan Haldane; Jaime Fernández Del Río; Mark Wiebe; Pearu Peterson; Pierre Gérard-Marchant; Kevin Sheppard; Tyler Reddy; Warren Weckesser; Hameer Abbasi; Christoph Gohlke; Travis E Oliphant
Journal:  Nature       Date:  2020-09-16       Impact factor: 49.962

7.  Space-efficient and exact de Bruijn graph representation based on a Bloom filter.

Authors:  Rayan Chikhi; Guillaume Rizk
Journal:  Algorithms Mol Biol       Date:  2013-09-16       Impact factor: 1.405

8.  REINDEER: efficient indexing of k-mer presence and abundance in sequencing datasets.

Authors:  Camille Marchet; Zamin Iqbal; Daniel Gautheret; Mikaël Salson; Rayan Chikhi
Journal:  Bioinformatics       Date:  2020-07-01       Impact factor: 6.937

9.  A unified catalog of 204,938 reference genomes from the human gut microbiome.

Authors:  Alexandre Almeida; Stephen Nayfach; Miguel Boland; Francesco Strozzi; Martin Beracochea; Zhou Jason Shi; Katherine S Pollard; Ekaterina Sakharova; Donovan H Parks; Philip Hugenholtz; Nicola Segata; Nikos C Kyrpides; Robert D Finn
Journal:  Nat Biotechnol       Date:  2020-07-20       Impact factor: 54.908

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.