Literature DB >> 32176522

An Efficient, Scalable, and Exact Representation of High-Dimensional Color Information Enabled Using de Bruijn Graph Search.

Fatemeh Almodaresi1, Prashant Pandey2, Michael Ferdman3, Rob Johnson3,4, Rob Patro1.   

Abstract

The colored de Bruijn graph (cdbg) and its variants have become an important combinatorial structure used in numerous areas in genomics, such as population-level variation detection in metagenomic samples, large-scale sequence search, and cdbg-based reference sequence indices. As samples or genomes are added to the cdbg, the color information comes to dominate the space required to represent this data structure. In this article, we show how to represent the color information efficiently by adopting a hierarchical encoding that exploits correlations among color classes-patterns of color occurrence-present in the de Bruijn graph (dbg). A major challenge in deriving an efficient encoding of the color information that takes advantage of such correlations is determining which color classes are close to each other in the high-dimensional space of possible color patterns. We demonstrate that the dbg itself can be used as an efficient mechanism to search for approximate nearest neighbors in this space. While our approach reduces the encoding size of the color information even for relatively small cdbgs (hundreds of experiments), the gains are particularly consequential as the number of potential colors (i.e., samples or references) grows into thousands. We apply this encoding in the context of two different applications; the implicit cdbg used for a large-scale sequence search index, Mantis, as well as the encoding of color information used in population-level variation detection by tools such as Vari and Rainbowfish. Our results show significant improvements in the overall size and scalability of representation of the color information. In our experiment on 10,000 samples, we achieved >11 × better compression compared to Ramen, Ramen, Rao (RRR).

Entities:  

Keywords:  RNA-sequence search; compression schemes; de bruijn graph; proximate membership query

Mesh:

Year:  2020        PMID: 32176522      PMCID: PMC7185321          DOI: 10.1089/cmb.2019.0322

Source DB:  PubMed          Journal:  J Comput Biol        ISSN: 1066-5277            Impact factor:   1.479


  26 in total

1.  Fragment assembly with double-barreled data.

Authors:  P A Pevzner; H Tang
Journal:  Bioinformatics       Date:  2001       Impact factor: 6.937

2.  An Eulerian path approach to DNA fragment assembly.

Authors:  P A Pevzner; H Tang; M S Waterman
Journal:  Proc Natl Acad Sci U S A       Date:  2001-08-14       Impact factor: 11.205

3.  Velvet: algorithms for de novo short read assembly using de Bruijn graphs.

Authors:  Daniel R Zerbino; Ewan Birney
Journal:  Genome Res       Date:  2008-03-18       Impact factor: 9.043

4.  Practical dynamic de Bruijn graphs.

Authors:  Victoria G Crawford; Alan Kuhnle; Christina Boucher; Rayan Chikhi; Travis Gagie
Journal:  Bioinformatics       Date:  2018-12-15       Impact factor: 6.937

5.  Entropy-scaling search of massive biological data.

Authors:  Y William Yu; Noah M Daniels; David Christian Danko; Bonnie Berger
Journal:  Cell Syst       Date:  2015-08-26       Impact factor: 10.304

6.  Oases: robust de novo RNA-seq assembly across the dynamic range of expression levels.

Authors:  Marcel H Schulz; Daniel R Zerbino; Martin Vingron; Ewan Birney
Journal:  Bioinformatics       Date:  2012-02-24       Impact factor: 6.937

7.  Using cascading Bloom filters to improve the memory usage for de Brujin graphs.

Authors:  Kamil Salikhov; Gustavo Sacomoto; Gregory Kucherov
Journal:  Algorithms Mol Biol       Date:  2014-02-24       Impact factor: 1.405

8.  Bridger: a new framework for de novo transcriptome assembly using RNA-seq data.

Authors:  Zheng Chang; Guojun Li; Juntao Liu; Yu Zhang; Cody Ashby; Deli Liu; Carole L Cramer; Xiuzhen Huang
Journal:  Genome Biol       Date:  2015-02-11       Impact factor: 13.583

9.  Dynamic compression schemes for graph coloring.

Authors:  Harun Mustafa; Ingo Schilken; Mikhail Karasikov; Carsten Eickhoff; Gunnar Rätsch; André Kahles
Journal:  Bioinformatics       Date:  2019-02-01       Impact factor: 6.937

10.  Space-efficient and exact de Bruijn graph representation based on a Bloom filter.

Authors:  Rayan Chikhi; Guillaume Rizk
Journal:  Algorithms Mol Biol       Date:  2013-09-16       Impact factor: 1.405

View more
  9 in total

1.  Improved representation of sequence bloom trees.

Authors:  Robert S Harris; Paul Medvedev
Journal:  Bioinformatics       Date:  2020-02-01       Impact factor: 6.937

2.  Lossless indexing with counting de Bruijn graphs.

Authors:  Mikhail Karasikov; Harun Mustafa; Gunnar Rätsch; André Kahles
Journal:  Genome Res       Date:  2022-05-24       Impact factor: 9.438

3.  VariantStore: an index for large-scale genomic variant search.

Authors:  Prashant Pandey; Yinjie Gao; Carl Kingsford
Journal:  Genome Biol       Date:  2021-08-19       Impact factor: 13.583

4.  An Incrementally Updatable and Scalable System for Large-Scale Sequence Search using the Bentley-Saxe Transformation.

Authors:  Fatemeh Almodaresi; Jamshed Khan; Sergey Madaminov; Michael Ferdman; Rob Johnson; Prashant Pandey; Rob Patro
Journal:  Bioinformatics       Date:  2022-03-23       Impact factor: 6.931

5.  Bifrost: highly parallel construction and indexing of colored and compacted de Bruijn graphs.

Authors:  Guillaume Holley; Páll Melsted
Journal:  Genome Biol       Date:  2020-09-17       Impact factor: 13.583

6.  A tri-tuple coordinate system derived for fast and accurate analysis of the colored de Bruijn graph-based pangenomes.

Authors:  Jindan Guo; Erli Pang; Hongtao Song; Kui Lin
Journal:  BMC Bioinformatics       Date:  2021-05-27       Impact factor: 3.169

7.  Cuttlefish: fast, parallel and low-memory compaction of de Bruijn graphs from large-scale genome collections.

Authors:  Jamshed Khan; Rob Patro
Journal:  Bioinformatics       Date:  2021-07-12       Impact factor: 6.937

8.  Constructing small genome graphs via string compression.

Authors:  Yutong Qiu; Carl Kingsford
Journal:  Bioinformatics       Date:  2021-07-12       Impact factor: 6.937

9.  Shark: fishing relevant reads in an RNA-Seq sample.

Authors:  Luca Denti; Yuri Pirola; Marco Previtali; Tamara Ceccato; Gianluca Della Vedova; Raffaella Rizzi; Paola Bonizzoni
Journal:  Bioinformatics       Date:  2021-05-01       Impact factor: 6.937

  9 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.