Literature DB >> 29641248

Improved Search of Large Transcriptomic Sequencing Databases Using Split Sequence Bloom Trees.

Brad Solomon1, Carl Kingsford1.   

Abstract

Enormous databases of short-read RNA-seq experiments such as the NIH Sequencing Read Archive are now available. These databases could answer many questions about condition-specific expression or population variation, and this resource is only going to grow over time. However, these collections remain difficult to use due to the inability to search for a particular expressed sequence. Although some progress has been made on this problem, it is still not feasible to search collections of hundreds of terabytes of short-read sequencing experiments. We introduce an indexing scheme called split sequence bloom trees (SSBTs) to support sequence-based querying of terabyte scale collections of thousands of short-read sequencing experiments. SSBT is an improvement over the sequence bloom tree (SBT) data structure for the same task. We apply SSBTs to the problem of finding conditions under which query transcripts are expressed. Our experiments are conducted on a set of 2652 publicly available RNA-seq experiments for the breast, blood, and brain tissues. We demonstrate that this SSBT index can be queried for a 1000 nt sequence in <4 minutes using a single thread and can be stored in just 39 GB, a fivefold improvement in search and storage costs compared with SBT.

Entities:  

Keywords:  RNA-seq; data indexing; sequence bloom trees; sequence search.

Mesh:

Year:  2018        PMID: 29641248      PMCID: PMC6067102          DOI: 10.1089/cmb.2017.0265

Source DB:  PubMed          Journal:  J Comput Biol        ISSN: 1066-5277            Impact factor:   1.479


  13 in total

1.  Compressive genomics.

Authors:  Po-Ru Loh; Michael Baym; Bonnie Berger
Journal:  Nat Biotechnol       Date:  2012-07-10       Impact factor: 54.908

2.  Efficient q-gram filters for finding all epsilon-matches over a given length.

Authors:  Kim R Rasmussen; Jens Stoye; Eugene W Myers
Journal:  J Comput Biol       Date:  2006-03       Impact factor: 1.479

3.  Entropy-scaling search of massive biological data.

Authors:  Y William Yu; Noah M Daniels; David Christian Danko; Bonnie Berger
Journal:  Cell Syst       Date:  2015-08-26       Impact factor: 10.304

4.  BLAST+: architecture and applications.

Authors:  Christiam Camacho; George Coulouris; Vahram Avagyan; Ning Ma; Jason Papadopoulos; Kevin Bealer; Thomas L Madden
Journal:  BMC Bioinformatics       Date:  2009-12-15       Impact factor: 3.169

5.  The sequence read archive.

Authors:  Rasko Leinonen; Hideaki Sugawara; Martin Shumway
Journal:  Nucleic Acids Res       Date:  2010-11-09       Impact factor: 16.971

6.  Sailfish enables alignment-free isoform quantification from RNA-seq reads using lightweight algorithms.

Authors:  Rob Patro; Stephen M Mount; Carl Kingsford
Journal:  Nat Biotechnol       Date:  2014-04-20       Impact factor: 54.908

7.  These are not the k-mers you are looking for: efficient online k-mer counting using a probabilistic data structure.

Authors:  Qingpeng Zhang; Jason Pell; Rosangela Canino-Koning; Adina Chuang Howe; C Titus Brown
Journal:  PLoS One       Date:  2014-07-25       Impact factor: 3.240

8.  Bloom Filter Trie: an alignment-free and reference-free data structure for pan-genome storage.

Authors:  Guillaume Holley; Roland Wittler; Jens Stoye
Journal:  Algorithms Mol Biol       Date:  2016-04-14       Impact factor: 1.405

9.  Compressive genomics for protein databases.

Authors:  Noah M Daniels; Andrew Gallant; Jian Peng; Lenore J Cowen; Michael Baym; Bonnie Berger
Journal:  Bioinformatics       Date:  2013-07-01       Impact factor: 6.937

10.  CRAC: an integrated approach to the analysis of RNA-seq reads.

Authors:  Nicolas Philippe; Mikaël Salson; Thérèse Commes; Eric Rivals
Journal:  Genome Biol       Date:  2013-03-28       Impact factor: 13.583

View more
  19 in total

1.  Improved representation of sequence bloom trees.

Authors:  Robert S Harris; Paul Medvedev
Journal:  Bioinformatics       Date:  2020-02-01       Impact factor: 6.937

2.  Lossless indexing with counting de Bruijn graphs.

Authors:  Mikhail Karasikov; Harun Mustafa; Gunnar Rätsch; André Kahles
Journal:  Genome Res       Date:  2022-05-24       Impact factor: 9.438

3.  SPRISS: Approximating Frequent K-mers by Sampling Reads, and Applications.

Authors:  Diego Santoro; Leonardo Pellegrina; Matteo Comin; Fabio Vandin
Journal:  Bioinformatics       Date:  2022-05-18       Impact factor: 6.931

4.  CMash: fast, multi-resolution estimation of k-mer-based Jaccard and containment indices.

Authors:  Shaopeng Liu; David Koslicki
Journal:  Bioinformatics       Date:  2022-06-24       Impact factor: 6.931

5.  An Efficient, Scalable, and Exact Representation of High-Dimensional Color Information Enabled Using de Bruijn Graph Search.

Authors:  Fatemeh Almodaresi; Prashant Pandey; Michael Ferdman; Rob Johnson; Rob Patro
Journal:  J Comput Biol       Date:  2020-03-16       Impact factor: 1.479

6.  To Petabytes and beyond: recent advances in probabilistic and signal processing algorithms and their application to metagenomics.

Authors:  R A Leo Elworth; Qi Wang; Pavan K Kota; C J Barberan; Benjamin Coleman; Advait Balaji; Gaurav Gupta; Richard G Baraniuk; Anshumali Shrivastava; Todd J Treangen
Journal:  Nucleic Acids Res       Date:  2020-06-04       Impact factor: 16.971

7.  Improved Search of Large Transcriptomic Sequencing Databases Using Split Sequence Bloom Trees.

Authors:  Brad Solomon; Carl Kingsford
Journal:  J Comput Biol       Date:  2018-03-12       Impact factor: 1.479

8.  An Incrementally Updatable and Scalable System for Large-Scale Sequence Search using the Bentley-Saxe Transformation.

Authors:  Fatemeh Almodaresi; Jamshed Khan; Sergey Madaminov; Michael Ferdman; Rob Johnson; Prashant Pandey; Rob Patro
Journal:  Bioinformatics       Date:  2022-03-23       Impact factor: 6.931

9.  Expected 10-anonymity of HyperLogLog sketches for federated queries of clinical data repositories.

Authors:  Ziye Tao; Griffin M Weber; Yun William Yu
Journal:  Bioinformatics       Date:  2021-07-12       Impact factor: 6.931

10.  Disk compression of k-mer sets.

Authors:  Amatur Rahman; Rayan Chikhi; Paul Medvedev
Journal:  Algorithms Mol Biol       Date:  2021-06-21       Impact factor: 1.405

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.