Literature DB >> 26436140

Entropy-scaling search of massive biological data.

Y William Yu1, Noah M Daniels1, David Christian Danko2, Bonnie Berger1.   

Abstract

Many data sets exhibit well-defined structure that can be exploited to design faster search tools, but it is not always clear when such acceleration is possible. Here we introduce a framework for similarity search based on characterizing a data set's entropy and fractal dimension. We prove that searching scales in time with metric entropy (number of covering hyperspheres), if the fractal dimension of the data set is low, and scales in space with the sum of metric entropy and information-theoretic entropy (randomness of the data). Using these ideas, we present accelerated versions of standard tools, with no loss in specificity and little loss in sensitivity, for use in three domains-high-throughput drug screening (Ammolite, 150x speedup), metagenomics (MICA, 3.5x speedup of DIAMOND (3700x BLASTX)), and protein structure search (esFragBag, 10x speedup of FragBag). Our framework can be used to achieve 'compressive omics,' and the general theory can be readily applied to data science problems outside of biology. Source code: http://gems.csail.mit.edu.

Entities:  

Year:  2015        PMID: 26436140      PMCID: PMC4591002          DOI: 10.1016/j.cels.2015.08.004

Source DB:  PubMed          Journal:  Cell Syst        ISSN: 2405-4712            Impact factor:   10.304


  35 in total

1.  KEGG: kyoto encyclopedia of genes and genomes.

Authors:  M Kanehisa; S Goto
Journal:  Nucleic Acids Res       Date:  2000-01-01       Impact factor: 16.971

2.  ProtoMap: automatic classification of protein sequences, a hierarchy of protein families, and local maps of the protein space.

Authors:  G Yona; N Linial; M Linial
Journal:  Proteins       Date:  1999-11-15

3.  BLAT--the BLAST-like alignment tool.

Authors:  W James Kent
Journal:  Genome Res       Date:  2002-04       Impact factor: 9.043

4.  Protein structure alignment by incremental combinatorial extension (CE) of the optimal path.

Authors:  I N Shindyalov; P E Bourne
Journal:  Protein Eng       Date:  1998-09

5.  Integrative analysis of environmental sequences using MEGAN4.

Authors:  Daniel H Huson; Suparna Mitra; Hans-Joachim Ruscheweyh; Nico Weber; Stephan C Schuster
Journal:  Genome Res       Date:  2011-06-20       Impact factor: 9.043

6.  Global view of the protein universe.

Authors:  Sergey Nepomnyachiy; Nir Ben-Tal; Rachel Kolodny
Journal:  Proc Natl Acad Sci U S A       Date:  2014-07-28       Impact factor: 11.205

7.  Small Molecule Subgraph Detector (SMSD) toolkit.

Authors:  Syed Asad Rahman; Matthew Bashton; Gemma L Holliday; Rainer Schrader; Janet M Thornton
Journal:  J Cheminform       Date:  2009-08-10       Impact factor: 5.514

8.  Metagenomic microbial community profiling using unique clade-specific marker genes.

Authors:  Nicola Segata; Levi Waldron; Annalisa Ballarini; Vagheesh Narasimhan; Olivier Jousson; Curtis Huttenhower
Journal:  Nat Methods       Date:  2012-06-10       Impact factor: 28.547

9.  Host lifestyle affects human microbiota on daily timescales.

Authors:  Lawrence A David; Arne C Materna; Jonathan Friedman; Maria I Campos-Baptista; Matthew C Blackburn; Allison Perrotta; Susan E Erdman; Eric J Alm
Journal:  Genome Biol       Date:  2014       Impact factor: 13.583

10.  Compressive genomics for protein databases.

Authors:  Noah M Daniels; Andrew Gallant; Jian Peng; Lenore J Cowen; Michael Baym; Bonnie Berger
Journal:  Bioinformatics       Date:  2013-07-01       Impact factor: 6.937

View more
  21 in total

1.  Low-Density Locality-Sensitive Hashing Boosts Metagenomic Binning.

Authors:  Yunan Luo; Jianyang Zeng; Bonnie Berger; Jian Peng
Journal:  Res Comput Mol Biol       Date:  2016-04

Review 2.  Online Databases for Taxonomy and Identification of Pathogenic Fungi and Proposal for a Cloud-Based Dynamic Data Network Platform.

Authors:  Peralam Yegneswaran Prakash; Laszlo Irinyi; Catriona Halliday; Sharon Chen; Vincent Robert; Wieland Meyer
Journal:  J Clin Microbiol       Date:  2017-02-08       Impact factor: 5.948

3.  Fast genotyping of known SNPs through approximate k-mer matching.

Authors:  Ariya Shajii; Deniz Yorukoglu; Yun William Yu; Bonnie Berger
Journal:  Bioinformatics       Date:  2016-09-01       Impact factor: 6.937

4.  Weighted minimizer sampling improves long read mapping.

Authors:  Chirag Jain; Arang Rhie; Haowen Zhang; Claudia Chu; Brian P Walenz; Sergey Koren; Adam M Phillippy
Journal:  Bioinformatics       Date:  2020-07-01       Impact factor: 6.937

5.  ChIPWig: a random access-enabling lossless and lossy compression method for ChIP-seq data.

Authors:  Vida Ravanmehr; Minji Kim; Zhiying Wang; Olgica Milenkovic
Journal:  Bioinformatics       Date:  2018-03-15       Impact factor: 6.937

Review 6.  Harnessing Big Data for Systems Pharmacology.

Authors:  Lei Xie; Eli J Draizen; Philip E Bourne
Journal:  Annu Rev Pharmacol Toxicol       Date:  2016-10-13       Impact factor: 13.820

7.  Fast search of thousands of short-read sequencing experiments.

Authors:  Brad Solomon; Carl Kingsford
Journal:  Nat Biotechnol       Date:  2016-02-08       Impact factor: 54.908

8.  An Efficient, Scalable, and Exact Representation of High-Dimensional Color Information Enabled Using de Bruijn Graph Search.

Authors:  Fatemeh Almodaresi; Prashant Pandey; Michael Ferdman; Rob Johnson; Rob Patro
Journal:  J Comput Biol       Date:  2020-03-16       Impact factor: 1.479

9.  Generalizable and Scalable Visualization of Single-Cell Data Using Neural Networks.

Authors:  Hyunghoon Cho; Bonnie Berger; Jian Peng
Journal:  Cell Syst       Date:  2018-06-20       Impact factor: 10.304

10.  Computational Biology in the 21st Century: Scaling with Compressive Algorithms.

Authors:  Bonnie Berger; Noah M Daniels; Y William Yu
Journal:  Commun ACM       Date:  2016-08       Impact factor: 4.654

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.