Warning: Undefined array key "mm" in /www/wwwroot/www.ai-bt.com/si.php on line 10 Deprecated: trim(): Passing null to parameter #1 ($string) of type string is deprecated in /www/wwwroot/www.ai-bt.com/si.php on line 10 Entropy-scaling search of massive biological data.

Literature DB >> 26436140

Entropy-scaling search of massive biological data.

Y William Yu¹, Noah M Daniels¹, David Christian Danko², Bonnie Berger¹.

Abstract

Many data sets exhibit well-defined structure that can be exploited to design faster search tools, but it is not always clear when such acceleration is possible. Here we introduce a framework for similarity search based on characterizing a data set's entropy and fractal dimension. We prove that searching scales in time with metric entropy (number of covering hyperspheres), if the fractal dimension of the data set is low, and scales in space with the sum of metric entropy and information-theoretic entropy (randomness of the data). Using these ideas, we present accelerated versions of standard tools, with no loss in specificity and little loss in sensitivity, for use in three domains-high-throughput drug screening (Ammolite, 150x speedup), metagenomics (MICA, 3.5x speedup of DIAMOND (3700x BLASTX)), and protein structure search (esFragBag, 10x speedup of FragBag). Our framework can be used to achieve 'compressive omics,' and the general theory can be readily applied to data science problems outside of biology. Source code: http://gems.csail.mit.edu.

Entities: Chemical Disease Gene Species

Year: 2015 PMID： 26436140 PMCID： PMC4591002 DOI： 10.1016/j.cels.2015.08.004

Source DB: PubMed Journal: Cell Syst ISSN： 2405-4712 Impact factor: 10.304

35 in total

1. KEGG: kyoto encyclopedia of genes and genomes.

Authors: M Kanehisa; S Goto
Journal: Nucleic Acids Res Date: 2000-01-01 Impact factor: 16.971

2. ProtoMap: automatic classification of protein sequences, a hierarchy of protein families, and local maps of the protein space.

Authors: G Yona; N Linial; M Linial
Journal: Proteins Date: 1999-11-15

3. BLAT--the BLAST-like alignment tool.

Authors: W James Kent
Journal: Genome Res Date: 2002-04 Impact factor: 9.043

4. Protein structure alignment by incremental combinatorial extension (CE) of the optimal path.

Authors: I N Shindyalov; P E Bourne
Journal: Protein Eng Date: 1998-09

5. Integrative analysis of environmental sequences using MEGAN4.

Authors: Daniel H Huson; Suparna Mitra; Hans-Joachim Ruscheweyh; Nico Weber; Stephan C Schuster
Journal: Genome Res Date: 2011-06-20 Impact factor: 9.043

6. Global view of the protein universe.

Authors: Sergey Nepomnyachiy; Nir Ben-Tal; Rachel Kolodny
Journal: Proc Natl Acad Sci U S A Date: 2014-07-28 Impact factor: 11.205

7. Small Molecule Subgraph Detector (SMSD) toolkit.

Authors: Syed Asad Rahman; Matthew Bashton; Gemma L Holliday; Rainer Schrader; Janet M Thornton
Journal: J Cheminform Date: 2009-08-10 Impact factor: 5.514

8. Metagenomic microbial community profiling using unique clade-specific marker genes.

Authors: Nicola Segata; Levi Waldron; Annalisa Ballarini; Vagheesh Narasimhan; Olivier Jousson; Curtis Huttenhower
Journal: Nat Methods Date: 2012-06-10 Impact factor: 28.547

9. Host lifestyle affects human microbiota on daily timescales.

Authors: Lawrence A David; Arne C Materna; Jonathan Friedman; Maria I Campos-Baptista; Matthew C Blackburn; Allison Perrotta; Susan E Erdman; Eric J Alm
Journal: Genome Biol Date: 2014 Impact factor: 13.583

10. Compressive genomics for protein databases.

Authors: Noah M Daniels; Andrew Gallant; Jian Peng; Lenore J Cowen; Michael Baym; Bonnie Berger
Journal: Bioinformatics Date: 2013-07-01 Impact factor: 6.937

21 in total

1. Low-Density Locality-Sensitive Hashing Boosts Metagenomic Binning.

Authors: Yunan Luo; Jianyang Zeng; Bonnie Berger; Jian Peng
Journal: Res Comput Mol Biol Date: 2016-04

Review 2. Online Databases for Taxonomy and Identification of Pathogenic Fungi and Proposal for a Cloud-Based Dynamic Data Network Platform.

Authors: Peralam Yegneswaran Prakash; Laszlo Irinyi; Catriona Halliday; Sharon Chen; Vincent Robert; Wieland Meyer
Journal: J Clin Microbiol Date: 2017-02-08 Impact factor: 5.948

3. Fast genotyping of known SNPs through approximate k-mer matching.

Authors: Ariya Shajii; Deniz Yorukoglu; Yun William Yu; Bonnie Berger
Journal: Bioinformatics Date: 2016-09-01 Impact factor: 6.937

4. Weighted minimizer sampling improves long read mapping.

Authors: Chirag Jain; Arang Rhie; Haowen Zhang; Claudia Chu; Brian P Walenz; Sergey Koren; Adam M Phillippy
Journal: Bioinformatics Date: 2020-07-01 Impact factor: 6.937

5. ChIPWig: a random access-enabling lossless and lossy compression method for ChIP-seq data.

Authors: Vida Ravanmehr; Minji Kim; Zhiying Wang; Olgica Milenkovic
Journal: Bioinformatics Date: 2018-03-15 Impact factor: 6.937

Review 6. Harnessing Big Data for Systems Pharmacology.

Authors: Lei Xie; Eli J Draizen; Philip E Bourne
Journal: Annu Rev Pharmacol Toxicol Date: 2016-10-13 Impact factor: 13.820

7. Fast search of thousands of short-read sequencing experiments.

Authors: Brad Solomon; Carl Kingsford
Journal: Nat Biotechnol Date: 2016-02-08 Impact factor: 54.908

8. An Efficient, Scalable, and Exact Representation of High-Dimensional Color Information Enabled Using de Bruijn Graph Search.

Authors: Fatemeh Almodaresi; Prashant Pandey; Michael Ferdman; Rob Johnson; Rob Patro
Journal: J Comput Biol Date: 2020-03-16 Impact factor: 1.479

9. Generalizable and Scalable Visualization of Single-Cell Data Using Neural Networks.

Authors: Hyunghoon Cho; Bonnie Berger; Jian Peng
Journal: Cell Syst Date: 2018-06-20 Impact factor: 10.304

10. Computational Biology in the 21st Century: Scaling with Compressive Algorithms.

Authors: Bonnie Berger; Noah M Daniels; Y William Yu
Journal: Commun ACM Date: 2016-08 Impact factor: 4.654