Literature DB >> 20681581

Hashing algorithms and data structures for rapid searches of fingerprint vectors.

Ramzi Nasr1, Daniel S Hirschberg, Pierre Baldi.   

Abstract

In many large chemoinformatics database systems, molecules are represented by long binary fingerprint vectors whose components record the presence or absence of particular functional groups or combinatorial features. To speed up database searches, we propose to add to each fingerprint a short signature integer vector of length M. For a given fingerprint, the i component of the signature vector counts the number of 1-bits in the fingerprint that fall on components congruent to i modulo M. Given two signatures, we show how one can rapidly compute a bound on the Jaccard-Tanimoto similarity measure of the two corresponding fingerprints, using the intersection bound. Thus, these signatures allow one to significantly prune the search space by discarding molecules associated with unfavorable bounds. Analytical methods are developed to predict the resulting amount of pruning as a function of M. Data structures combining different values of M are also developed together with methods for predicting the optimal values of M for a given implementation. Simulations using a particular implementation show that the proposed approach leads to a 1 order of magnitude speedup over a linear search and a 3-fold speedup over a previous implementation. All theoretical results and predictions are corroborated by large-scale simulations using molecules from the ChemDB. Several possible algorithmic extensions are discussed.

Entities:  

Mesh:

Year:  2010        PMID: 20681581      PMCID: PMC2926297          DOI: 10.1021/ci100132g

Source DB:  PubMed          Journal:  J Chem Inf Model        ISSN: 1549-9596            Impact factor:   4.956


  20 in total

1.  Similarity searching of chemical databases using atom environment descriptors (MOLPRINT 2D): evaluation of performance.

Authors:  Andreas Bender; Hamse Y Mussa; Robert C Glen; Stephan Reiling
Journal:  J Chem Inf Comput Sci       Date:  2004 Sep-Oct

2.  Comparison of topological descriptors for similarity-based virtual screening using multiple bioactive reference structures.

Authors:  Jérôme Hert; Peter Willett; David J Wilton; Pierre Acklin; Kamal Azzaoui; Edgar Jacoby; Ansgar Schuffenhauer
Journal:  Org Biomol Chem       Date:  2004-09-29       Impact factor: 3.876

3.  ZINC--a free database of commercially available compounds for virtual screening.

Authors:  John J Irwin; Brian K Shoichet
Journal:  J Chem Inf Model       Date:  2005 Jan-Feb       Impact factor: 4.956

4.  Bounds and algorithms for fast exact searches of chemical fingerprints in linear and sublinear time.

Authors:  S Joshua Swamidass; Pierre Baldi
Journal:  J Chem Inf Model       Date:  2007-02-28       Impact factor: 4.956

5.  ChemDB update--full-text search and virtual chemical space.

Authors:  Jonathan H Chen; Erik Linstead; S Joshua Swamidass; Dennis Wang; Pierre Baldi
Journal:  Bioinformatics       Date:  2007-06-28       Impact factor: 6.937

6.  Speeding up chemical database searches using a proximity filter based on the logical exclusive or.

Authors:  Pierre Baldi; Daniel S Hirschberg; Ramzi J Nasr
Journal:  J Chem Inf Model       Date:  2008-07-02       Impact factor: 4.956

7.  When is chemical similarity significant? The statistical distribution of chemical similarity scores and its extreme values.

Authors:  Pierre Baldi; Ramzi Nasr
Journal:  J Chem Inf Model       Date:  2010-07-26       Impact factor: 4.956

8.  An intersection inequality sharper than the tanimoto triangle inequality for efficiently searching large databases.

Authors:  Pierre Baldi; Daniel S Hirschberg
Journal:  J Chem Inf Model       Date:  2009-08       Impact factor: 4.956

9.  Large scale study of multiple-molecule queries.

Authors:  Ramzi J Nasr; S Joshua Swamidass; Pierre F Baldi
Journal:  J Cheminform       Date:  2009-06-04       Impact factor: 5.514

10.  PubChem: a public information system for analyzing bioactivities of small molecules.

Authors:  Yanli Wang; Jewen Xiao; Tugba O Suzek; Jian Zhang; Jiyao Wang; Stephen H Bryant
Journal:  Nucleic Acids Res       Date:  2009-06-04       Impact factor: 16.971

View more
  3 in total

Review 1.  Methods for Similarity-based Virtual Screening.

Authors:  Thomas G Kristensen; Jesper Nielsen; Christian N S Pedersen
Journal:  Comput Struct Biotechnol J       Date:  2013-03-03       Impact factor: 7.271

2.  Speeding up chemical searches using the inverted index: the convergence of chemoinformatics and text search methods.

Authors:  Ramzi Nasr; Rares Vernica; Chen Li; Pierre Baldi
Journal:  J Chem Inf Model       Date:  2012-04-10       Impact factor: 4.956

3.  The chemfp project.

Authors:  Andrew Dalke
Journal:  J Cheminform       Date:  2019-12-05       Impact factor: 5.514

  3 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.