Literature DB >> 18593143

Speeding up chemical database searches using a proximity filter based on the logical exclusive or.

Pierre Baldi1, Daniel S Hirschberg, Ramzi J Nasr.   

Abstract

In many large chemoinformatics database systems, molecules are represented by long binary fingerprint vectors whose components record the presence or absence in the molecular graphs of particular functional groups or combinatorial features, such as labeled paths or labeled trees. To speed up database searches, we propose to store with each fingerprint a small header vector containing primarily the result of applying the logical exclusive OR (XOR) operator to the fingerprint vector after modulo wrapping to a smaller number of bits, such as 128 bits. From the XOR headers of two molecules, tight bounds on the intersection and union of their fingerprint vectors can be rapidly obtained, yielding tight bounds on derived similarity measures, such as the Tanimoto measure. During a database search, every time these bounds are unfavorable, the corresponding molecule can be rapidly discarded with no need for further inspection. We derive probabilistic models that allow us to estimate precisely the behavior of the XOR headers and the level of pruning under different conditions in terms of similarity threshold and fingerprint density. These theoretical results are corroborated by experimental results on a large set of molecules. For a Tanimoto threshold of 0.5 (respectively 0.9), this approach requires searching less than 50% (respectively 10%) of the database, leading to typical search speedups of 2 to 3 times over the previous state-of-the-art.

Entities:  

Mesh:

Year:  2008        PMID: 18593143     DOI: 10.1021/ci800076s

Source DB:  PubMed          Journal:  J Chem Inf Model        ISSN: 1549-9596            Impact factor:   4.956


  11 in total

Review 1.  Methods for Similarity-based Virtual Screening.

Authors:  Thomas G Kristensen; Jesper Nielsen; Christian N S Pedersen
Journal:  Comput Struct Biotechnol J       Date:  2013-03-03       Impact factor: 7.271

2.  Hashing algorithms and data structures for rapid searches of fingerprint vectors.

Authors:  Ramzi Nasr; Daniel S Hirschberg; Pierre Baldi
Journal:  J Chem Inf Model       Date:  2010-08-23       Impact factor: 4.956

3.  Speeding up chemical searches using the inverted index: the convergence of chemoinformatics and text search methods.

Authors:  Ramzi Nasr; Rares Vernica; Chen Li; Pierre Baldi
Journal:  J Chem Inf Model       Date:  2012-04-10       Impact factor: 4.956

4.  An intersection inequality sharper than the tanimoto triangle inequality for efficiently searching large databases.

Authors:  Pierre Baldi; Daniel S Hirschberg
Journal:  J Chem Inf Model       Date:  2009-08       Impact factor: 4.956

5.  Accelerated similarity searching and clustering of large compound sets by geometric embedding and locality sensitive hashing.

Authors:  Yiqun Cao; Tao Jiang; Thomas Girke
Journal:  Bioinformatics       Date:  2010-02-23       Impact factor: 6.937

6.  Large scale study of multiple-molecule queries.

Authors:  Ramzi J Nasr; S Joshua Swamidass; Pierre F Baldi
Journal:  J Cheminform       Date:  2009-06-04       Impact factor: 5.514

7.  A probabilistic molecular fingerprint for big data settings.

Authors:  Daniel Probst; Jean-Louis Reymond
Journal:  J Cheminform       Date:  2018-12-18       Impact factor: 5.514

8.  A tree-based method for the rapid screening of chemical fingerprints.

Authors:  Thomas G Kristensen; Jesper Nielsen; Christian N S Pedersen
Journal:  Algorithms Mol Biol       Date:  2010-01-04       Impact factor: 1.405

9.  A survey of quantitative descriptions of molecular structure.

Authors:  Rajarshi Guha; Egon Willighagen
Journal:  Curr Top Med Chem       Date:  2012       Impact factor: 3.295

10.  Fragger: a protein fragment picker for structural queries.

Authors:  Francois Berenger; David Simoncini; Arnout Voet; Rojan Shrestha; Kam Y J Zhang
Journal:  F1000Res       Date:  2017-09-22
View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.