Warning: Undefined array key "mm" in /www/wwwroot/www.ai-bt.com/si.php on line 10 Deprecated: trim(): Passing null to parameter #1 ($string) of type string is deprecated in /www/wwwroot/www.ai-bt.com/si.php on line 10 Speeding up chemical database searches using a proximity filter based on the logical exclusive or.

Literature DB >> 18593143

Speeding up chemical database searches using a proximity filter based on the logical exclusive or.

Pierre Baldi¹, Daniel S Hirschberg, Ramzi J Nasr.

Abstract

In many large chemoinformatics database systems, molecules are represented by long binary fingerprint vectors whose components record the presence or absence in the molecular graphs of particular functional groups or combinatorial features, such as labeled paths or labeled trees. To speed up database searches, we propose to store with each fingerprint a small header vector containing primarily the result of applying the logical exclusive OR (XOR) operator to the fingerprint vector after modulo wrapping to a smaller number of bits, such as 128 bits. From the XOR headers of two molecules, tight bounds on the intersection and union of their fingerprint vectors can be rapidly obtained, yielding tight bounds on derived similarity measures, such as the Tanimoto measure. During a database search, every time these bounds are unfavorable, the corresponding molecule can be rapidly discarded with no need for further inspection. We derive probabilistic models that allow us to estimate precisely the behavior of the XOR headers and the level of pruning under different conditions in terms of similarity threshold and fingerprint density. These theoretical results are corroborated by experimental results on a large set of molecules. For a Tanimoto threshold of 0.5 (respectively 0.9), this approach requires searching less than 50% (respectively 10%) of the database, leading to typical search speedups of 2 to 3 times over the previous state-of-the-art.

Entities: Gene

Mesh：

Year: 2008 PMID： 18593143 DOI： 10.1021/ci800076s

Source DB: PubMed Journal: J Chem Inf Model ISSN： 1549-9596 Impact factor: 4.956

Keyword Cloud
Cited

11 in total

Review 1. Methods for Similarity-based Virtual Screening.

Authors: Thomas G Kristensen; Jesper Nielsen; Christian N S Pedersen
Journal: Comput Struct Biotechnol J Date: 2013-03-03 Impact factor: 7.271

2. Hashing algorithms and data structures for rapid searches of fingerprint vectors.

Authors: Ramzi Nasr; Daniel S Hirschberg; Pierre Baldi
Journal: J Chem Inf Model Date: 2010-08-23 Impact factor: 4.956

3. Speeding up chemical searches using the inverted index: the convergence of chemoinformatics and text search methods.

Authors: Ramzi Nasr; Rares Vernica; Chen Li; Pierre Baldi
Journal: J Chem Inf Model Date: 2012-04-10 Impact factor: 4.956

4. An intersection inequality sharper than the tanimoto triangle inequality for efficiently searching large databases.

Authors: Pierre Baldi; Daniel S Hirschberg
Journal: J Chem Inf Model Date: 2009-08 Impact factor: 4.956

5. Accelerated similarity searching and clustering of large compound sets by geometric embedding and locality sensitive hashing.

Authors: Yiqun Cao; Tao Jiang; Thomas Girke
Journal: Bioinformatics Date: 2010-02-23 Impact factor: 6.937

Speeding up chemical database searches using a proximity filter based on the logical exclusive or.

Review 1. Methods for Similarity-based Virtual Screening.

2. Hashing algorithms and data structures for rapid searches of fingerprint vectors.

3. Speeding up chemical searches using the inverted index: the convergence of chemoinformatics and text search methods.

4. An intersection inequality sharper than the tanimoto triangle inequality for efficiently searching large databases.

5. Accelerated similarity searching and clustering of large compound sets by geometric embedding and locality sensitive hashing.

6. Large scale study of multiple-molecule queries.

7. A probabilistic molecular fingerprint for big data settings.

8. A tree-based method for the rapid screening of chemical fingerprints.

9. A survey of quantitative descriptions of molecular structure.

10. Fragger: a protein fragment picker for structural queries.