Literature DB >> 17326616

Bounds and algorithms for fast exact searches of chemical fingerprints in linear and sublinear time.

S Joshua Swamidass1, Pierre Baldi.   

Abstract

Chemical fingerprints are used to represent chemical molecules by recording the presence or absence, or by counting the number of occurrences, of particular features or substructures, such as labeled paths in the 2D graph of bonds, of the corresponding molecule. These fingerprint vectors are used to search large databases of small molecules, currently containing millions of entries, using various similarity measures, such as the Tanimoto or Tversky's measures and their variants. Here, we derive simple bounds on these similarity measures and show how these bounds can be used to considerably reduce the subset of molecules that need to be searched. We consider both the case of single-molecule and multiple-molecule queries, as well as queries based on fixed similarity thresholds or aimed at retrieving the top K hits. We study the speedup as a function of query size and distribution, fingerprint length, similarity threshold, and database size |D| and derive analytical formulas that are in excellent agreement with empirical values. The theoretical considerations and experiments show that this approach can provide linear speedups of one or more orders of magnitude in the case of searches with a fixed threshold, and achieve sublinear speedups in the range of O(|D|0.6) for the top K hits in current large databases. This pruning approach yields subsecond search times across the 5 million compounds in the ChemDB database, without any loss of accuracy.

Mesh:

Substances:

Year:  2007        PMID: 17326616      PMCID: PMC2527184          DOI: 10.1021/ci600358f

Source DB:  PubMed          Journal:  J Chem Inf Model        ISSN: 1549-9596            Impact factor:   4.956


  11 in total

1.  Detailed analysis of scoring functions for virtual screening.

Authors:  M Stahl; M Rarey
Journal:  J Med Chem       Date:  2001-03-29       Impact factor: 7.446

2.  VisualiSAR: a web-based application for clustering, structure browsing, and structure-activity relationship study.

Authors:  D J Wild; C J Blankley
Journal:  J Mol Graph Model       Date:  1999-04       Impact factor: 2.518

3.  Grouping of coefficients for the calculation of inter-molecular similarity and dissimilarity using 2D fragment bit-strings.

Authors:  J D Holliday; C-Y Hu; P Willett
Journal:  Comb Chem High Throughput Screen       Date:  2002-03       Impact factor: 1.339

4.  Profile scaling increases the similarity search performance of molecular fingerprints containing numerical descriptors and structural keys.

Authors:  Ling Xue; Jeffrey W Godden; Florence L Stahura; Jürgen Bajorath
Journal:  J Chem Inf Comput Sci       Date:  2003 Jul-Aug

5.  ZINC--a free database of commercially available compounds for virtual screening.

Authors:  John J Irwin; Brian K Shoichet
Journal:  J Chem Inf Model       Date:  2005 Jan-Feb       Impact factor: 4.956

6.  Similarity search profiling reveals effects of fingerprint scaling in virtual screening.

Authors:  Ling Xue; Florence L Stahura; Jürgen Bajorath
Journal:  J Chem Inf Comput Sci       Date:  2004 Nov-Dec

7.  Graph kernels for chemical informatics.

Authors:  Liva Ralaivola; Sanjay J Swamidass; Hiroto Saigo; Pierre Baldi
Journal:  Neural Netw       Date:  2005-09-12

8.  Kernels for small molecules and the prediction of mutagenicity, toxicity and anti-cancer activity.

Authors:  S Joshua Swamidass; Jonathan Chen; Jocelyne Bruand; Peter Phung; Liva Ralaivola; Pierre Baldi
Journal:  Bioinformatics       Date:  2005-06       Impact factor: 6.937

9.  Stigmata: an algorithm to determine structural commonalities in diverse datasets.

Authors:  N E Shemetulskis; D Weininger; C J Blankley; J J Yang; C Humblet
Journal:  J Chem Inf Comput Sci       Date:  1996 Jul-Aug

Review 10.  The art and practice of structure-based drug design: a molecular modeling perspective.

Authors:  R S Bohacek; C McMartin; W C Guida
Journal:  Med Res Rev       Date:  1996-01       Impact factor: 12.944

View more
  24 in total

1.  A CROC stronger than ROC: measuring, visualizing and optimizing early retrieval.

Authors:  S Joshua Swamidass; Chloé-Agathe Azencott; Kenny Daily; Pierre Baldi
Journal:  Bioinformatics       Date:  2010-04-07       Impact factor: 6.937

Review 2.  Methods for Similarity-based Virtual Screening.

Authors:  Thomas G Kristensen; Jesper Nielsen; Christian N S Pedersen
Journal:  Comput Struct Biotechnol J       Date:  2013-03-03       Impact factor: 7.271

3.  Lossless compression of chemical fingerprints using integer entropy codes improves storage and retrieval.

Authors:  Pierre Baldi; Ryan W Benz; Daniel S Hirschberg; S Joshua Swamidass
Journal:  J Chem Inf Model       Date:  2007-10-30       Impact factor: 4.956

4.  Learning to predict chemical reactions.

Authors:  Matthew A Kayala; Chloé-Agathe Azencott; Jonathan H Chen; Pierre Baldi
Journal:  J Chem Inf Model       Date:  2011-09-02       Impact factor: 4.956

5.  Anatomy of high-performance 2D similarity calculations.

Authors:  Imran S Haque; Vijay S Pande; W Patrick Walters
Journal:  J Chem Inf Model       Date:  2011-09-07       Impact factor: 4.956

6.  An intersection inequality sharper than the tanimoto triangle inequality for efficiently searching large databases.

Authors:  Pierre Baldi; Daniel S Hirschberg
Journal:  J Chem Inf Model       Date:  2009-08       Impact factor: 4.956

7.  Accelerated similarity searching and clustering of large compound sets by geometric embedding and locality sensitive hashing.

Authors:  Yiqun Cao; Tao Jiang; Thomas Girke
Journal:  Bioinformatics       Date:  2010-02-23       Impact factor: 6.937

8.  OrChem - An open source chemistry search engine for Oracle(R).

Authors:  Mark Rijnbeek; Christoph Steinbeck
Journal:  J Cheminform       Date:  2009-10-22       Impact factor: 5.514

9.  BLASTing small molecules--statistics and extreme statistics of chemical similarity scores.

Authors:  Pierre Baldi; Ryan W Benz
Journal:  Bioinformatics       Date:  2008-07-01       Impact factor: 6.937

10.  A tree-based method for the rapid screening of chemical fingerprints.

Authors:  Thomas G Kristensen; Jesper Nielsen; Christian N S Pedersen
Journal:  Algorithms Mol Biol       Date:  2010-01-04       Impact factor: 1.405

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.