Literature DB >> 22462644

Speeding up chemical searches using the inverted index: the convergence of chemoinformatics and text search methods.

Ramzi Nasr1, Rares Vernica, Chen Li, Pierre Baldi.   

Abstract

In ligand-based screening, retrosynthesis, and other chemoinformatics applications, one often seeks to search large databases of molecules in order to retrieve molecules that are similar to a given query. With the expanding size of molecular databases, the efficiency and scalability of data structures and algorithms for chemical searches are becoming increasingly important. Remarkably, both the chemoinformatics and information retrieval communities have converged on similar solutions whereby molecules or documents are represented by binary vectors, or fingerprints, indexing their substructures such as labeled paths for molecules and n-grams for text, with the same Jaccard-Tanimoto similarity measure. As a result, similarity search methods from one field can be adapted to the other. Here we adapt recent, state-of-the-art, inverted index methods from information retrieval to speed up similarity searches in chemoinformatics. Our results show a several-fold speed-up improvement over previous methods for both threshold searches and top-K searches. We also provide a mathematical analysis that allows one to predict the level of pruning achieved by the inverted index approach and validate the quality of these predictions through simulation experiments. All results can be replicated using data freely downloadable from http://cdb.ics.uci.edu/ .

Entities:  

Mesh:

Substances:

Year:  2012        PMID: 22462644      PMCID: PMC3415597          DOI: 10.1021/ci200552r

Source DB:  PubMed          Journal:  J Chem Inf Model        ISSN: 1549-9596            Impact factor:   4.956


  16 in total

1.  Grouping of coefficients for the calculation of inter-molecular similarity and dissimilarity using 2D fragment bit-strings.

Authors:  J D Holliday; C-Y Hu; P Willett
Journal:  Comb Chem High Throughput Screen       Date:  2002-03       Impact factor: 1.339

2.  Profile scaling increases the similarity search performance of molecular fingerprints containing numerical descriptors and structural keys.

Authors:  Ling Xue; Jeffrey W Godden; Florence L Stahura; Jürgen Bajorath
Journal:  J Chem Inf Comput Sci       Date:  2003 Jul-Aug

3.  Bounds and algorithms for fast exact searches of chemical fingerprints in linear and sublinear time.

Authors:  S Joshua Swamidass; Pierre Baldi
Journal:  J Chem Inf Model       Date:  2007-02-28       Impact factor: 4.956

4.  Lossless compression of chemical fingerprints using integer entropy codes improves storage and retrieval.

Authors:  Pierre Baldi; Ryan W Benz; Daniel S Hirschberg; S Joshua Swamidass
Journal:  J Chem Inf Model       Date:  2007-10-30       Impact factor: 4.956

5.  Speeding up chemical database searches using a proximity filter based on the logical exclusive or.

Authors:  Pierre Baldi; Daniel S Hirschberg; Ramzi J Nasr
Journal:  J Chem Inf Model       Date:  2008-07-02       Impact factor: 4.956

6.  Using inverted indices for accelerating LINGO calculations.

Authors:  Thomas G Kristensen; Jesper Nielsen; Christian N S Pedersen
Journal:  J Chem Inf Model       Date:  2011-02-18       Impact factor: 4.956

7.  Hashing algorithms and data structures for rapid searches of fingerprint vectors.

Authors:  Ramzi Nasr; Daniel S Hirschberg; Pierre Baldi
Journal:  J Chem Inf Model       Date:  2010-08-23       Impact factor: 4.956

8.  When is chemical similarity significant? The statistical distribution of chemical similarity scores and its extreme values.

Authors:  Pierre Baldi; Ramzi Nasr
Journal:  J Chem Inf Model       Date:  2010-07-26       Impact factor: 4.956

9.  An intersection inequality sharper than the tanimoto triangle inequality for efficiently searching large databases.

Authors:  Pierre Baldi; Daniel S Hirschberg
Journal:  J Chem Inf Model       Date:  2009-08       Impact factor: 4.956

10.  Large scale study of multiple-molecule queries.

Authors:  Ramzi J Nasr; S Joshua Swamidass; Pierre F Baldi
Journal:  J Cheminform       Date:  2009-06-04       Impact factor: 5.514

View more
  2 in total

Review 1.  Methods for Similarity-based Virtual Screening.

Authors:  Thomas G Kristensen; Jesper Nielsen; Christian N S Pedersen
Journal:  Comput Struct Biotechnol J       Date:  2013-03-03       Impact factor: 7.271

2.  Accurate and efficient target prediction using a potency-sensitive influence-relevance voter.

Authors:  Alessandro Lusci; Michael Browning; David Fooshee; Joshua Swamidass; Pierre Baldi
Journal:  J Cheminform       Date:  2015-12-29       Impact factor: 5.514

  2 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.