Literature DB >> 17444629

Mathematical correction for fingerprint similarity measures to improve chemical retrieval.

S Joshua Swamidass1, Pierre Baldi.   

Abstract

In many modern chemoinformatics systems, molecules are represented by long binary fingerprint vectors recording the presence or absence of particular features or substructures, such as labeled paths or trees, in the molecular graphs. These long fingerprints are often compressed to much shorter fingerprints using a simple modulo operation. As the length of the fingerprints decreases, their typical density and overlap tend to increase, and so does any similarity measure based on overlap, such as the widely used Tanimoto similarity. Here we show that this correlation between shorter fingerprints and higher similarity can be thought of as a systematic error introduced by the fingerprint folding algorithm and that this systematic error can be corrected mathematically. More precisely, given two molecules and their compressed fingerprints of a given length, we show how a better estimate of their uncompressed overlap, hence of their similarity, can be derived to correct for this bias. We show how the correction can be implemented not only for the Tanimoto measure but also for all other commonly used measures. Experiments on various data sets and fingerprint sizes demonstrate how, with a negligible computational overhead, the correction noticeably improves the sensitivity and specificity of chemical retrieval.

Mesh:

Year:  2007        PMID: 17444629     DOI: 10.1021/ci600526a

Source DB:  PubMed          Journal:  J Chem Inf Model        ISSN: 1549-9596            Impact factor:   4.956


  7 in total

1.  Lossless compression of chemical fingerprints using integer entropy codes improves storage and retrieval.

Authors:  Pierre Baldi; Ryan W Benz; Daniel S Hirschberg; S Joshua Swamidass
Journal:  J Chem Inf Model       Date:  2007-10-30       Impact factor: 4.956

2.  DNA Barcoding a Complete Matrix of Stereoisomeric Small Molecules.

Authors:  Christopher J Gerry; Mathias J Wawer; Paul A Clemons; Stuart L Schreiber
Journal:  J Am Chem Soc       Date:  2019-06-25       Impact factor: 15.419

3.  Hashing algorithms and data structures for rapid searches of fingerprint vectors.

Authors:  Ramzi Nasr; Daniel S Hirschberg; Pierre Baldi
Journal:  J Chem Inf Model       Date:  2010-08-23       Impact factor: 4.956

4.  Speeding up chemical searches using the inverted index: the convergence of chemoinformatics and text search methods.

Authors:  Ramzi Nasr; Rares Vernica; Chen Li; Pierre Baldi
Journal:  J Chem Inf Model       Date:  2012-04-10       Impact factor: 4.956

5.  Large scale study of multiple-molecule queries.

Authors:  Ramzi J Nasr; S Joshua Swamidass; Pierre F Baldi
Journal:  J Cheminform       Date:  2009-06-04       Impact factor: 5.514

6.  Securely measuring the overlap between private datasets with cryptosets.

Authors:  S Joshua Swamidass; Matthew Matlock; Leon Rozenblit
Journal:  PLoS One       Date:  2015-02-25       Impact factor: 3.240

7.  Dashing: fast and accurate genomic distances with HyperLogLog.

Authors:  Daniel N Baker; Ben Langmead
Journal:  Genome Biol       Date:  2019-12-04       Impact factor: 13.583

  7 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.