Literature DB >> 20540577

When is chemical similarity significant? The statistical distribution of chemical similarity scores and its extreme values.

Pierre Baldi1, Ramzi Nasr.   

Abstract

As repositories of chemical molecules continue to expand and become more open, it becomes increasingly important to develop tools to search them efficiently and assess the statistical significance of chemical similarity scores. Here, we develop a general framework for understanding, modeling, predicting, and approximating the distribution of chemical similarity scores and its extreme values in large databases. The framework can be applied to different chemical representations and similarity measures but is demonstrated here using the most common binary fingerprints with the Tanimoto similarity measure. After introducing several probabilistic models of fingerprints, including the Conditional Gaussian Uniform model, we show that the distribution of Tanimoto scores can be approximated by the distribution of the ratio of two correlated Normal random variables associated with the corresponding unions and intersections. This remains true also when the distribution of similarity scores is conditioned on the size of the query molecules to derive more fine-grained results and improve chemical retrieval. The corresponding extreme value distributions for the maximum scores are approximated by Weibull distributions. From these various distributions and their analytical forms, Z-scores, E-values, and p-values are derived to assess the significance of similarity scores. In addition, the framework also allows one to predict the value of standard chemical retrieval metrics, such as sensitivity and specificity at fixed thresholds, or receiver operating characteristic (ROC) curves at multiple thresholds, and to detect outliers in the form of atypical molecules. Numerous and diverse experiments that have been performed, in part with large sets of molecules from the ChemDB, show remarkable agreement between theory and empirical results.

Entities:  

Mesh:

Substances:

Year:  2010        PMID: 20540577      PMCID: PMC2914517          DOI: 10.1021/ci100010v

Source DB:  PubMed          Journal:  J Chem Inf Model        ISSN: 1549-9596            Impact factor:   4.956


  19 in total

1.  Grouping of coefficients for the calculation of inter-molecular similarity and dissimilarity using 2D fragment bit-strings.

Authors:  J D Holliday; C-Y Hu; P Willett
Journal:  Comb Chem High Throughput Screen       Date:  2002-03       Impact factor: 1.339

2.  Profile scaling increases the similarity search performance of molecular fingerprints containing numerical descriptors and structural keys.

Authors:  Ling Xue; Jeffrey W Godden; Florence L Stahura; Jürgen Bajorath
Journal:  J Chem Inf Comput Sci       Date:  2003 Jul-Aug

3.  Similarity searching of chemical databases using atom environment descriptors (MOLPRINT 2D): evaluation of performance.

Authors:  Andreas Bender; Hamse Y Mussa; Robert C Glen; Stephan Reiling
Journal:  J Chem Inf Comput Sci       Date:  2004 Sep-Oct

4.  Comparison of topological descriptors for similarity-based virtual screening using multiple bioactive reference structures.

Authors:  Jérôme Hert; Peter Willett; David J Wilton; Pierre Acklin; Kamal Azzaoui; Edgar Jacoby; Ansgar Schuffenhauer
Journal:  Org Biomol Chem       Date:  2004-09-29       Impact factor: 3.876

5.  ZINC--a free database of commercially available compounds for virtual screening.

Authors:  John J Irwin; Brian K Shoichet
Journal:  J Chem Inf Model       Date:  2005 Jan-Feb       Impact factor: 4.956

6.  Bounds and algorithms for fast exact searches of chemical fingerprints in linear and sublinear time.

Authors:  S Joshua Swamidass; Pierre Baldi
Journal:  J Chem Inf Model       Date:  2007-02-28       Impact factor: 4.956

7.  Lossless compression of chemical fingerprints using integer entropy codes improves storage and retrieval.

Authors:  Pierre Baldi; Ryan W Benz; Daniel S Hirschberg; S Joshua Swamidass
Journal:  J Chem Inf Model       Date:  2007-10-30       Impact factor: 4.956

8.  ChemDB update--full-text search and virtual chemical space.

Authors:  Jonathan H Chen; Erik Linstead; S Joshua Swamidass; Dennis Wang; Pierre Baldi
Journal:  Bioinformatics       Date:  2007-06-28       Impact factor: 6.937

9.  BLASTing small molecules--statistics and extreme statistics of chemical similarity scores.

Authors:  Pierre Baldi; Ryan W Benz
Journal:  Bioinformatics       Date:  2008-07-01       Impact factor: 6.937

10.  PubChem: a public information system for analyzing bioactivities of small molecules.

Authors:  Yanli Wang; Jewen Xiao; Tugba O Suzek; Jian Zhang; Jiyao Wang; Stephen H Bryant
Journal:  Nucleic Acids Res       Date:  2009-06-04       Impact factor: 16.971

View more
  27 in total

1.  A benchmark driven guide to binding site comparison: An exhaustive evaluation using tailor-made data sets (ProSPECCTs).

Authors:  Christiane Ehrt; Tobias Brinkjost; Oliver Koch
Journal:  PLoS Comput Biol       Date:  2018-11-08       Impact factor: 4.475

2.  LIGSIFT: an open-source tool for ligand structural alignment and virtual screening.

Authors:  Ambrish Roy; Jeffrey Skolnick
Journal:  Bioinformatics       Date:  2014-10-21       Impact factor: 6.937

3.  Inhibition of Salmonella enterica biofilm formation using small-molecule adenosine mimetics.

Authors:  Jacob A Koopman; Joanna M Marshall; Aditi Bhatiya; Tadesse Eguale; Jesse J Kwiek; John S Gunn
Journal:  Antimicrob Agents Chemother       Date:  2014-10-13       Impact factor: 5.191

4.  Hashing algorithms and data structures for rapid searches of fingerprint vectors.

Authors:  Ramzi Nasr; Daniel S Hirschberg; Pierre Baldi
Journal:  J Chem Inf Model       Date:  2010-08-23       Impact factor: 4.956

5.  Correlation between protein function and ligand binding profiles.

Authors:  Matthew D Shortridge; Michael Bokemper; Jennifer C Copeland; Jaime L Stark; Robert Powers
Journal:  J Proteome Res       Date:  2011-03-22       Impact factor: 4.466

6.  Speeding up chemical searches using the inverted index: the convergence of chemoinformatics and text search methods.

Authors:  Ramzi Nasr; Rares Vernica; Chen Li; Pierre Baldi
Journal:  J Chem Inf Model       Date:  2012-04-10       Impact factor: 4.956

7.  CLiB - a novel cardiolipin-binder isolated via data-driven and in vitro screening.

Authors:  Isabel Kleinwächter; Bernadette Mohr; Aljoscha Joppe; Nadja Hellmann; Tristan Bereau; Heinz D Osiewacz; Dirk Schneider
Journal:  RSC Chem Biol       Date:  2022-06-10

8.  A palette of fluorophores that are differentially accumulated by wild-type and mutant strains of Escherichia coli: surrogate ligands for profiling bacterial membrane transporters.

Authors:  Jesus Enrique Salcedo-Sora; Srijan Jindal; Steve O'Hagan; Douglas B Kell
Journal:  Microbiology (Reading)       Date:  2021-02       Impact factor: 2.777

9.  Accurate and efficient target prediction using a potency-sensitive influence-relevance voter.

Authors:  Alessandro Lusci; Michael Browning; David Fooshee; Joshua Swamidass; Pierre Baldi
Journal:  J Cheminform       Date:  2015-12-29       Impact factor: 5.514

Review 10.  Towards structural systems pharmacology to study complex diseases and personalized medicine.

Authors:  Lei Xie; Xiaoxia Ge; Hepan Tan; Li Xie; Yinliang Zhang; Thomas Hart; Xiaowei Yang; Philip E Bourne
Journal:  PLoS Comput Biol       Date:  2014-05-15       Impact factor: 4.475

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.