Literature DB >> 17967006

Lossless compression of chemical fingerprints using integer entropy codes improves storage and retrieval.

Pierre Baldi1, Ryan W Benz, Daniel S Hirschberg, S Joshua Swamidass.   

Abstract

Many modern chemoinformatics systems for small molecules rely on large fingerprint vector representations, where the components of the vector record the presence or number of occurrences in the molecular graphs of particular combinatorial features, such as labeled paths or labeled trees. These large fingerprint vectors are often compressed to much shorter fingerprint vectors using a lossy compression scheme based on a simple modulo procedure. Here, we combine statistical models of fingerprints with integer entropy codes, such as Golomb and Elias codes, to encode the indices or the run lengths of the fingerprints. After reordering the fingerprint components by decreasing frequency order, the indices are monotone-increasing and the run lengths are quasi-monotone-increasing, and both exhibit power-law distribution trends. We take advantage of these statistical properties to derive new efficient, lossless, compression algorithms for monotone integer sequences: monotone value (MOV) coding and monotone length (MOL) coding. In contrast to lossy systems that use 1024 or more bits of storage per molecule, we can achieve lossless compression of long chemical fingerprints based on circular substructures in slightly over 300 bits per molecule, close to the Shannon entropy limit, using a MOL Elias Gamma code for run lengths. The improvement in storage comes at a modest computational cost. Furthermore, because the compression is lossless, uncompressed similarity (e.g., Tanimoto) between molecules can be computed exactly from their compressed representations, leading to significant improvements in retrival performance, as shown on six benchmark data sets of druglike molecules.

Entities:  

Mesh:

Year:  2007        PMID: 17967006      PMCID: PMC2536658          DOI: 10.1021/ci700200n

Source DB:  PubMed          Journal:  J Chem Inf Model        ISSN: 1549-9596            Impact factor:   4.956


  15 in total

1.  Grouping of coefficients for the calculation of inter-molecular similarity and dissimilarity using 2D fragment bit-strings.

Authors:  J D Holliday; C-Y Hu; P Willett
Journal:  Comb Chem High Throughput Screen       Date:  2002-03       Impact factor: 1.339

2.  Profile scaling increases the similarity search performance of molecular fingerprints containing numerical descriptors and structural keys.

Authors:  Ling Xue; Jeffrey W Godden; Florence L Stahura; Jürgen Bajorath
Journal:  J Chem Inf Comput Sci       Date:  2003 Jul-Aug

3.  Similarity searching of chemical databases using atom environment descriptors (MOLPRINT 2D): evaluation of performance.

Authors:  Andreas Bender; Hamse Y Mussa; Robert C Glen; Stephan Reiling
Journal:  J Chem Inf Comput Sci       Date:  2004 Sep-Oct

4.  Comparison of topological descriptors for similarity-based virtual screening using multiple bioactive reference structures.

Authors:  Jérôme Hert; Peter Willett; David J Wilton; Pierre Acklin; Kamal Azzaoui; Edgar Jacoby; Ansgar Schuffenhauer
Journal:  Org Biomol Chem       Date:  2004-09-29       Impact factor: 3.876

5.  ZINC--a free database of commercially available compounds for virtual screening.

Authors:  John J Irwin; Brian K Shoichet
Journal:  J Chem Inf Model       Date:  2005 Jan-Feb       Impact factor: 4.956

6.  Encoding and decoding graphical chemical structures as two-dimensional (PDF417) barcodes.

Authors:  M Karthikeyan; Andreas Bender
Journal:  J Chem Inf Model       Date:  2005 May-Jun       Impact factor: 4.956

7.  Kernels for small molecules and the prediction of mutagenicity, toxicity and anti-cancer activity.

Authors:  S Joshua Swamidass; Jonathan Chen; Jocelyne Bruand; Peter Phung; Liva Ralaivola; Pierre Baldi
Journal:  Bioinformatics       Date:  2005-06       Impact factor: 6.937

8.  Bounds and algorithms for fast exact searches of chemical fingerprints in linear and sublinear time.

Authors:  S Joshua Swamidass; Pierre Baldi
Journal:  J Chem Inf Model       Date:  2007-02-28       Impact factor: 4.956

9.  Mathematical correction for fingerprint similarity measures to improve chemical retrieval.

Authors:  S Joshua Swamidass; Pierre Baldi
Journal:  J Chem Inf Model       Date:  2007-04-20       Impact factor: 4.956

Review 10.  The art and practice of structure-based drug design: a molecular modeling perspective.

Authors:  R S Bohacek; C McMartin; W C Guida
Journal:  Med Res Rev       Date:  1996-01       Impact factor: 12.944

View more
  11 in total

Review 1.  Methods for Similarity-based Virtual Screening.

Authors:  Thomas G Kristensen; Jesper Nielsen; Christian N S Pedersen
Journal:  Comput Struct Biotechnol J       Date:  2013-03-03       Impact factor: 7.271

2.  Data structures and compression algorithms for genomic sequence data.

Authors:  Marty C Brandon; Douglas C Wallace; Pierre Baldi
Journal:  Bioinformatics       Date:  2009-05-15       Impact factor: 6.937

3.  Hashing algorithms and data structures for rapid searches of fingerprint vectors.

Authors:  Ramzi Nasr; Daniel S Hirschberg; Pierre Baldi
Journal:  J Chem Inf Model       Date:  2010-08-23       Impact factor: 4.956

4.  When is chemical similarity significant? The statistical distribution of chemical similarity scores and its extreme values.

Authors:  Pierre Baldi; Ramzi Nasr
Journal:  J Chem Inf Model       Date:  2010-07-26       Impact factor: 4.956

Review 5.  Data-driven high-throughput prediction of the 3-D structure of small molecules: review and progress.

Authors:  Alessio Andronico; Arlo Randall; Ryan W Benz; Pierre Baldi
Journal:  J Chem Inf Model       Date:  2011-03-18       Impact factor: 4.956

6.  Speeding up chemical searches using the inverted index: the convergence of chemoinformatics and text search methods.

Authors:  Ramzi Nasr; Rares Vernica; Chen Li; Pierre Baldi
Journal:  J Chem Inf Model       Date:  2012-04-10       Impact factor: 4.956

7.  Influence relevance voting: an accurate and interpretable virtual high throughput screening method.

Authors:  S Joshua Swamidass; Chloé-Agathe Azencott; Ting-Wan Lin; Hugo Gramajo; Shiou-Chuan Tsai; Pierre Baldi
Journal:  J Chem Inf Model       Date:  2009-04       Impact factor: 4.956

8.  Data structures and compression algorithms for high-throughput sequencing technologies.

Authors:  Kenny Daily; Paul Rigor; Scott Christley; Xiaohui Xie; Pierre Baldi
Journal:  BMC Bioinformatics       Date:  2010-10-14       Impact factor: 3.169

9.  Accurate and efficient target prediction using a potency-sensitive influence-relevance voter.

Authors:  Alessandro Lusci; Michael Browning; David Fooshee; Joshua Swamidass; Pierre Baldi
Journal:  J Cheminform       Date:  2015-12-29       Impact factor: 5.514

10.  BLASTing small molecules--statistics and extreme statistics of chemical similarity scores.

Authors:  Pierre Baldi; Ryan W Benz
Journal:  Bioinformatics       Date:  2008-07-01       Impact factor: 6.937

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.