Literature DB >> 21576229

ChemMine tools: an online service for analyzing and clustering small molecules.

Tyler W H Backman¹, Yiqun Cao, Thomas Girke.

Abstract

ChemMine Tools is an online service for small molecule data analysis. It provides a web interface to a set of cheminformatics and data mining tools that are useful for various analysis routines performed in chemical genomics and drug discovery. The service also offers programmable access options via the R library ChemmineR. The primary functionalities of ChemMine Tools fall into five major application areas: data visualization, structure comparisons, similarity searching, compound clustering and prediction of chemical properties. First, users can upload compound data sets to the online Compound Workbench. Numerous utilities are provided for compound viewing, structure drawing and format interconversion. Second, pairwise structural similarities among compounds can be quantified. Third, interfaces to ultra-fast structure similarity search algorithms are available to efficiently mine the chemical space in the public domain. These include fingerprint and embedding/indexing algorithms. Fourth, the service includes a Clustering Toolbox that integrates cheminformatic algorithms with data mining utilities to enable systematic structure and activity based analyses of custom compound sets. Fifth, physicochemical property descriptors of custom compound sets can be calculated. These descriptors are important for assessing the bioactivity profile of compounds in silico and quantitative structure-activity relationship (QSAR) analyses. ChemMine Tools is available at: http://chemmine.ucr.edu.

Entities: Chemical Disease Species

Mesh：

Substances：

Year: 2011 PMID： 21576229 PMCID： PMC3125754 DOI： 10.1093/nar/gkr320

Source DB: PubMed Journal: Nucleic Acids Res ISSN： 0305-1048 Impact factor: 16.971

INTRODUCTION

Cheminformatics tools for analyzing small molecule screening data play an important role in many fields including chemical biology, chemical genomics, drug discovery and agrochemical research (1–3). Informatics resources in these areas are essential for exploring the structure, properties and bioactivity of biologically relevant molecules. To provide these capabilities, software tools are required for analyzing the structural similarities, physicochemical properties and bioactivity profiles of natural and synthetic compounds to gain insight into their modes of action in biological systems. This information is important for the development of effective small molecule probes for studying the functions of protein and cellular networks in chemical genomics and drug discovery research (4). In addition, similar informatics resources are required for identifying the structural and physicochemical relationships among compounds from metabolic or signaling pathways (5–7). The rapidly growing relevance of chemical genomics approaches for modern biology research has significantly increased demand for small molecule mining systems in academia (8). Currently, the structures of over 30 million distinct small molecules are available in open-access databases, including PubChem, ChemBank and many others (9–15). In addition, preliminary bioactivity data from hundreds of high-throughput screening (HTS) experiments against a wide spectrum of target sites have become available for almost one million compounds in the bioassay sections of various public databases (see below; 9,10,15,16). To efficiently analyze these resources, the development of novel compound data mining and cheminformatic web services is essential. While there has been extensive development of public domain small molecule databases in recent years (6,9–11, 13–24), the number of open access web services for analyzing public or custom small molecule data is extremely limited at this point (25,26). Thus far, most development has been focused on standalone software applications targeted toward computational rather than experimental scientists. These include Open Babel (27,28), the Chemistry Development Kit (29,30), the Chemical Descriptors Library (31) and JOELib (32). Examples of software designed for non-expert users in this field are Chembench (33) for online quantitative structure—activity relationship (QSAR) modeling and KNIME (34) for designing data analysis pipelines. Here, we present ChemMine Tools as an online portal to a variety of cheminformatics, visualization, search and clustering tools for small molecule data. The utilities provided by this service are useful for various analysis and data mining routines of small molecule screening experiments in chemical genomics and related areas. An easy to use web interface makes these tools accessible to experimental scientists without an extensive computational background.

METHODS

Conceptually, the ChemMine Tools online service is divided into five application domains (Figure 1 and Table 1): (i) a Compound workbench for data imports and result management; (ii) a Structure Similarity toolbox to quantify the similarities among compounds; (iii) a Search toolbox for retrieving similar compounds from PubChem; (iv) a Clustering toolbox for accessing clustering and data visualization tools; and (v) a Property toolbox for predicting physicochemical properties of compounds. To construct robust data analysis workflows, the back-end of the server employs a modular design architecture with object-oriented methods and container classes assuring compatible input/output flows and parameter settings among the different data processing units. Currently, the server integrates over 30 cheminformatics and data mining tools that were developed by this or related open source projects. The modular organization of the ChemMine Tools service has several advantages. For instance, it maximizes the transparency and maintainability of the system, and simplifies the addition of new features and analysis methods upon user request. The web interface of ChemMine Tools is written in Python using the object-oriented and highly scalable Django web framework. Modern JavaScript/Ajax utilities are embedded to generate interactive and customizable high-content web pages. Moreover, the ChemMine Tools project is dedicated to an open access and resource sharing policy. All of its online services and downloadable software components are freely available without restrictions. The following subsections give a detailed description of the underlying algorithms and software tools used by the individual ChemMine Tools services.

Figure 1.

Illustration of the functionalities provided by ChemMine Tools. The utilities of the five application domains (i–v) are listed in more detail in Table 1.

Table 1.

List of services provided by ChemMine Tools

Functions	Program	Input	Output	Comments
(i) Compound workbench
Structure import/export	Open Babel	Mouse clicks	SMILES/SDF	One or many compounds
Format interconversions	Open Babel	SDF/SMILES	SMILES/SDF	One or many compounds
Bioactivity data import	JavaScript/Ajax	Tabular data	Table/heat map	SAR table
Structure depictions	CACTVS	SMILES/SDF	Image file (GIF)	One or many compounds
Structure drawing	JME Molecular Editor	Mouse clicks	SMILES/SDF	Single compound
Database import	SOAP	XML/SDF	SMILES/SDF	PubChem
Scriptable access from R	ChemmineR^a	SDF, tabular data	Online viewing	SAR table
(ii) Similarity toolbox
Fragment-based similarity	Atom Pairs^a	SDF/SMILES	Similarity coefficients	Pairwise comparisons
Maximum common substructure	MCS^a	SDF/SMILES	MCS (SDF), similarity coefficient	Pairwise comparisons
(iii) Search toolbox
Embedding and indexing	EI Search^a	Mouse clicks, SDF/SMILES	Ranked compound list	Database search
Fingerprint search	PubChem PUG	Mouse clicks, SDF/SMILES	Ranked compound list	Database search
(iv) Clustering toolbox
Binning clustering	cmp.cluster^a	SDF/SMILES, custom table	Cluster table
Hierarchical clustering	hclust	SDF/SMILES, custom table	Tree, distance matrix	Optional heat map
Multidimensional scaling	cmdscale	SDF/SMILES, custom table	Scatter plot	Interactive
(v) Property toolbox
Physicochemical descriptors	JOELib	SDF/SMILES	Property table	38 descriptors

The names of software tools, libraries and environments are italicized.

aPrograms developed by the ChemMine Tools project. Acronyms defined in text.

Illustration of the functionalities provided by ChemMine Tools. The utilities of the five application domains (i–v) are listed in more detail in Table 1. List of services provided by ChemMine Tools The names of software tools, libraries and environments are italicized. aPrograms developed by the ChemMine Tools project. Acronyms defined in text.

DISCUSSION OF SERVICES

Compound workbench

A central feature of ChemMine Tools is its Compound workbench. It provides a flexible online workspace to upload, manage and visualize small molecule data. Compounds can be imported by reading them from local files, copy and paste, PubChem queries (see Search toolbox) or by interacting with the service through the ChemmineR library (35) within the statistical programming environment R. The latter is an extension of the ChemMine Tools project to provide a programmable interface to more advanced users. Alternatively, compounds can be drawn online with the JME Molecular Editor (36) and then added to the Compound workbench. Currently, the import utility supports the structure data format (SDF) and simplified molecular input line entry system (SMILES). After the import, one can organize and annotate the compounds or view their structure images in single or batch modes. These images are generated in real time from the underlying structure definition data using the structure depiction tool of the CACTVS software suite (11) which runs on the server side. To revisit instances of compound sets, users can save their workbench for later use by downloading the compounds to local files. The compound download function also serves as a format conversion tool to interconvert structure representations between SDF and SMILES formats using utilities from the Open Babel project (27,28). Once the user has populated the Compound workbench with structures, it serves as a central submission system to all downstream analysis services.

Similarity toolbox

In many small molecule screening data analysis routines it is important to compute objective similarity measures among compounds as a means to compare and prioritize structurally related lead compounds. To provide this functionality, ChemMine Tools has implemented two algorithms for computing similarity coefficients among compound structures. The first employs atom pairs as structural descriptors (37) and the widely used Tanimoto coefficient as a similarity measure (see below for more details). Alternatively, users can choose other similarity coefficients, such as Tversky or Dice (38). The second algorithm identifies the maximum common substructure (MCS) shared among compound pairs (39). Subsequently, the size of both compounds and the size of their shared MCS is used to calculate the available similarity coefficients. The underlying MCS algorithm often provides the most accurate and sensitive similarity measure, especially for compounds with large size differences (40,41).

Search toolbox

To efficiently mine much of the chemical structure and bioactivity space available in the public domain, the ChemMine Tools service provides text and structure similarity search methods that interface with the PubChem database (15) via its SOAP-based Power User Gateway (PUG) data exchange feature. During an analysis session, instantaneous search functionality is often important for retrieval of detailed property and annotation information for compounds of interest, or to identify related structures. In ChemMine Tools, structural similarity searches can be performed with PubChem's fingerprint search engine or via the EI Search method. The latter was developed in house as part of this project to provide ultra-fast structure similarity search functionality using an embedding/indexing (EI) algorithm (42). When the fingerprint method is chosen, the query is sent to PubChem, where the structure search is performed and the results are returned to the compound workbench. In contrast to this, EI Search is specific to the ChemMine Tools project and thus, runs locally on its servers. These two tools possess complementary strengths and weaknesses in identifying weak similarities among compounds (42).

Clustering toolbox

Clustering of compounds by structural or property similarity can be a powerful approach to correlating compound features with biological activity. Clustering tools are also widely utilized for diversity analyses to identify structural redundancies and other biases in compound libraries. ChemMine Tools' clustering workbench provides an online interface to three clustering algorithms which include hierarchical clustering, multidimensional scaling (MDS) and binning clustering (35). The following provides a short overview of these tools, while a more detailed outline of the underlying theory and clustering schemes is available in the online tutorial. When clustering by structural similarity, the required similarity measures are computed by first generating the atom pair descriptors (features) for each compound which are then used to calculate a similarity matrix based on the common and unique features observed among all compound pairs using the Tanimoto coefficient. The Tanimoto coefficient has a range from 0 to 1 with higher values indicating greater similarity than lower ones. For the subsequent clustering steps, the similarity matrix is converted into a distance matrix by subtracting the similarity values from 1. The hierarchical and MDS clustering methods provided by ChemMine Tools are based on the R programs hclust and cmdscale, respectively; the third method utilizes an internally developed C++ implementation. These three programs complement one another with respect to their data outputs and visualization options. Hierarchical clustering organizes compounds by similarity in a tree with branch lengths proportional to the item-to-item (compound-to-compound) similarities, while the MDS output encodes this information in a scatter plot. These two methods do not directly provide assignments of compounds to discrete similarity groups; assignments are generated downstream of the actual clustering process using various post-processing methods, such as tree cutting approaches. The binning clustering output provides these groupings directly for a user-definable similarity cutoff. For instance, if a Tanimoto coefficient of 0.6 is chosen then compounds will be joined into groups that share a similarity of this value or greater using a ‘single linkage’ rule for cluster joining. Final results are presented as interactive visualization pages to simplify the interpretation of the (often complex) clustering results. The hierarchical clustering result page uses the Google Maps API to generate zoom- and click-able trees aligned with molecular structure images. Moreover, heat maps of user uploaded data containing compound property, activity or other information can be viewed alongside the tree. A similar system is used to present the MDS results as click-able scatter plots with cursor-over viewing of compound structures. The binning clustering results are presented in a table view containing (among other information) the cluster identifiers and the corresponding compound depictions.

Property toolbox

Predictions of small molecule physicochemical properties are important for assessing their ‘druglikeness’ and ‘leadlikeness’ in silico (43,44). They are also useful for enriching compound collections with desirable properties. For instance, the famous ‘Lipinski Rule of Five’ (45) is often applied to enrich compound collections with druglike candidates. This rule filters for compounds with ≤5 hydrogen bond donors, ≤10 hydrogen acceptors, a molecular weight ≤500 daltons and an octanol-water partition coefficient log P ≤ 5. Physicochemical property data are essential for predicting bioactive and other properties of small molecules using modern machine learning approaches. These data are fundamental to the development of QSAR models (25). ChemMine Tools provides an online interface to the property prediction module of the JOELib package (32). This service can calculate 38 physicochemical property values, including Lipinski descriptors for custom compound sets. The resulting property tables can be downloaded or further processed on ChemMine Tools by sending them to the Clustering toolbox. There, they can be used to cluster compounds by similar property profiles, as described above, or the data can be visualized as a heat map next to the hierarchical clustering trees.

CONCLUSION AND FUTURE DEVELOPMENT

ChemMine Tools is an online service for compound analysis in the chemical genomics field. The service is unique in that it integrates a large number of cheminformatic programs with clustering and visualization functionalities. Additional outstanding features of ChemMine Tools include: (i) its commitment to publicly developed open source software throughout its infrastructure; (ii) its strong dedication to the development of new cheminformatic tools and their free distribution in the community; and (iii) the integration of its many components into a unified online and downloadable software infrastructure which maximizes their utility for diverse tasks with different levels of complexity and customization needs. An intuitive web interface makes these tools accessible to scientists with limited computational background, while simultaneously providing a programmable interface for advanced users. To the best of our knowledge, there are currently no related online services available that provide a comparable suite of functionalities. Overlaps exist, however they are limited to isolated functionalities. For instance, ChemDB and VCCLab (13,43) can be used for property predictions and structure format interconversions of single compound queries; and PubChem supports structure-based clustering for compounds retrieved from its own database. In the future, many additional utilities will be added to the ChemMine Tools service including the addition of MCS-based search functionality within the Similarity toolbox to support more complex graph-based search strategies against custom compound sets imported into the Compound workbench. Existing functionalities for analyzing bioactivity data will also be expanded by adding a Bioactivity toolbox that will contain regression, machine learning and QSAR modeling tools.

FUNDING

National Science Foundation (grant numbers ABI-0957099, 2010-0520325 and IGERT-0504249). Funding for open access charge: National Science Foundation (grant number: ABI-0957099). Conflict of interest statement. None declared.

43 in total

1. Enhanced CACTVS browser of the Open NCI Database.

Authors: Wolf-Dietrich Ihlenfeldt; Johannes H Voigt; Bruno Bienfait; Frank Oellien; Marc C Nicklaus
Journal: J Chem Inf Comput Sci Date: 2002 Jan-Feb

2. Analysis and display of the size dependence of chemical similarity coefficients.

Authors: John D Holliday; Naomie Salim; Martin Whittle; Peter Willett
Journal: J Chem Inf Comput Sci Date: 2003 May-Jun

3. Performance of similarity measures in 2D fragment-based similarity searching: comparison of structural descriptors and similarity coefficients.

Authors: Xin Chen; Charles H Reynolds
Journal: J Chem Inf Comput Sci Date: 2002 Nov-Dec

4. Maximum common subgraph isomorphism algorithms for the matching of chemical structures.

Authors: John W Raymond; Peter Willett
Journal: J Comput Aided Mol Des Date: 2002-07 Impact factor: 3.686

5. Virtual computational chemistry laboratory--design and description.

Authors: Igor V Tetko; Johann Gasteiger; Roberto Todeschini; Andrea Mauri; David Livingstone; Peter Ertl; Vladimir A Palyulin; Eugene V Radchenko; Nikolay S Zefirov; Alexander S Makarenko; Vsevolod Yu Tanchuk; Volodymyr V Prokopenko
Journal: J Comput Aided Mol Des Date: 2005-06 Impact factor: 3.686

6. ChemDB update--full-text search and virtual chemical space.

Authors: Jonathan H Chen; Erik Linstead; S Joshua Swamidass; Dennis Wang; Pierre Baldi
Journal: Bioinformatics Date: 2007-06-28 Impact factor: 6.937

7. Accelerated similarity searching and clustering of large compound sets by geometric embedding and locality sensitive hashing.

Authors: Yiqun Cao; Tao Jiang; Thomas Girke
Journal: Bioinformatics Date: 2010-02-23 Impact factor: 6.937

8. Small Molecule Subgraph Detector (SMSD) toolkit.

Authors: Syed Asad Rahman; Matthew Bashton; Gemma L Holliday; Rainer Schrader; Janet M Thornton
Journal: J Cheminform Date: 2009-08-10 Impact factor: 5.514

9. Feature selection for descriptor based classification models. 2. Human intestinal absorption (HIA).

Authors: Jörg K Wegner; Holger Fröhlich; Andreas Zell
Journal: J Chem Inf Comput Sci Date: 2004 May-Jun

10. From genomics to chemical genomics: new developments in KEGG.

Authors: Minoru Kanehisa; Susumu Goto; Masahiro Hattori; Kiyoko F Aoki-Kinoshita; Masumi Itoh; Shuichi Kawashima; Toshiaki Katayama; Michihiro Araki; Mika Hirakawa
Journal: Nucleic Acids Res Date: 2006-01-01 Impact factor: 16.971

120 in total

1. Informatic search strategies to discover analogues and variants of natural product archetypes.

Authors: Chad W Johnston; Alex D Connaty; Michael A Skinnider; Yong Li; Alyssa Grunwald; Morgan A Wyatt; Russell G Kerr; Nathan A Magarvey
Journal: J Ind Microbiol Biotechnol Date: 2015-09-08 Impact factor: 3.346

2. An efficient algorithm coupled with synthetic minority over-sampling technique to classify imbalanced PubChem BioAssay data.

Authors: Ming Hao; Yanli Wang; Stephen H Bryant
Journal: Anal Chim Acta Date: 2013-11-06 Impact factor: 6.558

3. Defining estrogenic mechanisms of bisphenol A analogs through high throughput microscopy-based contextual assays.

Authors: Fabio Stossi; Michael J Bolt; Felicity J Ashcroft; Jane E Lamerdin; Jonathan S Melnick; Reid T Powell; Radhika D Dandekar; Maureen G Mancini; Cheryl L Walker; John K Westwick; Michael A Mancini
Journal: Chem Biol Date: 2014-05-22

4. Novel pH-Stable Glycoside Hydrolase Family 3 β-Xylosidase from Talaromyces amestolkiae: an Enzyme Displaying Regioselective Transxylosylation.

Authors: Manuel Nieto-Domínguez; Laura I de Eugenio; Jorge Barriuso; Alicia Prieto; Beatriz Fernández de Toro; Ángeles Canales-Mayordomo; María Jesús Martínez
Journal: Appl Environ Microbiol Date: 2015-07-06 Impact factor: 4.792

5. Screen of FDA-approved drug library identifies maprotiline, an antibiofilm and antivirulence compound with QseC sensor-kinase dependent activity in Francisella novicida.

Authors: Scott N Dean; Monique L van Hoek
Journal: Virulence Date: 2015 Impact factor: 5.882

Review 6. Pregnane X receptor and drug-induced liver injury.

Authors: Yue-Ming Wang; Sergio C Chai; Christopher T Brewer; Taosheng Chen
Journal: Expert Opin Drug Metab Toxicol Date: 2014-09-25 Impact factor: 4.481

7. Successful Identification of Cardiac Troponin Calcium Sensitizers Using a Combination of Virtual Screening and ROC Analysis of Known Troponin C Binders.

Authors: Melanie L Aprahamian; Svetlana B Tikunova; Morgan V Price; Andres F Cuesta; Jonathan P Davis; Steffen Lindert
Journal: J Chem Inf Model Date: 2017-11-16 Impact factor: 4.956

8. The Small Molecule Hyperphyllin Enhances Leaf Formation Rate and Mimics Shoot Meristem Integrity Defects Associated with AMP1 Deficiency.

Authors: Olena Poretska; Saiqi Yang; Delphine Pitorre; Wilfried Rozhon; Karin Zwerger; Marcos Castellanos Uribe; Sean May; Peter McCourt; Brigitte Poppenberger; Tobias Sieberer
Journal: Plant Physiol Date: 2016-05-03 Impact factor: 8.340

9. Chemical Genetics Uncovers Novel Inhibitors of Lignification, Including p-Iodobenzoic Acid Targeting CINNAMATE-4-HYDROXYLASE.

Authors: Dorien Van de Wouwer; Ruben Vanholme; Raphaël Decou; Geert Goeminne; Dominique Audenaert; Long Nguyen; René Höfer; Edouard Pesquet; Bartel Vanholme; Wout Boerjan
Journal: Plant Physiol Date: 2016-08-02 Impact factor: 8.340

10. Flexible small molecular anti-estrogens with N,N-dialkylated-2,5-diethoxy-4-morpholinoaniline scaffold targets multiple estrogen receptor conformations.

Authors: Bethany K Asare; Emmanuel Yawson; Rajendram V Rajnarayanan
Journal: Cell Cycle Date: 2017-07-19 Impact factor: 4.534