Literature DB >> 22711790

MolClass: a web portal to interrogate diverse small molecule screen datasets with different computational models.

Jan Wildenhain¹, Nicholas Fitzgerald, Mike Tyers.

Abstract

UNLABELLED: The MolClass toolkit and data portal generate computational models from user-defined small molecule datasets based on structural features identified in hit and non-hit molecules in different screens. Each new model is applied to all datasets in the database to classify compound specificity. MolClass thus defines a likelihood value for each compound entry and creates an activity fingerprint across diverse sets of screens. MolClass uses a variety of machine-learning methods to find molecular patterns and can therefore also assign a priori predictions of bioactivities for previously untested molecules. The power of the MolClass resource will grow as a function of the number of screens deposited in the database.
AVAILABILITY AND IMPLEMENTATION: The MolClass webportal, software package and source code are freely available for non-commercial use at http://tyerslab.bio.ed.ac.uk/molclass. A MolClass tutorial and a guide on how to build models from datasets can also be found on the web site. MolClass uses the chemistry development kit (CDK), WEKA and MySQL for its core functionality. A REST service is available at http://tyerslab.bio.ed.ac.uk/molclass/api based on the OpenTox API 1.2.

Entities: Chemical

Mesh：

Year: 2012 PMID： 22711790 PMCID： PMC3413392 DOI： 10.1093/bioinformatics/bts349

Source DB: PubMed Journal: Bioinformatics ISSN： 1367-4803 Impact factor: 6.937

1 INTRODUCTION

Bioactive molecules can serve as powerful tools for interrogation of biological systems and/or as precursors in drug discovery. An objective in chemical systems biology is to model biological systems in order to understand the effects of small molecules on cellular processes, and thereby explain the basis for small molecule action (Hopkins, 2008). Realization of this ambitious goal will require extensive experimental datasets. The generation of chemical datasets from biological screening assays is usually limited by cost and throughput. Pharmaceutical companies and academic groups use high-throughput screens to test large libraries of small molecules that elicit a desired biological response, typically against a single target or at most a few related targets. However, chemical space is estimated to contain on the order of 1060 molecular entities, which greatly exceeds even the multi-million compound libraries at the disposal of large pharmaceutical companies (Dobson, 2004). This vastness of chemical space requires that researchers devise rational approaches for identifying small bioactive molecules, particularly given the severe resource constraints on academic screening initiatives. The computational evaluation of potential bioactive molecules can drive down the high cost of screens and help extract potential drug-like compounds from pre-existing data in the public domain. To enable the extraction of information from existing chemical screen data, we have developed a suite of machine-based learning tools that statistically rank each compound for any given assay in a user-defined database. MolClass will thus facilitate the identification of specific bioactive molecules and allow the prediction of moieties that underpin biological activity.

2 WORKFLOW FEATURES

Existing resources for chemical screen data, notably PubChem, ChEMBL and ChemBank, are passive repositories that house an incomplete matrix of small molecule activity across submitted screens of various types, ranging from in vitro binding and enzyme assays to complex cellular and whole organism phenotypic assays (Fig. 1A). To interrogate such data, MolClass generates a complete matrix of compound activities across many screens and thereby enables functional predictions for all molecules, even if not tested in a specific screen. The user can upload input datasets of up to 20 000 molecules in SDF file format, in which tags distinguish hit from non-hit compounds in one or several screens. MolClass combines the datasets to generate a computational model for each screen submitted (Fig. 1B). These models are then applied to all molecules stored in MolClass to predict activity. MolClass currently provides either a composite of all molecular 2D chemical descriptors (2529 bit) or the user can independently choose 152 property descriptors, MACCS (166 bit), Substructure (306 bit), CDK extended (1024 bit) or PubChem (881 bit) fingerprints. As different machine-learning algorithms tend to generate slightly different likelihood values, a variety of algorithms are provided in MolClass including Random Forest, Naïve Bayes, SVM, KNN, Logistic Model Tree and J48. The user can apply one or several algorithms to any dataset of interest. Unbalanced datasets are boosted, to maximally double the size of the smaller part, using SMOTE (Nitesh ) and further, if they exceed a ratio 1:5 of active versus inactive compounds, are adjusted using the WEKA under-sampling method. All models in MolClass are then applied to these molecules to generate activity fingerprints. For training and testing, MolClass uses 10-fold cross-validation. The user can examine the model statistics, the likelihood scores for screens of interest and, as shown in Figure 1C, single molecule likelihood fingerprints for existing models. Finally, MolClass also enables a substructure search using the JME Editor in the event a molecule of interest is not present in the database.

Fig. 1

MolClass features. (A) current state of data from public resources such as PubChem and ChemBank. (B) MolClass workflow from experimental data to activity likelihoods. (C) Likelihood scores for fenbendazole and aspirin in 14 different models: (1) neurosphere proliferation, +/none (Diamandis ); (2) Caco-2 permeation, +/− (Hou ); (3) flucanozole synergizer, +/none (Spitzer ); (4) Caenorhabditis elegans drug bioaccumulation, none/+ (Burns ); (5) Ames mutagenicity benchmark, none/+ (Hansen ); (6) mutagenicity prediction, +/none (Kazius ); (7) blood–brain barrier penetration, +/− (Li ); (8) PubChem AID 1828 +/none; (9) PubChem AID 595 +/− (10) ChemBank 1000423 +/− (11) ChemBank 1001644 +/− (12) ChemBank 1000359 +/− (13) autofluoresence none/+ and (14) ChEMBL TargetID CHEMBL204 none/ +. ‘+’ activating, ‘−’ inhibiting and ‘none’ no effect

3 CONCLUSION

MolClass provides a comprehensive overview of compound activity in different screens. These profiles can reveal promiscuous activities across several screens, which may reflect undesirable off-target effects. For experimental datasets, the user can discover structure activity relationships because similar structures and activities will lead to specific likelihood patterns. As the data collection is expanded by users to different biological responses and assay formats, the classification power of the portal will increase, and thereby facilitate chemical systems biology.

4 IMPLEMENTATION

MolClass is implemented in Java and Perl using CDK (Steinbeck ), WEKA (Hall ) and moldb4 (Haider, 2010). The web interface and REST service are written in PHP5, Slim and PEAR and run on a Fedora Linux 8 server, as an Apache HTTP service. The data are stored in a MySQL 5.5 database running on a separate Fedora Linux 16 server.

11 in total

1. Chemical space and biology.

Authors: Christopher M Dobson
Journal: Nature Date: 2004-12-16 Impact factor: 49.962

2. Derivation and validation of toxicophores for mutagenicity prediction.

Authors: Jeroen Kazius; Ross McGuire; Roberta Bursi
Journal: J Med Chem Date: 2005-01-13 Impact factor: 7.446

3. Effect of selection of molecular descriptors on the prediction of blood-brain barrier penetrating and nonpenetrating agents by statistical learning methods.

Authors: Hu Li; Chun Wei Yap; Choong Yong Ung; Ying Xue; Zhi Wei Cao; Yu Zong Chen
Journal: J Chem Inf Model Date: 2005 Sep-Oct Impact factor: 4.956

4. Chemical genetics reveals a complex functional ground state of neural stem cells.

Authors: Phedias Diamandis; Jan Wildenhain; Ian D Clarke; Adrian G Sacher; Jeremy Graham; David S Bellows; Erick K M Ling; Ryan J Ward; Leanne G Jamieson; Mike Tyers; Peter B Dirks
Journal: Nat Chem Biol Date: 2007-04-08 Impact factor: 15.040

Review 5. Network pharmacology: the next paradigm in drug discovery.

Authors: Andrew L Hopkins
Journal: Nat Chem Biol Date: 2008-11 Impact factor: 15.040

6. Benchmark data set for in silico prediction of Ames mutagenicity.

Authors: Katja Hansen; Sebastian Mika; Timon Schroeter; Andreas Sutter; Antonius ter Laak; Thomas Steger-Hartmann; Nikolaus Heinrich; Klaus-Robert Müller
Journal: J Chem Inf Model Date: 2009-09 Impact factor: 4.956

7. A predictive model for drug bioaccumulation and bioactivity in Caenorhabditis elegans.

Authors: Andrew R Burns; Iain M Wallace; Jan Wildenhain; Mike Tyers; Guri Giaever; Gary D Bader; Corey Nislow; Sean R Cutler; Peter J Roy
Journal: Nat Chem Biol Date: 2010-05-30 Impact factor: 15.040

8. Cross-species discovery of syncretic drug combinations that potentiate the antifungal fluconazole.

Authors: Michaela Spitzer; Emma Griffiths; Kim M Blakely; Jan Wildenhain; Linda Ejim; Laura Rossi; Gianfranco De Pascale; Jasna Curak; Eric Brown; Mike Tyers; Gerard D Wright
Journal: Mol Syst Biol Date: 2011-06-21 Impact factor: 11.429

9. The Chemistry Development Kit (CDK): an open-source Java library for Chemo- and Bioinformatics.

Authors: Christoph Steinbeck; Yongquan Han; Stefan Kuhn; Oliver Horlacher; Edgar Luttmann; Egon Willighagen
Journal: J Chem Inf Comput Sci Date: 2003 Mar-Apr

10. Functionality pattern matching as an efficient complementary structure/reaction search tool: an open-source approach.

Authors: Norbert Haider
Journal: Molecules Date: 2010-07-27 Impact factor: 4.411

5 in total