| Literature DB >> 28316655 |
Jiangming Sun1, Nina Jeliazkova2, Vladimir Chupakin3, Jose-Felipe Golib-Dzib4, Ola Engkvist1, Lars Carlsson1, Jörg Wegner3, Hugo Ceulemans3, Ivan Georgiev2, Vedrin Jeliazkov2, Nikolay Kochev2,5, Thomas J Ashby6, Hongming Chen1.
Abstract
Chemogenomics data generally refers to the activity data of chemical compounds on an array of protein targets and represents an important source of information for building in silico target prediction models. The increasing volume of chemogenomics data offers exciting opportunities to build models based on Big Data. Preparing a high quality data set is a vital step in realizing this goal and this work aims to compile such a comprehensive chemogenomics dataset. This dataset comprises over 70 million SAR data points from publicly available databases (PubChem and ChEMBL) including structure, target information and activity annotations. Our aspiration is to create a useful chemogenomics resource reflecting industry-scale data not only for building predictive models of in silico polypharmacology and off-target effects but also for the validation of cheminformatics approaches in general.Entities:
Keywords: Big Data; Bioactivity; Chemical structure; Chemogenomics; Molecular fingerprints; QSAR; Search engine
Year: 2017 PMID: 28316655 PMCID: PMC5340785 DOI: 10.1186/s13321-017-0203-5
Source DB: PubMed Journal: J Cheminform ISSN: 1758-2946 Impact factor: 5.514
Fig. 1Workflow for data preparation
Fig. 2Browsing the ExCAPE-DB web interface. a Searching the database via gene symbol or free-text. The original compound information is linked to from the result page. b Searching the database via substructure search
Public chemogenomics dataset
| ChEMBL | PubChem | ExCAPE-DB | |
|---|---|---|---|
| Actives | |||
| # SAR data points | 1,259,338 | 439,288 | 1,332,426 |
| # Compounds | 566,143 | 263,119 | 593,156 |
| Inactives | |||
| # SAR data points | 1,530,908 | 68,948,609 | 69,517,737 |
| # Compounds | 416,655 | 654,562 | 719,192 |
| Total | |||
| # SAR data points | 2,790,246 | 69,387,897 | 70,850,163 |
| # Compounds | 710,324 | 828,317 | 998,131 |
| # Targets | 1644 | 1588 | 1667 |
Fig. 3Composition of active compounds in the dataset. The distribution of active compounds among the targets in a ExCAPE-DB, b ChEMBL part of ExCAPE-DB and c the fraction span of actives in both datasets. We note that the ChEMBL dataset is shown here before the filtering and aggregation process and contains only single-target assays. Active compounds should have a pXC50 no less than 5 and only targets with at least 20 active compounds were considered
Fig. 4Distribution of cluster size in ExCAPE-DB. Here singletons and small clusters whose size is <4 are excluded from the analysis
Fig. 5Target family distribution in the dataset
Fig. 6The physicochemical property distribution. a Molecular weight (MW), b calculated value of lipophilic efficiency (ClogP), c polar surface area (PSA) and d fraction of sp3 carbon (FCS)
Performances of fivefold cross-validation for 18 targets using SVM
| Target | Active compounds | Inactive compounds | Ratio (active/inactive compounds) | Sensitivity | Precision | Specificity | κ |
|---|---|---|---|---|---|---|---|
| PPARA | 1955 | 1465 | 1.33 | 0.96 | 0.94 | 0.92 | 0.89 |
| MMP2 | 2742 | 2363 | 1.16 | 0.96 | 0.96 | 0.96 | 0.92 |
| MAOA | 732 | 733 | 1.00 | 0.79 | 0.80 | 0.81 | 0.59 |
| NR1I2 | 249 | 1090 | 0.23 | 0.82 | 0.73 | 0.93 | 0.72 |
| TMPRSS15 | 139 | 724 | 0.19 | 0.43 | 0.54 | 0.93 | 0.39 |
| HSD17B10 | 3410 | 11,510 | 0.30 | 0.41 | 0.40 | 0.82 | 0.23 |
| KDM4E | 3938 | 35,059 | 0.11 | 0.22 | 0.29 | 0.94 | 0.18 |
| LMNA | 14,533 | 171,164 | 0.09 | 0.49 | 0.13 | 0.72 | 0.10 |
| TDP1 | 23,133 | 276,782 | 0.08 | 0.76 | 0.38 | 0.90 | 0.45 |
| TARDBP | 12,193 | 387,934 | 0.03 | 0.22 | 0.08 | 0.92 | 0.08 |
| ALOX15 | 1932 | 69,362 | 0.03 | 0.49 | 0.12 | 0.90 | 0.16 |
| BRCA1 | 8619 | 363,912 | 0.02 | 0.72 | 0.20 | 0.93 | 0.29 |
| DRD2 | 4613 | 343,076 | 0.01 | 0.96 | 0.93 | 1.00 | 0.94 |
| GSK3B | 3334 | 300,186 | 0.01 | 0.85 | 0.72 | 1.00 | 0.78 |
| JAK2 | 2158 | 213,915 | 0.01 | 0.85 | 0.81 | 1.00 | 0.83 |
| POLK | 773 | 389,418 | 0.002 | 0.55 | 0.17 | 0.99 | 0.26 |
| FEN1 | 1050 | 381,575 | 0.003 | 0.35 | 0.03 | 0.96 | 0.04 |
| HDAC3 | 369 | 311,425 | 0.001 | 0.98 | 0.76 | 1.00 | 0.86 |