| Literature DB >> 30538725 |
Qiwan Hu1, Mudong Feng2, Luhua Lai1,2, Jianfeng Pei1.
Abstract
Due to diverse reasons, most drug candidates cannot eventually become marketed drugs. Developing reliable computational methods for prediction of drug-likeness of candidate compounds is of vital importance to improve the success rate of drug discovery and development. In this study, we used a fully connected neural networks (FNN) to construct drug-likeness classification models with deep autoencoder to initialize model parameters. We collected datasets of drugs (represented by ZINC World Drug), bioactive molecules (represented by MDDR and WDI), and common molecules (represented by ZINC All Purchasable and ACD). Compounds were encoded with MOLD2 two-dimensional structure descriptors. The classification accuracies of drug-like/non-drug-like model are 91.04% on WDI/ACD databases, and 91.20% on MDDR/ZINC, respectively. The performance of the models outperforms previously reported models. In addition, we develop a drug/non-drug-like model (ZINC World Drug vs. ZINC All Purchasable), which distinguishes drugs and common compounds, with a classification accuracy of 96.99%. Our work shows that by using high-latitude molecular descriptors, we can apply deep learning technology to establish state-of-the-art drug-likeness prediction models.Entities:
Keywords: MDDR; ZINC; auto-encoder; deep learning; drug-likeness
Year: 2018 PMID: 30538725 PMCID: PMC6277570 DOI: 10.3389/fgene.2018.00585
Source DB: PubMed Journal: Front Genet ISSN: 1664-8021 Impact factor: 4.599
Detailed information of the dataset pairs.
| Dataset pair | Number of positive | Number of negative | Total |
|---|---|---|---|
| WDI/ACD | 38,260 | 288,540 | 326,800 |
| MDDR/ZINC | 171,850 | 199,220 | 371,070 |
| WORLDDRUG/ZINC | 3,380 | 199,220 | 202,600 |
Data preprocessing and post-processing steps used in this study.
| Data processing | |
|---|---|
| Step Name/Software | Step description |
| Element filter/KNIME ( | Hydrocarbons are removed. Molecules containing elements other than C H O N P S Cl Br I Si are removed. |
| Remove Mixture/KNIME ( | All records containing more than one molecules are removed. |
| Standardize/ChemAxon Standardizer ( | Neutralize, tautomerize, aromatize, and clean 2D |
| Remove duplicate/OpenBabel ( | Two molecules having the same InChI(including stereochemistry) means duplication. If a molecule appears in both drug set and nondrug set, it is removed from nondrug set. As for duplications in the same set, only the one that appears first is kept. |
| Data post-processing | |
| Remove error values/Python | If a descriptor has the value of N/A or ‘infinity’, the molecule it belongs to is removed. |
| Remove constant descriptors/Python | If a descriptor has the same value across all molecules, the descriptor is removed from the descriptor list. |
FIGURE 1A schematic architecture of a stacked autoencoder. Left) the architecture of autoencoder, layer-by-layer can be stacked. Right) a pre-trained autoencoder to initialize a fully connected network with the same structure for classifying.
Hyper-parameter settings of the stacked autoencoder.
| Hyperparameter | Setting |
|---|---|
| Initializer | TruncatedNormal |
| Number of hidden layers | 1 |
| Number of hidden layer nodes | 512 |
| L2 Normalization term | 1e-4 |
| Dropout rate | 0.14 |
| Activation | Relu |
| Batch size | 128 |
| Optimizer | Adam |
| Loss | mse for AE, binary crossentropy for classifier |
Performance on the training sets with 5-CV.
| Model | Copy the minority class | SMOTE over-sampling | ||||||
|---|---|---|---|---|---|---|---|---|
| ACC | SE | SP | AUC | ACC | SE | SP | AUC | |
| WDI/ACD | 0.8923 | 0.8991 | 0.8859 | 0.9598 | 0.9265 | 0.9244 | 0.9286 | 0.9783 |
| MDDR/ZINC | 0.9095 | 0.8855 | 0.9302 | 0.9701 | 0.9116 | 0.9141 | 0.9092 | 0.9719 |
| WORLD/ZINC | 0.9910 | 0.9961 | 0.9859 | 0.9986 | 0.9906 | 0.9937 | 0.9874 | 0.9990 |
Performance of the models on the validation sets.
| Model | Using SMOTE over-sampling | ||||
|---|---|---|---|---|---|
| ACC | SE | SP | MCC | AUC | |
| WDI/ACD | 0.9014 | 0.7683 | 0.9191 | 0.6014 | 0.9271 |
| MDDR/ZINC | 0.9025 | 0.9012 | 0.9036 | 0.8043 | 0.9669 |
| WORLD/ZINC | 0.9800 | 0.7544 | 0.9838 | 0.5690 | 0.9707 |
FIGURE 2Evaluations of different models vary with weight of positive sample loss.
Performance on the training set after optimizing the weight of loss function.
| Model | SMOTE over-sampling | ||||
|---|---|---|---|---|---|
| ACC | SE | SP | MCC | AUC | |
| WDI/ACD | 0.9104 | 0.9694 | 0.8515 | 0.8270 | 0.9757 |
| MDDR/ZINC | 0.9120 | 0.9219 | 0.9020 | 0.8243 | 0.9726 |
| WORLD/ZINC | 0.9699 | 0.9985 | 0.9414 | 0.9416 | 0.9955 |
Performance on the validation set after optimizing the weight of loss function.
| Model | SMOTE over-sampling | ||||
|---|---|---|---|---|---|
| ACC | SE | SP | MCC | AUC | |
| WDI/ACD | 0.8458 | 0.8524 | 0.8449 | 0.5286 | 0.9253 |
| MDDR/ZINC | 0.9046 | 0.9174 | 0.8935 | 0.8095 | 0.9699 |
| WORLD/ZINC | 0.9366 | 0.8804 | 0.9376 | 0.4049 | 0.9622 |