| Literature DB >> 29086090 |
Alexios Koutsoukas1, Keith J Monaghan1, Xiaoli Li1, Jun Huan2.
Abstract
BACKGROUND: In recent years, research in artificial neural networks has resurged, now under the deep-learning umbrella, and grown extremely popular. Recently reported success of DL techniques in crowd-sourced QSAR and predictive toxicology competitions has showcased these methods as powerful tools in drug-discovery and toxicology research. The aim of this work was dual, first large number of hyper-parameter configurations were explored to investigate how they affect the performance of DNNs and could act as starting points when tuning DNNs and second their performance was compared to popular methods widely employed in the field of cheminformatics namely Naïve Bayes, k-nearest neighbor, random forest and support vector machines. Moreover, robustness of machine learning methods to different levels of artificially introduced noise was assessed. The open-source Caffe deep-learning framework and modern NVidia GPU units were utilized to carry out this study, allowing large number of DNN configurations to be explored.Entities:
Keywords: Cheminformatics; Data-mining; Deep learning; Machine-learning; Naïve Bayes; Random forest; SARs; Support vector machines; kNN
Year: 2017 PMID: 29086090 PMCID: PMC5489441 DOI: 10.1186/s13321-017-0226-y
Source DB: PubMed Journal: J Cheminform ISSN: 1758-2946 Impact factor: 5.514
Fig. 1a A feed-forward deep neural network with two hidden layers, each layer consists of multiple neurons, which are fully connected with neurons of the previous and following layers. b Each artificial neuron receives one or more input signals x 1, x 2,…, x and outputs a value y to neurons of the next layer. The output y is a nonlinear weighted sum of input signals. Nonlinearity is achieved by passing the linear sum through non-linear functions known as activation functions. c Popular neurons activation functions: the rectified linear unit (ReLU) (red), Sigmoid (Sigm) (green) and Tanh (blue)
Bioactivity datasets assembled from ChEMBL repository and utilized in this study
| Activity class | CHEMBL target id | Number of active inhibitors | Number of decoys |
|---|---|---|---|
| Carbonic anhydrase II, | CHEMBL205 | 1631 | 16,310 |
| Cyclin-dependent kinase 2, | CHEMBL301 | 705 | 7050 |
| HERG, | CHEMBL240 | 700 | 7000 |
| Dopamine D4 receptor, | CHEMBL219 | 506 | 5060 |
| Coagulation factor X, | CHEMBL244 | 1144 | 11,440 |
| Cannabinoid CB1 receptor, | CHEMBL218 | 1911 | 19,013 |
| Cytochrome P450 19A1, | CHEMBL1978 | 621 | 6210 |
The ratio of decoys/active per activity class was set to 10:1
Hyper-parameters values explored for Bernoulli Naïve Bayes, k-nearest neighbor, random forest, support vector machines and deep neural networks
| Hyper-parameters | Values explored | Parameter |
|---|---|---|
| Bernoulli Naïve Bayes | ||
| Alpha | 1, 0.5, 0.1 | Laplace/Lidstone smoothing parameter |
| Fit_prior | True, false | Class prior probabilities. In case of false, a uniform prior was used |
| k-Nearest neighbor | ||
| Nn | 1, 3, 5, 7, 9, 11 | Number of nearest neighbors |
| Random forest | ||
| Ntrees | 10, 50, 100, 300, 700, 1000 | Number of trees |
| Criterion | Gini, entropy | Functions used to measure the quality of each split |
| Max_features | Sqrt(n_features), log2(n_features) | Number of features considered for each split |
| Support vector machines | ||
| Kernel | rbf | Radial basis function |
| C | 103, 102, 10, 1 | Cost |
| γ | 10−5, 10−4, 10−3, 10−2, 10−1 | Gamma |
| Kernel | Linear | Linear kernel |
| C | 103, 102, 10, 1, 10−1, 10−2, 10−3, 10−4 | Cost |
| Deep neural networks | ||
| η | 1, 10−1, 10−2, 10−3, 10−4 | Learning rate for the stochastic gradient descent (“SGD”) |
| Momentum (μ) | 0.9 | Weight of the previous update |
| Weight decay | 0.0005 | |
| Epochs | 300 | Number of training epochs |
| Batch size | 256 | mini-batch training size |
| Hidden layers | 1, 2, 3, 4 | Number of hidden layers |
| Number neurons | 5, 10, 50, 100, 200, 500, 700, 1000, 1500, 2000, 2500, 3000, 3500 | Number of neurons per hidden layer |
| Activation function | ReLU, Sigmoid, Tanh | Neuron activation functions |
| Regularization | No, Dropout | Regularization techniques |
| Dropout | (0%, 20%, 50%) input layer, 50% hidden layers | % of neurons “dropped” using the Drop-out technique |
| Weight and bias initiation | Gaussian {SD: 0.01} | Function used to initiate weights and biases. |
| Loss function | SoftmaxWithLoss | Function used to minimize loss |
| Output function | Softmax | Function used to calculate probability for predictions |
| Number of classes | 2 | Binary classification |
Fig. 2Comparison of activation functions rectified linear units (ReLU), Tanh and Sigmoid (Sigm) on the performance of DNNs. DNNs with a single hidden layer and variable number of neurons were trained and tested using the activation functions ReLU (red), Sigm (green) and Tanh (blue) over fivefold cross validation. The performance was measured using MCC as evaluation metric
Pairwise comparison of performance between deep neural networks with rectified linear units (ReLU) against Sigmoid (Sigm) and Tanh activation functions based on the Wilcoxon paired signed rank test, with confidence intervals 99%
| Activation | Mean of MCC diff. | SD of MCC diff. |
|
|---|---|---|---|
| ReLU—Sigm | 0.018 | 0.022 | 3.922e−11 |
| ReLU—Tanh | 0.029 | 0.033 | 3.417e−14 |
DNN combined with ReLU function were found to statistically outperform Sigm and Tanh functions
Fig. 3Effect of the hyper-parameters (i) number of hidden layers, (ii) number of neurons and (iii) dropout regularization on the performance of DNNs measured by MCC as evaluation metric. DNN configuration A shows results obtained by DNN with a single hidden layer and 10 neurons, ReLU activation function and no regularization averaged over the seven activity datasets, B a two hidden layered DNN with 500 neurons in each layer, ReLU activation function and no regularization, C two hidden layers with 3000 neurons per hidden layer and dropout regularization (0% for the input and 50% for hidden layers), D two hidden layers with 3000 neurons per hidden layer and dropout regularization (20% for the input and 50% for hidden layers), E two hidden layers with 3000 neurons per hidden layer and dropout regularization (50% for both the input and hidden layers), F three hidden layers with 3000 neurons per layer and dropout regularization (50% for both the input and hidden layers) and G four hidden layers with 3500 neurons per layer and dropout regularization (50% for the input and hidden layers)
Performance achieved by DNN, NB, kNN, RF and SVM measured using MCC as evaluation metric
| Dataset | Algorithm | MCC 1st | MCC 2nd | MCC 3rd | Mean MCC | Std MCC |
|---|---|---|---|---|---|---|
| DRD4 | DNN |
|
|
|
| 0.018 |
| SVM_rbf | 0.889 | 0.876 | 0.865 | 0.876 | 0.012 | |
| SVM_linear | 0.828 | 0.816 | 0.840 | 0.828 | 0.012 | |
| RF | 0.867 | 0.854 | 0.862 | 0.861 | 0.007 | |
| kNN | 0.762 | 0.778 | 0.763 | 0.767 | 0.009 | |
| NB | 0.742 | 0.761 | 0.750 | 0.751 | 0.009 | |
| HERG | DNN |
|
|
|
| 0.028 |
| SVM_rbf | 0.845 | 0.891 | 0.872 | 0.869 | 0.023 | |
| SVM_linear | 0.773 | 0.782 | 0.780 | 0.778 | 0.005 | |
| RF | 0.838 | 0.857 | 0.848 | 0.847 | 0.010 | |
| kNN | 0.818 | 0.813 | 0.825 | 0.819 | 0.006 | |
| NB | 0.612 | 0.620 | 0.602 | 0.611 | 0.009 | |
| CDK2 | DNN | 0.919 |
|
|
| 0.007 |
| SVM_rbf |
| 0.913 | 0.922 | 0.919 | 0.005 | |
| SVM_linear | 0.863 | 0.889 | 0.864 | 0.872 | 0.015 | |
| RF | 0.895 | 0.912 | 0.902 | 0.903 | 0.008 | |
| kNN | 0.895 | 0.904 | 0.910 | 0.903 | 0.007 | |
| NB | 0.769 | 0.773 | 0.780 | 0.774 | 0.006 | |
| CogX | DNN |
|
|
|
| 0.003 |
| SVM_rbf | 0.978 | 0.979 | 0.980 | 0.979 | 0.001 | |
| SVM_linear | 0.971 | 0.970 | 0.977 | 0.973 | 0.004 | |
| RF | 0.973 | 0.979 | 0.982 | 0.978 | 0.004 | |
| kNN | 0.968 | 0.971 | 0.971 | 0.970 | 0.002 | |
| NB | 0.889 | 0.897 | 0.882 | 0.889 | 0.008 | |
| CYP_19A1 | DNN |
|
|
|
| 0.014 |
| SVM_rbf | 0.886 | 0.896 | 0.889 | 0.890 | 0.005 | |
| SVM_linear | 0.849 | 0.866 | 0.862 | 0.859 | 0.009 | |
| RF | 0.873 | 0.910 | 0.879 | 0.887 | 0.020 | |
| kNN | 0.805 | 0.811 | 0.821 | 0.812 | 0.008 | |
| NB | 0.755 | 0.821 | 0.775 | 0.784 | 0.034 | |
| CB1 | DNN |
|
|
|
| 0.002 |
| SVM_rbf | 0.941 | 0.937 | 0.931 | 0.936 | 0.005 | |
| SVM_linear | 0.885 | 0.893 | 0.881 | 0.886 | 0.007 | |
| RF | 0.908 | 0.923 | 0.914 | 0.915 | 0.008 | |
| kNN | 0.906 | 0.921 | 0.901 | 0.909 | 0.011 | |
| NB | 0.758 | 0.781 | 0.765 | 0.768 | 0.012 | |
| CAII | DNN |
|
| 0.843 |
| 0.021 |
| SVM_rbf | 0.857 | 0.851 |
| 0.858 | 0.007 | |
| SVM_linear | 0.828 | 0.826 | 0.830 | 0.828 | 0.002 | |
| RF | 0.836 | 0.857 | 0.861 | 0.851 | 0.013 | |
| kNN | 0.558 | 0.557 | 0.577 | 0.564 | 0.011 | |
| NB | 0.754 | 0.769 | 0.783 | 0.769 | 0.015 |
Results for each activity class and validation set from three experiments as shown. Best recorded results for each activity class are highlighted in italic
Wilcoxon paired signed-rank test was employed to compare the performance of DNN against the rest algorithms NB, kNN, RF and SVM across the datasets, with confidence intervals 99%
| Algorithms | Mean MCC diff. | SD of MCC diff. |
|
|---|---|---|---|
| DNN-NB | 0.149 | 0.061 | 4.768E−07 |
| DNN-kNN | 0.092 | 0.095 | 4.768E−07 |
| DNN-SVM (linear) | 0.052 | 0.031 | 3.2E−5 |
| DNN-RF | 0.021 | 0.016 | 8.6E−5 |
| DNN-SVM (rbf) | 0.009 | 0.012 | 5.075E−4 |
Below are reported the means and standard deviations of the observed MCC for each algorithm and the p value
Fig. 4Boxplot of differences between performances achieved by tuned DNN and the rest algorithms measured using MCC as evaluation metric on the validation sets over the seven activity classes. Results are ranked by decreased mean differences. The differences ranged on average from 0.149 MCC units between DNN and NB, 0.092 DNN and kNN, 0.052 DNN and SVM with linear kernel, 0.021 DNN and RF and 0.009 DNN and SVM with “rbf” kernel
Fig. 5Robustness of machine learning methods to different levels of noise for 4 out of 7 activity classes. At low levels of noise, lower that 20%, non-linear methods performed well achieving performance higher than 0.7 MCC units for most of the tested datasets. Instead, at higher level of noise, equal to or higher than 30%, performance for most algorithms dropped below 0.7 MCC and in several occasions even lower than 0.6 at 50% of noise. Naïve Bayes method was found to be the least affected method achieving in several tested datasets performance higher than 0.6 MCC even at the highest level of noise tested 50% and outperforming more complex methods