| Literature DB >> 31655545 |
Sunyoung Kwon1,2, Ho Bae3, Jeonghee Jo3, Sungroh Yoon4,5,6,7,8.
Abstract
BACKGROUND: Quantitative structure-activity relationship (QSAR) is a computational modeling method for revealing relationships between structural properties of chemical compounds and biological activities. QSAR modeling is essential for drug discovery, but it has many constraints. Ensemble-based machine learning approaches have been used to overcome constraints and obtain reliable predictions. Ensemble learning builds a set of diversified models and combines them. However, the most prevalent approach random forest and other ensemble approaches in QSAR prediction limit their model diversity to a single subject.Entities:
Keywords: Drug-prediction; Ensemble-learning; Meta-learning
Year: 2019 PMID: 31655545 PMCID: PMC6815455 DOI: 10.1186/s12859-019-3135-4
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Details of the bioassay datasets used in the experiments
| Assay ID | Description of BioAssay | # Active | # Inactive |
|---|---|---|---|
| 1851_1a2 | Cytochrome P450 Panel Assay, cyp1a2 | 5,902 | 6,974 |
| 1851_2c19 | Cytochrome P450 Panel Assay, cyp2c19 | 5,840 | 7,135 |
| 1851_2c9 | Cytochrome P450 Panel Assay, cyp2c9 | 4,065 | 8,361 |
| 1851_2d6 | Cytochrome P450 Panel Assay, cyp2d6 | 2,601 | 10,826 |
| 1851_3a4 | Cytochrome P450 Panel Assay, cyp3a4 | 5,175 | 7,446 |
| 1915 | Streptokinase Expression Inhibition | 2,219 | 1,017 |
| 2358 | Inhibitors of Protein Phosphatase 1 (PP1) | 1,006 | 934 |
| 463213 | Inhibitors of tim10-1 yeast | 4,138 | 3,234 |
| 463215 | Inhibitors of tim10 yeast | 2,941 | 1,695 |
| 488912 | Inhibitors of Sentrin-specific protease 8 | 2,491 | 3,705 |
| 488915 | Inhibitors of Sentrin-specific protease 6 | 3,568 | 2,628 |
| 488917 | Inhibitors of Sentrin-specific protease 7 | 4,283 | 1,913 |
| 488918 | Inhibitors of Sentrin-specific proteases | 3,691 | 2,505 |
| 492992 | Inhibitors of KCNK9 ∗ | 2,094 | 2,820 |
| 504607 | Inhibitors of Mdm2/MdmX interaction | 4,825 | 1,406 |
| 624504 | Inhibitors of the mtPTP | 3,944 | 1,090 |
| 651739 | Inhibition of T.cruzi proliferation | 4,043 | 1,322 |
| 651744 | NIH/3T3 (mouse embryonic fibroblast) toxicity | 3,099 | 2,303 |
| 652065 | Molecules that bind r(CAG) RNA repeats | 2,965 | 1,286 |
The 19 bioassays are those specified in [10]
∗Two-pore domain potassium channel
Mitochondrial permeability transition pore
Performance comparison between the proposed comprehensive ensemble and the individual models on 19 bioassay datasets
| BioAssay | PubChem fingerprint | ECFP fingerprint | MACCS fingerprint | SMILES | comprehensive | |||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| RF | SVM | GBM | NN | RF | SVM | GBM | NN | RF | SVM | GBM | NN | NN | ensemble | |
| 1851_1a2 |
| 0.896 | 0.900 |
| 0.919 | 0.906 | 0.882 | 0.920 | 0.912 | 0.879 | 0.894 | 0.912 |
|
|
| 1851_2c19 | 0.871 | 0.852 | 0.848 | 0.872 | 0.882 | 0.871 | 0.854 |
| 0.874 | 0.842 | 0.850 |
| 0.875 |
|
| 1851_2c9 | 0.871 | 0.857 | 0.851 | 0.873 |
| 0.866 | 0.843 |
| 0.858 | 0.828 | 0.840 | 0.870 | 0.877 |
|
| 1851_2d6 | 0.858 | 0.847 | 0.832 |
|
| 0.850 | 0.833 | 0.856 | 0.854 | 0.816 | 0.830 | 0.852 | 0.846 |
|
| 1851_3a4 | 0.877 | 0.868 | 0.865 | 0.887 |
| 0.887 | 0.855 |
| 0.867 | 0.832 | 0.851 | 0.875 |
|
|
| 1915 | 0.754 | 0.692 | 0.709 | 0.722 | 0.731 | 0.700 | 0.700 | 0.712 |
| 0.716 | 0.736 |
| 0.701 |
|
| 2358 |
| 0.705 | 0.736 | 0.770 | 0.780 | 0.767 | 0.722 | 0.761 | 0.774 | 0.731 | 0.763 |
| 0.697 |
|
| 463213 | 0.673 | 0.639 | 0.652 | 0.651 |
| 0.652 | 0.644 | 0.661 |
| 0.642 | 0.655 | 0.651 | 0.636 |
|
| 463215 | 0.620 | 0.576 | 0.592 | 0.604 | 0.617 | 0.585 | 0.598 | 0.595 |
| 0.600 |
| 0.625 | 0.587 |
|
| 488912 | 0.679 | 0.643 | 0.634 | 0.668 |
| 0.654 | 0.668 | 0.675 | 0.667 | 0.634 | 0.650 |
| 0.644 |
|
| 488915 | 0.718 | 0.686 | 0.679 | 0.713 |
| 0.693 | 0.680 |
| 0.692 | 0.659 | 0.680 | 0.693 | 0.679 |
|
| 488917 |
| 0.777 | 0.759 | 0.805 |
| 0.788 | 0.760 | 0.799 | 0.788 | 0.726 | 0.752 | 0.786 | 0.780 |
|
| 488918 | 0.762 | 0.745 | 0.735 |
|
| 0.766 | 0.729 | 0.767 | 0.737 | 0.690 | 0.708 | 0.742 | 0.746 |
|
| 492992 |
| 0.784 | 0.783 | 0.800 |
| 0.807 | 0.802 | 0.822 | 0.825 | 0.726 | 0.759 | 0.790 | 0.802 |
|
| 504607 |
| 0.678 |
| 0.686 | 0.690 | 0.668 | 0.673 | 0.656 | 0.676 | 0.640 | 0.662 | 0.655 | 0.649 |
|
| 624504 |
| 0.850 | 0.857 | 0.867 |
| 0.858 | 0.858 | 0.861 | 0.872 | 0.832 | 0.862 | 0.876 | 0.868 |
|
| 651739 | 0.791 | 0.770 | 0.773 | 0.781 |
| 0.782 | 0.771 | 0.788 | 0.779 | 0.729 | 0.759 | 0.754 |
|
|
| 651744 | 0.884 | 0.862 | 0.872 | 0.885 |
| 0.883 | 0.875 | 0.896 | 0.869 | 0.829 | 0.843 | 0.853 |
|
|
| 652065 |
| 0.752 | 0.782 | 0.780 |
| 0.775 | 0.758 | 0.774 | 0.776 | 0.736 | 0.759 | 0.772 | 0.763 |
|
| average |
| 0.762 | 0.766 | 0.786 |
| 0.777 | 0.763 | 0.784 | 0.783 | 0.741 | 0.762 | 0.778 | 0.771 |
|
Each value shows the averaged AUC from twenty repeated experiments on the test set (bold: top 3 AUC on each dataset), and the last row shows the averaged AUC calculated from 19 AUC results
The AUC scores of the ensemble classifier and the best single classifier for 19 PubChem assays
| Assay ID | The Best Single Classifier (AUC) | The Ensemble Classifier (AUC) |
|---|---|---|
| 1851_1a2 | 0.922 | 0.934 |
| 1851_2c19 | 0.885 | 0.900 |
| 1851_2c9 | 0.88 | 0.898 |
| 1851_2d6 | 0.867 | 0.884 |
| 1851_3a4 | 0.895 | 0.914 |
| 1915 | 0.758 | 0.755 |
| 2358 | 0.787 | 0.803 |
| 463213 | 0.685 | 0.689 |
| 463215 | 0.630 | 0.627 |
| 488912 | 0.693 | 0.698 |
| 488915 | 0.731 | 0.735 |
| 488917 | 0.814 | 0.834 |
| 488918 | 0.778 | 0.799 |
| 492992 | 0.849 | 0.845 |
| 504607 | 0.694 | 0.721 |
| 624504 | 0.884 | 0.897 |
| 651739 | 0.802 | 0.804 |
| 651744 | 0.899 | 0.901 |
| 652065 | 0.800 | 0.826 |
Performance comparison with other ensemble approaches
| BioAssay | limited ensemble | comprehensive ensemble | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|
| method ensemble | representation ensemble | |||||||||
| PubChem | ECFP | MACCS | RF | SVM | GBM | NN | NN (+SMILES) ∗ | average | meta-learning | |
| 1851_1a2 | 0.921 | 0.922 | 0.910 | 0.931 | 0.920 | 0.907 |
|
| 0.934 |
|
| 1851_2c19 | 0.875 | 0.889 | 0.879 | 0.893 | 0.887 | 0.869 |
|
| 0.900 |
|
| 1851_2c9 | 0.878 | 0.885 | 0.866 | 0.888 | 0.882 | 0.865 |
|
| 0.898 |
|
| 1851_2d6 | 0.870 | 0.869 | 0.853 | 0.880 | 0.869 | 0.852 |
|
|
|
|
| 1851_3a4 | 0.890 | 0.902 | 0.874 | 0.898 | 0.901 | 0.881 | 0.913 |
|
|
|
| 1915 | 0.729 | 0.721 | 0.750 |
| 0.728 | 0.739 | 0.747 | 0.750 |
|
|
| 2358 | 0.758 | 0.781 | 0.780 |
| 0.780 | 0.772 |
| 0.803 | 0.803 |
|
| 463213 | 0.669 | 0.672 | 0.669 |
| 0.671 | 0.666 | 0.682 | 0.684 |
|
|
| 463215 | 0.604 | 0.603 |
|
| 0.604 | 0.623 | 0.623 | 0.624 | 0.627 |
|
| 488912 | 0.674 | 0.682 | 0.676 |
| 0.668 | 0.667 | 0.695 |
|
|
|
| 488915 | 0.720 | 0.719 | 0.699 | 0.731 | 0.711 | 0.700 | 0.732 |
|
|
|
| 488917 | 0.811 | 0.815 | 0.785 | 0.824 | 0.808 | 0.782 | 0.832 |
|
|
|
| 488918 | 0.777 | 0.783 | 0.743 | 0.780 | 0.782 | 0.752 | 0.793 |
|
|
|
| 492992 | 0.820 | 0.829 | 0.795 |
| 0.818 | 0.812 | 0.836 | 0.845 |
|
|
| 504607 |
| 0.687 | 0.682 | 0.708 | 0.701 | 0.703 | 0.698 | 0.706 |
|
|
| 624504 | 0.879 | 0.875 | 0.867 | 0.896 | 0.880 | 0.878 | 0.892 |
|
|
|
| 651739 | 0.795 |
| 0.774 | 0.800 | 0.776 | 0.783 | 0.803 |
| 0.804 |
|
| 651744 | 0.892 |
| 0.868 | 0.890 | 0.882 | 0.879 | 0.899 |
| 0.901 |
|
| 652065 | 0.795 | 0.791 | 0.784 | 0.807 | 0.804 | 0.803 | 0.813 |
|
|
|
| average | 0.793 | 0.796 | 0.784 | 0.809 | 0.793 | 0.786 | 0.810 |
|
|
|
All AUC values except those in the last two columns are based on limited subject ensembles, while the AUC values in the last two columns are from the comprehensive ensemble. The first three columns are method ensembles that consider various methods by fixing them to a target molecular fingerprint. The next five columns are representation ensembles that consider various chemical compound representations by fixing them to a learning method. Except for the final meta-learning approach, combining is based on uniform averaging. Each value is the averaged AUC from five repeated experiments (bold: top 3)
∗NN(+SMILES) is a representation ensemble that combines a set of models trained on a diversified input representation of fingerprints (PubChem, ECFP, MACCS) and SMILES-based on NN
Fig. 1Ensemble effects on class-imbalanced datasets. a Improved average AUC value produced by neural network bagging (NN-bagging) and neural network-based representation ensemble (NN-representation ensemble) over three fingerprints. b Pearson’s correlation (r=0.69, p-value=1.1x 10−3) between the improved AUC values from NN-bagging and the class imbalance ratio. The class imbalance ratio was calculated from the number of active and inactive chemicals, as shown in Table 1
Performance comparison between multi-task [10] and meta-learning neural networks
| Assay ID | Multi-task | Proposed (Meta-learning) |
|---|---|---|
| 1851_1a2 | 0.938 |
|
| 1851_2c19 | 0.903 |
|
| 1851_2c9 | 0.907 |
|
| 1851_2d6 | 0.861 |
|
| 1851_3a4 | 0.897 |
|
| 1915 | 0.750 |
|
| 2358 | 0.751 |
|
| 463213 | 0.676 |
|
| 463215 | 0.654 | 0.634 |
| 488912 | 0.816 | 0.700 |
| 488915 | 0.873 | 0.739 |
| 488917 | 0.894 | 0.841 |
| 488918 | 0.842 | 0.801 |
| 492992 | 0.829 |
|
| 504607 | 0.670 |
|
| 624504 | 0.889 |
|
| 651739 | 0.825 | 0.809 |
| 651744 | 0.900 |
|
| 652065 | 0.792 |
|
The mean AUC values for both neural networks are shown (bold: top AUC on each dataset)
Performance comparison with Multi-task neural networks [10] on HIV datasets [29]
| AUC | Accuracy | MCC | F1-score | |
|---|---|---|---|---|
| Multi-task [ | 0.714 ±0.007 | 0.947 ±0.009 | 0.260 ±0.020 | 0.972 ±0.005 |
| Meta-learning | 0.714 ±0.007 | 0.964 ±0.001 | 0.269 ±0.026 | 0.982 ±0.001 |
The table shows the average test set of various measures for Multi-task neural networks and Meta-learning neural networks
Fig. 2Interpretation of model importance through meta-learning. Weights through meta-learning were used to interpret model importance. Darker green indicates a highly weighted and significant model, while lighter yellow indicates a less weighted and less significant model
Fig. 3Learning procedure of the proposed comprehensive ensemble. The individual i-th learning algorithm outputs its prediction probability P for the training dataset through 5-fold cross-validation. The n diverse learning algorithms produce n prediction probabilities (P1,P2,⋯,P). The probabilities are concatenated and then used as input to the second-level learning algorithm , which makes a final decision . a First-level learning. b Second-level learning
Fig. 4Proposed CNN + RNN model. The input SMILES strings are converted with one-hot encoding and truncated to a maximum length of 100. The preprocessed input is subsequently fed to the CNN layer without pooling, and the outputs are directly fed into the GRU layer