| Literature DB >> 33430997 |
Chia-Hsiu Chen1, Kenichi Tanaka1, Masaaki Kotera1, Kimito Funatsu2.
Abstract
Ensemble learning helps improve machine learning results by combining several models and allows the production of better predictive performance compared to a single model. It also benefits and accelerates the researches in quantitative structure-activity relationship (QSAR) and quantitative structure-property relationship (QSPR). With the growing number of ensemble learning models such as random forest, the effectiveness of QSAR/QSPR will be limited by the machine's inability to interpret the predictions to researchers. In fact, many implementations of ensemble learning models are able to quantify the overall magnitude of each feature. For example, feature importance allows us to assess the relative importance of features and to interpret the predictions. However, different ensemble learning methods or implementations may lead to different feature selections for interpretation. In this paper, we compared the predictability and interpretability of four typical well-established ensemble learning models (Random forest, extreme randomized trees, adaptive boosting and gradient boosting) for regression and binary classification modeling tasks. Then, the blending methods were built by summarizing four different ensemble learning methods. The blending method led to better performance and a unification interpretation by summarizing individual predictions from different learning models. The important features of two case studies which gave us some valuable information to compound properties were discussed in detail in this report. QSPR modeling with interpretable machine learning techniques can move the chemical design forward to work more efficiently, confirm hypothesis and establish knowledge for better results.Keywords: Blending; Decision tree; Ensemble learning; Extremely randomized trees; Fluorescence; Liquid crystal; QSPR; Quantitative structure–property; Random forest
Year: 2020 PMID: 33430997 PMCID: PMC7106596 DOI: 10.1186/s13321-020-0417-9
Source DB: PubMed Journal: J Cheminform ISSN: 1758-2946 Impact factor: 5.514
Fig. 1Concept of blending
Fig. 2Structure template of rod-like LC
Confusion table
| Actual class | Predicted class | |
|---|---|---|
| LC | NLC | |
| LC | ||
| NLC | ||
The results of four different DT-based ensemble learning methods
| Training dataset | Test dataset | |||
|---|---|---|---|---|
| RMSE (nm) | RMSE (nm) | |||
| RF | 0.966 | 22.25 | 0.904 | 34.42 |
| ExtraTrees | 0.991 | 11.15 | 0.908 | 33.71 |
| AdaBoost | 0.981 | 16.22 | 0.904 | 34.45 |
| GBM | 0.988 | 12.92 | 0.905 | 34.26 |
Top 10 important descriptors selected by four DT-based ensemble learning models
| RF | ExtraTrees | ||
|---|---|---|---|
| Selected descriptors | Feature importance | Selected descriptors | Feature importance |
| Gap | 0.3412 | Gap | 0.0712 |
| AP(xx) | 0.0986 | F01[C-N] | 0.0350 |
| Chi1_EA(dm) | 0.0344 | AP(xx) | 0.0243 |
| Chi0_EA(dm) | 0.0274 | SpMax2_Bh(i) | 0.0221 |
| EP(xx) | 0.0239 | F02[C-N] | 0.0192 |
| P_VSA_ppp_L | 0.0215 | SpMax7_Bh(m) | 0.0179 |
| SpDiam_AEA(ed) | 0.0160 | F01[C-C] | 0.0174 |
| SpMax_AEA(ed) | 0.0134 | C-004 | 0.0157 |
| SpMin5_Bh(m) | 0.0119 | P_VSA_e_2 | 0.0152 |
| CATS2D_06_LL | 0.0093 | EP(xx) | 0.0123 |
The results of three different blending methods
| Training dataset | Test dataset | |||
|---|---|---|---|---|
| RMSE (nm) | RMSE (nm) | |||
| Uniform blending | 0.988 | 13.26 | 0.921 | 31.35 |
| Linear blending | 0.992 | 10.25 | 0.922 | 31.05 |
| Any blending | 0.996 | 7.84 | 0.931 | 29.11 |
Fig. 3Experimental values versus calculated values of λem by any blending
Fig. 4Performance comparison of a RF, ExtraTrees and any blending, b AdaBoost, GBM and any blending
Fig. 5Chemical structures in area A and area B which every prediction models could not provide accurate predictions
Fig. 6Chemical structures in area C and are D which specific models could not provide accurate predictions
Top 10 important descriptors selected by three blending methods
| Uniform blending | Linear blending | Any blending | |||
|---|---|---|---|---|---|
| Selected descriptors | Feature importance | Selected descriptors | Feature importance | Selected descriptors | Feature importance |
| Gap | 0.1654 | Gap | 0.1330 | ||
| AP(xx) | 0.0470 | AP(xx) | 0.0359 | ||
| F01[C-N] | 0.0216 | F01[C-N] | 0.0247 | ||
| SpMax2_Bh(i) | 0.0176 | SpMax2_Bh(i) | 0.0174 | F01[C-N] | 0.0198 |
| Solvent | 0.0169 | Solvent | 0.0141 | SpMax2_Bh(i) | 0.0170 |
| P_VSA_MR_7 | 0.0162 | F02[C-N] | 0.0133 | P_VSA_ppp_L | 0.0147 |
| P_VSA_ppp_L | 0.0151 | EP(xx) | 0.0122 | EP(xx) | 0.0115 |
| EP(xx) | 0.0133 | F01[C-C] | 0.0111 | F02[C-N] | 0.0108 |
| Chi1_EA(dm) | 0.0119 | SpMin5_Bh(m) | 0.0106 | Chi1_EA(dm) | 0.0105 |
| F02[C-N] | 0.0109 | P_VSA_ppp_L | 0.0103 | Chi0_EA(dm) | 0.0092 |
Italic values indicate the significance of important features which highly affect the fluorescence wavelengths
Performance metrics values and corresponding confusion tables for four different classifiers
| Training set | Test set | |||||||||
|---|---|---|---|---|---|---|---|---|---|---|
| Actual class | Predicted class | |||||||||
| LC | NLC | |||||||||
| RF | 99.3 | 99.5 | 88.5 | 91.7 | 93.5 | 92.5 | 67.3 | LC | 673 | 47 |
| NLC | 61 | 166 | ||||||||
| ExtraTrees | 99.3 | 99.5 | 87.5 | 91.2 | 92.5 | 91.9 | 65.2 | LC | 666 | 54 |
| NLC | 64 | 163 | ||||||||
| AdaBoost | 99.3 | 99.5 | 88.1 | 91.2 | 93.4 | 92.3 | 65.1 | LC | 673 | 47 |
| NLC | 65 | 162 | ||||||||
| GBM | 95.3 | 96.8 | 87.4 | 91.0 | 92.6 | 91.8 | 64.3 | LC | 667 | 53 |
| NLC | 66 | 161 | ||||||||
Fig. 7Prediction comparison of three LC compounds (LC-A, LC-B and LC-C) and three NLC compounds (NLC-A, NLC-B and NLC-C)
Top five important descriptors of LC selected by four DT-based ensemble learning models
| RF | ExtraTrees | ||
|---|---|---|---|
| Selected descriptors | Feature importance | Selected descriptors | Feature importance |
| HeavyAtomCount | 0.04649 | NumRotatableBonds | 0.03541 |
| NumRotatableBonds | 0.04381 | HeavyAtomCount | 0.03495 |
| wing2_HeavyAtomCount | 0.04329 | wing1_NumRotatableBonds | 0.02801 |
| fr_unbrch_alkane | 0.03315 | wing2_HeavyAtomCount | 0.02700 |
| wing1_NumRotatableBonds | 0.03218 | wing1_HeavyAtomCount | 0.02653 |
Performance metrics values and corresponding confusion tables for three different blending methods
| Training set | Test set | |||||||||
|---|---|---|---|---|---|---|---|---|---|---|
| Acc (%) | F1 (%) | Acc (%) | Pr (%) | r (%) | F1 (%) | MCC (%) | Actual class | Predicted class | ||
| LC | NLC | |||||||||
| Uniform blending | 99.5 | 99.7 | 88.3 | 91.3 | 93.6 | 92.5 | 67.3 | LC | 674 | 46 |
| NLC | 64 | 163 | ||||||||
| Linear blending | 99.5 | 99.7 | 88.4 | 91.6 | 93.5 | 92.5 | 67.8 | LC | 673 | 47 |
| NLC | 62 | 165 | ||||||||
| Any blending | 99.3 | 99.5 | 88.8 | 91.7 | 93.8 | 92.7 | 68.6 | LC | 675 | 45 |
| NLC | 61 | 166 | ||||||||
Top five important descriptors selected by three blending methods
| Uniform blending | Linear blending | ||
|---|---|---|---|
| Selected descriptors | Feature importance | Selected descriptors | Feature importance |
| HeavyAtomCount | 0.05993 | HeavyAtomCount | 0.04781 |
| NumRotatableBonds | 0.04803 | NumRotatableBonds | 0.04219 |
| wing2_HeavyAtomCount | 0.03827 | wing2_HeavyAtomCount | 0.03471 |
| mesogen_HeavyAtomCount | 0.03484 | wing1_HeavyAtomCount | 0.03006 |
| wing1_HeavyAtomCount | 0.03290 | fr_unbrch_alkane | 0.02897 |
Fig. 8Bar chart of top 10 important descriptors selected by any blending