| Literature DB >> 31997851 |
Ivan Olier1,2, Noureddin Sadawi3,4, G Richard Bickerton5,6, Joaquin Vanschoren7, Crina Grosan4,8, Larisa Soldatova4,9, Ross D King2.
Abstract
We investigate the learning of quantitative structure activity relationships (QSARs) as a case-study of meta-learning. This application area is of the highest societal importance, as it is a key step in the development of new medicines. The standard QSAR learning problem is: given a target (usually a protein) and a set of chemical compounds (small molecules) with associated bioactivities (e.g. inhibition of the target), learn a predictive mapping from molecular representation to activity. Although almost every type of machine learning method has been applied to QSAR learning there is no agreed single best way of learning QSARs, and therefore the problem area is well-suited to meta-learning. We first carried out the most comprehensive ever comparison of machine learning methods for QSAR learning: 18 regression methods, 3 molecular representations, applied to more than 2700 QSAR problems. (These results have been made publicly available on OpenML and represent a valuable resource for testing novel meta-learning methods.) We then investigated the utility of algorithm selection for QSAR problems. We found that this meta-learning approach outperformed the best individual QSAR learning method (random forests using a molecular fingerprint representation) by up to 13%, on average. We conclude that meta-learning outperforms base-learning methods for QSAR learning, and as this investigation is one of the most extensive ever comparisons of base and meta-learning methods ever made, it provides evidence for the general effectiveness of meta-learning over base-learning.Entities:
Keywords: Algorithm selection; Drug discovery; Meta-learning; QSAR
Year: 2017 PMID: 31997851 PMCID: PMC6956898 DOI: 10.1007/s10994-017-5685-x
Source DB: PubMed Journal: Mach Learn ISSN: 0885-6125 Impact factor: 2.940
Fig. 1Rice’s framework for algorithm selection.
Adapted from Rice (1976) and Smith-Miles (2008)
List of baseline QSAR algorithms
| Short name | Name | Parameter settings |
|---|---|---|
| ctree | Conditional trees | min_split |
| rtree | Regression trees | min_split |
| cforest | Random forest (with conditional trees) | n_trees |
| rforest | Random forest | n_trees |
| gbm | Generalized boosted regression | n_trees |
| fnn | k-Nearest neighbor | k |
| earth | Adaptive regression splines (earth) | (As default) |
| glmnet | Regularized GLM | (As default) |
| ridge | Penalized ridge regression | (As default) |
| lm | Multiple linear regression | (As default) |
| pcr | Principal component regression | (As default) |
| plsr | Partial least squares | (As default) |
| rsm | Response surface regression | (As default) |
| rvm | Relevance vector machine | Kernel |
| ksvm | Support vector machines | Kernel |
| ksvmfp | Support vector machines with Tanimoto kernel | Kernel |
| nnet | Neural networks | size |
| nneth2o | Neural networks using H2O library | layers |
n_trees number of trees; min_split minimum node size allowed for splitting; min_bucket minimum size of the bucket. k number of neighbours; depth search depth; CV cross-validation; min_obs_node minimum number of observations per node; RBF radial basis function with nu (spread) and epsilon (scale) parameters; size number of neurons in the hidden layer; n_inputs length of the input vector
Names of the generated dataset representations
| Basic set of descriptors (43) | All descriptors (1447) | FCFP4 fingerprint (1024) | |
|---|---|---|---|
| Original dataset | basicmolprop (not used) | allmolprop (not used) | fpFCFP4 |
| Missing value imputation | basicmolprop.miss | allmolprop.miss | (No missing values) |
Fig. 2Graphical representation of the number of times (target counts) a particular QSAR learning method obtains the best performance (minimum RMSE)
Fig. 3Graphical representation of the number of times (target counts) a dataset representation was fitted with the best performer QSAR method (minimum RMSE)
Fig. 4Graphical representation of the number of times (target counts) a combination of dataset representation and QSAR method obtained the best performance (minimum RMSE)
Fig. 5Average ranking of dataset representation and QSAR combination as estimated using the RMSE ratio
Fig. 6Box plot displays the post-hoc test results over the top 6 ranked best performer QSAR strategies: 1—rforest.fpFCFP4, 2—ksvm.fpFCFP4, 3—ksvmfp.fpFCFP4, 4—rforest.allmolprop.miss, 5—glmnet.fpFCFP4, and 6—rforest.basicmolprop.miss.fs. Statistically significant comparisons (P value ) represented with green boxes (Color figure online)
Fig. 7The key branches of the meta-QSAR ontology (a fragment)
Fig. 8The representation of the meta-features and their values
Dataset meta-features (examples)
| Feature | Description |
|---|---|
| multiinfo | Multiple information (also called total correlation) among the random variables in the dataset |
| mutualinfo | Mutual information between nominal attributes X and Y. Describes the reduction in uncertainty of Y due to the knowledge of X, and leans on the conditional entropy |
| nentropyfeat | Normalised entropy of the features which is the class entropy divided by log(n) where n is the number of the features |
| mmeanfeat | Average mean of the features |
| msdfeat | Average standard deviation of the features |
| kurtresp | Kurtosis of the response variable |
| meanresp | Mean of the response variable |
| skewresp | Skewness of the response variable |
| nentropyresp | Normalised entropy of the response variable |
| sdresp | Standard deviation of the response |
| aggFCFP4fp (1024 features) | Aggregated fingerprints and normalized over the number of instances in the dataset |
Fig. 11Violin plots with added box plots representing the mean decrease accuracy of the meta-features grouped by meta-feature groups. Notice that for visualization purposes, we are showing the group dataset meta-features (as defined in Sect. 3) in two separated groups: “Aggregated Fingerprints” and “Information Theory”
Drug targets meta-features (examples)
| Feature | Description |
|---|---|
| Aliphatic index | The Aliphatic index (Atsushi |
| Boman index | This the potential protein interaction index proposed by Boman ( |
| Hydrophobicity (38 features) | Hydrophobicity is the association of non-polar groups or molecules in an aqueous environment which arises from the tendency of water to exclude non-polar molecules (Mcnaught and Wilkinson |
| Net charge | The theoretical net charge of a protein sequence as described by Moore ( |
| Molecular weight | Ratio of the mass of a molecule to the unified atomic mass unit. Sometimes called the molecular weight or relative molar mass (Mcnaught and Wilkinson |
| Isoelectric point | The pH value at which the net electric charge of an elementary entity is zero. (pI is a commonly used symbol for this kind of quantity, however, a more accurate symbol is pH(I)) (Mcnaught and Wilkinson |
| Sequence length | The number of amino acids in a protein sequence |
| Instability index | The instability index was proposed by (Guruprasad, 1990). A protein whose instability index is smaller than 40 is predicted as stable, a value above 40 predicts that the protein may be unstable |
| DC groups (400 features) | The Dipeptide Composition descriptor (Xiao et al. |
Fig. 9Meta-learning setup to select QSAR combinations (workflows) for a given QSAR dataset. The 52 QSAR combinations are generated by combining 3 types of representation/preprocessing with 17 regression algorithms, plus the Tanimoto KSVM which was only run on the fingerprint representation
Fig. 10Schematic representation of the meta-dataset used for meta-QSAR
Fig. 12Box plots representing the computed Spearman’s rank correlation coefficient (rs) between the predicted and actual rankings. Labels in the horizontal axis indicates: mRF—multivariate random forest, 1-NN, 5-NN, 10-NN, 50-NN, 100-NN, 500-NN, and All—1, 5, 10, 50, 100, 500, and all nearest neighbours, respectively
Fig. 13Average of predicted ranking of QSAR combinations using the multivariate random forest algorithm according to the RMSE ratio
Fig. 14Visual comparison of performance distributions between the default strategy (in black) and all meta-learners (in grey) using asymmetric bean plots. Average RMSE for each implementation is represented by vertical black lines on the “beans” (performance distribution curves)
Comparison of performance between the default strategy and all Meta-QSAR implementations
| Implementation | Mean RMSE | Relative RMSE reduction (%) |
|
|---|---|---|---|
| Default | 0.1964 | ||
| mRF | 0.1709 | 13.0 | < 0.001 |
| All-NN | 0.1737 | 11.6 | < 0.001 |
| 500-NN | 0.1738 | 11.5 | < 0.001 |
| 100-NN | 0.1738 | 11.5 | 0.009 |
| 50-NN | 0.1751 | 10.9 | 0.003 |
| 10-NN | 0.1815 | 7.6 | 0.011 |
| 5-NN | 0.1881 | 4.3 | < 0.001 |
| 1-NN | 0.2098 | - 6.8 | < 0.001 |
| cl.Top2 | 0.1711 | 12.9 | 0.007 |
| cl.Top3 | 0.1709 | 13.0 | < 0.001 |
| cl.Top6 | 0.1779 | 9.5 | 0.022 |
| cl.Top11 | 0.1771 | 9.9 | 0.086 |
| cl.Top16 | 0.1788 | 9.0 | 0.072 |
| cl.All | 0.1823 | 7.2 | 0.189 |
Relative RMSE reduction (in%) is estimated as , where and correspond to the mean RMSE of the default and Meta-QSAR strategies, respectively. P values were estimated using Wilcoxon Rank Sum test