| Literature DB >> 21708012 |
Fabian Buchwald1, Lothar Richter, Stefan Kramer.
Abstract
BACKGROUND: We present a machine learning approach to the problem of protein ligand interaction prediction. We focus on a set of binding data obtained from 113 different protein kinases and 20 inhibitors. It was attained through ATP site-dependent binding competition assays and constitutes the first available dataset of this kind. We extract information about the investigated molecules from various data sources to obtain an informative set of features.Entities:
Year: 2011 PMID: 21708012 PMCID: PMC3151211 DOI: 10.1186/1758-2946-3-22
Source DB: PubMed Journal: J Cheminform ISSN: 1758-2946 Impact factor: 5.514
Figure 1Training set inhibitors. Structures of the 20 inhibitors that were subject of our study [7].
Characteristics of the used dataset
| Ambit Biosciences | |
|---|---|
| number of kinases | 113 |
| number of inhibitors | 20 |
| number of pairs | 2260 |
| number of bindings | 597 (26.4%) |
| number of no-bindings | 1663 (73.6%) |
Summary of different features of kinases used in our study
| Short-hand | Full Name | # Features | Feature Type |
|---|---|---|---|
| Serine/Threonine, Tyrosine Kinases | 1 | nominal | |
| Phylogenetic Clustering | 2 | nominal | |
| PROSITE patterns | 12 | numeric | |
| Apriori patterns | 14 | numeric | |
| global alignment scores | 113 | numeric | |
| local alignment scores | 113 | numeric | |
| Position Specific Features | 98 | nominal | |
| abstract Postition Specific Features | 98 | nominal | |
Summary of different features for the inhibitors used in our study
| Short-hand | Full Name | # Features | Feature Type |
|---|---|---|---|
| Primary Target | 1 | nominal | |
| 2D Molecular Structure | 1 | nominal | |
| Free Trees | 78 | numeric | |
| KNN clustering | 20 | numeric | |
| Chemical Features | 15 | numeric | |
| Geometric Features | 5 | numeric | |
| Pharmacophores | 50 | numeric | |
Figure 2Hard and soft case of LOOCV. Illustration of the hard (left) and the soft (right) case of LOOCV.
Figure 3Mixed and mixed-mixed case of LOOCV. Illustration of the mixed (left) and the mixed-mixed (right) case of LOOCV.
This table indicates which features are contained in which feature sets
| FS1 | FS2 | FS3 | FS4(C5) | FS4(SVM) | FS5 | FS6(C5) | FS6(SVM) | FS7(C5) | FS7(SVM) | |
|---|---|---|---|---|---|---|---|---|---|---|
| PT | X | X | X | X | X | X | X | X | X | X |
| MS | X | X | X | X | X | X | X | X | X | X |
| FTs | X | X | X | X | X | X | X | X | X | X |
| KNN | X | X | X | X | X | X | X | X | X | X |
| CF | X | X | ||||||||
| GF | X | X | ||||||||
| P | X | X | ||||||||
| STTK | X | X | X | X | X | X | X | X | X | X |
| PC | X | X | X | X | X | X | X | X | X | X |
| PRO | X | X | X | X | X | X | X | X | X | X |
| Apri | X | X | X | X | X | X | X | X | X | X |
| glAli | X | X | X | X | ||||||
| locAli | X | X | X | X | X | X | ||||
| PSF | X | X | X | X | X | X | ||||
| abPSF | X | X | X | X | X | |||||
This table indicates which features are contained in which feature sets. PT: Primary Targets, MS: 2D Molecular Structure, FTs: Free Trees, KNN: KNN clustering, CF: Chemical Features, GF: Geometric Features, P: Pharmacophores, STTK: Partitioning in Serine-, Threonine and Tyrosine Kinases, PC: Phylogenetic Clustering, PRO: PROSITE patterns, Apri: APriori patterns, glAli: global alignment scores, locAli: local alignment scores, PSF: Position Specific Features, abPSF: abstract Position Specific Features. The upper part of the table describes the chemical features, the lower part the biological features. In the left part of the table (from FS1 to FS6), the description of the kinases is optimized (testing combinations of alignment-based and position-specific features). In the right part (FS7), the chemical representation is further optimized by additional descriptors.
Comparison of prediction accuracies for single feature groups
| Comparison of prediction accuracies for single feature groups on the test set | |||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| C5 | 73.6 | 75.6 | 79.3 | 79.3 | 79.3 | 79.3 | 79.3 | 73.6 | 73.6 | 73.6 | 73.6 | 74.0 | 74.0 | 73.3 | 73.3 |
| SVM | 73.6 | 75.6 | 79.3 | 73.6 | 76.8 | 79.3 | 79.3 | 73.6 | 73.5 | 73.1 | 73.1 | 72.7 | 73.6 | 72.3 | 72.3 |
Comparison of prediction accuracies of C5 without global pruning and SVMs with the quadratic kernel for single feature groups on the test set. For a description of the abbreviations of the feature sets see Table 4.
Figure 4Performance on different feature sets (soft case). Prediction accuracies, recall and precision for different feature sets from C5 and Support Vector Machines with different parameter settings (soft case).
Figure 5Comparison of prediction accuracies with random features. Comparison of prediction accuracy for different feature sets including random features.
Figure 6Performance comparison of the hard and the soft case. Comparison of the prediction accuracy and recall/precision in the hard and the soft case.
Figure 7Performance using solely test kinase-inhibitor pairs. Comparison of prediction accuracy and recall/precision using solely test kinase-inhibitor pairs in the training set.
Figure 8Performance comparison of different mixed cases (C5). Comparison of prediction accuracy and recall/precision for different mixed cases (C5 without global pruning).
Figure 9Performance comparison of different mixed-mixed cases (C5). Comparison of prediction accuracy for the soft, hard, mixed and mixed-mixed cases (C5 without global pruning).
Comparison of prediction accuracies for some mixed-mixed cases (on an absolute basis) (FS7)
| Comparison of prediction accuracies of C5 for some mixed-mixed cases on the test set (on an absolute basis) | |||||||||
|---|---|---|---|---|---|---|---|---|---|
| C5 | 72.4 | 77.5 | 78.6 | 73.5 | 73.9 | 78.9 | 79.1 | 79.4 | 79.5 |
Comparison of prediction accuracies for C5 for some mixed-mixed cases on the test set (on an absolute basis). Results are obtained with feature set 7.
Comparison of prediction accuracies for some mixed-mixed cases (on a percentage basis) (FS7)
| Comparison of prediction accuracies of different classifiers for some mixed-mixed cases on the test set | |||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| C5 | 81.4 | 79.4 | 80.6 | 80.5 | 79.6 | 71.8 | 82.7 | 73.9 | 83.1 | 81.1 | 83.8 |
| Bayes | 76.1 | 77.2 | 76.3 | 76.7 | 73.6 | 73.6 | 73.6 | 73.6 | 77.6 | 78.2 | 78.5 |
| Maj. | 73.6 | 73.6 | 73.6 | 73.6 | 73.6 | 73.6 | 73.6 | 73.6 | 73.6 | 73.6 | 73.6 |
Comparison of prediction accuracies of different classifiers for some mixed-mixed cases on the test set.
Results are obtained with feature set 7.
Figure 10Performance on the external test set. Prediction accuracy and recall/precision on the external test set with feature set 7, for both C5 and SVMs.