| Literature DB >> 23046442 |
Gang Fu1, Xiaofei Nan, Haining Liu, Ronak Y Patel, Pankaj R Daga, Yixin Chen, Dawn E Wilkins, Robert J Doerksen.
Abstract
BACKGROUND: In the context of drug discovery and development, much effort has been exerted to determine which conformers of a given molecule are responsible for the observed biological activity. In this work we aimed to predict bioactive conformers using a variant of supervised learning, named multiple-instance learning. A single molecule, treated as a bag of conformers, is biologically active if and only if at least one of its conformers, treated as an instance, is responsible for the observed bioactivity; and a molecule is inactive if none of its conformers is responsible for the observed bioactivity. The implementation requires instance-based embedding, and joint feature selection and classification. The goal of the present project is to implement multiple-instance learning in drug activity prediction, and subsequently to identify the bioactive conformers for each molecule.Entities:
Mesh:
Year: 2012 PMID: 23046442 PMCID: PMC3439725 DOI: 10.1186/1471-2105-13-S15-S3
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Figure 1Cartoon representation of the relationship between molecules and conformers. M= 1, 2, 3, 4 represent the molecules (bags), circled by dashed lines. The solid triangles in M, circles in M, squares in M, and stars in Mrepresent conformers for different molecules. Molecules 2, 3, and 4 were biologically active since they had at least one bioactive conformer, whereas molecule 1 was inactive since none of its conformers was bioactive. The distance between two molecules, Mand M, was calculated by the minimum distance D(M, M).
Figure 2Overview of the MILES approach. (1) Structure preprocessing and conformational sampling. (2) Creating pharmacophore fingerprints and significance analysis of pharmacophore models. (3) Instance-based feature mapping based on structural similarity measures. (4) Joint feature selection and classification using 1-norm SVM.
Metrics used for dissimilarity measurements
| Dissimilarity Measure | |
|---|---|
| Soergel | |
| Dice | |
| Manhattan | |
| Rogers-Tanimoto |
Let P1 and P2 be two pharmacophore fingerprints, a be the count of bits which are set to 1 in both P1 and P2, b be the count of bits which are set to 1 in P1 but not in P2, c be the count of bits which are set to 1 in P2 but not in P1, and d be the count of bits which are set to 0 in both P1 and P2. So a is called the number of total matches, b and c are called the number of single matches, and d is called the number of no matches.
Figure 3Identification of bioactive conformers. Molecule i was circled by a dashed line and its conformers were represented by solid triangles. The plus circles represent the positively contributing prototype conformers and the minus circles represent the negatively contributing prototype conformers. The identification of bioactive conformers was accomplished by calculating the total contributions from the closest prototype conformers.
Data set statistics
| Data set | No. of molecules in training set | No. of molecules in test set | Total no. of molecules | ||
|---|---|---|---|---|---|
| Positive | Negative | Positive | Negative | ||
| 199 | 188 | 67 | 70 | 524 | |
| 191 | 210 | 62 | 74 | 537 | |
| 247 | 131 | 60 | 57 | 495 | |
| 94 | 93 | 28 | 35 | 250 | |
Conformational sampling and pharmacophore fingerprints
| Data set |
|
|
|
|
| |
|---|---|---|---|---|---|---|
| 17249 | 5399 | 1872521 | 243721 | 2979 | 1.77 | |
| 35434 | 12333 | 1670985 | 155220 | 14002 | 5.40 | |
| 32528 | 9942 | 1636254 | 145996 | 1542 | 1.80 | |
| 41960 | 10746 | 13687602 | 161018 | 3467 | 1.66 |
The number of conformers in the training set; the number of conformers in the test set; the number of pharmacophore bits in the fingerprint originally enumerated; the number of pharmacophore bits in the fingerprint after filtering; the number of bits in the optimal subset of the pharmacophore fingerprint; the optimal threshold value to select truly significant pharmacophore bits.
Optimization of tuning parameter λ for MILES
| Data set | Dissimilarity measure | Cross-validation | λ |
|
|---|---|---|---|---|
| Soergel | 0.777 | 8.000 | 196 | |
| Dice | 0.761 | 4.400 | 165 | |
| Manhattan | 0.803 | 4.400 | 130 | |
| Rogers-Tanimoto | 0.801 | 4.000 | 153 | |
| Soergel | 0.865 | 0.001 | 103 | |
| Dice | 0.865 | 0.001 | 85 | |
| Manhattan | 0.877 | 0.022 | 63 | |
| Rogers-Tanimoto | 0.868 | 0.069 | 72 | |
| Soergel | 0.899 | 0.001 | 94 | |
| Dice | 0.901 | 0.001 | 75 | |
| Manhattan | 0.934 | 0.550 | 63 | |
| Rogers-Tanimoto | 0.935 | 4.400 | 46 | |
| Soergel | 0.579 | 0.003 | 125 | |
| Dice | 0.544 | 0.031 | 111 | |
| Manhattan | 0.690 | 0.550 | 87 | |
| Rogers-Tanimoto | 0.689 | 6.800 | 78 | |
The median classification accuracy for 5 replications of 5-fold cross-validation; the number of prototype conformers selected in the set Γ; The model selected based on the number of prototype conformers.
Predictive performance for different dissimilarity measures
| Data set | Dissimilarity measure | Training set | Test set | ||
|---|---|---|---|---|---|
| Accuracy | MCC | Accuracy | MCC | ||
| Soergel | 0.972 | 0.944 | 0.854 | 0.714 | |
| Dice | 0.979 | 0.959 | 0.825 | 0.653 | |
| Manhattan | 0.941 | 0.881 | 0.861 | 0.725 | |
| Rogers-Tanimoto | 0.961 | 0.923 | 0.861 | 0.725 | |
| Soergel | 0.965 | 0.933 | 0.860 | 0.725 | |
| Dice | 0.965 | 0.933 | 0.868 | 0.745 | |
| Manhattan | 0.978 | 0.956 | 0.904 | 0.807 | |
| Rogers-Tanimoto | 0.973 | 0.946 | 0.897 | 0.793 | |
| Soergel | 0.989 | 0.977 | 0.846 | 0.706 | |
| Dice | 0.989 | 0.977 | 0.855 | 0.717 | |
| Manhattan | 0.979 | 0.954 | 0.838 | 0.686 | |
| Rogers-Tanimoto | 0.947 | 0.885 | 0.846 | 0.711 | |
| Soergel | 0.904 | 0.823 | 0.667 | 0.301 | |
| Dice | 0.904 | 0.823 | 0.635 | 0.307 | |
| Manhattan | 0.957 | 0.918 | 0.714 | 0.433 | |
| Rogers-Tanimoto | 0.898 | 0.811 | 0.794 | 0.584 | |
The model selected based on the number of prototype conformers.
Optimization of tuning parameter λ for 1-norm SVM
| Data set | Cross-validation | λ |
|
|---|---|---|---|
| 0.693 | 0.001 | 223 | |
| 0.880 | 2.000 | 80 | |
| 0.912 | 0.016 | 77 | |
| 0.598 | 0.125 | 89 |
The median classification accuracy for 5 replications of 5-fold cross-validation; the number of important pharmacophore bits.
Predictive performance for different models
| Data set | Methods | Training set | Test set | ||
|---|---|---|---|---|---|
| Accuracy | MCC | Accuracy | MCC | ||
| MILES | 0.941 | 0.881 | 0.861 | 0.725 | |
| Decision tree | 0.915 | 0.830 | 0.781 | 0.569 | |
| 1-norm SVM | 1.000 | 1.000 | 0.832 | 0.668 | |
| Random forest | 0.995 | 0.990 | 0.891 | 0.783 | |
| MILES | 0.978 | 0.956 | 0.904 | 0.807 | |
| Decision tree | 0.955 | 0.913 | 0.919 | 0.837 | |
| 1-norm SVM | 0.980 | 0.961 | 0.882 | 0.765 | |
| Random forest | 0.945 | 0.896 | 0.868 | 0.754 | |
| MILES | 0.947 | 0.885 | 0.846 | 0.711 | |
| Decision tree | 0.966 | 0.924 | 0.838 | 0.682 | |
| 1-norm SVM | 0.995 | 0.988 | 0.812 | 0.624 | |
| Random forest | 0.982 | 0.959 | 0.855 | 0.717 | |
| MILES | 0.898 | 0.811 | 0.794 | 0.584 | |
| Decision tree | 0.914 | 0.829 | 0.698 | 0.398 | |
| 1-norm SVM | 0.952 | 0.906 | 0.714 | 0.418 | |
| Random forest | 0.936 | 0.877 | 0.698 | 0.392 | |
Manhattan dissimilarity measure; Rogers-Tanimoto dissimilarity measure.
Validations on the prediction of bioactive conformers
| ID | Name | PDB ID | Contribution | Rank |
|
|---|---|---|---|---|---|
| 23 | AR | 2.792 | 3 | 117 | |
| 37 | Benzoimidazole-1 | 0 | N.A. | 138 | |
| 50 | Jonjon-1 | 2.827 | 6 | 38 | |
| 59 | LM-4 | 0.858 | 1 | 2 | |
| 60 | LM-5 | 11.941 | 1 | 3 | |
| 77 | LM-29 | 8.576 | 2 | 7 | |
| 97 | Maleimide | 0 | N.A. | 121 | |
| 98 | OxaD-0 | 10.629 | 1 | 53 | |
| 99 | OxaD-00 | 4.637 | 2 | 9 | |
| 153 | Pyzo-11 | 10.371 | 1 | 11 | |
| 198 | RM-0 | 5.568 | 2 | 25 | |
| 199 | Staurosporine | 22.359 | 1 | 5 |
Molecule index in the data set; molecular name in the data set; Protein Data Bank index for the protein structure from which the experimental conformer was extracted; contribution calculated using equation 6; the rank in the set of contributions; the number of conformers for each molecule; the rank cannot be determined and the conformer was predicted to be irrelevant to classification based on the MILES method.