| Literature DB >> 24267824 |
Zhendong Zhao, Gang Fu, Sheng Liu, Khaled M Elokely, Robert J Doerksen, Yixin Chen, Dawn E Wilkins.
Abstract
BACKGROUND: In drug discovery and development, it is crucial to determine which conformers (instances) of a given molecule are responsible for its observed biological activity and at the same time to recognize the most representative subset of features (molecular descriptors). Due to experimental difficulty in obtaining the bioactive conformers, computational approaches such as machine learning techniques are much needed. Multiple Instance Learning (MIL) is a machine learning method capable of tackling this type of problem. In the MIL framework, each instance is represented as a feature vector, which usually resides in a high-dimensional feature space. The high dimensionality may provide significant information for learning tasks, but at the same time it may also include a large number of irrelevant or redundant features that might negatively affect learning performance. Reducing the dimensionality of data will hence facilitate the classification task and improve the interpretability of the model.Entities:
Mesh:
Substances:
Year: 2013 PMID: 24267824 PMCID: PMC3850986 DOI: 10.1186/1471-2105-14-S14-S16
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Figure 1A toy example for illustrating instance classification.
Figure 2Flowchart of MIL via joint instance and feature selection.
Some statistics of data sets.
| Data set | No. of molecules in training set | No. of molecules in test set | Total No. of molecules | ||
|---|---|---|---|---|---|
| Positive | Negative | Positive | Negative | ||
| 199 | 188 | 67 | 70 | 524 | |
| 191 | 210 | 62 | 74 | 537 | |
| 247 | 131 | 60 | 57 | 495 | |
| 94 | 93 | 28 | 35 | 250 | |
Number of conformers and length of pharmacophore fingerprint of data sets.
| Data set | No. of conformers | Length of pharmacophore fingerprint | |||
|---|---|---|---|---|---|
| Training | Test | ||||
| Total | Average | Total | Average | ||
| 17429 | 44.6 | 5399 | 39.4 | 2979 | |
| 35434 | 88.4 | 12333 | 90.7 | 14002 | |
| 32528 | 86.1 | 9942 | 85.0 | 1542 | |
| 41960 | 224.4 | 10746 | 170.6 | 3467 | |
Test error rates for each data set.
| R1 | R2 | R3 | R4 | R5 | Average | ||
|---|---|---|---|---|---|---|---|
| Method 1 | Data set | 18.25 | 14.60 | 17.52 | 10.95 | 19.71 | 16.20 ± 3.48 (11.82 ± 1.20) |
| Data set | 18.38 | 16.18 | 17.65 | 19.85 | 15.44 | 17.50 ± 1.76 (16.91 ± 0.74) | |
| Data set | 16.24 | 14.53 | 15.38 | 16.24 | 17.95 | 16.07 ± 1.27 (15.38 ± 1.05) | |
| Data set | 28.57 | 23.81 | 34.92 | 28.57 | 30.16 | 29.21 ± 3.98 (26.35 ± 0.87) | |
| Method 2 | Data set | 15.33 | 16.06 | 19.71 | 21.90 | 19.71 | 18.54 ± 2.76 (11.82 ± 1.20) |
| Data set | 18.38 | 13.24 | 13.24 | 16.18 | 18.38 | 15.88 ± 2.58 (16.91 ± 0.74) | |
| Data set | 16.24 | 17.09 | 17.52 | 17.09 | 15.38 | 16.67 ± 0.85 (15.38 ± 1.05) | |
| Data set | 37.30 | 34.92 | 26.98 | 28.57 | 36.51 | 32.86 ± 4.75 (26.35 ± 0.87) | |
R1-R5: from the first to the fifth repeat.
Note: numbers in parentheses are average error rates of the original MILES method.
Results of data set I
| Iteration | 0 | 1 | 2 | 3 | 4 | ||
|---|---|---|---|---|---|---|---|
| Method 1 | Error rate (%) | CV | 19.23 | 14.10 | 11.67 | - | - |
| Training | 3.88 | 7.24 | 5.43 | - | - | ||
| Test | 13.14 | 13.14 | 17.52 | - | - | ||
| Number of instances | 17249 | 153 | 80 | - | - | ||
| Number of features | 2979 | 298 | 168 | - | - | ||
| Method 2 | Error rate (%) | CV | 19.47 | 15.79 | 15.39 | 16.67 | 14.27 |
| Training | 6.46 | 7.24 | 6.72 | 11.89 | 8.53 | ||
| Test | 10.95 | 16.79 | 18.98 | 18.25 | 19.71 | ||
| Number of instances | 17249 | 125 | 79 | 70 | 48 | ||
| Number of features | 2979 | 284 | 158 | 117 | 111 | ||
Note: Iteration 0 is the original MILES method without instance and feature selection.
Results of data set II.
| Iteration | 0 | 1 | 2 | 3 | 4 | 5 | ||
|---|---|---|---|---|---|---|---|---|
| Method 1 | Error rate (%) | CV | 12.50 | 10.00 | 10.00 | 11.10 | 8.63 | - |
| Training | 6.73 | 6.23 | 9.23 | 8.73 | 7.48 | - | ||
| Test | 16.18 | 15.44 | 14.71 | 14.71 | 15.44 | - | ||
| Number of instances | 35434 | 58 | 40 | 26 | 23 | - | ||
| Number of features | 14002 | 106 | 43 | 35 | 19 | - | ||
| Method 2 | Error rate (%) | CV | 12.50 | 8.75 | 10.00 | 10.00 | 12.50 | 12.50 |
| Training | 4.99 | 6.98 | 11.97 | 11.47 | 13.72 | 13.72 | ||
| Test | 16.91 | 19.85 | 14.71 | 15.81 | 16.18 | 16.18 | ||
| Number of instances | 35434 | 62 | 37 | 10 | 7 | 1 | ||
| Number of features | 14002 | 92 | 36 | 20 | 10 | 6 | ||
Note: Iteration 0 is the original MILES method without instance and feature selection.
Results of data set III.
| Iteration | 0 | 1 | 2 | 3 | ||
|---|---|---|---|---|---|---|
| Method 1 | Error rate (%) | CV | 5.26 | 5.26 | - | - |
| Training | 6.08 | 6.88 | ||||
| Test | 17.09 | 16.24 | ||||
| Number of instances | 32528 | 22 | - | - | ||
| Number of features | 1542 | 36 | - | - | ||
| Method 2 | Error rate (%) | CV | 5.33 | 6.69 | 9.21 | 7.89 |
| Training | 5.56 | 5.82 | 8.73 | 8.73 | ||
| Test | 14.53 | 15.39 | 17.09 | 17.09 | ||
| Number of instances | 32528 | 48 | 14 | 5 | ||
| Number of features | 1542 | 55 | 8 | 2 | ||
Note: Iteration 0 is the original MILES method without instance and feature selection.
Results of data set IV.
| Iteration | 0 | 1 | 2 | 3 | 4 | 5 | ||
|---|---|---|---|---|---|---|---|---|
| Method 1 | Error rate (%) | CV | 30.00 | 22.22 | 22.22 | 22.22 | 22.22 | 22.22 |
| Training | 17.65 | 22.46 | 22.46 | 24.60 | 24.06 | 25.13 | ||
| Test | 25.40 | 20.64 | 23.81 | 20.64 | 22.22 | 23.81 | ||
| Number of instances | 41960 | 40 | 15 | 8 | 8 | 8 | ||
| Number of features | 3467 | 142 | 56 | 50 | 48 | 27 | ||
| Method 2 | Error rate (%) | CV | 22.22 | 22.22 | 22.22 | - | - | - |
| Training | 19.79 | 25.67 | 25.13 | - | - | - | ||
| Test | 26.98 | 30.16 | 28.57 | - | - | - | ||
| Number of instances | 41960 | 27 | 13 | - | - | - | ||
| Number of features | 3467 | 90 | 54 | - | - | - | ||
Note: Iteration 0 is the original MILES method without instance and feature selection.
Figure 3The four pharmacophoric elements selected by our approach: blue solid sphere (positive ionizable group), cyan and red small spheres (hydrophobes), and mixed red and cyan meshed sphere (hydrogen bond acceptor/donor). The 3D shape is constructed to provide the main skeleton of possible hits.
Figure 6Comparison of native ligand conformation and one of a very closely related molecule. Receptor grid based alignment (left), native ligand in green, selected conformer in grey. Pharmacophore based alignment (right), native ligand cyan, selected conformer in orange. Perfect alignments are found in both cases, repesenting the ability of this approach to find the bioactive conformer.