| Literature DB >> 26719774 |
Alessandro Lusci1, Michael Browning2, David Fooshee1, Joshua Swamidass2, Pierre Baldi1.
Abstract
BACKGROUND: A number of algorithms have been proposed to predict the biological targets of diverse molecules. Some are structure-based, but the most common are ligand-based and use chemical fingerprints and the notion of chemical similarity. These methods tend to be computationally faster than others, making them particularly attractive tools as the amount of available data grows.Entities:
Keywords: Fingerprints; Influence-relevance voter; Large-scale; Molecular potency; Random inactive molecules; Target-prediction
Year: 2015 PMID: 26719774 PMCID: PMC4696267 DOI: 10.1186/s13321-015-0110-6
Source DB: PubMed Journal: J Cheminform ISSN: 1758-2946 Impact factor: 5.514
Fig. 1Distribution of the number of molecules associated with each protein. The x axis bins proteins by the number of small-molecule data-points with which they are associated. The y axis plots the number of proteins in each bin. There are 2108 protein targets. About 33 % of these datasets contain more than 100 molecules
AUC performance in the cross-validation experiment on the ChEMBL dataset
| Cutoff ( | MaxSim | MeanSim | 11NN | IRV | PS-IRV | SVM | RF |
|---|---|---|---|---|---|---|---|
| All datasets | |||||||
| | 0.79 | 0.76 | 0.81 |
|
| 0.84 | 0.84 |
| | 0.76 | 0.74 | 0.82 | 0.84 |
| 0.84 | 0.82 |
| | 0.75 | 0.73 | 0.81 |
|
| 0.84 | 0.82 |
| Datasets with fewer than 100 molecules | |||||||
| | 0.75 | 0.75 | 0.75 |
|
| 0.77 |
|
| | 0.72 | 0.73 | 0.74 | 0.74 | 0.76 |
| 0.76 |
| | 0.71 | 0.71 | 0.74 | 0.75 | 0.75 |
|
|
| Datasets with more than 100 molecules | |||||||
| | 0.80 | 0.76 | 0.73 | 0.87 |
| 0.85 | 0.85 |
| | 0.77 | 0.74 | 0.84 | 0.87 |
| 0.86 | 0.84 |
| | 0.77 | 0.73 | 0.84 | 0.87 |
| 0.86 | 0.86 |
| Datasets with more than 200 molecules | |||||||
| | 0.81 | 0.75 | 0.84 | 0.89 |
| 0.86 | 0.86 |
| | 0.78 | 0.74 | 0.86 | 0.89 |
| 0.87 | 0.86 |
| | 0.77 | 0.73 | 0.86 | 0.88 |
| 0.87 | 0.85 |
Each section of the table shows the average performance for datasets of different sizes
Best results within each group are in italics
Fig. 2Cross-validation experiment: AUC scores as dataset size grows. Average AUC (y axis) plotted as a function of the minimum number of training molecules on the x axis. Model performance (AUC) increases as datasets with fewer examples are excluded
Fig. 3Cross-validation experiment: best performing models as dataset size grows. The fraction of times each model achieves the best performance for a dataset is plotted on the vertical axis, excluding datasets containing a number of molecules smaller than a specified size. PS-IRV is more consistently the best performer as more of the smaller datasets are excluded
AUC performance in the cross-validation experiment on the external validation (ChEMBL 19) dataset
| Cutoff ( | PS-IRV | SVM | RF |
|---|---|---|---|
| All datasets | |||
| |
| 0.69 | 0.68 |
| |
| 0.67 | 0.67 |
| |
| 0.66 | 0.67 |
| Datasets with more than 100 molecules | |||
| |
| 0.70 | 0.70 |
| |
| 0.68 | 0.69 |
| |
| 0.67 | 0.67 |
| Datasets with more than 200 molecules | |||
| |
|
| 0.71 |
| |
| 0.69 | 0.70 |
| |
| 0.68 | 0.68 |
Models were trained on the ChEMBL 13 dataset
Each section of the table shows the average performance for datasets of different sizes
Best results within each group are in italics
AUC performance in the simulated target-prediction experiments
| Method | Average AUC ( | Average AUC ( | Average AUC ( |
|---|---|---|---|
| Training without random negatives | |||
| PS-IRV |
| 0.84 | 0.83 |
| SVM | 0.84 |
|
|
| RF | 0.84 | 0.80 | 0.79 |
| Training with random negatives | |||
| PS-IRV |
|
| 0.97 |
| SVM |
|
| 0.98 |
| RF |
|
|
|
Models were trained using a tenfold cross-validation protocol and tested on the corresponding test set augmented with 9000 randomly selected ChEMBL molecules
In the top panel, models were trained in the standard way, without random negatives. In the bottom panel, the training set was supplemented with 1000 random negatives
Adding random negatives dramatically improves the performance of all methods
Best results are in italics
Fig. 4Simulated target-prediction experiment: AUC scores as dataset size grows. Average AUC (y axis) plotted as a function of the minimum number of training molecules (x axis). Each method’s ability to separate known actives from a background set of 9000 random ChEMBL molecules, assumed to be inactive, is measured. Training sets are not augmented
Average enrichment in the simulated target-prediction experiment when training with random negatives
| Enrichment (%) | PS-IRV | SVM | RF |
|---|---|---|---|
|
| |||
| 5 |
| 92 | 95 |
| 10 |
| 94 | 96 |
| 20 |
| 97 | 97 |
| 30 |
|
| 97 |
|
| |||
| 5 |
| 92 |
|
| 10 |
| 94 | 96 |
| 20 |
| 97 | 97 |
| 30 | 98 | 98 | 97 |
|
| |||
| 5 |
|
| 93 |
| 10 |
| 95 |
|
| 20 |
|
|
|
| 30 | 97 |
| 97 |
Models are tested using 10-fold cross-validation. 9000 randomly selected ChEMBL molecules are added to the original test set as putative inactives. 1000 randomly selected ChEMBL molecules are added to the original training sets as putative inactives. Best results at each cutoff are in italics
Fig. 5Simulated target-prediction experiment when training with random negatives: AUC scores as dataset size grows. Average AUC (y axis) plotted as a function of the minimum number of training molecules (x axis). Each method’s ability to separate known actives from a background set of 9000 random ChEMBL molecules, assumed to be inactive, is measured. 1000 random negative molecules are added to the original training sets. The extended training sets result in significant performance improvements
AUC performance in the simulated target-prediction experiment including external validation molecules
| Method |
|
|
|
|---|---|---|---|
| Average AUC | |||
| PS-IRV |
|
|
|
| SVM | 0.88 | 0.86 |
|
| RF | 0.85 | 0.84 | 0.84 |
| Median AUC | |||
| PS-IRV |
|
|
|
| SVM | 0.94 | 0.93 |
|
| RF | 0.93 | 0.91 | 0.90 |
Models are trained using 10-fold cross-validation and tested on the external validation set
Training and test sets are augmented with 1000 and 9000 random negative molecules respectively
Here, we report both average and median AUC as we find a significant difference between the two measures. The results suggest that if we exclude a few outliers, AUC performance is consistently above 0.90 for each method. Best results are in italics
Fig. 6Probabilistic predictions. This reliability diagram plots the percentage of positive molecules (y axis) in the respective bins of molecules with similar prediction values (x axis). The data is collected from the outputs of the target-prediction models with AUC greater than 0.90. The PS-IRV and RF both produce lines that closely follow the y = x line, indicating that their output can be interpreted as a probability
Fig. 7IRV Interpretability. A test molecule is shown at center, along with its neighbors and their influences. Each neighbor’s influence factors into the overall vote determining the predicted activity of the test molecule. This test molecule has been experimentally determined as active, and is predicted by IRV to be active given its neighbors and their influences