| Literature DB >> 34066072 |
Matthieu Najm1,2,3, Chloé-Agathe Azencott1,2,3, Benoit Playe1,2,3, Véronique Stoven1,2,3.
Abstract
Identification of the protein targets of hit molecules is essential in the drug discovery process. Target prediction with machine learning algorithms can help accelerate this search, limiting the number of required experiments. However, Drug-Target Interactions databases used for training present high statistical bias, leading to a high number of false positives, thus increasing time and cost of experimental validation campaigns. To minimize the number of false positives among predicted targets, we propose a new scheme for choosing negative examples, so that each protein and each drug appears an equal number of times in positive and negative examples. We artificially reproduce the process of target identification for three specific drugs, and more globally for 200 approved drugs. For the detailed three drug examples, and for the larger set of 200 drugs, training with the proposed scheme for the choice of negative examples improved target prediction results: the average number of false positives among the top ranked predicted targets decreased, and overall, the rank of the true targets was improved.Our method corrects databases' statistical bias and reduces the number of false positive predictions, and therefore the number of useless experiments potentially undertaken.Entities:
Keywords: chemogenomic; drug discovery; false positive predictions; learning bias; machine learning; negative examples; random forests; support vector machines; target identification
Year: 2021 PMID: 34066072 PMCID: PMC8151112 DOI: 10.3390/ijms22105118
Source DB: PubMed Journal: Int J Mol Sci ISSN: 1422-0067 Impact factor: 5.923
Figure 1Method for building one RN-dataset (or one BN-dataset).
Figure 2Flowchart of the Drug-Target Interaction (DTI) prediction pipeline.
Figure 3Flowchart of the target identification pipeline.
Performance of the SVM and RF algorithms for DTI predictions on the RN-datasets.
| Algorithm | AUPR | ROC-AUC | Recall | Precision | FPR |
|---|---|---|---|---|---|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Figure 4Distribution of the probability scores predicted for known positive DTIs and randomly chosen negative DTIs among unlabeled DTIs.
Figure 5Statistical bias in the DB-Database. (a) Distribution of the molecules according to their number of targets in the DB-Database. (b) Distribution of the proteins according to their number of ligands in the DB-Database.
Distribution in the DB-Database of the number of DTIs involving proteins from various categories, according to their number on known ligands.
| Protein nb of Ligands | nb of Interactions |
|---|---|
| 1 | 1106 |
| 2 to 4 | 2527 |
| 5 to 10 | 2404 |
| 11 to 20 | 1920 |
| 21 to 30 | 1238 |
| >30 | 5442 |
DTI prediction results for 3 marketed drugs, when the algorithm is trained on the RN-datasets or the BN-datasets: number of False Positive predicted targets, score and rank of the true target.
| RN-Datasets | BN-Datasets | |||||
|---|---|---|---|---|---|---|
|
|
|
|
|
|
|
|
| DB11363 | 27 | 0.8 | 31 | 16 | 0.8 | 3 |
| DB11842 | 27 | 0.76 | 31 | 26 | 0.85 | 18 |
| DB11732 | 27 | 0.67 | 107 | 26 | 0.83 | 17 |
Figure 6Balancing the BN-datasets. (a) Distribution of the proteins according to the number of positive examples or negative examples in which they are involved. (b) Distribution of the molecules according to the number of positive examples or negative examples in which they are involved.
Rate of false positives for proteins with various numbers of known ligands.
| FPR (Threshold = 0.5) | FPR (Threshold = 0.7) | |||
|---|---|---|---|---|
| Prot in Category | RN-Datasets | BN-Datasets | RN-Datasets | BN-Datasets |
| 0 |
|
|
|
|
| 1 |
|
|
|
|
| 2 to 4 |
|
|
|
|
| 5 to 10 |
|
|
|
|
| 11 to 20 |
|
|
|
|
| 21 to 30 |
|
|
|
|
| >30 |
|
|
|
|