| Literature DB >> 35918366 |
Eric D Cosoreanu1, Joseph Dooley1, Joshua S Fryer1, Shaun M Gordon1, Nikhil Kharbanda1, Martin Klamrowski1, Patrick N L LaCasse1, Thomas F Leung1, Muneeb A Nasir1, Chang Qiu1, Aisha S Robinson1, Derek Shao1, Boyan R Siromahov1, Evening Starlight1, Christophe Tran1, Christopher Wang1, Yu-Kai Yang1, Kevin Dick2,3, Daniel G Kyrollos1,4, James R Green1,4.
Abstract
The identification of novel drug-target interactions (DTI) is critical to drug discovery and drug repurposing to address contemporary medical and public health challenges presented by emergent diseases. Historically, computational methods have framed DTI prediction as a binary classification problem (indicating whether or not a drug physically interacts with a given protein target); however, framing the problem instead as a regression-based prediction of the physiochemical binding affinity is more meaningful. With growing databases of experimentally derived drug-target interactions (e.g. Davis, Binding-DB, and Kiba), deep learning-based DTI predictors can be effectively leveraged to achieve state-of-the-art (SOTA) performance. In this work, we formulated a DTI competition as part of the coursework for a senior undergraduate machine learning course and challenged students to generate component DTI models that might surpass SOTA models and effectively combine these component models as part of a meta-model using the Reciprocal Perspective (RP) multi-view learning framework. Following 6 weeks of concerted effort, 28 student-produced component deep-learning DTI models were leveraged in this work to produce a new SOTA RP-DTI model, denoted the Meta Undergraduate Student DTI (MUSDTI) model. Through a series of experiments we demonstrate that (1) RP can considerably improve SOTA DTI prediction, (2) our new double-cold experimental design is more appropriate for emergent DTI challenges, (3) that our novel MUSDTI meta-model outperforms SOTA models, (4) that RP can improve upon individual models as an ensembling method, and finally, (5) RP can be utilized for low computation transfer learning. This work introduces a number of important revelations for the field of DTI prediction and sequence-based, pairwise prediction in general.Entities:
Mesh:
Year: 2022 PMID: 35918366 PMCID: PMC9344797 DOI: 10.1038/s41598-022-16493-9
Source DB: PubMed Journal: Sci Rep ISSN: 2045-2322 Impact factor: 4.996
Figure 1Conceptual overview of the proposed MUSDTI predictor.
DTI dataset sizes and their combined usage in defining the two test datasets.
| Dataset descriptor | Num. DTI pairs |
|---|---|
| Davis | 25,772 |
| BindingDB | 55,148 |
| KIBA | 117,657 |
| Training data (D+BDB) | 80,920 |
| Numerical Map data (KIBA) | 108,436 |
| Test size (double cold) | 8178 |
| Test size (DTA-defined) | 19,550 |
Figure 2Experimental design to evaluate the proposed MUSDTI predictor.
Figure 3Example paired one-to-all score curves. An example pair demonstrating dramatically differing distributions is depicted to emphasize that even though a given drug scores relatively low in the given protein target perspective, that protein is the top-scoring target for that specific drug.
The 14 RP features derived from DTI pair-specific one-to-all score curves.
| Feature generic name | Short name | Type | Description |
|---|---|---|---|
| Y-in-X-percentile | Rank | Percentile of target Y among all the predictions for drug X | |
| X-in-Y-percentile | Rank | Percentile of drug X among all the predictions for target Y | |
| Adjusted reciprocal rank order | ARRO | Rank | Reciprocal product of |
| X-percentile-baseline | Rank | Percentile rank of the target nearest to the local cutoff value of drug X | |
| X-baseline | Score | Score at the local cutoff value of drug X | |
| Y-percentile-baseline | Rank | Percentile rank of the drug nearest to the local cutoff value of target Y | |
| Y-baseline | Score | Score at the local cutoff value of target Y | |
| Percentile-difference-from-baseline-X | Fold | Difference between | |
| Percentile-difference-from- baseline-Y | Fold | Difference between | |
| Fold-difference-from-baseline-X | Fold | Fold-difference of target Y score in drug X from baseline | |
| Fold-difference-from-baseline-Y | Fold | Fold-difference of drug X score in target Y from baseline | |
| SD-distance-from-mean-X | Stats | The number of standard deviations from the mean score in drug X | |
| SD-distance-from-mean-Y | Stats | The number of standard deviations from the mean score in target Y | |
| Original-Score | Score | The original predicted score from the component model |
MUSDTI hyperparameter values.
| MUSDTI model parameter | DeepDTA dataset | DeepDTA* dataset | Double-cold dataset | Double-cold* dataset |
|---|---|---|---|---|
| Colsample by tree | 0.9362 | 0.7991 | 0.8988 | 0.9278 |
| Gamma | 1.1306 | 2.918 | 1.9723 | 4.342 |
| Learning rate | 0.2637 | 0.095 | 1.748 | 0.093 |
| Max depth | 15.0 | 13.0 | 12.0 | 12.0 |
| Min child weight | 2.0 | 9.000 | 4.0 | 1.0 |
The * designation denotes the model parameters used for the ensembled MUSDTI* model prior to the cascaded application of RP.
Component and MUSDTI model performance evaluated over the validation and test datasets using concordance index.
Component and MUSDTI model performance evaluated over the validation and test datasets using root mean squared error.
Figure 4Inference rates of each component model measured over random subsets of 1 million pairs.
Figure 5Component model performance improvement from the reciprocal perspective cascaded layer over the double-cold dataset.
Figure 6Component model performance improvement from the reciprocal perspective cascaded layer over the DeepDTA-defined dataset.
Figure 7Experimental results over the DeepDTA-defined datasets when incrementally incorporating reciprocal perspective component models compared to the SOTA DeepDTA models. The top-performing combined models were circled in the figure (top-20 models) and the first (top-10 models) represent the performance of the proposed MUSDTI model even when the later combined models represent a marginally higher performance. We opted for the component model ensemble that represented the plateaued performance of component models.
Figure 8Shapeley additive features analysis. The x-axis is sorted left-to-right in increasing magnitude of SHAP value summed over the column while the y-axis is sorted top-down in increasing magnitude of SHAP value summed over the row. Emanating out from the bottom-right are the models and features with increasingly lesser impact on the model decision. Only the top-10 models contributing to the MUSDTI model are depicted along all 14 RP features.