| Literature DB >> 33431042 |
Benoit Playe1,2,3, Veronique Stoven4,5,6.
Abstract
Chemogenomics, also called proteochemometrics, covers a range of computational methods that can be used to predict protein-ligand interactions at large scales in the protein and chemical spaces. They differ from more classical ligand-based methods (also called QSAR) that predict ligands for a given protein receptor. In the context of drug discovery process, chemogenomics allows to tackle the question of predicting off-target proteins for drug candidates, one of the main causes of undesirable side-effects and failure within drugs development processes. The present study compares shallow and deep machine-learning approaches for chemogenomics, and explores data augmentation techniques for deep learning algorithms in chemogenomics. Shallow machine-learning algorithms rely on expert-based chemical and protein descriptors, while recent developments in deep learning algorithms enable to learn abstract numerical representations of molecular graphs and protein sequences, in order to optimise the performance of the prediction task. We first propose a formulation of chemogenomics with deep learning, called the chemogenomic neural network (CN), as a feed-forward neural network taking as input the combination of molecule and protein representations learnt by molecular graph and protein sequence encoders. We show that, on large datasets, the deep learning CN model outperforms state-of-the-art shallow methods, and competes with deep methods with expert-based descriptors. However, on small datasets, shallow methods present better prediction performance than deep learning methods. Then, we evaluate data augmentation techniques, namely multi-view and transfer learning, to improve the prediction performance of the chemogenomic neural network. We conclude that a promising research direction is to integrate heterogeneous sources of data such as auxiliary tasks for which large datasets are available, or independently, multiple molecule and protein attribute views.Entities:
Keywords: Chemogenomics; Deep learning; Drug virtual screening; Graph neural networks
Year: 2020 PMID: 33431042 PMCID: PMC7011501 DOI: 10.1186/s13321-020-0413-0
Source DB: PubMed Journal: J Cheminform ISSN: 1758-2946 Impact factor: 5.514
Fig. 1The chemogenomic neural network (CN)
Fig. 2Sketch of the Graph neural network iterative process. (a) The function updates node representation vectors by aggregating information coming from itself and its neighbours. (b) As the process iterates, nodes receive information from further nodes in the graph. (c) The function builds a graph-level representation vector by aggregating information from all nodes. (d) A graph-level representation is learned at each iteration
Fig. 3Performances on DBEColi for the S1 (random split), S2 (orphan proteins in test set), S3 (orphan molecules in test set), and S4 (double orphan in test set) settings, with a positive:negative samples ratio set to 1:5. For each setting, the order from left to right in which the results are displayed is given in the legend
Fig. 4Performances on DBHuman for the S1 (random split), S2 (orphan proteins in test set), S3 (orphan molecules in test set), and S4 (double orphan in test set) settings, with a positive:negative samples ratio set to 1:5. For each setting, the order from left to right in which the results are displayed is given in the legend
Fig. 5Chemogenomic neural networks combining expert-based and learnt features in one step
Fig. 6Modification of the proposed chemogenomic neural network combining expert-based and learnt features
AUPR score for the two transfer learning modifications of the chemogenomic neural network CN, based on a single train/validation/test split of the DBEColi dataset
| Raw ( | Orphan proteins ( | Orphan molecules ( | Double orphan ( | |
|---|---|---|---|---|
| Chemogenomic neural network (CN) | ||||
| Curriculum learning | ||||
The curriculum learning line corresponds to pre-training the molecule encoder of CN on the DBHuman dataset. The standard deviations are obtained by repeating 5 times the evaluation procedure
Fig. 7AUPR scores obtained via 5-fold nested cross-validation on the DBEColi dataset for the S1 (random split), S2 (orphan protein in test set), S3 (orphan molecule in test set), and S4 (double orphan) settings. The performances of are compared to the reference shallow methods and to FNN and CN as reference methods for deep learning. For each setting, the order from left to right in which the results are displayed is indicated in legend
Fig. 8AUPR scores obtained via 5-fold nested cross-validation on the DBHuman dataset for the S1 (random split), S2 (orphan proteins in test set), S3 (orphan molecules in test set), and S4 (double orphan) settings. The performances of are compared to the reference shallow methods and to FNN and CN as reference methods for deep learning. For each setting, the order from left to right in which the results are displayed is indicated in legend
Fig. 9Sketch of the chemogenomic neural network CN of Fig. 1, modified for curriculum learning
Fig. 10AUPR scores obtained with a 5-fold nested cross-validation scheme for the (random split), (orphan protein in test set), (orphan molecule in test set), and (double orphan) settings on the DBEColi dataset. The performance of and are compared to the reference shallow methods (kronSVM and NRLMF) and to FNN and CN as reference methods for deep learning. For each setting, the order from left to right in which the results are displayed is indicated in legend
AUPR scores on the Chembl-based dataset in the four , , , or settings, and for a test sample positive:negative ratio 1:5
| NRLMF | ||||
| CN |
AUPR scores (mean and standard deviation obtained by nested 5-fold cross-validation) on the DBEColi dataset in the four , , , or settings, and for a test sample positive:negative ratio 1:5
| kronSVM | ||||
| FNN | ||||
| Chemogenomic neural network (CN) | ||||
| RF with proteochemometric features |