| Literature DB >> 34849575 |
Ashwin Dhakal1, Cole McKay2, John J Tanner2,3, Jianlin Cheng1.
Abstract
New drug production, from target identification to marketing approval, takes over 12 years and can cost around $2.6 billion. Furthermore, the COVID-19 pandemic has unveiled the urgent need for more powerful computational methods for drug discovery. Here, we review the computational approaches to predicting protein-ligand interactions in the context of drug discovery, focusing on methods using artificial intelligence (AI). We begin with a brief introduction to proteins (targets), ligands (e.g. drugs) and their interactions for nonexperts. Next, we review databases that are commonly used in the domain of protein-ligand interactions. Finally, we survey and analyze the machine learning (ML) approaches implemented to predict protein-ligand binding sites, ligand-binding affinity and binding pose (conformation) including both classical ML algorithms and recent deep learning methods. After exploring the correlation between these three aspects of protein-ligand interaction, it has been proposed that they should be studied in unison. We anticipate that our review will aid exploration and development of more accurate ML-based prediction strategies for studying protein-ligand interactions.Entities:
Keywords: binding affinity; binding pose; binding site; deep learning; drug discovery; machine learning; protein–ligand interaction
Mesh:
Substances:
Year: 2022 PMID: 34849575 PMCID: PMC8690157 DOI: 10.1093/bib/bbab476
Source DB: PubMed Journal: Brief Bioinform ISSN: 1467-5463 Impact factor: 11.622
Figure 1(A) Sketch of peramivir, an inhibitor of the viral protein neuraminidase from the H1N9 influenza virus. (B) Sketch of human Src kinase inhibitor bosutinib.
Figure 2Human Src kinase docked by bosutinib visualized with a hydrophobic surface generated in Chimera, PDB code 4MX0. Most hydrophobic regions colored red; most hydrophilic indicated in blue.
Figure 3Conceptual workflow of ML pipeline. Inputs are the properties of the target protein and ligands, and output are the predicted interactions.
Figure 4Pie charts showing the distribution of prevailing datasets for the AI-based PLI prediction models. (A) Prevalent dataset for AI based protein-ligand binding affinity prediction models. (B) Prevalent dataset for AI based protein-ligand binding pose prediction models. (C) Prevalent dataset for AI based protein-ligand binding site prediction models.
Figure 5Statistics of PDBBind dataset showing its composition from Version 2015 as well as the basic structure of Version 2020.
Figure 6Protein–ligand interactions demonstrated through neuraminidase–peramivir interaction. (A) Neuraminidase monomer with peramivir depicted in red. (B) view of full monomer with peramivir with hydrogen bonding pairs labeled and displayed in canonical atom coloring. Oxygen colored red, and nitrogen in blue. (C) focused view of neuraminidase peramivir hydrogen bonding, PDB code 1L7F.
Classic ML methods to predict protein–ligand binding sites
| SN | Approach | Techniques | Features | Database used | Year |
|---|---|---|---|---|---|
| 1 | Oriented Shell Model [ | Support vector machine | Developed oriented shell model, utilizing distance and angular position distribution | Self-curated | 2005 |
| 2 | SitePredict [ | Random forest | Predicted small ligand-binding sites mobilizing backbone structure | Self-curated | 2008 |
| 3 | LIBRUS [ | Support vector machine | Combined ML and homology information for sequence-based ligand-binding residue prediction | Self-curated + FINDSITE’s database | 2009 |
| 4 | Qiu and Wang’s method [ | Random forest | Used eight structural properties to train random forest classifiers, latter combined to predict binding residues | Q-SiteFinder’s dataset | 2011 |
| 5 | Wong | Support vector machine + differential evolution | Classified the grid points with the location most likely to contain bound ligands | LigASite | 2012 |
| 6 | DoGSiteScorer [ | Support vector machine | Web server for binding site prediction, analysis and druggability assessment | Self-curated | 2012 |
| 7 | Wong | Support vector machine | Used SVM to cluster most probable ligand-binding pockets using protein properties | LigASite + self-curated | 2013 |
| 8 | TargetS [ | Support vector machine + modified AdaBoost | Designed template-free predictor with classifier ensemble and spatial clustering | BioLip | 2013 |
| 9 | Wang | Support vector machine + statistical depth function | SVM model integrating sequence and structural information | PDBbind | 2013 |
| 10 | LigandRFs [ | Random forest | Applied random forest ensemble to identify ligand-binding residues from sequence information alone | CASP9 targets + CASP8 targets | 2014 |
| 11 | Suresh | Naive Bayes classifier | Trained Naive Bayes classifier using only sequence-based information | Self-curated | 2015 |
| 12 | OSML [ | Support vector machine | Proposed dynamic learning framework for constructing query-driven prediction models | BioLip + CASP9 targets | 2015 |
| 13 | PRANK [ | Random forests | Developed mechanism to prioritize the predicted putative pockets | Astex Diverse set + self-curated | 2015 |
| 14 | UTProt Galaxy [ | Support vector machine + neural network + random forest | Developed pipeline for protein–ligand binding site predictive tools using multiomics big data | Self-curated | 2015 |
| 15 | Chen | Random forest | Proposed dynamic ensemble approach to identify protein–ligand binding residues by using sequence information | ccPDB + CASP9 targets + CASP8 targets | 2016 |
| 16 | Chen | Random forest | Predicted allosteric and functional sites on proteins | PDBbind + allosteric DB + CATH DB | 2016 |
| 17 | TargetCom [ | Support vector machine + modified AdaBoost algorithm | Designed ligand-specific methods to predict the binding sites of protein–ligand interactions by an ensemble classifier | BioLip | 2016 |
| 18 | P2Rank 2.1 [ | Bayesian optimization | Improved version of P2Rank | Self-curated | 2017 |
| 19 | P2Rank [ | Random forest | Built stand-alone template-free tool for prediction of ligand-binding sites | Self-curated | 2018 |
| 20 | PrankWeb [ | Random forest | Online resource providing an interface to P2Rank | Self-curated | 2019 |
List of deep learning methods to predict protein–ligand binding sites
| SN | Approach | Techniques involved | Feature | Database used | Year |
|---|---|---|---|---|---|
| 1 | DeepCSeqSite [ | Deep convolutional neural network | Proposed sequence-based approach for ab initio protein–ligand binding residue prediction. | BioLip | 2019 |
| 2 | DELIA [ | Hybrid Deep neural network + bidirectional long short-term memory network | Designed hybrid deep neural network is to integrate 1D sequence-based features with 2D structure-based amino acid distance matrices. | BioLip + ATPBind | 2020 |
| 3 | Kalasanty [ | 3D convolutional neural network | Designed model based on U-Net’s architecture. | sc-PDB [ | 2020 |
| 4 | DeepSurf [ | Deep convolutional neural network + ResNet | Proposed surface-based deep learning approach for protein–ligand binding residue prediction. | scPDB | 2021 |
| 5 | PUResNet [ | ResNet | Based on deep ResNet + novel data cleaning process. | scPDB | 2021 |
Binding affinity (Ki) of the SmCI group of inhibitors on three proteins [161].
| Inhibitor ligand (protein) | Porcine pancreatic elastase | Trypsin | Bovine carboxypeptidase A | Human carboxypeptidase A1 |
|---|---|---|---|---|
| SmCI | 2.66 × 10−8 | 3.81 × 10−8 | 2.83 × 10−8 | – |
| rSmCI | 1.70 × 10−8 | 3.66 × 10−8 | 9.55 × 10−8 | 2.54 × 10−8 |
| SmCI N23A | 1.94 × 10−9 | 4.08 × 10−10 | 4.25 × 10−8 | 1.29 × 10−8 |
Here, – is an indication of no data
List of classical ML approaches to predict protein–ligand binding affinity
| SN | Approach | Technique involved | Feature | Database used | Year |
|---|---|---|---|---|---|
| 1 | Deng | Kernel partial least squares | Applied knowledge-based QSAR approach + used genetic algorithm-based feature selection method. | Self-curated | 2004 |
| 2 | Ashtawy | KNN + SVM + MLR + MARS + RF + BRT | Explored range of scoring functions employing ML approaches utilizing physicochemical features that characterize protein–ligand complexes. | PDBbind | 2011 |
| 3 | CSCORE [ | Regression | Developed Cerebellar Model Articulation Controller (CMAC) learning architecture. | PDBbind | 2011 |
| 4 | SFCscoreRF [ | Random forest | Followed random forest approach to train new regression models. | PDBbind + CSAR | 2013 |
| 5 | B2BScore [ | Random forest | Predicted binding affinity for protein−ligand complexes based on β contacts and B factor. | PDBBind | 2013 |
| 6 | Li | Random forest + multiple linear regression | Analyzed the importance of structural features to binding affinity prediction using the RF variable importance tool. | PDBbind | 2014 |
| 7 | Wang | Random forest | Predicted the protein–ligand binding affinity based on protein sequence, binding pocket, ligand structure and intermolecular interaction feature set. | PDBbind | 2014 |
| 8 | Cyscore [ | Linear regression | Improved protein–ligand binding affinity prediction by using a curvature-dependent surface area model. | PDBbind | 2014 |
| 9 | Pred-binding [ | Random forest + support vector machine | Applied ML algorithms for binding affinity prediction problem based on a large-scale dataset. | PDSP Ki DB + PubChem + DrugBank + ChemSpider | 2016 |
| 10 | Avila | ML methods available in SAnDReS | Applied machine learning box interface of SAnDReS to explore the scoring function virtual space (SFVS). | PDBbind + MOAD + BindingDB | 2017 |
| 11 | Ferreira | ML methods available in SAnDReS | Predicted Gibbs free energy of binding (ΔG) based on the crystallographic structure of complexes. | MOAD + BindingDB + PDBBIND | 2018 |
| 12 | Kundu | GP + LR + MP + SMOR + Kstar + RF | Incorporated Weka 3.6.8 package to select optimum parameters of the ML algorithms. | PDBBind | 2018 |
| 13 | Boyles | Random forest + XGBoost | Used ligand-based features to improve ML scoring functions. | PDBbind + CASF | 2019 |
| 14 | RASPD+ [ | SVM + LR + KNN + SDN + RF + ERF | Introduced fast prefiltering method for ligand prioritization based on ML models. | PDBbind + DUD-E | 2020 |
| 15 | Amangeldiuly | RF + SVR + XGBOOST + KNN | Designed prediction method for binding kinetics based on the ML analysis of protein−ligand structural features. | BindingDB + self-curated | 2020 |
| 16 | Wee and Xia’s method [ | Ollivier persistent Ricci curvature-based ML | Persistent attributes were used as molecular descriptors and further combined gradient boosting tree. | PDBbind | 2021 |
List of deep learning methods to predict protein–ligand binding affinity
| SN | Approach | Technique involved | Feature | Database used | Year |
|---|---|---|---|---|---|
| 1 | BgN-Score and BsN-Score [ | Ensemble neural networks | Assessed the scoring accuracies of two new ensemble neural network scoring functions based on bagging (BgN-Score) and boosting (BsN-Score). | PDBbind | 2015 |
| 2 | Gomes | Atomic convolution layer | Developed 3D spatial convolution operation for learning atomic-level chemical interactions. | PDBBind | 2017 |
| 3 | KDEEP [ | 3D convolutional neural networks | Featurized protein and ligand considering eight pharmacophoric-like properties that are used by a three-dimensional CNN model. | PDBbind | 2018 |
| 4 | DeepDTA [ | Convolutional neural network | Proposed deep learning–based model that uses only sequence information of both targets and drugs to predict drug target interaction binding affinities. | Kinase dataset + KIBA dataset | 2018 |
| 5 | Pafnucy [ | Deep neural network | Represented molecular complex with a 4D tensor, processed by three convolutional layers and three dense (fully connected) layers. | PDBbind + CASF + Astex Diverse Set | 2018 |
| 6 | OnionNet [ | Deep convolutional neural network | Constructed modified deep CNN and defined customized loss function to train multiple-layer intermolecular contact features. | PDBbind + CASF | 2019 |
| 7 | Zhu | Neural network | Predicted the binding affinity from a given pose of a 3D protein−ligand complex by pairwise function based on neural network. | PDBbind + CASF-2016 | 2020 |
| 8 | DeepAtom [ | 3D convolutional neural network | Extracted binding–related atomic interaction patterns automatically from the voxelized complex structure. | PDBbind + Astex Diverse Set | 2020 |
| 9 | Jones | 3D CNN + Spatial Graph-CNN | Developed fusion models to benefit from feature representations of two neural network models to improve the binding affinity prediction. | PDBBind | 2020 |
| 10 | AK-score [ | 3D CNN ensemble | Used ensemble of multiple independently trained networks that consist of multiple channels of 3D CNN layers. | PDBbind + CASF | 2020 |
| 11 | graphDelta [ | Graph-convolutional neural network | Designed graph-convolutional neural networks for predicting binding constants of protein−ligand complexes. | PDBbind + CSAR + CASF | 2020 |
| 12 | DeepDTAF [ | Deep convolutional neural network | Employed dilated convolution to capture multiscale long-range interactions. | PDBbind | 2021 |
| 13 | LigityScore [ | Convolutional neural network | Designed rotationally invariant scoring functions. | PDBbind + CASF | 2021 |
| 14 | Seo | Deep attention mechanism | Employed deep attention mechanism based on intermolecular interactions. | PDBbind + CSAR | 2021 |
| 15 | DEELIG [ | Convolutional neural network | CNN was used to learn representations from the features. | Self-curated | 2021 |
| 16 | ResAtom System [ | ResNet + attention mechanism | Implemented ResNet neural network with added attention mechanism. | PDBbind + CASF | 2021 |
Figure 7In silico prediction of two similar analogue Inhibitors docked within the binding sight-2 of human dynamin-1 PH domain. (Figure adopted from [141]) Both analogues share very similar intermolecular forces such as H-bonding and yet slight differences in the ligand’s orientation occur.
List of ML methods to predict the binding score of protein–ligand binding pose
| SN | Approach | Technique involved | Feature | Database used | Year |
|---|---|---|---|---|---|
| 1 | Ashtawy | MLR + MARS + KNN + SVM + RF + BRT | Employed ML approaches utilizing physicochemical and geometrical features characterizing protein–ligand complexes | PDBbind | 2015 |
| 2 | Grudininet | Regression | Predicted binding poses and affinities with a statistical parameter estimation | PDBBind + HSP90 dataset + MAP4K dataset | 2016 |
| 3 | Ragoza | Convolutional neural network | Trained CNN scoring function to discriminate binding poses using the differentiable atomic grid format as input | PDBbind | 2017 |
| 4 | Ragoza | Convolutional neural network | Trained and optimized CNN scoring functions to discriminate between correct and incorrect binding poses | CSAR | 2017 |
| 5 | Nguyen1 | Random forest + convolutional neural networks | Used mathematical deep learning for pose and binding affinity prediction | PDBbind | 2018 |
| 6 | Jose | Reinforcement learning | An approach to represent the protein–ligand complex using graph CNN that would help utilize both atomic and spatial features to score protein–ligand poses | PDBbind + self-curated | 2021 |