Literature DB >> 35301148

Structure-based drug repurposing: Traditional and advanced AI/ML-aided methods.

Chinmayee Choudhury¹, N Arul Murugan², U Deva Priyakumar³.

Abstract

The current global health emergency in the form of the Coronavirus 2019 (COVID-19) pandemic has highlighted the need for fast, accurate, and efficient drug discovery pipelines. Traditional drug discovery projects relying on in vitro high-throughput screening (HTS) involve large investments and sophisticated experimental set-ups, affordable only to big biopharmaceutical companies. In this scenario, application of efficient state-of-the-art computational methods and modern artificial intelligence (AI)-based algorithms for rapid screening of repurposable chemical space [approved drugs and natural products (NPs) with proven pharmacokinetic profiles] to identify the initial leads is a powerful option to save resources and time. Structure-based drug repurposing is a popular in silico repurposing approach. In this review, we discuss traditional and modern AI-based computational methods and tools applied at various stages for structure-based drug discovery (SBDD) pipelines. Additionally, we highlight the role of generative models in generating molecules with scaffolds from repurposable chemical space.

Entities: Chemical

Keywords: Drug repurposing; Force field; Generative modeling; Inverse design; Machine learning; Quantum mechanics

Mesh：

Year: 2022 PMID： 35301148 PMCID： PMC8920090 DOI： 10.1016/j.drudis.2022.03.006

Source DB: PubMed Journal: Drug Discov Today ISSN： 1359-6446 Impact factor: 8.369

Introduction

Identifying small molecules that can lead to an alteration in biochemical mechanisms via interactions with specific biological targets has been the key aspect of modern rational drug discovery (DD). This idea revolutionized the DD pipeline, resulting in extensive development of combinatorial chemistry and HTS over the past few decades. However, these techniques involve very high costs and long assay development and standardization times, which are not affordable for all. In this scenario, a shift from traditional ways of synthesizing and screening huge chemical libraries to the concept of drug repositioning/repurposing/reprofiling (DR), in which drugs with known indications are repurposed for new indications is a safe and cost-effective alternative. This rapid drug development strategy involves evaluation of new disease pathways, identifying new targets and studying their structures, functions, and dynamics to rationally reposition suitable molecules from the known chemical space, rather than random screening.1, 2, 3 In silico DR has attracted the attention of the pharmaceutical industries and research communities worldwide during the current COVID-19 pandemic because the use of advanced computational algorithms can predict 3D structures of targets, detect binding pockets/interaction hotspots of new drug targets, and screen the known drug candidates against new target structures, dramatically reducing the time and cost required for DR. DR involves the identification of new applications for existing drugs at a lower cost and in a shorter time. There are different computational DR strategies. For example, computational DR approaches that have been applied to the COVID-19 pandemic can be broadly categorized into: (i) drug/target network-based models; (ii) structure-based approaches; and (iii) AI approaches. Network-based approaches are divided into two categories: network-based clustering approaches and network-based propagation approaches. Both network-based approaches enable the annotation of important patterns, the identification of proteins that are functionally associated with COVID-19, and the discovery of novel drug–disease or drug–target relationships useful for new therapies. Structure-based approaches enable the identification of small chemical compounds able to bind macromolecular targets to evaluate how a chemical compound can interact with its biological counterpart, to find new applications for existing drugs. AI-based networks currently appear less relevant because they need more data for their application. Rapidly emerging high-precision in silico techniques/algorithms and consistently increasing computational access to huge amounts of data regarding clinical research, pathways involved in diseases, gene expression profiles, drug target structures, pharmacophores, and so on, have supported the use of computational approaches to envisage new indications/placements for old drugs.2, 4 In silico DR pipelines involve a variety of approaches, such as genomics, systems biology, network biology, chemo/bioinformatics, and structural bioinformatics-based approaches to identify optimal ‘new target–known drug’ pairs. Among these in silico methods, structure-based drug repurposing (SBDR) is important in its own right, given that the 3D structure of the target is a prerequisite to screen the repurposable chemical space (RCS) and explore suitable ligand interactions with the target binding site through techniques including docking, pharmacophore modeling, and molecular dynamics (MD) simulations. Along with the approved drugs, the RCS can include: all the molecules that have passed preclinical in vitro/in vivo stages and have entered the clinical phase, as well as compounds from various NP databases, such as Ayurveda, IMPPAT Berdy’s Bioactive NP Database, Carotenoids Database, Chinese Traditional Medicinal Herbs database, FooDB, and TCMDB@Taiwan, the absorption, distribution, metabolism, and excretion (ADMET) and toxicity profiles of which are well established. Table 1 lists data sources of the RCS, drug targets, pathways, and drug–target complexes. Although traditionally SBDD mostly involves docking-based virtual screening (VS), computationally intensive methods, such as MD simulations to include flexibilities of the targets, binding free energy calculations, and quantum chemical (QM) calculations, can also be applied for accurate predictions when a considerably smaller chemical library, such as only approved drugs, is considered for a DR project. In addition, the rapidly emerging AI–machine learning (ML) methods have essential roles in overcoming the limitations of traditional methods and confer accurate predictions. In this review, we discuss traditional and the modern AI-based computational methods and tools applied at various stages of SBDR pipelines. Advanced ML techniques, such as generative modeling, are also discussed, which can be indirectly applied for SBDR. We also highlight recent successful applications of computational techniques for SBDR.

Table 1

Data sources for repurposable chemical space, targets, pathways, and drug–target complexes.

Database	URL	Content
Data sources for repurposable chemicals
DrugBank	https://go.drugbank.com/	Detailed chemical, pharmacological, and pharmaceutical data of drugs and sequence, structure, and pathway information of drug targets
TCM	http://tcm.cmu.edu.tw/	170 000 traditional Chinese medicine compounds, which passed ADMET filters with 3D structures
e-Drug3D	https://chemoinfo.ipmc.cnrs.fr/MOLDB/index.php	1822 compounds (maximum molecular weight: 2000), similar to the US Pharmacopeia of Small Drugs
SuperDRUG2	http://cheminfo.charite.de/superdrug2/	∼ 4600 active pharmaceutical ingredients
DNP	http://dnp.chemnetbase.com/faces/chemical/ChemicalSearch.xhtml	The Natural Products subset of Dictionary of Organic Compounds
KEGG DRUG	www.genome.jp/kegg/drug/	Drugs approved to be marketed in Europe, USA, and Japan, with information of their targets and other molecular interaction networks
Data sources to explore new targets/pathways/indications for the RCS
Therapeutic Target Database (TTD)	http://bidd.nus.edu.sg/group/cjttd/	Studied and reported protein, RNA/DNA drug targets as well as pathways involved in targeted disease
STITCH	http://stitch.embl.de/	Known and predicted interactions of chemicals and proteins
Small Molecule Pathway Database (SMPDB)	https://smpdb.ca/	Information on ∼ 350 human small-molecule pathways
Transformer	https://bioinformatics.charite.de/transformer/	Data on enzymatic/nonenzymatic transformation of various xenobiotics in humans; interactions and process of transport of drugs, prodrugs, traditional Chinese medicines etc.
Human Metabolome Database	https://hmdb.ca/	Small-molecule metabolites in the human body
KEGG PATHWAY Database	www.genome.jp/kegg/pathway.html	Detailed information on targets, molecular interaction networks, and enzymes involved in metabolism of known drugs with references to several relevant databases and web-based tools
Data sources to train and test ML models for binding affinity prediction
Protein Data Bank (PDB)	www.rcsb.org/	Experimental structures of biomacromolecules, such as proteins/nucleic acids, ribosomes etc.
PDBbind	www.pdbbind.org.cn/	Experimentally measured IC₅₀, K_d, K_i, and other binding affinity data of the PDB protein–ligand complexes
BindingDB	www.bindingdb.org/bind/index.jsp	Measured binding affinities of small, drug-like molecules and drugs with known drug targets
SCORPIO	http://scorpio.biophysics.ismb.lon.ac.uk/scorpio.html	Structurally resolved and thermodynamically characterised protein–ligand complexes
Ki Database	https://kidbdev.med.unc.edu/databases/kidb.php	Published and internally derived 55 472 Ki, or affinity values for a large number of drugs and drug candidates with GPCRs, ion channels, transporters, and enzymes
BAPPL complexes set	www.scfbio-iitd.res.in/software/drugdesign/proteinliganddataset.htm	161 protein–ligand complexes with experimental and predicted free energies of binding
DNA Drug complex data set	www.scfbio-iitd.res.in/software/drugdesign/dnadrugdataset.jsp	DNA–drug complexes comprising 16 minimized crystal structures and 34 model-built structures, along with experimental affinities
DUD.E	http://dude.docking.org/	Provides decoy molecules for testing docking and ML models; affinities of 22 886 active compounds against 102 different targets; includes 50 decoy molecules for each active molecule with similar physicochemical properties but dissimilar 2D topologies

Data sources for repurposable chemical space, targets, pathways, and drug–target complexes.

SBDR and AI/ML techniques in modern drug discovery

The fundamentals of SBDR are based on the abilities of the drug to bind to multiple protein-binding sites. Apart from their original therapeutic targets, the drugs show affinities for other proteins, so-called ‘off-targets’. These off-targets can be carrier proteins, transporters, plasma proteins, among others, to which the drugs bind to cause side effects, which are not always detrimental and open ways to explore new indications for the drugs. One of the earliest examples of such an off-target-based approach was repositioning of sildenafil, which was originally used to treat angina; observation of sildenafil interacting with a phosphodiesterase (PDE5) resulted in this drug being repurposed for the treatment of erectile dysfunction. SBDR methods depend on the availability of the receptor protein and ligand structures. These methods mostly comprise high-throughput VS of the RCS using molecular docking and/or pharmacophore models.7, 8 The past few years have witnessed a rapid increase in the area of data-driven ML applications in general, which are becoming a vital tool during early drug discovery efforts. Multiple factors, such as rapidly accumulating relevant experimental data (e.g., DrugBank, ChEMBL, PDB, PubChem, and PDBbind), development of modern ML methods, libraries, and affordable computational power, are fueling such a surge. ML algorithms have relevant and potential applications at almost all steps of the SBDR pipeline and beyond, such as drug screening, target screening, target structure/binding site prediction, lead optimization, prediction of drug–drug interactions, and ADMET property prediction. ML methods aim to learn from existing data and predict properties instead of using physics-based understanding to explicitly compute properties. These methods can broadly be classified as supervised learning, unsupervised learning, and reinforcement learning. In supervised learning, markers or labels of new samples are predicted through ML models that are trained from samples with known markers. Unsupervised learning, in which the training samples without any labels are used to develop a model, is used to recognize complex patterns and to transform data to a lower dimension in general. Reinforcement learning attempts to perform reward-driven learning, in which an agent attempts to find an ideal set of actions to endorse some outcome through analysis of the environment combined with performing actions to alter that environment. Fig. 1 shows various categories of ML tasks and algorithms that are commonly used in drug design exercises. Naive Bayesian (NB), support vector machine (SVM), decision trees, random forest (RF), and artificial neural networks (ANNs) are the most popular classical ML algorithms, whereas deep Boltzmann machine (DBM), deep belief networks (DBNs), generative adversarial networks (GANs), variational autoencoders (VAEs), and adversarial autoencoders (AAEs) are some of the modern ML methods for discriminative, regression, clustering, regularization, dimensionality reduction, and generative tasks.

Figure 1

Title. (a) Classification of machine-learning (ML) tasks based on principle of learning; (b) different types of ML algorithm. For definitions of abbreviations, please see the main text.

Title. (a) Classification of machine-learning (ML) tasks based on principle of learning; (b) different types of ML algorithm. For definitions of abbreviations, please see the main text. Despite issues with their use in other research areas, AI/ML methods have been used continuously in drug design efforts over the past 25 years or so. Earlier applications in drug design activities were dominated by classical ML methods. NB algorithms, a supervised learning method, have been successful in processing massive amounts of information and in predictive modeling, while having a unique tolerance of data noise. For example, NB models in combination with extended-connectivity fingerprints (ECFPs) were used by Pang et al. to classify active and inactive molecules and predict their biological activity as estrogen receptor antagonists. Similarly, Wei et al. developed multiple quantitative structure activity relationship (QSAR) models using NB as a classifier in combination with SVM to identify HIV and hepatitis C inhibitors. RF comprises an ensemble of multiple uncorrelated decision trees, where, for a given task, each tree independently performs one prediction and the one with the maximum votes is selected as the best fit. Training of several decision trees minimizes individual errors and maximizes the efficiency because the final prediction is the best out of several independent predictions, unlike other algorithms. Cano et al. applied RF methods to predict protein–ligand binding affinities in a VS project, in which they trained the algorithm with a data set comprising kinases, nuclear hormone receptors, and their ligands. Rahman et al. predicted drug response confidence level for a particular genome by using multivariate RF, in which the input data were genetic and epigenetic attributes. SVMs are popular in computer-aided drug design (CADD) on account of their ability to differentiate between actives and inactives through binary class prediction or to train regression models that predict the activities and ranking compounds. SVM are trained to separate nonlinearly separable low-dimensional input data in a higher-dimensional latent space through feature mapping. SVM models that are specifically designed to predict drug–receptor interactions take into account protein-binding site as well as protein–ligand interaction features as important components for predictive modelling. Wang et al. developed and trained SVM models with diverse features, such as chemical structural features, pharmacological or therapeutic effects, and genomics data of the proteins, to predict drug–target interactions. Kawaii et al. used SVM models in which the drug molecules were allowed to match with numerous targets from different pathways to predict their bioactivities against multiple pathways. ANNs, analogous to nerve cells or neurons, obtain frequent input signals, calculate the weighted sum of the inputs via a nonlinear activation function, and produce an initiation response. The resulting connected neurons then receive the output signals passed on from preceding neurons. A typical ANN comprises three components: (i) an input layer; (ii) a hidden layer; and (iii) an output layer. The middle hidden layer comprises fully or partially connected processing nodes, which receive the input variables from the input nodes and transform them into the output nodes, which ultimately compute the output signal. ANN algorithms are iteratively trained via back propagation. The performance of ANN methods might be inferior to that of RF and SVM, especially when the data set is small, resulting in problems such as overfitting. However, with availability of big data, ANNs have re-emerged as deep learning (DL) algorithms, which are based on the feed-forward NNs of ANN with several hidden layers. These hidden layers account for the learning abilities of the computational models from multidimensional data. DL algorithms are at the development front-line in most scientific and technological fields. DL-based methods have brought about a paradigm shift in the field of CADD, from QSAR, target identification, VS to lead molecule design and optimization because they are able to recognize, interpret as well as generate complex data. Deep NNs (DNNs), recurrent NNs (RNNs), and convolutional NNs (CNNs) are the major NNs that are used in DD projects. These can be used for both prediction of molecular properties and generating molecular structures with requisite properties.

Traditional and AI/ML-aided methods at different stages of SBDR pipelines

SBDR methods depend on the availability of receptor protein and ligand structures. Fig. 2 provides examples of approaches used in SBDR projects. The first step of most SBDR pipelines is to obtain high-quality 3D structures of the new targets. If a structure is not solved experimentally, one can model it computationally. Once a good-quality target structure is available, identifying and characterizing the ligand-binding sites in the receptor is the next step so that the RCS can be screened against them. This is followed by high-throughput VS of the RCS using molecular docking and/or pharmacophore models , to obtain initial repurposing candidates. These candidates are further ranked, screened, or optimized using computationally intensive MD simulations, MM-GB(PB)SA and QM-based binding energy estimations. Once a NP or an existing drug has been found to have significant affinity for a given target, it can be used as a lead for further development to improve the binding affinity. In other words, by preserving the overall structural skeleton/scaffold of the molecule, one can attempt to change the functional groups around the structure until the desired property is achieved. Here, we highlight how classical and modern ML methods along with traditional computational methods are used at all the above-discussed stages of SBDR and the rapidly evolving generative models for generating small molecules containing privileged scaffolds (from NPs or existing drugs).

Figure 2

Possible strategies for structure-based drug repurposing (SBDR) for screening of molecules from repurposable chemical space.

Target structure prediction

The first step in SBDR is identification of the relevant target/s of interest and the availability of their 3D structure. VS of RCS using structure-based methods (traditional or ML) requires that the 3D structure of the target is available. experimental 3D structures from X-ray crystallography, NMR spectroscopy or cryo-electron microscopy can be obtained from the Protein Data Bank (PDB), which contains more than 150 000 bimolecular structures. Given that there is a large gap between the number of potential targets and the number of available experimental 3D structures, there is a tremendous interest in developing computational methods that can predict protein structures reliably. In silico methods, such as threading, ab initio techniques, and homology modeling have essential roles in predicting the structure of the desired targets. Homology modeling is the most popular structure prediction method, in which the structure of the target protein is modeled based on the experimental structure of a homologous template protein. In the absence of a homologous template structure, the fold recognition or threading method is used, in which each residue of the target is aligned to a position in the template and a template is selected based on the best alignment. If a target sequence does not have a suitable template either through homology or threading, the structures are modeled from scratch by optimizing the enthalpic and entropic parameters to generate the thermodynamically most-stable 3D conformation of the target protein. I-TASSER is a widely used structure prediction tool, which uses a combination of ab initio modeling, threading, and atomistic energy refinement to generate the 3D structure of a protein from its sequence. Although comparative modeling, ab initio modeling, and threading methods have had successes, they have major limitations. Over the past few years, ML methods have been helping to push the predictive capabilities of protein structures from sequences toward experimental accuracy. ML methods are capable of learning the relationship between primary sequences of proteins and known 3D structures, to develop predictive models. In CASP13, a DL-based ab initio protein structure prediction method named AlphaFold showed the best performance. AlphaFold comprises a core distance map predictor, which is implemented as a deep residue-NN with 220 residue slabs handling a depiction of dimensionality, analogous to input features calculated from two 64-amino acid fragments. The NN predictions include backbone torsion angles and pairwise distances between residues. Each residue slab has three layers containing a dilated convolutional layer and the blocks phase through dilation of values 1, 2, 4, and 8. The DL model has 21 million parameters, including 1D and 2D parameters, their combinations, and the evolutionary/coevolutionary profiles, of a training set of ∼ 29 000 proteins curated from various sources. Along with a distance map, AlphaFold predicts the φ and ψ angles to generate an initial predicted structure. Recently, AlphaFold 2.0 was proposed in CASP14 to outperform all the methods known so far, to the extent that the authors claim this to be the ‘solution to a 50-year-old grand challenge in biology’. The recently developed DL-based RoseTTAFold tool has shown promise in fast, correct protein structure and interaction predictions using a three-track network incorporating sequence (1D), topological distance map 2D, and spatial position (3D) information.

Binding site prediction

The logic that proteins with similar structures might have affinities for similar ligands and seem to be involved in similar functions forms the basis of SBDR. Studies reported that similar ligands could bind to multiple targets with similar local binding sites despite the low global sequence similarity, demonstrating the importance of binding site/binding pocket detection and comparison in DR. Binding sites for ligands are mostly concave surfaces characterized by specific amino acid residues in a specific geometric orientation suitable for molecular recognition and molecular function of the protein. Conventional pocket detection algorithms can be broadly classified as sequence-based, geometry-based, and energy-based methods. Geometry-based methods were the first binding site prediction methods, and use 3D structural information to explore the pockets/clefts/cavities on the protein surface. These methods are efficient but do not consider the flexibilities of the protein surface. Surfnet, proposed by Laskowski, Fpocket algorithm, LIGSITEcsc, and PASS are examples of geometry-based methods. Energy-based methods predict the most suitable binding site on the protein surface based on estimation of interaction energies of flexible probe molecules throughout the surface. One of the first methods was developed by Goodford, who calculated H-bond, electrostatic, and van der Waals components of interaction energies for different grid points on the protein surface and predicted the binding sites according to these interaction energies. Q-SiteFinder and PocketFinder are examples of energy-based methods. COACH is a combination of FINDSITE and ConCavity, which performed better than either method alone. FunFOLD, CHED, and HemeBIND also generate prediction models using a combination of different methods. Recently, ML-based methods, such as DeepSite, DeeplyTough, DeepDrug3D, and BionoiNet, were shown to be extremely efficient, achieving experimental accuracy for the prediction of binding sites.

RCS screening and lead optimization

Structure-based VS represents a highly efficient methodology for repositioning of known drug molecules to bind to potential new targets. Structure-based VS is mostly molecular docking based. Docking finds the suitable binding poses of molecules in the target binding site using a scoring function and the best-scored compounds from a large chemical library for a biomolecular target are further ranked based on the protein–ligand interactions. The RCS constitutes various classes of privileged structure with proven bioavailability and compatibility, reducing the probability of the best hits obtained via VS failing downstream in vitro/in vivo or ADMET tests. Molecular docking can be a single-target approach, in which only interactions between the known drugs and an individual target are identified, or it can be an ‘inverse docking’ approach, in which binding interactions of a molecule with multiple known targets are explored5, to estimate its target selectivity. The molecular docking method typically comprises three key steps: modeling and predocking preparation of target and ligand structures; generation and sampling of the ligand conformers in the binding pocket of the receptor; and evaluation of the docking score reflecting the binding energy of the ligand–target complexes. To address the issue of ligand flexibility, several methods are commonly used, with stochastic methods being popular. Monte Carlo (MC) and/or genetic algorithms (GA) are two such examples. The MC algorithm stochastically alters a single parameter each time to produce new conformations that are allowed or disallowed based on Boltzmann distributions. A sufficiently high temperature is assigned at the start of modeling to ensure a high chance of the next sampled conformation being accepted. Then, the temperature is gradually lowered during docking, during which a low-energy protein–ligand complex is captured as a result of the lower conformational flexibility. Conversely, GA adopts a methodology inspired by Darwin’s evolution theory, which is initialized by an arbitrary population of conformations modeled as a set of chromosomes that can randomly crossover and mutate to produce a new set of conformations. The compound conformations with the lowest binding energies with the target are considered the ‘fittest’ and are accepted as start points to yield a new generation. This sequence is iteratively repeated until the target–ligand complex reaches a local energy minimum. There are three broad classes of traditional scoring function: (i) empirical; (ii) knowledge based; and (iii) force-field based. In the first class, different types of polar and nonpolar intermolecular interactions are extracted from a training set comprising the reported experimental structures, and parameters equivalent to each type of interactions are standardized with a certain weightage. The coefficients of these parameters are optimized through multiple linear regression models, using the reported binding affinity values of the training set molecules as the independent variable. Force-field-based scoring functions compute the potential energy of the entire ligand–target complex by adding up contributions from van der Waals or electrostatic interaction energies between the atoms of the ligand and those of the receptor. In knowledge-based scoring, the reported receptor–ligand complexes are analyzed to obtain structural information, which is further used to develop atomic interaction potentials that refer to the interactions between the ligand and receptor atoms. Fig. 3 depicts the popular computational tools/software available for tasks at different stages of SBDR.

Figure 3

In silico tools for structure-based drug repurposing (SBDR).

In silico tools for structure-based drug repurposing (SBDR). Consistent efforts are being made to improve the performance of existing scoring functions by including additional terms for precise assessment of the ligands or entropy changes during receptor binding. Consensus scoring (i.e., using several scoring functions in parallel) has been developed for superior estimation of the binding affinity and to minimize false positive results. The computationally demanding, yet more accurate, QM techniques are being used to improve accuracies of the scoring functions, as discussed below. Finally, multiple scoring functions can be used in concert for so-called ‘consensus scoring’.

Binding energy estimations using traditional computational methods

In force field-based MD simulations, the systems comprise atoms and ions and the electrons are not considered explicitly. MD simulations allow us to keep track of the positions and momenta of these fundamental particles as a function of time. The atoms located in different molecular centers interact with each other through van der Waals and electrostatic interactions. Usually, the former is described using the Lennard–Jones-like potential energy function, which has –rij –6 and rij –12 dependence on the distance between the atoms, whereas the latter has inverse distance dependence. The dynamics of the system can be followed by solving Newton’s equation of motion. The time step usually used is 1–2 fs for modeling the biological systems in ambient conditions (300 K and 1 atm pressure). Once trajectories of sufficient timescale are established, thermodynamic properties can be computed using the positions and momenta of all the particles. To study the kinetics of association and dissociation of protein–ligand complexes, one needs to carry out long timescale simulations, which is usually computationally demanding. However, this can be handled with the use of steered MD or simulations with enhanced sampling techniques along selected reaction coordinates. In some implementations, one has to define the egression (unbinding) pathway explicitly, whereas, in some recent implementations (such as random acceleration MD), by setting the acceleration threshold for the ligand (to help the ligand to identify the pathway for release) alone helps the algorithm finds the regression pathway. In umbrella sampling simulations, the reaction coordinate for the dissociation is defined and the free energies for the unbinding are computed from the potential mean force. These methods have the advantage of traditional MD and provide free energy changes along the protein–ligand association or dissociation pathway. In certain targets, the residence time (RT) of the ligand within a target dictates the pharmacological activity rather than its binding affinity itself and, in these cases, enhanced sampling MD simulations can provide direct information about the RT, which is inversely proportional to koff. Targets, such as G-protein-coupled receptors (GPCR), HIV protease inhibitors, kinase inhibitors, and translocator proteins (TSPOs) are those targets for which RT is a key parameter for optimizing the potent ligands. In the case of TSPO targets, the sampling MD simulations were able to explain different koff for a specific ligand compared with the remaining two compounds, even though all three ligands had comparable binding affinity. The interaction of its naphthyl group with the LP1 loop along the egression pathway has been attributed to its increased residence time.

Binding free energy calculations using MM-GB(PB)SA

Molecular docking approaches have been in use for more than three decades but their success rate in predicting the lead drug compounds from a chemical library is low, limiting their application. Binding free energies and docking poses from molecular docking approaches were found to be inaccurate in many cases. Nevertheless, they are the workhorses when compounds from larger chemical libraries needed to be screened. As the entries in certain chemical spaces are expected to grow exponentially, there will be no end to the use of molecular docking approaches. In addition, for obtaining potential lead compounds, one can use these approaches for prescreening, with the most promising compounds then being screened using a more reliable scoring function. This approach has been shown to be promising in ranking various protein–ligand complexes.53, 54 MM-GB(PB)SA-based binding free energies are widely used scoring functions for ranking protein–ligand complexes next to those used in molecular docking approaches. In both approaches, the binding free energies are obtained as the sum of van der Waals, electrostatic interactions, polar and nonpolar solvation free energies. Both MM-GBSA and MM-PBSA approaches differ with respect to the solvation-free energies, with the former two terms remain the same. In the MM-GBSA approach, the polar contribution solvation-free energies are obtained by solving the electrostatics of the complex in an aqueous solvent environment using the Generalized Born approach, whereas in the case of MM-PBSA, they are obtained using the Poisson–Boltzmann equation. The nonpolar contributions to solvation-free energies in MM-GBSA approach are obtained from the solvent accessible surface area. The binding free energy in these approaches is generally obtained as the difference in the free energies of the end products. In other words, the free energies are computed for the reactants (i.e., the protein and ligands in unbound state in an aqueous solvent environment) and products (protein–ligand complex in an aqueous solvent environment); the free energy difference of these two states is referred to as the binding free energy. The binding free energies are computed in two different ways, referred to as 1A-MM-GB(PB)SA or 3A-MM-GB(PB)SA depending upon whether the binding free energies were computed using a trajectory of the complex alone or using trajectories of subsystems (i.e., protein and ligand) and the complex. The former approach is computationally less demanding because a single MD simulation is carried out for the complex and the binding free energies for the three systems (complex, protein, and ligand) are obtained by using the coordinates of the system of interest and by stripping out the rest of the system coordinates. Another advantage of using a single trajectory for computing the binding free energies is that the change in internal energies associated with the complexation process is zero. Even though it is expensive, one can compute the entropic contributions from a normal mode analysis. In most instances, the entropic contributions are not computed because it is assumed that they do not have a major role in estimating the relative binding free energy differences of different ligands. The binding free energies computed using the MM-GBSA and MM-PBSA approaches are not explicitly treating the effect of nonbonded interactions between the solvent (hydrogen bonds in particular) with the protein and ligands. In certain cases, in which the protein binding sites are occupied by ‘crystalline water’, these implicit models might not perform well and contributions from such water molecules need to be added in addition to the contributions obtained from implicit solvent models. The binding free energies are generally reported as an average over various configurations from the MD simulations and so these approaches account for the conformational flexibility of proteins and ligands, which is one of the merits of these approaches. These approaches have shown success in ranking various protein–ligand complexes and there are reports of them outperforming the molecular docking-based ranking. For example, Rastelli et al. compared the performance of MM-GBSA and MM-PBSA with AutoDock in identifying active compounds from decoys against Plasmodium falciparum DHFR; the former two methods were able to rank the compounds in excellent agreement with experimental binding affinities. On the negative side, there were also many benchmark studies that showed larger fluctuations in binding free energies computed using a longer timescale. Instead, it was suggested that the binding free energies should be computed from many independent simulations of shorter timescales. In the case of avdin complexed with biotin analogs, it was shown that the average binding free energies over 5–50 independent MD simulations were needed to get an accuracy of 1 kJ/mol. Other studies also reported that the longer timescale MD simulations were not beneficial but that timescales limited to 5 ns yielded better accuracy in binding free energies.

Binding free energy calculations from QM-based approaches

The binding free energies obtained using force-field approaches suffer from the use of fixed charges for the ligands in aqueous and protein environments. Naturally, the electronic structure, atomic charges, and molecular dipole moments depend on the nature of the environment and force-field methods do not account for such effects. To describe the electrostatics in solvent and protein environments, we need to use electronic structure theory-based approaches. However, these are computationally very demanding and memory intensive. The expense of electronic structure theory calculations is in the order of N 3–N 7, where N is the number of one electron wavefunctions of the system; thus, the size of the system that can be handled is limited to 100–200 atoms. Here, we are interested in the interaction energies of protein–ligand complexes, which are many times larger than this. Thus, approximate methods were developed that facilitate the use of QM theory for large-scale systems, such as protein–ligand complexes: (i) QM cluster models; (ii) hybrid QM/MM models; (iii) QM fragmentation approaches; and (iv) fragment molecular orbitals. QM cluster models are based on the approximation that the binding site residues make larger contributions to the protein–ligand binding free energies. One can obtain the model for the protein–ligand cluster by using a cut-off, and the binding site residues within this distance from the center of mass of the ligand are included. It is essential to add suitable capping atoms in which the peptide bonds are cut. Given that, in many cases, the structure of the binding site is stabilized by the rest of the residues in the protein, the free optimization of the cluster can lead to changes in the binding mode/pose of the ligand within the binding site. Therefore, the terminal atoms of amino acids are fixed and partial optimizations are carried out to estimate the interaction energies. The interaction energies are given as the difference between the energy of the cluster to the sum of energies of the ligand and amino acids. Hybrid QM/MM models use an effective Hamiltonian to describe the interaction between the protein–ligand subsystems, in which these systems are described using molecular mechanics and QM, respectively. The polarization of the ligand by the environment is correctly captured by the model, but the effect resulting from back polarization (i.e., polarization of the protein environments by the ligand) is not accounted for. Since we are mainly interested in the energetics of the ligands, this approach is reliable and also computationally less demanding. The whole protein and solvents can be included in the MM region without any difficulty and their polarization effect on the ligands can be modeled correctly using this approach. However, this approximation has issues when there is significant charge transfer between the binding site residues or solvents to ligand or when the QM subsystem is covalently bonded to MM region (as in the irreversible inhibitors), which is nicely described in QM cluster models. The charge transfer effect can be accounted for by describing the whole system involved in the charge transfer as a QM system and the rest as the MM system. This requires the treatment of the bonded region connecting the QM and MM subsystems using the hydrogen capping method and, in certain cases, overpolarization of the QM region connected through the MM region by covalent bonds has to be screened using a damping function. The QM fragmentation scheme allows one to estimate the interaction of protein–ligand complexes using electronic structure theory. As the whole protein can not be treated using QM theory, the protein is fragmented into individual amino acids and the contributions from each fragment to the interaction energy with the ligand are computed and added together to obtain the total interaction energy. In other words, the total protein–ligand interactions are computed as the sum over the individual amino acid–ligand interactions. Usually, the bonds are cut along peptide bonds and capped with hydrogens or certain capping groups, such as acetyl or N-methyl amino groups. However, when we use such capping groups, their interaction energy contributions to the total protein–ligand interaction energy should be removed at the end. Since each amino acid and ligand intermolecular complex is handled separately, even the interaction energies can be obtained using highly correlated methods, such as MP2 and coupled-cluster theory. In general, dispersion corrected DFT or Minnesota functionals (namely MO6-2X) can be adopted to best describe the interaction between the individual amino acid fragments and ligand. In QM-based approaches, the binding enthalpies are approximated for binding free energies because the interaction energies are computed from the optimized structure for protein–ligand complexes. With the use of dispersion-corrected DFT (B3LYP/6-31G* -D), the performance of a QM fragmentation scheme referred to as EE-GMFCC-CPCM was tested on biotin and biotin analogs bound to avidin; the correlation between the experimental and predicted binding affinities was ∼ 0.88. The study was based on protein–ligand configurations obtained from MD; by averaging over more configurations, the correlation was shown to improve.

AI/ML-based scoring functions and binding affinity prediction

One of the major efforts in VS is to be able to calculate binding affinities accurately. Whereas MD-based free energy methods can yield accurate values, they are slow; by contrast, scoring functions are fast but are less accurate. ML methods are thought of as having the potential to be fast/efficient and simultaneously significantly better than traditional scoring functions.60, 61 An SVM model was trained by coupling distinct docking-energy terms with the experimentally reported binding affinity of the training set of PDE inhibitors, to identify direct inhibitors of Mycobacterium tuberculosis, which was one of the first applications of the ML technique in the context of drug repositioning. Recently, the element-specific persistent homology (ESPH) method was used in association with CNN by Wei and coworkers to develop TopologyNet, a multichannel topological NN, in which the topological features represented biomacromolecular geometry diminishing the dimensionality of the complex 3D data. The gradient boosting decision tree (GBDT) regression was combined with the ESPH method to develop T-Bind. Here, element-specific topological fingerprints generated the features represented as binned barcodes and the models were fed by these features. The 3D voxel representation of both ligands and receptors were generated applying 3D CNN to devise KDEEP. Ashtawy and Mahapatra established two new scoring functions, BgN-Score and BsN-Score, based on bagging and boosting ensembles of NN models, respectively, using features that were combinations of the terms from X-Score, AffiScore, GOLD, and RF-Score. Later, Pande and coworkers proposed a scoring function known as PotentialNet based on staged graph CNN (GCN), which encompassed steps such as covalent-only, dual noncovalent–covalent propagations, and ligand-based graph using atom types, bonds, and interatomic distances as input descriptors; the authors emphasized the fact that the whole data set as well as the methods used for splitting the data, affect the relative performance of scoring functions. Twelve ML-based scoring functions were proposed and evaluated by Khamis and Gomaa on the PDBbind (v2013) core sets. They performed principal component analysis (PCA) to decrease the dimensionality of the huge set of input features to seven principal components using RF, kNN, NN, and SVM, which initially featured 108 terms from RF-Score, BALL, X-Score, and SLIDE. Li et al. developed the first XGBoost-based scoring function XGB-Score, implementing GBDT for amplified accuracy and speed. Su et al. also reported similar observations from their systematic study including six ML algorithms, namely Bayesian Ridge Regression (BRR), K-Nearest Neighbors (KNN), Decision Trees (DTs), Linear Support Vector Regression (L-SVR), Multilayer Perceptron (MLP), and RF. Yang et al. emphasized the importance of large, diverse, unbiased data sets for training AI/ML-based models, where they found overperformance (Pearson R2 = 0.73) of atomic CNN models trained on the PDBbind data set and recognized the property and topology biases in the DUD-E data set leading to artificially increased enrichment. Morrone et al. developed modular graph-based CNN models trained on structural data from protein − ligand complexes generated by molecular docking, to predict activity and binding mode. The algorithm presents a dual-graph architecture with separate subnetworks for the receptor–ligand contact maps and the ligand bond connectivities. Moro and coworkers used a combination of convolutional and fully connected NNs to develop a model to predict the performance of different common docking protocols from a protein structure and a small ligand molecule. Deep Docking is a new platform based on DL, which is able to dock billions of compounds with optimized speed and accuracy. This approach predicts the docking scores using deep QSAR models that learn from docking scores of a training set compound library. OnionNet is a DNN model to accurately predict the protein–ligand binding affinities based on rotation-free element pair-specific contacts between ligands and protein atoms. The efficiency of the model was assessed and compared with the contemporary scoring functions using the CASF-2013 benchmark and PDBbind database (v2016 core set). Sirimulla and colleagues established a DNN-based scoring function trained by 384 molecular descriptors, such as electrostatic interactions and H-bonds, calculated from the binding pockets of the PDBbind v2016 data set using BINANA software. Several other DL-based scoring functions have recently been developed to achieve speed and accuracy to predict target–receptor binding affinity, as discussed in recent reviews.75, 76, 77, 78

Generative modeling

Once a NP or an existing drug has been found to have significant affinity toward a given target, it can be taken as a lead for further development to improve its binding affinity. In other words, preserving the overall structural skeleton/scaffold of the molecule, one attempts to change the functional groups around the structure until the desired property is achieved. Over the last 2 to 3 years, modern DL method-enabled generative modeling has been shown to be effective for such purposes. Molecular design typically involves the measurement or prediction of a given property of interest for guess molecules using experiments or computational methods. This is followed by understanding of the structure–property relationship; upon multiple iterations between the two steps, molecules with desired properties are obtained. In other words, traditionally, one goes from the chemical space to the property space. However, generative models allow us to go from the property space to the chemical space. In other words, these methods are capable of generating molecular structures with the desired physicochemical and other pharmacodynamic/pharmacokinetic properties. The two major tasks of a generative model is to propose valid chemical structures, and to condition the generation toward certain biases. Four main methods have been successful in this aspect: (i) RNNs; (ii) Reinforcement Learning (RL); (iii) GANs; and (iv) VAEs. In the context of molecular design in the DD process, the chemical space is essentially infinite and, hence, such generative modeling approaches are useful for exploring this space to identify molecules that exhibit the desired properties. For optimization in the context of improving the binding affinity or other pharmacokinetic properties of NPs or existing drugs, generative models can be conditioned with multiple objectives such as the presence of a given scaffold and exhibition of desired properties.

Recurrent neural networks

RNN-based models are considered powerful generative models in the natural language-processing domain. These models are trained on the string representation of molecules, such as simplified molecular input line entry systems (SMILES), and learn the semantics of the representation,80, 81, 82, 83 helping to generate new molecules without explicitly defining the rules for molecule design.

Variational autoencoders

DL models based on VAEs comprise an encoder and a decoder. Generally, molecules are mapped to a latent space using an encoder, and a decoder is used to map latent vector representation back to the molecule.84, 85, 86 The latent space is often combined with optimization techniques to generate new molecules with the desired properties.

Generative adversarial networks

GANs comprise two ML models, the generator and discriminator, which are trained simultaneously to compete with each other. The generator generates a molecule and the discriminator performs a binary classification if that molecule belongs to the data set or is synthetic.87, 88 The generator helps to sample new molecules from the learned distribution.

Reinforcement learning

RL methods aid generative models with the objective of maximizing the reward of the generated molecules. RL techniques have been combined with SMILES-based models to generate new molecules but have low chemical validity.90, 91, 92, 93 To overcome this problem, a graph convolutional policy network (GCPN) was proposed achieving 100% validity of generated molecules. Fig. 4 shows a schematic of different generative models using different modern ML methods.

Figure 4

Schematics of simple generative models using different modern machine-learning (ML) methods; (a) recurrent neural network (RNN); (b) variational auto encoder (VAE); (c) generative adversarial network (GAN); and (d) reinforcement learning (RL).

Recent examples of SBDR

Drug repurposing was considered the most efficient route to develop therapeutics for COVID-19-like virus-associated infections. A review article published in 2019 showed that from 2012 to 2017, 172 drugs were repurposed, with 70% in different stages of clinical development. Aspirin, bevacizumab, canakinumab, difluprednate, dimethyl fumarate, sildenafil, bupropion, and thalidomide are some of the drugs from repurposable chemical space that have since been approved for treating different diseases.89, 90 A bibliometric review of drug repurposing showed that > 60% of the 35 000 drugs or drug candidates have been tested against more than one disease, whereas 189 chemicals have been tested against > 300 diseases. Drugs, such as prednisolone, dexamethasone, prednisone, and methylprednisolone, have been repurposed for treating > 1000 diseases. Such promising results have also attracted researchers working toward the development of therapeutics for various virus-associated infections, such as Ebola virus, Middle East respiratory syndrome-coronavirus (MERS-CoV), and severe acute respiratory syndrome SARS-CoV-1 over the past decade. During the recent emergence of SARS-CoV-2-associated COVID-19, drug repurposing based on computational approaches has been used to identify potential drug compounds. The chemical library of approved antipolymerase drugs, the DrugBank database52, 93, 94 and chemical libraries of natural products were used. 3CLpro, PLpro, envelope (E) protein, spike protein, RNA dependent RNA polymerase (RdRp) and methyltransferase proteins were considered as potential targets from the virus, whereas, in humans, those that mediate the interaction with the viral spike protein, such as ACE-2, TMPRSS2, and Cathepsin-L, were also considered potential targets. For example, Yadav et al. recently performed docking and MD simulations to explore the repurposing of two approved bile salts, chenodeoxycholate and ursodeoxycholate, to bind to the SARS-CoV-2 envelope protein (Fig. 5 ). A sequential approach involving molecular docking and binding free energy calculations using MM-GBSA was used to repurpose compounds from the DrugBank database for COVID-19 therapeutics. Fig. 6 shows the binding mode of lead compounds from the DrugBank database within the four viral targets.

Figure 5

Figure 6

Binding mode of lead compounds from the DrugBank database within the four viral targets from severe acute respiratory syndrome-coronavirus 2 (SARS-CoV-2): (a) 3CLPro; (b) PLPro; (c) RdRp; and (d) Spike protein.

Molecular dynamic (MD) simulation studies reveal a high influx of water molecules into the transmembrane channel of the severe acute respiratory syndrome-coronavirus 2 (SARS-CoV-2) envelope protein (a) when bound to the approved drug chenodeoxycholate (b), which is a natural bile salt. Binding mode of lead compounds from the DrugBank database within the four viral targets from severe acute respiratory syndrome-coronavirus 2 (SARS-CoV-2): (a) 3CLPro; (b) PLPro; (c) RdRp; and (d) Spike protein.

Concluding remarks and prospects

Fully exploring the chemical space with currently available experimental and computational approaches is not possible. The upper limit for the number of entries in chemical space is reported to be 10180 and the number of possible small organic molecules is suggested to be 1060. Even if we had access to exascale computing facilities that could screen a compound per second, we still need the lifetime of the universe to scan all the compounds. Then, even if we were able to identify top compounds with superior binding affinity, there is no assurance that these compounds would have favorable pharmacodynamic and pharmacokinetic properties (i.e., ADMET, solubility and bioavailability). Thus, in situations such as the current COVID-19 pandemic and rapidly emerging SARS-CoV-2 variants, where one has to urgently find a scalable solution, repurposing existing drugs and screening of existing NPs with experimentally annotated pharmacokinetic profiles are appropriate approaches to identify potential compounds toward any therapeutic target associated with a disease of interest within a reasonable timeline. The limited size of the repurposable chemical space can be handled easily with currently available SBDD approaches. Here, we have summarized traditional methods applied at each stage of SBDR as well as recently developed AI algorithms, which can be used either instead of, or in association with, traditional methods to achieve accurate predictions. Computationally intensive MD simulations and QM-based methods that can be used conveniently for small RCS for efficient binding energy estimation have also been discussed. Whereas traditional methods, such as docking-based VS, are extremely quick to screen a few thousand molecules of RCS against new targets, the accuracy of the calculated molecular properties, such as binding affinity, is low because of the severe approximations used. Alternatively, free energy calculations using MD simulations and QM methods are capable of providing accurate values. In recent years, modern ML methods have been seen as potential methods that will make every task throughout the DD process more efficient. Although classical ML methods are still valuable in situations where the data set size is limited, modern ML methods are proving to be disruptive and are changing the way that different tasks in DD processes are being undertaken. Recent studies have shown that ML methods can help in identifying targets, predicting 3D structures of target proteins from the sequence, helping to screen large numbers of small druglike molecules, performing generative tasks to suggest new ligands, providing retrosynthetic pathways for synthesis, controlling robotic systems to physically synthesize compounds, processing the signal corresponding to molecule characterization based on spectra, and predicting outcomes of clinical trials. For VS applications, NN-based methods have been shown to be useful for developing ML-based scoring functions that are accurate and computationally tractable. Additionally, generative methods are capable of suggesting molecules that have scaffolds identified from NPs and existing drugs. Hence, careful combination of traditional methods and data-driven methods is expected to speed up the whole DD process in general and drug repurposing in particular.

3 in total

1. Drug repurposing: An effective strategy to accelerate contemporary drug discovery.

Authors: Peng Zhan; Bin Yu; Liang Ouyang
Journal: Drug Discov Today Date: 2022-05-31 Impact factor: 8.369

2. Discovery of novel IDH1-R132C inhibitors through structure-based virtual screening.

Authors: Chujiao Hu; Zhirui Zeng; Dan Ma; Zhixin Yin; Shanshan Zhao; Tengxiang Chen; Lei Tang; Shi Zuo
Journal: Front Pharmacol Date: 2022-09-07 Impact factor: 5.988

3. Identification of 1H-purine-2,6-dione derivative as a potential SARS-CoV-2 main protease inhibitor: molecular docking, dynamic simulations, and energy calculations.

Authors: Hossam Nada; Ahmed Elkamhawy; Kyeong Lee
Journal: PeerJ Date: 2022-10-07 Impact factor: 3.061

3 in total