Literature DB >> 36188488

Searching for potential inhibitors of SARS-COV-2 main protease using supervised learning and perturbation calculations.

Trung Hai Nguyen^1,2, Nguyen Minh Tam^1,2, Mai Van Tuan³, Peng Zhan⁴, Van V Vu⁵, Duong Tuan Quang⁶, Son Tung Ngo^1,2.

Abstract

Inhibiting the biological activity of SARS-CoV-2 Mpro can prevent viral replication. In this context, a hybrid approach using knowledge- and physics-based methods was proposed to characterize potential inhibitors for SARS-CoV-2 Mpro. Initially, supervised machine learning (ML) models were trained to predict a ligand-binding affinity of ca. 2 million compounds with the correlation on a test set of R = 0.748 ± 0.044 . Atomistic simulations were then used to refine the outcome of the ML model. Using LIE/FEP calculations, nine compounds from the top 100 ML inhibitors were suggested to bind well to the protease with the domination of van der Waals interactions. Furthermore, the binding affinity of these compounds is also higher than that of nirmatrelvir, which was recently approved by the US FDA to treat COVID-19. In addition, the ligands altered the catalytic triad Cys145 - His41 - Asp187, possibly disturbing the biological activity of SARS-CoV-2.

Entities: Chemical

Keywords: Docking, Simulation; FEP; FEP, Free Energy Perturbation; LIE; LIE, Linear Interaction Energy; ML, Machine Learning; Machine learning; Mpro, SARS-CoV-2 Mpro; SARS-CoV-2 Mpro; SL, Supervised Learning; Supervised learning

Year: 2022 PMID： 36188488 PMCID： PMC9511900 DOI： 10.1016/j.chemphys.2022.111709

Source DB: PubMed Journal: Chem Phys ISSN： 0301-0104 Impact factor: 2.552

Introduction

Coronaviruses have the largest genomes among RNA viruses (26–32 kb) encrypting structural and nonstructural proteins [1], [2]. Coronaviruses have been infecting humans and normally cause mild respiratory syndrome [3]. However, in 2002, the severe acute respiratory syndrome coronavirus (SARS-CoV) was first recognized in Guandong, China, and was associated with 774 deaths over 8096 infected cases [4]. The Middle East respiratory syndrome coronavirus (MERS-CoV) was first reported in 2012 to be able to transfect animals to humans and lead to severe cases of respiratory syndromes and deaths [5]. This shows that coronavirus can induce severe symptoms and potential pneumonia and death. A novel coronavirus, SARS-CoV-2, causes severe acute respiratory syndromes and is related to millions of deaths worldwide since it initially spread in December 2019 in Wuhan, Hubei Province, China [6], [7], [8], [9]. The virus has been suggested to initiate from bats and can quickly transfect between humans [10]. The spreading speed is tremendously high since the virus can exist in an aerosol [11]. Despite efforts to reduce the viral outbreak, more than 450 million people have been infected to date. The viral outbreak effectuated the COVID-19 pandemic. Therefore, the development of therapy is crucial for community health. In this context, remdesivir was first approved as an antiviral drug for treating COVID-19 [12]. However, it is considered a controversial decision [13] since the drug showed disappointing trials [14], [15]. After that, Pfizer’s Paxlovid, which combined nirmatrelvir and ritonavir, was authorized as the first oral antiviral drug to treat COVID-19 by the FDA [16]. Although Paxlovid effectiveness is as high as 89% in reducing hospitalization or death compared placebo, its components might cause a severe interaction with widely used medications such as statins, blood thinners, and some antidepressants. Moreover, new SARS-CoV-2 variants in the UK, South Africa, and the US [17], [18] have prompted scientists to search for more COVID-19 drugs since emerging mutations can reduce the effectiveness of current treatments. Indeed, some variants have recently been reported to be able to escape from neutralizing antibodies [19], [20]. Therefore, developing an appropriate treatment for COVID-19 is urgently needed. Among more than 20 structural proteins and nonstructural proteins (nsp) encoded by SARS-CoV-2 genomes, the main protease (Mpro), as well as a 3-chymotrypsin-like protease (3CLpro), are known as crucial enzymes related to the replication and proliferation of a new virus [1], [21]. In particular, SARS-CoV-2 Mpro, corresponding to nsp5, and papain-like protease (PLpro), corresponding to nsp3, first autocleave themselves from the synthesis of messenger RNA (mRNA) translation. The proteases then cleave polyproteins to polypeptides, leading to the replication and functionalities of a new virus. The cleaved proteins involve endoribonuclease (NendoU), exonuclease (exoN), helicase (Hel), RNA-dependent RNA polymerase (RdRp), and an S-adenosyl-methionine-dependent ribose 2’-O-methyltransferase (2’-O-MT). It should be noted that whereas PLpro is responsible for the formation of nsp1-3, Mpro determines the formation of nsp4-16 [22]. Therefore, 3CLpro/Mpro is one of the most appropriate targets for COVID-19 drug design. 3CLpro/Mpro is a homodimer protein consisting of two chains each consisting of 306 residues divided into three domains I, II, and III [23]. Indeed, the active site of Mpro comprises His41 and Cys145 located in the cleft between domains I and II [24], [25]. Strong binding inhibitors to Mpro often form tight hydrogen bond (HB) and nonbonded (NBC) contacts with these residues [26]. Moreover, the important residues controlling the ligand-binding affinity to SARS-CoV-2 Mpro include Thr26, Ser46, Asn142, Gly143, His164, Glu166, and Gln189 [26]. Although the catalytic activity of SARS-CoV-2 Mpro rigidly relies on dimerization [23], investigation of the ligand-binding affinity in silico can be performed based on the monomeric form [27]. Because of the well-characterized structure and great interest in designing inhibitors for SARS-CoV-2 Mpro, numerous computational and experimental studies have been carried out to estimate efficient inhibitors to block the biological activity of Mpro [24], [25], [28], [29], [30], [31], [32], [33], [34], [35], [36], [37]. Among these studies, computational approaches are widely used to speed up the screening of potential SARS-CoV-2 inhibitors since several thousand compounds can be tested over a short period of time [27], [38], [39], [40]. Indeed, computer-aided drug design (CADD) has arisen as a robust protocol for high-throughput screening of thousands/million compounds for potential inhibitors of enzymes since October 5, 1981. An article entitled “Next Industrial Revolution: Designing Drugs by Computer at Merck” was published in Fortune magazine [41]. The power of CADD has been increasingly demonstrated because the method remarkably reduces the time and cost of drug development [42]. For example, 81 inhibitors were screened over 400 000 tested compounds, yielding a hit rate of only 0.02% [43]. CADD is used not only to screen for new inhibitors but also to test existing drugs for repurposing targets [34], [40]. Therefore, numerous drugs have been discovered thanks to the contribution of CADD. Examples of some of the earliest successes of CADD include the carbonic anhydrase inhibitor dorzolamide, which was authorized in 1995 [44], [45], and saquinavir, ritonavir, and indinavir, which were authorized for inhibiting human immunodeficiency virus 1 (HIV-1) protease in 1995, 1996, and 1996, respectively [41]. Typically, the computational approach is utilized to determine promising agents that can bind well to a protein target. Investigation of the ligand-binding free energy is thus one of the most important tasks in CADD [46]. Several methods have been developed to unravel the physical/chemical process [47]. Among these, molecular docking [48] or quantitative structure-activity relationship (QSAR) [49] methods can be used to characterize the ligand-binding affinity of several thousand/million ligands. More accurate methods such as the fast pulling of ligand (FPL) [50], linear interaction energy (LIE) [51], [52], and molecular mechanism/Poisson-Boltzmann surface area (MM/PBSA) [53], [54], [55] are then used to refine the docking/QSAR outcomes. The free energy perturbation (FEP) method was finally performed to validate the list of promising inhibitors [56], [57], [58], [59]. Moreover, recent advancements in machine learning (ML) methods have benefited many areas of science and technology. In particular, ML has been used in CADD for drug discovery and repurposing [60], [61]. The most common task of ML in CADD [62] is to predict ligand binding affinities from features extracted from molecular properties including physical, chemical, and structural terms. This is a typical supervised regression problem where ML approaches such as random forest, gradient boosting, and deep learning can result in good prediction accuracy. ML was used to repurpose existing drugs for SARS-CoV-2 treatment [40]. In this work, we combined ML models with atomistic simulations to screen a large database of approximately two million compounds for potential inhibitors of SARS-CoV-2 Mpro. In particular, ML models were trained to predict ligand binding free energies for the whole database, and top-lead ligands with the strongest predicted binding affinity were selected. These top-lead compounds were subsequently subjected to physics-based calculations, including molecular docking and molecular dynamics (MD) simulations, to validate their binding mechanisms to SARS-CoV-2 Mpro. Our resulting list of promising inhibitors for SARS-CoV-2 Mpro can serve as a foundation for further experimental investigations and contribute to the rapid development of SARS-CoV-2 therapy.

Materials and Methods

Computational scheme

The computational scheme used to investigate potential inhibitors for SARS-CoV-2 Mpro from ChEMBL [63], a database of bioactive molecules with drug-like properties, is described in Figure 1 . In particular, ML models were trained and tested to accurately determine the binding free energy of ChEMBL compounds to SARS-CoV-2 Mpro. The top 100 potential inhibitors for SARS-CoV-2 Mpro, which were estimated by ML models, were docked to the protease. The structural change of the SARS-CoV-2 Mpro + inhibitor complexes was then investigated using unbiased MD simulations. The Gibbs free energy difference between the unbound and bound states of the top 100 ligands to Mpro was revealed via LIE and FEP calculations. A list of compounds for inhibiting SARS-CoV-2 Mpro from the ChEMBL database was thus obtained.

Figure 1

Computational strategy. (A) Computational approach utilized to search promising inhibitors for SARS-CoV-2 Mpro by using hybrid approach involving supervised machine learning and atomistic simulations. (B) Ligand-binding pose was preliminarily predicted via AutoDock Vina. (C) Configuration of catalytic triad Cys145 - His41 - Asp187. (D) + (E) Initial conformations of SARS-CoV-2 Mpro + inhibitor and individual inhibitors in solution.

SARS-CoV-2 Mpro and Ligands

The list of available inhibitors of SARS-CoV-2 Mpro (Table S1) was collected from literature reviews, which involved 571 compounds with the corresponding values of the half-maximal inhibitory concentration (). The experimental ligand-binding free energy was calculated as , which was based on the approximation that equals the inhibition constant () and was used as the label for training models. Table S1 contains the labeled data and SMILES strings of the corresponding compounds. The distribution of the values is shown in Figure S1 of the Supporting Information. A total of 451 and 120 compounds were randomly chosen for training and testing, respectively. The performance metrics used for model selection include RMSE, Pearson’s , and Spearman’s correlation coefficients. The best model was used to predict ligand-binding affinity for the ChEMBL database, which includes ca. 2 million bioactive compounds with drug properties. Furthermore, the SARS-CoV-2 Mpro structure was downloaded from the Protein Data Bank with ID 7JYC [64].

Supervised Learning Calculation

Regression models were trained to predict ligand-binding free energy from molecular features. They included linear regression (LR), random forest (RF), extreme gradient boosting (XGBoost) [65], and a deep learning model based on convolutional networks on graphs (GraphConv) [66]. The LR model, which is simple and therefore less prone to overfitting, was used as a baseline model. Model hyperparameters were tuned by using a tenfold cross-validation technique. Optimal values of hyperparameters that minimize mean square error (MSE) were searched by using the Hyperopt library [67]. For LR, only the L2 regularization strength (alpha) was tuned. For RF, the following hyperparameters were tuned: max_depth, min_samples_split, min_samples_leaf and max_features. For XGBoost, max_depth, min_child_weight, subsample, colsample_bytree, reg_lambda, and learning_rate were optimized. For the GraphConv model, the number of units in the graph_cov layers and dense layer, learning rate, and dropout rates were tested. The Python library Scikit-Learn [68] was used to train the LR and RF models. To train XGBoost and GraphConv [66], we used the XGBoost and DeepChem [69] libraries, respectively. Molecular features were extracted by using the RDKitDescriptors tool kit implemented in Deepchem [69]. It computed 200 physical and chemical properties such as molecular weight, polar surface area, number of valence electrons, numbers of HB donors and acceptors, and maximum and minimum partial charge. To make the models more robust and less prone to overfitting, we reduce the number of features as follows. Some features have zero values for almost all compounds in the training set. Removing features having zero for more than 99% training compound resulted in a reduction of 68 features. Strongly correlated features (absolute value of Pearson’s R > 0.95) were also removed, which resulted in a further reduction of 27 features. Finally, 105 features were used as input to the ML models. Some features may have no numerical value for certain compounds. These missing values were imputed by the median of the feature. Before inputting into the models, all features were standardized to have a zero mean and standard deviation of one. These 105 features were used to train the LR, RF, and XGBoost models. GraphConv uses convolutional networks on graphs to automatically extract useful features from the ligand molecule, which is represented as an undirected graph [66]. It is implemented as an embedding layer that accepts the SMILES string as input and outputs a fixed-length vector (called the molecular fingerprint). The fixed-length vector is then fed into a densely connected layer. Both the embedding vectors and weights of densely connected layers are learned simultaneously during training of the model. The Python code for carrying out ML modeling is available online at this GitHub URL https://github.com/nguyentrunghai/SARS-CoV-2Mpro_inhibitor_ML.

Docking Simulations

AutoDock Vina [70] using modified empirical parameters [71] was utilized to investigate the ligand binding poses to SARS-CoV-2 Mpro. In particular, AutoDockTools [72] was used to parameterize both the ligands and receptor. The docking grid center was picked as the center of mass of the native inhibitor. The grid size was picked as Å, referring to the previous assessment [26]. The exhaustiveness parameter of modified Vina was selected as the default value referring to the previous assessment [71]. The largest energy variability between docking modes was appointed as 7 kcal mol-1. The docking structure with the lowest docking energy was used as the starting conformation of MD simulations.

Molecular Dynamics Simulations

MD simulations were performed to refine the docking outcome using GROMACS 5.1.5 [73]. In particular, the Amber99SB-iLDN force field [74] was employed to topologize the protease and charge-neutralizing ions. The catalytic triad Cys145-His41-Asp187 may play a vital role in the binding process of a ligand to SARS-CoV-2 Mpro [75]. The protonation state of these residues was assigned as shown in Figure 1 C. Water molecules were parameterized via the TIP3P water model [76]. In addition, the general Amber force field [77] was utilized to parameterize the ligands using the AmberTools18 and ACPYPE packages [78], [79]. Among these, the chemical information, including geometrical parameters and charges, was obtained from quantum mechanics calculations using density functional theory (DFT) with the B3LYP functional and 6-31G(d,p) basis set. The restrained electrostatic potential approach was employed to assign the atomic charges [77]. A dodecahedron periodic boundary condition (dPBC) box with a volume of 669 nm3 was used to situate the SARS-CoV-2 Mpro + inhibitor complex. In particular, the minimum distance between the complex and the boundary is 1.0 nm. The solvated complex thus comprises ca. 66,000 atoms, respectively, including 1 protease, 1 ligand, 20,400 water molecules, and counterbalanced ions Na+. In addition, a dPBC box with a capacity of ca. 41 nm2 was employed to place the ligand, in which the minimum distance between the ligand and the boundary was 1.0 nm. The solvated ligand system hence consists of ca. 3700 atoms, including 1 ligand, 1200 water molecules, and neutralized ions Na+. MD parameters were obtained according to previous simulations [26], [36]. In particular, nonbonded interactions were cut off at 0.9 nm. The fast particle–mesh Ewald (PME) electrostatics method [80] was utilized to mimic electrostatic (cou) interactions. Moreover, the cutoff scheme was employed to calculate the van der Waals (vdW) interaction. Equilibration simulations were attempted over three steps involving energy minimization, NVT, and NPT simulations. The length of the NVT and NPT simulations is 100 ps. The last snapshot of NPT simulations was then used as the starting shape of MD simulations. The lengths of the MD simulations are 5.0 and 20.0 ns, corresponding to the solvated ligand and complex systems, respectively. The simulation for each system was repeated twice with different random initializations.

Free Energy Calculations

Linear interaction energy (LIE) calculation. The binding free energy () of a ligand to SARS-CoV-2 Mpro via the LIE approach [81] was computed as the difference between the average of the cou and vdW interactions of the ligand with neighboring atoms over various systems, including the ligand in the solvated complex (bound state - noted as subscript b) and the ligand in solution (unbound state - noted as subscript u). can be determined by where the coefficients were selected as , , and by referring to the previous assessment [26], [33]. Free energy perturbation (FEP) simulation. The FEP approach [82] was finally used to provide a more accurate estimation of the ligand binding free energy. -alteration simulations [83] were employed to change the ligand-binding system from bound () to unbound () states. The free energy alteration between two states, , can be calculated via several values of the coupling parameter over MD simulations. The ligand changing from bound to unbound states over -alteration simulations can be called the ligand-demolition process, in which the energy change can be computed by using the Bennett acceptance ratio (BAR) method [84]. The binding free energy of a ligand to a protein is thus determined as the gap of energy changes over two ligand-demolition processes, including annihilating the ligand in the soluble complex () and individual ligand (), as follows:

Analysis tools

The chemicalize webapp, an online tool of ChemAxon, was utilized to predict the protonation states of ligands. The computed error (success-docking rate and correlation coefficients) was assessed via 1000 rounds of the bootstrapping method [85]. The ligand-binding diagram was produced by using the free version of Maestro [86]. The nonhydrogen atom root-mean-square deviation (RMSD) between docked and experimental binding poses was calculated via GROMACS tools “gmx rms” [73]. A hydrogen bond (HB) contact was counted if the angle [acceptor (A)-hydrogen (H)-donor (D)] was larger than 135° and the A-D distance was less than 0.35 nm. A nonbonded contact was counted if the distance between two nonhydrogen atoms was lower than 4.5 Å. The clustering calculation was performed via GROMACS tools “gmx cluster.” The collective-variable FEL was constructed by using GROMACS tools “gmx sham.”

Results and Discussion

Prediction of Potential Inhibitors from Supervised ML Models

To rapidly screen the ChEMBL library, four regression models were trained: linear regression (LR), random forest (RF), extreme gradient boosting (XGBoost) [65], and convolutional networks on graphs (GraphConv) [66]. The performance metrics of the test set are listed in Table 1 . Due to its simplicity, the LR model is the least accurate among the four models we trained in this work. This is not unexpected since LR is not able to capture nonlinear relationships between features and targets. Nevertheless, the LR model served as a baseline for more sophisticated models. The best model by all three metrics is XGBoost, which gives an RMSE of 1.125 ± 0.095, Pearson’s R of 0.748 ± 0.044 and Spearman’s of 0.765 ± 0.048 (Table 1). However, it is not better than the second-best model, which is random forest by a large margin. The GraphConv model comes in third place, although its performance is very close to that of random forest. From this assessment, the XGBoost model was selected to predict the binding free energy for nearly 2 million compounds in the ChEMBL database. Figure 2 shows a comparison of the binding free energy between the experiment and prediction made by XGBoost for 120 compounds in the test set.

Table 1

Performance metrics of regression models in predicting binding free energy of 120 tested ligands to SARS-CoV-2 Mpro. Numbers in parentheses are error bars estimated by bootstrapping.

Model	RMSE (kcal mol^-1)	Pearson’s R	Spearman’sρ
Linear Regression	1.299 ± 0.104	0.631 ± 0.070	0.708 ± 0.053
Random Forest	1.157 ± 0.093	0.737 ± 0.046	0.753 ± 0.045
XGBoost	1.125 ± 0.095	0.748 ± 0.044	0.765 ± 0.048
GraphConv	1.161 ± 0.088	0.735 ± 0.050	0.749 ± 0.043

Figure 2

Predicted binding free energy versus experiment for 120 test compounds. Prediction was made using XGBoost model.

Performance metrics of regression models in predicting binding free energy of 120 tested ligands to SARS-CoV-2 Mpro. Numbers in parentheses are error bars estimated by bootstrapping. Predicted binding free energy versus experiment for 120 test compounds. Prediction was made using XGBoost model. The distribution of the predicted binding free energy by the XGBoost model () is shown in Figure S1 of the Supporting Information. The predicted binding free energy for the ChEMBL ligand ranges from -5.78 to -10.02 kcal mol-1 with a mean value of -7.47 ± 0.46 kcal mol-1. The top 100 compounds with values from -9.63 to -10.02 kcal mol-1 may be good candidates for SARS-CoV-2 Mpro inhibitors because their predicted binding affinities are comparable to that of nirmatrelvir with kcal mol-1. The mean value of is kcal mol-1. The predicted binding free energy is probably underestimated since the ML model was trained and tested on , which was approximated by . The list of compounds is reported in Table S2 of the Supporting Information. Although the ML model adopted a good correlation coefficient with the respective experiments (cf. Figure 2), physical-based methods were also carried out to further evaluate the binding process of these ligands to SARS-CoV-2 Mpro.

Estimation of Docking Pose of Ligands to SARS-CoV-2 Mpro

As mentioned above, although the ML method was shown to be effective in predicting potential inhibitors, physical-based approaches were also utilized to refine the predicted binding affinity and to provide physical and chemical insights into the binding process [87]. In this work, the binding process of SARS-CoV-2 Mpro and its inhibitors was investigated via two steps: molecular docking and MD simulations. In particular, the ligand-binding pose to SARS-CoV-2 Mpro was preliminarily probed via AutoDock Vina [70]. Note that AutoDock Vina provided an appropriate success-docking rate, , when the docking pose of nine ligands was compared to the respective experiments [26]. In addition, recently, the modified version of AutoDock Vina [71] was benchmarked using the different empirical parameters, which provided a better correlation to the respective experiments. However, the performance of the approach on SARS-CoV-2 Mpro was not considered. Therefore, in this work, the value of AutoDock Vina with original and modified empirical parameters was reevaluated with a larger set involving 40 different complexes (Table S3 of the Supporting Information). The obtained results were consistent with those of previous work [26]. In particular, the conformations docked via AutoDock Vina differed from the experimental structures by an amount of such that the value was %. Interestingly, the corresponding values of the modified AutoDock Vina are Å and %, respectively, which are better than those of AutoDock Vina. Here, a successfully docked shape was counted if the RMSD was lower than 2.00 Å. AutoDock Vina with the modified empirical parameters was thus employed to find the binding pose of the top 100 ligands to SARS-CoV-2 Mpro. The docking outcomes are shown in Table 2 and Table S5 of the Supporting Information. The ligand-binding free energy diffuses in the range from -11.3 to -17.6 kcal mol-1, in which the mean value is -14.89 ± 0.10 kcal mol-1. The mean value is significantly smaller than that of nirmatrelvir, kcal mol-1, which was considered a positive control. Note that the modified AutoDock Vina often offered an overestimated value of ligand-binding affinity [71]. Moreover, as mentioned above, the predicted value is probably underestimated. This may explain why the obtained docking results were significantly larger than those obtained by the ML models.

Table 2

Calculated binding free energy of top-lead compounds to SARS-CoV-2 Mpro via different approaches.

N⁰	Compound name	ΔGXGB	ΔGVina	〈Vl-scou〉b-〈Vl-scou〉u	〈Vl-svdW〉b-〈Vl-svdW〉u	ΔGLIE	ΔGcou	ΔGvdW	ΔGFEP	ΔGEXPa
1	CHEMBL3815050	-10.02	-15.7	7.16	-36.37	-16.71 ± 0.36	-2.06	-10.06	-12.12 ± 1.12
2	CHEMBL4300604	-9.98	-14.7	10.66	-33.23	-15.97 ± 0.49	-9.48	-9.49	-18.97 ± 3.59
3	CHEMBL3945443	-9.98	-15.1	9.58	-30.44	-15.12 ± 0.03	-16.92	-14.73	-31.65 ± 0.92
4	CHEMBL3678802	-9.97	-17.1	5.23	-32.15	-15.40 ± 1.39	-9.47	-8.72	-18.19 ± 0.69
5	CHEMBL4170638	-9.91	-14.0	3.62	-31.78	-15.21 ± 0.36	-3.68	-11.19	-14.87 ± 0.33
6	CHEMBL4111845	-9.79	-15.3	3.79	-35.84	-16.39 ± 0.58	-12.95	-8.46	-21.41 ± 0.69
7	CHEMBL580289	-9.75	-15.0	7.88	-32.08	-15.51 ± 0.78	-17.79	-9.45	-27.24 ± 2.87
8	CHEMBL3640406	-9.74	-15.8	15.84	-29.33	-15.10 ± 0.14	9.97	-14.96	-4.99 ± 1.26
9	CHEMBL538763	-9.74	-15.1	11.20	-30.55	-15.23 ± 0.78	6.02	-12.13	-6.11 ± 3.77
10	CHEMBL176909	-9.73	-15.3	14.99	-35.52	-16.84 ± 0.36	-6.99	-12.47	-19.45 ± 1.07
11	CHEMBL4095929	-9.72	-17.2	18.81	-32.15	-16.06 ± 0.21	9.13	-14.07	-4.95 ± 0.42
12	CHEMBL3640394	-9.69	-14.4	15.34	-29.54	-15.14 ± 0.30	7.79	-14.77	-6.98 ± 0.57
13	CHEMBL3657195	-9.68	-15.0	7.87	-30.99	-15.19 ± 0.26	5.07	-8.63	-3.56 ± 1.93
14	CHEMBL416434	-9.68	-15.9	11.72	-32.35	-15.77 ± 1.97	-7.06	-11.11	-18.17 ± 2.52
15	CHEMBL1471687	-9.67	-13.8	-8.97	-38.03	-16.39 ± 0.45	-7.19	-13.76	-20.95 ± 0.08
16	CHEMBL285908	-9.67	-11.3	11.07	-33.31	-16.01 ± 0.99	1.60	-10.40	-8.80 ± 2.00
17	CHEMBL3673817	-9.67	-16.3	2.23	-32.67	-15.40 ± 0.27	3.57	-12.21	-8.64 ± 1.74
18	CHEMBL4101092	-9.66	-17.1	14.50	-32.10	-15.83 ± 0.7	-1.73	-16.77	-18.50 ± 1.08
19	CHEMBL20260	-9.64	-15.8	-3.65	-34.18	-15.55 ± 1.34	1.54	-12.55	-11.01 ± 2.88
20	Nirmatrelvir	-9.57	-13.8	2.21	-29.80	-14.57 ± 0.06	-4.56	-9.78	-14.35 ± 0.04	-10.46 [88]

value obtained based on IC50 value, in which term was approximately equal to inhibition constant and contribution of covalent binding energy is assumed to be small [88]. Unit is kcal mol-1.

Calculated binding free energy of top-lead compounds to SARS-CoV-2 Mpro via different approaches. value obtained based on IC50 value, in which term was approximately equal to inhibition constant and contribution of covalent binding energy is assumed to be small [88]. Unit is kcal mol-1.

MD-Refined Simulations

Docking approaches normally utilize several approximations to speed up the computation [70], [72]. Therefore, the obtained outcomes are thus refined by using more accurate and precise approaches such as Steered-MD (SMD), MD, and replica-exchange MD (REMD) simulations [34], [37], [89]. In this work, although AutoDock Vina with the modified empirical parameters formed the success-docking rate , unbiased MD simulations were employed to clarify the docking results. According to previous work [26], the solvated SARS-CoV-2 Mpro + inhibitor complex was equilibrated by using 20.0 ns unbiased MD simulations for each trajectory, which is long enough to equilibrate the complex conformations. The all-atom RMSD of the complexes almost achieved stable states just after 5.0 ns (cf. Table S4 of the Supporting Information). The equilibrated conformations of the solvated complex and ligand systems were then used for free energy calculations via LIE/FEP approaches. The protease individual residues were calculated to reveal the ligand-binding mechanism by analyzing intermolecular NBC and HB contacts between top-lead compounds (Table 2 and discussed below). The probability of NBC and HB contacts of ligands to SARS-CoV-2 Mpro residues was described in Figure 3 . In particular, top-lead ligands adopted intermolecular contacts with 43 residues. The residues formed intermolecular NBCs to top-lead ligands over more than 46% of the evaluated conformations (20,000 shapes in total). In addition, only 24/43 residues adopted intermolecular HB to top-lead compounds, which occupied 6% of all investigated shapes. Note that the outcomes are obviously larger than those of twenty available inhibitors [26]. It may be argued that the top-lead compounds form a stronger ligand-binding affinity than twenty available inhibitors [26]. Moreover, the residues that were able to adopt more than 6% HB and 46% NBC to ligands are critical elements controlling the ligand-binding process to the protease. There are 13 residues satisfying the criteria (Figure 3). The residues are Thr25, His41, Ser46, Asn142, Gly143, Ser144, Cys145, His163, His164, Glu166, Asp187, Gln189, and Gln192. Interestingly, top-lead compounds frequently created intermolecular contacts with the catalytic triad Cys145-His41-Asp187, which may prevent the biological activity of SARS-CoV-2 Mpro. In addition, the residues Met49, Leu141, Met165, Leu167, Pro168, Arg188, Thr190, and Ala191 also contribute a large amount of the binding energy because they rigidly adopt NBC to ligands. Overall, top-lead ligands can bind well to Mpro by forming a large number of contacts with several residues located in the active sites. This may prevent the reduction in ligand-binding affinity when mutations of SARS-CoV-2 Mpro occur.

Figure 3

Probability of NBC and HB contacts between SARS-CoV-2 Mpro individual residues and top-lead compounds. Green rectangles denote residues that formed more than 6% HB and 46% NBC to ligands.

Probability of NBC and HB contacts between SARS-CoV-2 Mpro individual residues and top-lead compounds. Green rectangles denote residues that formed more than 6% HB and 46% NBC to ligands. As mentioned above, the catalytic triad Cys145-His41-Asp187 may play a crucial role in the binding process of a ligand to SARS-CoV-2 Mpro [75]. The distance between the Cys145-Sγ and His41-Nε () and His41-Nδ vs. Asp187-Oδ () atoms was considered to mediate the influence of ligands on SARS-CoV-2 Mpro. Moreover, SARS-CoV-2 Mpro in solution without ligand was also simulated over 220 ns (Figure S2 of the Supporting Information) to compare the difference. The collective-variable free energy landscape (FEL) using and () as reaction coordinates was constructed over equilibrium snapshots of SARS-CoV-2 Mpro with top-lead inhibitors (Table 1) and without an inhibitor. The outcome is shown in Figure 4 . The presence of inhibitors clearly altered the FEL of SARS-CoV-2 Mpro, implying a change in enzymatic biological activity. In particular, the SARS-CoV-2 Mpro without the presence of ligands formed two minima noted as A1-2 at (, () coordinates of (0.50, 0.35) and (0.73, 0.33). In addition, the SARS-CoV-2 Mpro + ligands adopted two minima denoted as B1-2 at (, () coordinates of (0.51, 0.34) and (0.29, 0.33).

Figure 4

Collective-variable FEL of SARS-CoV-2 Mpro in present and absent ligands. Distances and (, which are associated with catalytic triad Cys145 - His41 - Asp187, were utilized as reaction coordinates.

Binding free energy calculation via the LIE method.

The binding free energy of ligands and SARS-CoV-2 Mpro can be calculated using the LIE approach with a correlation coefficient of [26]. The binding free energy was thus performed to assess the top 100 inhibitors, which was suggested by the XGBoost model. The mean gap of the cou and vdW interaction energies between a ligand and surrounding molecules over bound and unbound states was calculated and is shown in Table 2 and S2 of the Supporting Information. The corresponding values over 100 ligands are and kcal mol-1, respectively. The obtained outcome suggests the domination of vdW over electrostatic interactions in the binding process of ligands to SARS-CoV-2 Mpro, which is in good agreement with previous works [34], [36]. This also explained why the empirical parameters for amyloid beta systems involving , , and were successfully applied for computing the binding affinity of a ligand to SARS-CoV-2 Mpro. The negative metrics and correspond to the loss of electrostatic interaction during the ligand association, and the hydrophobic interaction is strong as the domination of the vdW term. The obtained is shown in Table 2 and S2 of the Supporting Information. Over the top 100 compounds, the metric varies in the range from to kcal mol-1, in which the average value is kcal mol-1. The magnitude of is larger than that of by an amount of kcal mol-1, which is consistent with the previous work that the is smaller than by a value of 3.89 kcal mol-1. Note that the overestimation of compared with was observed because the ML model was trained using as an approximation to in calculating the experimental binding free energy. Moreover, the obtained are in good agreement with the docking simulation, which is kcal mol-1. The RMSE between and is 1.88 kcal mol-1. Furthermore, 19% of the ligands formed a strong binding free energy to the protease, in which the calculated was lower than -15.00 kcal mol-1 (Table 2). Interestingly, the top 19 inhibitors formed a lower than that of the positive control nirmatrelvir, whose is kcal mol-1 (Table 2). They were thus selected for further investigation to validate the outcome via the perturbation method [82].

Investigation of ligand-binding affinity via perturbation simulations

The perturbation simulation is known as one of the most accurate approaches thus far [26], [47], [90], [91]. In a recent report [26], this approach was indicated as the most accurate method for determining the binding free energy of SARS-CoV-2 Mpro and its ligands. The FEP approach formed the highest correlation coefficient, [26], for the respective experiment in comparison with other methods such as LIE, fast pulling of ligand (FPL) [50], and molecular mechanics Poisson-Boltzmann surface area (MM-PBSA) [92], [53]. The perturbation simulations were successfully applied to characterize several potential inhibitors for SARS-CoV-2 Mpro [34], [35], [90]. The approach was thus utilized to refine the binding free energy of the top 19 inhibitors, which adopted an smaller than -15.00 kcal mol-1. The obtained was mentioned in Table 2. In particular, the value fell in the range from to kcal mol-1 with a mean of kcal mol-1. This may suggest that nine compounds having kcal mol-1 would be highly potent inhibitors for SARS-CoV-2 Mpro (Table 2). These compounds include CHEMBL4300604, CHEMBL3945443, CHEMBL3678802, CHEMBL4111845, CHEMBL580289, CHEMBL176909, CHEMBL416434, CHEMBL1471687, and CHEMBL4101092. Interestingly, among nine compounds, CHEMBL3945443, which formed the strongest binding affinity to SARS-CoV-2 Mpro, has a nitrile group, the same as nirmatrelvir; unfortunately, the MD-refined structure (Figure 5 ) of the complex did not reveal this issue. This may imply the inaccuracy of conventional MD simulations since they cannot represent the covalent binding between proteins and ligands. Further computation using a quantum chemical approach would thus be employed to reveal the problem. Moreover, using the same approach, the available inhibitors can form (α-ketoamide 13b), (ritonavir), and (proscillaridin) kcal mol-1 [26], [36]. Furthermore, the binding free energy of PF-07321332 (nirmatrelvir) to SARS-CoV-2 Mpro is kcal mol-1, which was calculated as a positive control. The compound CHEMBL3945443 can thus form a covalent bond to Cys145-Sγ of SARS-CoV-2 Mpro. Furthermore, over all considered systems, the mean electrostatic free energy difference is kcal mol-1, while the average vdW free energy difference is kcal mol-1. The obtained results indicated that the vdW interaction rules over electrostatic interactions in the binding process of the considered inhibitors to SARS-CoV-2 Mpro. It should be noted that this result is consistent with previous works [26], [36].

Figure 5

Interaction diagram between SARS-CoV-2 Mpro + CHEMBL3945443. Complexed structure was obtained by calculating clustering of all equilibrium snapshots of complex with cutoff of 0.2 nm.

Conclusions

In this work, a hybrid approach using knowledge- and physical-based methods was proposed to characterize potential inhibitors of SARS-CoV-2 Mpro. Initially, the XGBoost model was trained to screen ligand-binding affinity over a large database of compounds. The predicted binding affinity for a test set strongly correlates with experimental data with a correlation coefficient of . The ligand-binding free energy of drug-like compounds from the ChEMBL database [63] was then predicted by the XGboost model. The top 100 compounds that formed the largest values of were further investigated via atomistic simulations. The binding pose of these ligands to the protease was preliminarily estimated using the modified AutoDock Vina with a successful docking rate of %. The complex was then refined via unbiased MD simulations. Moreover, the equilibrium conformations of the complex were used for binding free energy calculation via the LIE approach. A short list including 19 compounds was suggested for further analysis via the FEP method since adopting kcal mol-1. Furthermore, the perturbation simulations indicated that 9 compounds (CHEMBL4300604, CHEMBL3945443, CHEMBL3678802, CHEMBL4111845, CHEMBL580289, CHEMBL176909, CHEMBL416434, CHEMBL1471687, and CHEMBL4101092) can bind well to SARS-CoV-2 Mpro with a binding free energy of kcal mol-1, which is lower than that of nirmatrelvir, kcal mol-1. It may be argued that these compounds may act as highly potent inhibitors of SARS-CoV-2 Mpro. The obtained results also suggested that the vdW interaction may play an important role in the binding process of SARS-CoV-2 Mpro + inhibitors. Moreover, the residues Thr25, His41, Ser46, Asn142, Gly143, Ser144, Cys145, His163, His164, Glu166, Asp187, Gln189, and Gln192 play an important role since HB and NBC were rigidly adopted as ligands. Furthermore, the residues Met49, Leu141, Met165, Leu167, Pro168, Arg188, Thr190, and Ala191 also contribute a large amount of the binding energy since rigidly adopting NBC to ligands. In addition, the ligands inhibit the biological activity of SARS-CoV-2 Mpro by altering FEL construction via and (). The catalytic triad Cys145 - His41 - Asp187 was thus disturbed.

CRediT authorship contribution statement

Trung Hai Nguyen: Methodology, Software, Formal analysis, Investigation, Writing – original draft, Writing – review & editing. Nguyen Minh Tam: Methodology, Formal analysis. Mai Van Tuan: Project administration, Resources. Peng Zhan: Supervision, Conceptualization. Van V. Vu: Conceptualization, Visualization. Duong Tuan Quang: Funding acquisition, Supervision. Son Tung Ngo: Supervision, Writing – original draft, Writing – review & editing, Investigation, Validation.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

68 in total

1. Binding of a diverse set of ligands to avidin and streptavidin: an accurate quantitative prediction of their relative affinities by a combination of molecular mechanics and continuum solvent models.

Authors: B Kuhn; P A Kollman
Journal: J Med Chem Date: 2000-10-05 Impact factor: 7.446

2. Structure-based drug design and modern medicine.

Authors: R Vijayakrishnan
Journal: J Postgrad Med Date: 2009 Oct-Dec Impact factor: 1.476

Review 3. Thermodynamics and Kinetics of Drug-Target Binding by Molecular Simulation.

Authors: Sergio Decherchi; Andrea Cavalli
Journal: Chem Rev Date: 2020-10-02 Impact factor: 60.622

4. Structure of M^pro from SARS-CoV-2 and discovery of its inhibitors.

Authors: Zhenming Jin; Xiaoyu Du; Yechun Xu; Yongqiang Deng; Meiqin Liu; Yao Zhao; Bing Zhang; Xiaofeng Li; Leike Zhang; Chao Peng; Yinkai Duan; Jing Yu; Lin Wang; Kailin Yang; Fengjiang Liu; Rendi Jiang; Xinglou Yang; Tian You; Xiaoce Liu; Xiuna Yang; Fang Bai; Hong Liu; Xiang Liu; Luke W Guddat; Wenqing Xu; Gengfu Xiao; Chengfeng Qin; Zhengli Shi; Hualiang Jiang; Zihe Rao; Haitao Yang
Journal: Nature Date: 2020-04-09 Impact factor: 49.962

5. Antibody resistance of SARS-CoV-2 variants B.1.351 and B.1.1.7.

Authors: Pengfei Wang; Manoj S Nair; Lihong Liu; Sho Iketani; Yang Luo; Yicheng Guo; Maple Wang; Jian Yu; Baoshan Zhang; Peter D Kwong; Barney S Graham; John R Mascola; Jennifer Y Chang; Michael T Yin; Magdalena Sobieszczyk; Christos A Kyratsous; Lawrence Shapiro; Zizhang Sheng; Yaoxing Huang; David D Ho
Journal: Nature Date: 2021-03-08 Impact factor: 69.504

Review 6. Coronavirus envelope protein: current knowledge.

Authors: Dewald Schoeman; Burtram C Fielding
Journal: Virol J Date: 2019-05-27 Impact factor: 4.099

7. ChEMBL: towards direct deposition of bioassay data.

Authors: David Mendez; Anna Gaulton; A Patrícia Bento; Jon Chambers; Marleen De Veij; Eloy Félix; María Paula Magariños; Juan F Mosquera; Prudence Mutowo; Michal Nowotka; María Gordillo-Marañón; Fiona Hunter; Laura Junco; Grace Mugumbate; Milagros Rodriguez-Lopez; Francis Atkinson; Nicolas Bosc; Chris J Radoux; Aldo Segura-Cabrera; Anne Hersey; Andrew R Leach
Journal: Nucleic Acids Res Date: 2019-01-08 Impact factor: 16.971

8. Potent Noncovalent Inhibitors of the Main Protease of SARS-CoV-2 from Molecular Sculpting of the Drug Perampanel Guided by Free Energy Perturbation Calculations.

Authors: Chun-Hui Zhang; Elizabeth A Stone; Maya Deshmukh; Joseph A Ippolito; Mohammad M Ghahremanpour; Julian Tirado-Rives; Krasimir A Spasov; Shuo Zhang; Yuka Takeo; Shalley N Kudalkar; Zhuobin Liang; Farren Isaacs; Brett Lindenbach; Scott J Miller; Karen S Anderson; William L Jorgensen
Journal: ACS Cent Sci Date: 2021-02-22 Impact factor: 14.553

9. Identify potent SARS-CoV-2 main protease inhibitors via accelerated free energy perturbation-based virtual screening of existing drugs.

Authors: Zhe Li; Xin Li; Yi-You Huang; Yaoxing Wu; Runduo Liu; Lingli Zhou; Yuxi Lin; Deyan Wu; Lei Zhang; Hao Liu; Ximing Xu; Kunqian Yu; Yuxia Zhang; Jun Cui; Chang-Guo Zhan; Xin Wang; Hai-Bin Luo
Journal: Proc Natl Acad Sci U S A Date: 2020-10-13 Impact factor: 11.205

10. 3C-like protease inhibitors block coronavirus replication in vitro and improve survival in MERS-CoV-infected mice.

Authors: Athri D Rathnayake; Jian Zheng; Yunjeong Kim; Krishani Dinali Perera; Samantha Mackin; David K Meyerholz; Maithri M Kashipathy; Kevin P Battaile; Scott Lovell; Stanley Perlman; William C Groutas; Kyeong-Ok Chang
Journal: Sci Transl Med Date: 2020-08-03 Impact factor: 17.956