Literature DB >> 34426763

Docking-generated multiple ligand poses for bootstrapping bioactivity classifying Machine Learning: Repurposing covalent inhibitors for COVID-19-related TMPRSS2 as case study.

Ma'mon M Hatmal1, Omar Abuyaman1, Mutasem Taha2.   

Abstract

In the present work we introduce the use of multiple docked poses for bootstrapping machine learning-based QSAR modelling. Ligand-receptor contact fingerprints are implemented as descriptor variables. We implemented this method for the discovery of potential inhibitors of the serine protease enzyme TMPRSS2 involved the infectivity of coronaviruses. Several machine learners were scanned, however, Xgboost, support vector machines (SVM) and random forests (RF) were the best with testing set accuracies reaching 90%. Three potential hits were identified upon using the method to scan known untested FDA approved drugs against TMPRSS2. Subsequent molecular dynamics simulation and covalent docking supported the results of the new computational approach.
© 2021 The Author(s).

Entities:  

Keywords:  Bootstrapping; Covalent docking; Docking; Ligand-receptor contact fingerprints; Machine learning; Scoring

Year:  2021        PMID: 34426763      PMCID: PMC8373588          DOI: 10.1016/j.csbj.2021.08.023

Source DB:  PubMed          Journal:  Comput Struct Biotechnol J        ISSN: 2001-0370            Impact factor:   7.271


Introduction

Docking algorithms explore the conformational space of bound ligand(s) by sampling numerous ligand conformations/poses within the targeted binding site. The binding enthalpy of each docked pose/conformation is evaluated employing force fields that calculate interactions between ligand-protein complementary groups [1], [2], [3], [4]. However, docking engines ignore entropic contributions in binding and therefore need to be guided by scoring functions to select realistic docked poses [4], [5], [6], [7], [8], [9], [10], [11], [12], [13], [14], [15], [16], [17], [18], [19], [20], [21], [22], [23], [24], [25], [26], [27], [28], [29], [30], [31], [32], [33]. Modern docking methods can reproduce an experimental ligand crystallographic pose within high-ranking solutions. However, scoring functions are usually unable to evaluate binding free energies to accurately rank docked poses [1], [3], [34], [35], [36], [37]. The concept of Ligand-Receptor Contacts Fingerprints (LRCFs) is well established [38]. It proceeds by either mapping out all binding site atoms that contact a list of docked potent ligands and evade inactive compounds [38], [39], [40], [41], or identify binding site atoms that frequently contact a particular bound ligand during molecular dynamics or related simulations [42], [43], [44]. Significant binding site contacts can be transformed into pharmacophores [38], [39], [40], [41], [42], [43], [44]. A related concept to LRCFs is the interaction fingerprints (IFPs) [45], [46], [47], [48], [49], [50], [51], which were used as postdock equivalents to scoring functions [45], [46], [52] or as means for virtual screening by similarity search [46], [49], [53], [54]. Machine learning (ML) in drug discovery involves the implementation of statistical means for learning and predicting molecular properties [55], [56], [57], [58], [59], [60], [61], [62]. The following are popular ML algorithms used in computer-aided drug design and discovery and were evaluated in the current research: Random Forest (RF) [61], [63]; Naïve Bayesian (NB) [64], [65], [66], [67], [68]; eXtreme Gradient Boosting (XGBoost) [69], [70]; K-nearest neighbors (kNN) [71]; Support vector machine (SVM) [72], [73]; probabilistic neural networks (PNN) [74], [75], [76], [77], [78], [79]; and multilayer perceptron (MLP) [80], [81]. However, ML needs to be validated vis-à-vis statistical accuracy. Bootstrapping (BS) is a commonly used statistical approach to assign accuracy values for ML models. BS is a resampling method that uses random sampling with replacement to assign accuracy measurements to sample estimates particularly in situations of limited training data [82], [83], [84]. We herein propose to use multiple docked poses (up to hundreds per ligand), generated by a number of docking engines/scoring functions for a list of active and inactive ligands, to bootstrap bioactivity ML classifiers using LRCFs descriptors. The rational is based on a plausible assumption: Since docking algorithms implement reasonable enthalpy estimating methods, then all docked conformers/poses are enthalpically plausible, particularly for potent ligands. Therefore, ML-based convergence, among numerous docked poses/conformers of potent ligands, on certain group of atom contacts within the binding site highlights the significance of those contacts and their collective ability to classify docked virtual screening hits, i.e., as active or inactive. This method is reminiscent of 4D-QSAR, which uses molecular dynamics simulations to create conformation-dependent molecular descriptors [179]. Still, our approach is less computationally extensive and more suited for virtual screening purposes. We applied this innovative concept, i.e., bootstrapping ML with multiple docked poses, to virtually screen and repurpose known drug molecules against the COVID-19-related enzyme transmembrane serine protease 2 (TMPRSS2). TMPRSS2 is a cell surface serine protease involved in priming entry proteins necessary for respiratory viruses, e.g., influenza and COVID19, to cross cellular membranes [85], [86], [87], [88], [89], [90], [91], [92], [93], [94]. Several reversible [90] and irreversible [95], [96] inhibitors have been reported for TMPRSS2 [86], [88], [89], [90], [91], [92], e.g., nafamostat and camostat [97], [98], [99], [100], [101], [102]. TMPRSS2 had no known crystallographic structure until recently (PDB code: 7MEQ, Released 2021–04-21), nevertheless, it was successfully modeled based on its close homologue hepsin (PDB code 5ce1) [92]. Incidentally, our proposed approach of bootstrapping ML with multiple docked poses is expected to be particularly useful in such a case where ligand–receptor crystallographic structure is unavailable to help identify critical ligand-receptor binding interactions. Irreversible “covalent” inhibitors usually include reactive electrophilic “warheads” capable of reacting with nucleophilic center(s) within targeted binding sites [103], [104], [105], [106]. However, for successful covalent inhibition, irreversible bond formation should be preceded by reversible recognition [107], [108] that specifically allows the warhead close proximity to the targeted nucleophilic center (Ser478 in TMPRSS2) [107], [108]. In fact, presence of reactive warhead in particular inhibitor represents an added value to enhance bioactivity rather than an essential precondition for bioactivity. This is why we were prompted to use docking tools developed for reversible binders to bootstrap and generate ML models intended for the identification of covalent binders, as shown below. The following are reported warheads against serine hydrolases. (i) β-lactam rings [106], [109]. (ii) Carbamates [106], [109], [110], [111] (iii) Nitriles [112], [113]. Two famous nitrile-based protease covalent inhibitors are vildagliptin and saxagliptin [114], [115].(iv) Michael acceptors [107], [116], [117], [118], [119]. (v) Nitro groups [120], [121], [122], [123], [124], [125], [126], [127], [128], [129], [130], [131], [132]. (vi) Aromatic esters. [97], [133], [134], [135], [136]. Table 1 shows the different warheads and appropriate example on each.
Table 1

Major warheads used for covalent inhibition of serine proteases, their mechanisms of action, and examples on each.

WarheadNucleophile-Electrophile ChemistryExample
Target
StructureName
β-LactamL-647957Elastase
CarbamateCiluprevirNS3/4A protease (Hepatitis C)
NitrileSaxagliptinDipeptidyl Peptidase 4
Michael AcceptorsEWG: Electron withdrawing groupSyringolin Aβ5 chymotrypsin-like proteasomal subunit
Nitro3-Nitro-propionateIsocitrate lyase
Aromatic esterNafamostatProstasin
Major warheads used for covalent inhibition of serine proteases, their mechanisms of action, and examples on each. The computational workflow in this project is shown in Fig. 1. Firstly, a diverse set of known inhibitors are docked into TMPRSS2 binding pocket using 3 docking engines and 9 scoring functions. The docked ligands include active and inactive compounds, i.e., against TMPRSS2, of reversible and irreversible binding capacity. Reversible and irreversible active inhibitors are rather easy to define. However, inactive reversible and irreversible inhibitors warrant more explanation: These are molecules with or without covalent warheads, respectively, which were explicitly reported as being inactive against TMPRSS2.
Fig. 1

Computational workflow implemented in the current project.

Computational workflow implemented in the current project. The collected compounds were ionized, tautomerized and docked/scored into TMPRSS2. Identical or closely similar docked poses were filtered out using proper RMSD filter. Reasonable docked poses (as judged by plausible consensus among docking engines and scoring functions) were split into training and testing sets for ML. Several ML methods were evaluated and the best were used to predict the TMPRSS2 bioactivity classification of docked poses of FDA-approved drugs. Promising hits were further evaluated by covalent docking and molecular dynamics simulations. Three hits were found to be potential potent inhibitors of TMPRSS2, namely, capreomycin, aspoxicillin and fosamprenavir.

Material and methods

Data collection

The literature was carefully searched for TMRPSS2 inhibitors. The search identified 1064 TMPRSS2 inhibitors that can be unequivocally classified as “active” (IC50 ≤ 1000 nM) or “Inactive” (IC50 ≥ 10,000 nM). Details about these compounds are as follows: Crystallographic ligands bound to the close homologue of TMPRSS2, human hepsin, were collected and considered. In one crystallographic complex, namely, 5ce1, the ligand is reported to have anti-TMPRSS2 IC50 of 100 nM [137] and was therefore considered as “active, while in another crystallographic complex, namely, 1p57, the ligand is reported to have IC50 value of 40,000 nM, and accordingly was considered as “Inactive”. Additionally, a group of 99 reversible TMPRSS2 inhibitors were collected [90], out of which 88 compounds were reported to have Ki values ≤ 1000 nM, and were therefore categorized as “actives”, while 2 compounds were reported with Ki values ≥ 10,000 nM and were categorized as “inactives”. Two additional covalent-bond forming TMPRSS2 inhibitors were also included as “actives”, namely, nafamostat (IC50 = 100 nM) and camostat (IC50 = 1000 nM) [97]. Both contain aromatic ester warheadsand inhibit TMPRSS2 by covalent bonding [138], [139]. Additionally, 972 established “inactives” were collected [97], out of which 150 were equipped with covalent warheads (30 nitro compounds, 29 Michael acceptors, 24 nitriles, 35 carbamates, 23 β-lactams and 9 aromatic esters). However, to reduce the number of modeled compounds, it was decided to use principal component analysis (PCA) to visually select diverse representative molecules from the collected compounds as in Fig. 2.
Fig. 2

Three-dimensional plots showing three main principal components based on 8 physicochemical descriptors (LogP, Molecular Weight, hydrogen bond donors, hydrogen bond acceptors, Rotatable Bonds, Number of Rings, Number of Aromatic Rings, Molecular Fractional Polar Surface Area) calculated for (A) All collected “actives” (red cubes, ■) compared to all established “inactives” (blue spheres, ●), (B) “actives” used in current modelling (red cubes, ■) compared to all collected compounds (blue spheres, ●), (C) “inactives” used in current modelling (green spheres, ●), compared to all collected compounds (blue spheres, ●). (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.)

Three-dimensional plots showing three main principal components based on 8 physicochemical descriptors (LogP, Molecular Weight, hydrogen bond donors, hydrogen bond acceptors, Rotatable Bonds, Number of Rings, Number of Aromatic Rings, Molecular Fractional Polar Surface Area) calculated for (A) All collected “actives” (red cubes, ■) compared to all established “inactives” (blue spheres, ●), (B) “actives” used in current modelling (red cubes, ■) compared to all collected compounds (blue spheres, ●), (C) “inactives” used in current modelling (green spheres, ●), compared to all collected compounds (blue spheres, ●). (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.) This shortlisted the modelled compounds into 107 divided into 15 “actives” (of which two are irreversible and 13 reversible inhibitors) and 92 “inactives” (of which 12 have nitro warheads, 9 with nitrile warheads, 6 Michael acceptors, 18 carbamates, 16 β-lactams, 2 aromatic esters, one α-keto-amide and 26 compounds lacking any reactive warheads). Table 2 shows the chemical structures of the modeled compounds and their reported bioactivities (if available) [90], [97], [137], [140].
Table 2

Chemical structures of the modeled compounds and their reported bioactivities.

Compound
StructureIC50 or Ki (nM)WarheadReference
NumberName
1Nafamostat100Aromatic ester[97]
2*Camostat1000Aromatic ester[97]
35ce1-Hepsin100NonePDB code: 5CE1
4[76]a8None[90]
5*[24]a19None[90]
6[66]a20None[90]
7[25]a19None[90]
8[54]a45None[90]
9*[71]a60None[90]
10*[31]a160None[90]
11[69]a220None[90]
12[92]a0.9None[90]
13[93]a4None[90]
14*[27]a21None[90]
15[29]a50None[90]
16CefdinirInactive**β-lactam[97]
17CefditorenInactive**β-lactam[97]
18CeftazidimeInactive**β-lactam[97]
19*CephalexinInactive**β-lactam[97]
20CloxacillinInactive**β-lactam[97]
21*SimeprevirInactive**None[97]
22MezlocillinInactive**β-lactam[97]
23PiperacillinInactive**β-lactam[97]
24*EprosartanInactive**MichaelAcceptor[97]
25SunitinibInactive**MichaelAcceptor[97]
26*AfatinibInactive**MichaelAcceptor[97]
27NicardipineInactive**Nitro[97]
28AmprenavirInactive**Carbamate[97]
29CapecitabineInactive**Carbamate[97]
30DiperodonInactive**Carbamate[97]
31*RitonavirInactive**Carbamate[97]
32ZafirlukastInactive**Carbamate[97]
33*CobicistatInactive**Carbamate[97]
34DoxazosinInactive**None[97]
35GefitinibInactive**None[97]
36PentamidineInactive**None[97]
37NafcillinInactive**β-lactam[97]
38AztreonamInactive**β-lactam[97]
39*CarbenicillinInactive**β-lactam[97]
40CefoperazoneInactive**β-lactam[97]
41DicloxacillinInactive**β-lactam[97]
42EzetimibeInactive**β-lactam[97]
43MeropenemInactive**β-lactam[97]
44EntacaponeInactive**MichaelAcceptor[97]
45BosutinibInactive**MichaelAcceptor[97]
46EtravirineInactive**Nitrile[97]
47AnastrozoleInactive**Nitrile[97]
48EscitalopramInactive**Nitrile[97]
49TofacitinibInactive**Nitrile[97]
50RuxolitinibInactive**Nitrile[97]
51TeriflunomideInactive**Nitrile[97]
52CimetidineInactive**Nitrile[97]
53*PinacidilInactive**Nitrile[97]
54FebuxostatInactive**Nitrile[97]
55ChloramphenicolInactive**Nitro[97]
56*FlutamideInactive**Nitro[97]
57*NifedipineInactive**Nitro[97]
58NimodipineInactive**Nitro[97]
59NisoldipineInactive**Nitro[97]
60NitazoxanideInactive**Nitro[97]
61NitrendipineInactive**Nitro[97]
62*NitrofurazoneInactive**Nitro[97]
63TinidazoleInactive**Nitro[97]
64TolcaponeInactive**Nitro[97]
65*AtazanavirInactive**Carbamate[97]
66FelbamateInactive**Carbamate[97]
67FenspirideInactive**Carbamate[97]
68LinezolidInactive**Carbamate[97]
69*LoratadineInactive**Carbamate[97]
70*MethocarbamolInactive**Carbamate[97]
71PhysostigmineInactive**Carbamate[97]
72RivastigmineInactive**Carbamate[97]
73SolifenacinInactive**Carbamate[97]
74ZolmitriptanInactive**Carbamate[97]
75RivaroxabanInactive**Carbamate[97]
76ClozapineInactive**None[97]
77GuanabenzInactive**None[97]
78LoxapineInactive**None[97]
79MethazolamideInactive**None[97]
80PhenforminInactive**None[97]
81PyrimethamineInactive**None[97]
82QuetiapineInactive**None[97]
83SildenafilInactive**None[97]
84*TerazosinInactive**None[97]
85*TrimethoprimInactive**None[97]
86AlfuzosinInactive**None[97]
87ArgatrobanInactive**None[97]
88FamotidineInactive**None[97]
89RanitidineInactive**None[97]
90DasatinibInactive**None[97]
91DidanosineInactive**None[97]
92MercaptopurineInactive**None[97]
93*DiminazeneInactive**None[97]
94ErlotinibInactive**None[97]
95GabexateInactive**Aromaticester[97]
961p57-Hepsin40,000None[138]
97[53]a10,000None[90]
98[78]a20,000None[90]
99SivelestatInactive**Aromatic ester[97]
100TelaprevirInactive**α-keto-amide[97]
101AzlocillinInactive**β-lactam[97]
102DoripenemInactive**β-lactam[97]
103*FamciclovirInactive**None[97]
104MupirocinInactive**Michael Acceptor[97]
105NizatidineInactive**Nitro[97]
106DarunavirInactive**Carbamate[97]
107ZanamivirInactive**None[97]

These compounds were used as testing compounds in machine learning.

Defined as “Inactive” by the particular reference.

Numbers in brackets represent the number of each compound in the original literature citation.

Chemical structures of the modeled compounds and their reported bioactivities. These compounds were used as testing compounds in machine learning. Defined as “Inactive” by the particular reference. Numbers in brackets represent the number of each compound in the original literature citation. Prior to docking, the modelled compounds were appropriately ionized using the Prepare Ligands protocol in Discovery Studio (version 4.5) assuming pH range of 6.5–8.5. Additionally, the same protocol was also used to generate tautomeric forms of each compound. This yielded 338 tautomeric forms (30 for the “active” inhibitors and 308 for the “inactive” members) for subsequent docking studies.

Homology modelling

Template search

Template search with BLAST and HHBlits has been performed against the SWISS-MODEL template library (last update: 2020-04-15, last included PDB release: 2020-04-10) [141], [142]. A total of 1166 templates were found.

Model building

Models are built based on the target-template alignment using ProMod3 [143]. Coordinates which are conserved between the target and the template are copied from the template to the model. Insertions and deletions are remodelled using a fragment library. Side chains are then rebuilt. Finally, the geometry of the resulting model is regularized by using a force field. In case loop modelling with ProMod3 fails, an alternative model is built with PROMOD-II [144].

Ligand modelling

The ligand present in the template structure (5ce1) was transferred by homology to the TMPRSS2 model because it satisfies the requested criteria by SWISS-MODEL: (a) The ligand is annotated as biologically relevant Hepsin inhibitor, (b) the ligand is in contact with the model, (c) the ligand is not clashing with the protein, (d) the residues in contact with the ligand are conserved between the target and the template.

Model quality estimation

The global and per-residue model quality has been assessed using tools implemented in SWISSS-MODEL including: QMEAN (Qualitative Model Energy Analysis) score and MolProbity score [145]. The former is a composite of 6 energy values within the homology model matrix related to protein nativeness with score values ≤ 4.0 indicating poor quality homology models. The later, on the other hand, combines several protein parameters including: clash score, Ramachandran Plot criteria (Ramachandran Favored and Ramachandran Outliers), rotamer outliers, C-Beta deviations (geometry problems around alpha-carbons), bad bonds, and bad angles [146].

Docking

The collected set of inhibitors (1–107, Table 1) were docked into the homology model of TMPRSS2 using 3 docking engines: LibDock [26], [147], LigandFit [148] and CDOCKER [149], [150]. The binding pocket was defined as the cavity volume occupied by hepsin ligand within the homology model. Docking details are found in Supplementary Sections SM1-SM3.

Scoring of docked poses

Highest ranking docked conformers/poses generated by LibDock, LigandFit, and CDOCKER were scored using 9 scoring functions: Jain [6], [15], LigScore1, LigScore2 [148], PLP1, PLP2 [151], PMF, PMF04 [152], [153], -CDOCKER Energy and -CDOCKER Interaction Energy [149]. LigScore1 and LigScore2 scores were calculated employing CFF force field (version 1.02) and using grid-based energies with a grid extension of 7.5 Å across the binding site. PMF scores were calculated employing cutoff distances of 12.0 Å for carbon-carbon interactions and other atomic interactions, while PMF04 scores were calculated employing cutoff values of 6.0 and 9.0 Å for carbon-carbon interactions and other atomic interactions, respectively. -CDOCKER Energy and -CDOCKER Interaction Energy were calculated using Momany-Rone ligand partial charge method. It was decided to select docked conformers/poses based on consensus among the 9 scoring functions [154], [155]. The consensus function assigned a value of 1 for any molecular pose ranked within the highest 20% by the particular scoring function; otherwise, it was assigned a zero value (i.e., fit was within the lowest 80%). Subsequently, the consensus function summed up the scores for each molecular pose/conformer and ranked the molecular orientations. Docked poses of a particular ligand that achieved consensus among at least 4 scoring functions were selected for subsequent processing.

RMSD filtering

The RMSD filter implemented in Discovery Studio 4.5 was used. This filter calculates the Root Mean Square Deviation (RMSD) of ligand poses (in Å). Only heavy atoms were included for RMSD calculation (i.e., hydrogen atoms were excluded). RMSD values were calculated with respect to all docked poses of a particular compound. Poses with an RMSD <2.0 Å were considered duplicates of which only the one having higher consensus score was retained.

Ligand-Receptor fingerprints

The docked poses/conformers of each modelled compound were evaluated to identify their closest binding site atoms. A binding site atom that occurs within ≤ 2.5 Å of any atom within docked ligand pose is allocated an intermolecular contact value of “one”, otherwise it is given a contact value of “zero”. Distance evaluations were automatically performed employing an in-house made FORTRAN package. Eventually, a 2D matrix is built where each row corresponds to docked ligands poses and each column corresponds to different binding site atom. The matrix is filled with binary code, whereby “zeros” correspond to inter-atomic distances >2.5 Å and “ones” for distances binding site atoms at distances ≤ 2.5 Å.

Machine learning

Seven orthogonal ML were scanned, namely, RF, XGBoost, kNN, PNN, SVM, NB, and MLP.

Random Forest (RF)

RF is a multipurpose ML strategy for classification based on ensemble of Decision Trees (DTs) [61]. Each tree predicts a classification independently and “votes” for the related class. Most of the votes decide the overall RF predictions [70]. We implemented RF learner node within KNIME Analytics Platform (Version 4.1.3) with the following settings: Splitting criterion is the Information Gain Ratio (which normalizes the standard information gain by the split entropy to overcome any unfair preference for nominal splits with many child nodes), Number of trees = 100. No limitations were imposed on the number of levels or minimum node size. The accuracy was calculated using out-of-bag internal validation.

eXtreme Gradient Boosting (XGBoost)

eXtreme Gradient Boosting (XGBoost, or XGB) relies on the ensemble of weak DT-type models to create boosted DT-type models [69], [70], [159]. We implemented the XGBoost Learner node within KNIME Analytics Platform (Version 4.1.3) with the following settings: Tree booster was implemented with depth wise grow policy, boosting rounds = 100, Eta = 0.3, Gamma = 0, maximum depth = 6, minimum child weight = 1, maximum delta step = 0, subsampling rate = 1, column sampling rate by tree = 1, column sampling rate by level = 1, lambda = 1, Alpha = 0, sketch epsilon = 0.03, scaled position weight = 1. Maximum number of bins = 256, Sample type (uniform), Normalize type (tree), and Dropout rate = 0.

k-Nearest Neighbors (kNN)

The kNN classifier depends on a distance learning methodology that calculates the activity value of an unknown member based on the bioactivities of a certain number (k) of nearest neighbors (kNNs) in the training set. In this classifier, the similarity is measured by a distance metric [160]. We implemented kNN Learner node within KNIME Analytics Platform (Version 4.1.3) with k scanned from 3 to 6.

Probabilistic neural network (PNN)

Trains a probabilistic neural network (PNN) based on the Dynamic Decay Adjustment method on labeled data using Constructive Training of Probabilistic Neural Networks as the underlying algorithm [161], [162]. We implemented PNN Learner node within KNIME Analytics Platform (Version 4.1.3) using PNN theta minus = 0.2 and theta plus = 0.4 and without specifying maximum number of epochs so that the PNN process is repeated until stable rule model is achieved.

Naïve Bayesian (NB)

NB is a simple classifier whereby class labels are predicted and assigned to external observations based on vectors of descriptors for some finite set of training observations. NB classifier assumes each descriptor to contribute independently to the probability that certain observation (e.g., compound) belongs to a particular class (e.g., active or inactive) [163], [164]. The probability of certain observation to belong to certain class is the multiplication of the individual probabilities of that class within each individual descriptor [164], [165], [166]. We implemented NB learner node within KNIME Analytics Platform (Version 4.1.3) with the following parameters: Default probability = 0.0001, minimum standard deviation = 0.0001, threshold standard deviation = 0.0 and maximum number of unique nominal values per attribute = 20.

Multilayer perceptron (MLP)

It is an implementation of the RProp algorithm for multilayer feed forward networks [167]. MLP has the capacity to learn nonlinear models in real time. MLP can have one or more nonlinear hidden layers between the input and output layers. For each hidden layer, different numbers of hidden neurons can be assigned. Each hidden neuron gives a weighted linear summation for the values from the previous layer, and the nonlinear activation function is followed. The output values are reported after the output layer transforms the values from the last hidden layer. We implemented MLP learner node within KNIME Analytics Platform (Version 4.1.3) with the following optimized parameters: Maximum number of iterations = 100, Number of hidden layers = 3, and number of hidden neurons per layer = 100.

Support vector machine (SVM)

Support vector machine (SVM) chooses a small number of boundary instances called support vectors to create discriminatory function to separates training observations into distinct classes with widest possible boundaries. SVM allows the effective use of a multitude of kernels to allow classification. A key feature of SVMs is the attempt to minimize the error on training data and reduce the computational complexity of models to avoid over fitting by tuning the factors involved in the process [168], [73]. Two SVM method were attempted, namely, C-SVM and nu-SVM. C and nu are regularization parameters that penalize misclassifications. C ranges from 0 to infinity while nu ranges between 0 and 1 and represents the lower and upper bound on the number of examples that are support vectors and that lie on the wrong side of the hyperplane. The following default settings were used in both SVM methods as implemented in the WEKA-KNIME (version 4.1.3) LibSVM node, these include: Kernel Cache (Cache Size = 40.0), kernel type is radial basis function: exp(-gamma*|u-v|^2), and loss function is 0.1, kernel coefficients epsilon = 0.001 and Gamma = 0.00. However, in nu-SVM the optimized nu value of 0.1 was used (identified using Bayesian Optimization (TPE) implemented in KNIME).

ML model evaluation

ML models were evaluated by calculating their accuracies (Eq. (1)) and Cohen’s kappa values (Eq. (2)) [169], [170], [171] against the training and testing sets (Table 2).where, TP is the true positive (correctly classified actives), TN true negatives (truly classified inactives), and n is the total number of evaluated compounds.where Po is the relative observed agreement among raters (i.e., accuracy), and Pe is the hypothetical probability of chance agreement. This is done by using the observed data to calculate the probabilities of each observer randomly seeing each category. If the raters are in complete agreement, then kappa = 1. If there is no agreement among the raters other than what would be expected by chance (as given by Pe), kappa = 0. Negative Cohen’s kappa value implies the agreement is worse than random [172]. Evaluation against the training set involves removing 20% (i.e., leave-20%-out or 5-fold cross-validation) of the data points (i.e., compounds), then building the particular ML model from the remaining data. The model is then used for classifying the removed compounds. The process is repeated until all training data points are removed from the training list and predicted at least once. Accuracy is calculated based on comparing classification results with actual bioactivity classes. On the other hand, evaluation against the testing set involves calculating the accuracy Cohen’s kappa of the particular ML model by comparing its classification results with the actual bioactivity classes of the testing set [173], [174].

Virtual screening

To collect FDA-approved drug molecules for virtual screening, we employed SMARTS codes corresponding to reactive warheads in Table 1 to screen an in house built list of FDA-approved drugs (2111 molecules) for molecules of covalent warheads. Screening was performed within DiscoveryStudio (version 4.5) environment. The resulting list was then compared with TMPRSS2-related ligands reported in the literature (active and inactive) [90], [97]. Only untested compounds (155 molecules) were kept for subsequent virtual screening as potential TMPRSS2 inhibitors. Supplementary Table SM-2 shows screened compounds and their corresponding chemical structures and warheads. The evaluated compounds were prepared, docked, scored, RMSD-filtered and have their LRCFs determined exactly in the same manner as described for the modelled training and testing compounds (sections 2.3–2.4). Subsequently, their bioactivity classifications were predicted using successful ML models.

Molecular dynamics

Docked poses of capreomycin, aspoxicillin, fosamprenavir corresponding to highest consensus scores were solvated by VMD in TIP3 water molecules. The complexes were then neutralized with NaCl and the systems were minimized by conjugate gradient minimization until 10 KJ/mol/nm. Simulations were run using CHARMM22 force field for proteins and general forcefield for drug-like molecules (CGenFF) (https://cgenff.paramchem.org) for ligands. NVE ensemble was arbitrary applied. SHAKE constraints were applied to all hydrogen atoms. Step size was set to be 2 femto-seconds (fs). During the heating phase, the temperature of the system was raised linearly from 0 to 310 K over 155,000 steps with a time step of 0.002 ps. Equilibration of the solvent molecules was achieved in 1,000,000 steps of simulation (2 ns). This was followed by 200 ns of production simulation for data collection, during which structures were stored every 1 ns. Minimization, heating, equilibration, and heating–cooling simulations were performed using OpenMM software (http://openmm.org/). Results from all simulationswere visualized using Discovery Studio (version 4.5, Biovia, USA).

Covalent docking

Capreomycin and aspoxicillin were covalently docked into TMPRSS2 homology model using CovDock software [175] through Maestro Molecular Modeling Interface (Schroedingers Inc., USA). CovDock begins with Glide docking to a receptor with the reactive residue trimmed to alanine. The reactive residue is then added and sampled to form a covalent bond with the ligand in different poses. Covalent complexes are minimized using the Prime VSGB2.0 energy model to score the top covalent complexes. An apparent affinity score, based on the Glide score of pre-reactive and post-reactive poses, is also calculated to estimate binding energies for use in virtual screening. The following docking settings were implemented: The docking mode was set to Pose Prediction (thorough), Energy cutoff to retain poses for further refinement = 2.5 kcal/mol with a maximum number of poses = 200, the reactive residue is S478, docked ligand(s) is(are) confined to an enclosing box of ≤ 20 Å, the center of the enclosing box is set to the hydroxyl of the S478. In each docked ligand, the corresponding reaction type was selected from the available drop-list (i.e., β-lactam addition and Michael acceptor). A single top-ranking docked pose was reported as output.

Results and discussion

Homology modeling of TMPRSS2

Since TMPRSS2 had no known crystallographic structure at the time of preparing this manuscript, it was necessary to build an appropriate homology model for this target. Sequence data for TMPRSS2 was obtained from Pubmed (GenBank accession number NP_001128571). Then BLAST and HHBlits search (141, 142) for pairwise sequence-to-sequence alignment was performed to search and identify close template structure(s) in the protein databank (). However, we only opted for template proteins that have at least 65% sequence coverage with TMPRSS2 and have their co-crystallized bound ligands successfully transferred by SWISS-MODEL to the proposed binding pocket of TMPRSS2. In SWISS-MODEL, for a template bound ligand to be transferred to the corresponding homology model, the ligand should be biologically related to the modelled protein, has favorable interactions and no clashing contacts within model atoms, and the contacting residues are conserved between the target and the template. Subsequent homology modelling was performed using SWISS-MODEL on the resulting alignments [143], [144], [156], [157], [158]. Hepsin crystallographic structure 5ce1 was selected as template as it scored 66% sequence coverage, had its bound ligand successfully transferred to the homology model, and achieved the highest SWISS-MODEL Global Model Quality Estimate (GMQE = 0.49), despite an overall sequence identity of 33.43%. Out of 35 binding site amino acids in the homology model, 23 are identical (66%) to their counterparts in the template, while 4 amino acids were similar (11%). Overall, amino acid homology with TMPRSS2 in the proposed binding site is 77% with the catalytic serine (S478) being conserved. Fig. 3A compares TMPRSS2 sequence to the template protein Hepsin.
Fig. 3

Criteria for the resulting TMPRSS2 homology model. (A) Alignment of TMPRSS2 (GenBank: NP_001128571) and Hepsin crystallographic template (PDB code: 5ce1, resolution 2.5 Å) yielding homology model. Gaps are shown as (–), identical and similar residues are highlighted in green and yellow, respectively. Binding site amino acids are indicated with asterisks, the catalytic S478 is marked with red asterisk. (B) Ramachandran plot of the homology model amino acids. (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.)

Criteria for the resulting TMPRSS2 homology model. (A) Alignment of TMPRSS2 (GenBank: NP_001128571) and Hepsin crystallographic template (PDB code: 5ce1, resolution 2.5 Å) yielding homology model. Gaps are shown as (–), identical and similar residues are highlighted in green and yellow, respectively. Binding site amino acids are indicated with asterisks, the catalytic S478 is marked with red asterisk. (B) Ramachandran plot of the homology model amino acids. (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.) The resulting homology structure was evaluated employing structure assessment tools within SWISS-MODEL. The results are as follows (in brackets): QMEAN (-1.48), MolProbity score (1.91), clash score (6.0), Ramachandran Favored (92.17%), Ramachandran Outliers (1.16%) (See Fig. 3B), rotamer outliers (1.35%), C-Beta deviations (reporting geometry problems around alpha-carbons:7 amino acids), bad angles (37 amino acids) and no bad bonds. Interestingly, all binding site amino acids were devoid of any significant modelling-related artefacts as in Fig. 4A. Thankfully, the binding site of a recently release crystallographic structure of TMPRSS2 (7 meq, released 2021-04-21) fitted closely the binding site of our homology model within a radius of 12 Å surrounding the co-crystallized ligand. The alignment scored RMSD values of 1.76 Å and 1.74 Å based on α-carbons (Ca) and main chain carbons, respectively. Fig. 4B shows the alignment between the binding sites of the homology model and corresponding X-ray structure of TMPRSS2. Still, the side of chain of the critical binding amino acid Gln438 in the X-ray structure (corresponding to Gln475 in the homology model) is missing including the terminal amide moiety. This major artefact undermines the validity of docking studies using the crystallographic structure and adds merits to the homology model.
Fig. 4

TMPRSS2 binding site (A) Detailed view of many TMPRSS2 homology model binding site residues (green backbone) compared to their corresponding counter parts in template serine protease Hepsin (PDB ID: 5ce1, red backbone) X-ray structure including the crystallographic bound pose of hepsin inhibitor 2-[6-(1-hydroxycyclohexyl)pyridin-2-yl]-1H-indole-5-carboximidamide (compound 3 in Table 2). Most binding site amino acids are correctly aligned (B) Main binding site residues of TMPRSS2 homology model (green backbone) aligned to their counterparts in a recently released TMPRSS2 X-ray crystalographic structure (PDB ID: 7 meq, blue backbone) including the bound fragment of nafamostat (compound 1 in Table 2). The dotted arrow points to the missing side chain of Gln438 (corresponding to Gln475 in the homology model). (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.)

TMPRSS2 binding site (A) Detailed view of many TMPRSS2 homology model binding site residues (green backbone) compared to their corresponding counter parts in template serine protease Hepsin (PDB ID: 5ce1, red backbone) X-ray structure including the crystallographic bound pose of hepsin inhibitor 2-[6-(1-hydroxycyclohexyl)pyridin-2-yl]-1H-indole-5-carboximidamide (compound 3 in Table 2). Most binding site amino acids are correctly aligned (B) Main binding site residues of TMPRSS2 homology model (green backbone) aligned to their counterparts in a recently released TMPRSS2 X-ray crystalographic structure (PDB ID: 7 meq, blue backbone) including the bound fragment of nafamostat (compound 1 in Table 2). The dotted arrow points to the missing side chain of Gln438 (corresponding to Gln475 in the homology model). (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.)

Data collection, Docking, scoring and Machine Learning.

The literature was carefully searched for TMRPSS2 inhibitors. The search identified 1064 inhibitors that can be unequivocally classified as “active” (IC50 ≤ 1,000 nM) or “inactive” (IC50 ≥ 10,000 nM) as in the Data Collection section under Experimental. However, we opted to select a subset of inhibitors for performing this study partially to minimize the computational cost, but mainly, to assess the possibility of exploiting limited number of ligands for building successful and predictive machine learning models. Limitations related to number of available ligands is often encountered in drug discovery projects particularly those involving new biotargets. To select diverse representative molecules from the collected compounds, it was decided to use principal component analysis (PCA) combined with visual selection, as in Fig. 2. Eventually, 107 compounds were selected for modelling, including: 15 actives (of which two are irreversible) and 92 inactives (of which 64 exhibit covalent warheads). Table 2 shows the chemical structures of the modeled compounds and their bioactivities (90, 97, 137, 140). Incidentally, we attempted to use automatic means for selecting diverse subset employing the Find Diverse Molecules protocol within Discovery Studio (it attempts to select diverse sets based on a maximum dissimilarity algorithm using Tanimoto distance and molecular fingerprints), however, the selected list had limited number of actives (only two molecules). Moreover, it is of comparable diversity criteria to our visually-selected subset (as in Supplementary Table SM-1). The modeled compounds were appropriately ionized and tautomerized yielding 338 tautomeric forms ready for docking. The structures were then docked into the binding pocket of TMPPRSS2 homology model using three docking engines, namely, LibDock (26,147), CDOCKER (149) and LigandFit (148). This yielded 65,941 docked conformers/poses. The resulting docked poses were then scored by 9 docking scoring functions. Docked poses of each compound that scored within top 20% of at least 4 scoring functions (i.e., consensus among 4 scoring functions) were retained for subsequent processing and the rest were discarded resulting in 12,457 docked poses/conformers. The reason for this restriction is to limit ML modelling to high-quality docked poses vis-à-vis binding enthalpies in order to minimize ML noise caused by low quality poses. The veracities of the implemented docking settings were evaluated by two means: (i) Looking at the docked poses of co-crystallized hepsin ligand (PDB code: 5ce1, compound 3 in Table 2) and nafamostat (compound 1 in Table 2). In the former case we were interested to see if the docked pose(s) approximate the experimental pose of the ligand despite being generated for another protein (i.e., hepsin). The close analogy between hepsin and TMPRSS2 binding sites (as in Fig. 4) should correct this discrepancy. On the other hand, the docked poses of nafamostat should probe the docking veracity by looking at the distance between nafamostat’s warhead (the ester carbon atom) and the nucleophilic oxygen of Ser478. Proximity between these two atoms below 3.3 Å should allow successful covalent bond formation (see section Molecular Dynamics Simulation and Covalent Docking below), and therefore demonstrates the veracity of the docking settings. Needless to say, that nafamostat covalently binds TMPRSS2 Ser478 hydroxyl as in Fig. 4B. Fig. 5 shows the best docked poses in the two cases. In compound 3 case, 2 and 14, out of 28 docked poses are within RMSD 2.0 and 4.0 Å, respectively, from the crystallographic pose, while one of the top-ranking docked poses (i.e., 4th based on consensus score and 1st based on 6 out of 9 scoring functions) achieved RMSD of 1.18 Å from the crystallographic pose, as in Fig. 5A. Moreover, upon docking 3 into its native protein, hepsin (PDB code: 5ce1), the same docking-scoring settings generated 89 docked poses, all of which were within RMSD of 1.2 Å from the crystallographic bound pose, as in Supplementary Fig. SM-1. Needless to say, success to reproduce the crystallographic pose (i.e., RMSD ≤ 2.0 Å) among top-ranked solutions is considered sufficient validation for certain docking-scoring settings [182], [183], [184].
Fig. 5

Assessment of docking veracity. (A) The co-crystallized pose of 3 (Table 2, Green skeleton) within hepsin (PDB code: 5ce1) extracted into the binding site of TMPRSS2 homology model and compared with high ranking docked pose of the same molecule within TMPRSS2 binding site (red skeleton). (B) Two high ranking docked poses of nafamostat (compound 1 in Table 2) with their ester carbons at close proximities (ca. 3.1 Å) to the nucleophilic hydroxyl of Ser478. (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.)

Assessment of docking veracity. (A) The co-crystallized pose of 3 (Table 2, Green skeleton) within hepsin (PDB code: 5ce1) extracted into the binding site of TMPRSS2 homology model and compared with high ranking docked pose of the same molecule within TMPRSS2 binding site (red skeleton). (B) Two high ranking docked poses of nafamostat (compound 1 in Table 2) with their ester carbons at close proximities (ca. 3.1 Å) to the nucleophilic hydroxyl of Ser478. (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.) Unfortunately, it is not possible to perform similar analysis in nafamostat (compound 1) case because the corresponding complex (PDB code: 7 meq) includes only fragment of the bound ligand following covalent bond formation with Ser478, i.e., it is not possible to compare whole docked nafamostat with partial crystallographic fragment in 7 meq. As an alternative we opted to gauge the success of the docking-scoring settings by measuring the distance separating the warhead of docked nafamostat poses from the hydroxyl of Ser478 (as in Fig. 5B). Interestingly, 4 out of 285 docked poses were positioned within the binding site such that the warhead ester carbon of nafamostat occurs within 3.3 Å from the nucleophilic hydroxyl of Ser478. Two poses are shown in Fig. 5B. Such close proximity should allow successful covalent bond formation (177), and therefore, further validates our docking settings. (ii) By assessing the difference between docking-scoring values calculated for docked poses of active versus inactive compounds (training and testing alike). T-test analysis indicates that 8 out of 9 docking-scoring functions (namely, LigScore1, LigScore2, -PLP1, -PLP2, Jain, -PMF, -PMF04, -Cdocker Interaction Energy) have generated statistically significant scoring values for docked poses of active compounds compared to inactive counterparts (training and testing alike), as in Supplementary Table SM-7. These results further support the notion that the implemented docking/scoring conditions segregate active compounds from inactives by allocating them distinct binding poses and regions within the binding site. We hypothesize that convergence of particularly high-quality docked poses, corresponding to active ligands, on certain unique binding site contacts (i.e., not contacted by high quality docked poses of inactive ligands) highlights the significance of those binding site points as activity discriminators. Fig. 6 illustrates this point: Clearly, the optimal docked pose of active inhibitor 12 (Table 2) fills distinct space within the binding pocket from that occupied by the best docked pose of the analogous inactive inhibitor 97. This trend is also apparent upon comparing multiple high-ranking docked poses for the same compounds albeit less obvious to the human eye necessitating ML usage. Moreover, this consistency between docked poses and bioactivities further supports the veracity of docking settings.
Fig. 6

Docking/bioactivity consistency. (A) Highest-ranking docked poses of compounds 12 (Table 2, IC50 = 0.9 nM, green skeleton) and 97 (Table 2, IC50 = 10,000 nM, red skeleton). (B) Docked poses of the same compounds as were used for generating LRCFs and subsequent ML. (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.)

Docking/bioactivity consistency. (A) Highest-ranking docked poses of compounds 12 (Table 2, IC50 = 0.9 nM, green skeleton) and 97 (Table 2, IC50 = 10,000 nM, red skeleton). (B) Docked poses of the same compounds as were used for generating LRCFs and subsequent ML. (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.) To eliminate repetitions in the docked poses/conformers list, which might emerge due to the use of multiple docking engines on the same set of ligands, it was decided to filter out docked poses of RMSD <2.0. This step reduced the number of poses for subsequent ML to 12073. In summary, it can be concluded that the overall effect of docking/scoring/RMSD filtering is enriching the ML list from just 108 observations (i.e., TMPRSS2 ligands) to 12,073 observations (corresponding docked poses) in a process reminiscent of statistical bootstrapping albeit the multiple docked poses represent realistic permutations rather than simple repetitive sampling. Subsequently, the docked poses were used to generate LRCFs such that binding site atoms that are positioned within 2.5 Å from docked poses are given a binary code of 1.0, otherwise they are annotated as zeros. Additionally, scoring values of docked poses (that survived the RMSD filter) calculated by all 9 scoring functions were also added as descriptors. Thereafter, the modeled list was randomly split into training (~82%, 9890 poses corresponding to 85 compounds, see Table 2) and testing (~18%, 2183 poses corresponding to 23 compounds, see Table 2) sets. The splitting was performed based on the modelled compounds (prior to docking, Table 2). Approximately 16% and 84% of the training poses correspond to active and inactive ligands, respectively. On the other hand, the testing list included ~28% and ~72% poses corresponding to active and inactive ligands, respectively. The apparent imbalance between poses corresponding to active and inactive ligands is attributed to the fact that known “active” TMPRSS2 inhibitors are rather limited in number and diversity, while their inactive counterparts are quite numerous and very diverse allowing them to generate significantly more docked poses. This imbalance is not necessarily disadvantageous as the excess of inactive poses should better demark the “forbidden” regions within the binding site. Nevertheless, such imbalance necessitates the use of Cohen’s kappa [172], [176] as additional tool for evaluating the success/failure of a particular ML. Seven MLs were evaluated against the training and testing set, namely, Xgboost, SVM, RF, PNN, NB, kNN, and MLP. LRCFs and/or docking-scoring values were evaluated as input explanatory descriptors, as in Table 3.
Table 3

Accuracy and Cohen’s Kappa values for ML models developed using different ML learners combined with LRCFs and/or scoring function values as descriptors.

LearnerDescriptorsAccuracy
Cohen's Kappa
L20%outaTestingbRe-substitutioncL20%outaTestingbRe-substitutionc
XgboostLRCFs and Scoring Functions0.960.901.000.850.731.00
LRCFs0.910.830.970.630.530.89
Scoring0.950.881.000.800.681.00
C-SVMLRCFs and Scoring Functions0.940.890.950.730.710.80
LRCFs0.860.760.870.220.210.32
Scoring Functions0.840.721.000.000.001.00
nu-SVMdLRCFs and Scoring Functions0.950.900.990.800.720.95
LRCFs0.870.800.940.510.450.77
Scoring Functions0.840.721.000.020.001.00
RFLRCFs and Scoring Functions0.95g0.901.000.79g0.720.98
LRCFs0.90g0.810.960.57g0.460.83
Scoring Functions0.95g0.891.000.78g0.691.00
PNNLRCFs and Scoring Functions0.910.880.920.620.660.67
LRCFs0.840.720.840.000.000.00
Scoring Functions0.920.880.930.630.660.68
Naïve BayasianLRCFs and Scoring Functions0.840.720.840.010.000.00
LRCFs0.840.720.840.000.000.00
Scoring Functions0.870.810.870.490.480.48
kNNeLRCFs and Scoring Functions0.930.870.940.690.640.76
LRCFs0.880.800.900.430.370.52
Scoring Functions0.930.870.940.680.640.75
MLPfLRCFs and Scoring Functions0.890.820.990.560.490.96
LRCFs0.890.820.990.570.500.96
Scoring Functions0.870.790.880.420.380.44

L20%out: Leave 20% out cross-validation for accuracy and Cohen’s Kappa.

Testing: Accuracy and Cohen’s Kappa determined against the testing set (marked with asterisks in Table 2).

Re-substitution: Accuracy and Cohen’s Kappa determined by applying the particular ML model to predict the same training compounds used to build the model.

Performed using optimized nu value of 0.1.

Performed using optimized k value of 6 (number of neighbors).

Performed using 3 hidden layers, 100 neuron per layer, and 100 iterations (epochs).

RF ML models were validated by Out-of-Bag validation instead of Leave 20% out cross-validation for accuracy and Cohen’s Kappa.

Accuracy and Cohen’s Kappa values for ML models developed using different ML learners combined with LRCFs and/or scoring function values as descriptors. L20%out: Leave 20% out cross-validation for accuracy and Cohen’s Kappa. Testing: Accuracy and Cohen’s Kappa determined against the testing set (marked with asterisks in Table 2). Re-substitution: Accuracy and Cohen’s Kappa determined by applying the particular ML model to predict the same training compounds used to build the model. Performed using optimized nu value of 0.1. Performed using optimized k value of 6 (number of neighbors). Performed using 3 hidden layers, 100 neuron per layer, and 100 iterations (epochs). RF ML models were validated by Out-of-Bag validation instead of Leave 20% out cross-validation for accuracy and Cohen’s Kappa. Clearly from Table 3, all learners achieved excellent accuracies against testing and training sets regardless of using LRCFs and/or scoring functions values as input descriptors. This indicates that the data is self-consistent. Still, in most cases the combination of both descriptor classes generally yielded better accuracies compared to either class alone (i.e., LRCF or scoring functions). Intriguingly, Cohen’s Kappa unveiled significant differences between learners not seen from accuracy values: In three leaners, the use of a single class of descriptors completely failed to yield significant Kappa, i.e., C-SVM, nu-SVM, PNN. In the particular case of NB, combining both classes, or using LRCFs alone, failed to yield significant Cohen’s Kappa values. On the other hand, in kNN and MLP cases, the combined use of both descriptor classes (LRCFs and scoring functions) failed to have any significant additive advantage over the use of one class only. However, in RF and Xgboost cases, the use of any of the two descriptor classes was successful and the combination yielded even better Cohen’s Kappa values. Overall, the varying performances of different learners vis-à-vis success of their ML models upon using differing combinations of descriptor classes points to the orthogonality of these learners and that their ML models inherently different. From the above and by comparing the behavior of different ML models in Table 3 we concluded that the best models to be used for predicting the bioactivity of screened compounds are Xgboost, RF and nu-SVM. Moreover, the orthogonality of these ML approaches prompted us to stack them in a meta-learner were each individual learner casts an equivalent vote for predicting the anti-TMPRSS2 bioactivity of screened ligand docked poses. To further exclude the possibility of chance correlation we performed y-scrambling validation experiment. In this test, 100 random bioactivity data are generated. Subsequently, each learner is challenged to use these random data to generate ML models using scoring functions and LRCFs descriptors [180]. Supporting Tables SM-4 to SM-6 show the results of the experiments employing Leave-20%-Out or Out-of-Bag cross-validations. The results show that the original non-randomized data yielded ML models of significantly superior accuracy and Cohen’s Kappa values compared to all randomized trials. The effect is particularly evident in Cohen’s Kappa values. Overall, the results support the validity of our ML models. Incidentally, upon using more stringent pose-selection criteria through imposing consensus score ≥ 8 on training and testing poses (i.e., 8 or 9 of the scoring functions ranked the selected poses within their top 20%), the best three MLs (Xgboost, RF and nu-SVM) performed slightly worse (see Supplementary Table SM-3). It is noteworthy to mention that under such restrictions (imposing consensus score ≥ 8), the number of training compounds fell from 85 to 42, whereas testing compounds decreased from 23 to 13. This is because not all screened compounds have such high ranking docked poses (consensus score ≥ 8). This reduction in training and testing data should restrict the applicability domain of the respective ML model [181]. Moreover, imposing docking consensus score ≥ 8 will undoubtedly lower the number of docking hits captured in a virtual screening campaign because many potential hits fail to reach high quality docked poses.

Virtual screening and prediction of hits’ anti-TMPRSS2 bioactivities

To utilize optimal ML models to search for potential anti-TMPRSS2 inhibitors within FDA-approved drugs, it was decided to screen 155 drug molecules of reactive warheads (Supplementary Table SM-2). All screened molecules were not tested before against TMPRSS2. Virtual screening commenced by properly ionizing and tautomerizing screened molecules. Subsequently, they were docked, scored and RMSD-filtered utilizing the same settings implemented for the training and testing sets. The resulting docked poses were then used to generate corresponding LRCFs in exactly the same manner as in the training and testing sets. Subsequently, the resulting LRCFs were substituted in the best ML models, namely, Xgboost, nu-SVM and RF, to predict the activity label of each docked pose/conformer. This resulted in a situation where each screened compound yielded a set of poses that are assigned either “active” or “inactive” labels. This prompted us to define a threshold by which to consider certain screened molecule as being promising or not, i.e., as anti-TMPRSS2, based on the ratio of docked poses/conformers predicted to be “active” compared to those predicted to be “inactive”. We decided that the most reasonable way to define such a threshold is to evaluate the active/inactive ratios within the testing set. It can be reasonably assumed the least active-to-inactive ratio of unequivocally documented active inhibitor represents an acceptable threshold for identifying potentially new active hits. Table 4 shows the percentages of active poses of testing compounds as predicted by the top three ML (i.e., Xgboost, nu-SVM and RF).
Table 4

Predicted active and inactive docked poses for testing set compounds.

Compoundsa
Predicted number of active and inactive docked poses
NumbersAnti-TMPRSS2 ActivityXgboost
nu-SVM
RF
Active PosesInactive PosesPercent Active PosesbActive PosesInactive PosesPercent Active PosesbActive PosesInactive PosesPercent Active Posesb
5Active100496.2941090.4102298.1
9Active97595.197595.199397.1
10Active901486.5792576842080.8
14Active10319999595.2102298.1
2cActive3916519.14316121.12617812.7
19Inactive2454.364112.80470
21Inactive0800803537.5
24Inactive0141011400.701410
26Inactive29540.239530.319550.1
31Inactive555028203730
33Inactive1812601293.372323.3
39Inactive018001800180
53Inactive070070070
56Inactive020020020
57Inactive013001300130
62Inactive020020020
65Inactive115011501150
69Inactive020020020
70Inactive013001300130
84Inactive09201911.10920
85Inactive41632.411660.601670
93Inactive2434.454011.13426.7
103Inactive010019100100

Compounds’ numbers and bioactivities are as in Table 2.

Determined by dividing the number of active poses by the total number poses (active + inactive).

The percent active poses of this compound (camostat, Ki = 1000 nM) were used as threshold to classify screened compounds into potential active and inactive TMPRSS2 inhibitors.

Predicted active and inactive docked poses for testing set compounds. Compounds’ numbers and bioactivities are as in Table 2. Determined by dividing the number of active poses by the total number poses (active + inactive). The percent active poses of this compound (camostat, Ki = 1000 nM) were used as threshold to classify screened compounds into potential active and inactive TMPRSS2 inhibitors. Clearly from Table 4, the anti-TMPRSS2 inhibitor 2 (camostate, Ki = 1000 nM, Table 2) shows the smallest ratios of predicted active-to-inactive docked poses among other inhibitors in the testing set, and therefore, it can be used to discriminate actives among screened compounds (Supplementary Table SM-2). Still, careful assessment of Table 4 shows that two inactives, namely, 31 (Ritonavir) and 65 (Atazanavir), to have predicted active-inactive ratios above than the proposed thresholds. Nevertheless, the total number of docked poses of these outliers are rather low (31 has 10 docked poses, while 65 has only 2 docked poses) and way from the number of docked poses of active compounds (average = 123.6, minimum = 102). On the other hand, it is noteworthy to mention that camostate has the highest number of docked poses among all active testing compounds (204 poses), which further highlights the importance of this point. Table 5 shows the predicted active-to-inactive ratios of the screened compounds. Clearly all three top MLs agreed on three drugs to exhibit active-to-inactive ratios exceeding the corresponding thresholds of camostat in Table 4. The three compounds are: 116 (aspoxicillin), 126 (capreomycin), and 196 (fosamprenavir). Moreover, these compounds have considerable number of docked poses, namely, 4404 for the two forms of capreomycin, 210 for aspoxicillin and 65 for fosamprenavir. Still, only capreomycin and aspoxicillin have their count of docked poses exceeding the minimum of docked poses of active testing compounds (i.e., 102), suggesting the lesser propensity of fosamprenavir as valid TMPRSS2 inhibitor. To validate our overall computational method, we evaluated the three promising hits by molecular dynamics simulation and covalent docking.
Table 5

Screened drug molecules and counts of their predicted “active” and “inactive” docked poses as predicted by the three top MLs.

CompoundsaPredicted number of active and inactive docked poses
Xgboost
nu-SVM
RF
Active PosesInactive Poses%Active PosesbActive PosesInactive Poses%Active PosesbActive PosesInactive Poses%Active Posesb
1085-Iodo-Sunitinib12180.50219002190
1095-CH3O-Sunitinib22530.822530.812540.4
110Acenocoumarol1109.101100110
111Acrivastine024002400240
112Alectinib038003800380
113Alogliptin032003200320
114Apalcillin42841.452831.712870.3
115Apraclonidine010010010
116Aspoxicillinc8912142.47513535.79711346.2
117Ast-130615230.2125122.305240
118Bacampicillin3415717.82516613.12117011
119Bambuterol04201412.40420
120Baricitinib025002500250
121Belinostat078007800780
122Benznidazole012001200120
123Biapenem070070070
124Brodimoprim1293.32286.72286.7
125Canertinib1012840.81912751.5012940
126Capreomycin1c,d118393156120391156.9159052475.2
Capreomycin2c,d131397757.31269102155.4172556575.3
127Carisoprodol080080080
128Carumonam011001101109.1
129Cefadroxil0870117612.60870
130Cefamandole071007100710
131Cefatrizine25750.3175602.925750.3
132Cefazedone057005700570
133Cefazolin051005100510
134Cefbuperazone066006600660
135Cefcapene011500115001150
136Cefclidin018100181011800.6
137Cefepime21051.90107001070
138Cefetamet052005200520
139Cefixime040040040
140Cefmenoxime017600176001760
141Cefmetazole025002500250
142Cefminox080047650800
143Cefodizime043004300430
144Cefonicid094009400940
145Ceforanide11660.60167021651.2
146Cefotaxime011200112001120
147Cefotetan079007900790
148Cefotiam0183011820.501830
149Cefoxitin2335.71342.90350
150Cefozopran0270012690.402700
151Cefpimizole0179011780.611780.6
152Cefpiramide330910312013110.3
153Cefpirome11570.60158001580
154Cefpodoxime074007400740
155Cefprozil030030030
156Cefroxadine058005800580
157Cefteram016001600160
158Ceftezole038003800380
159Ceftibuten061006100610
160Ceftizoxime1422.32414.70430
161Ceftriaxone018600186001860
162Cefuroxime05901581.70590
163Cefuzonam016800168001680
164Cephacetrile020020020
165Cephaloridine02301224.30230
166Cephalothin019001900190
167Cephapirin03802365.30380
168Cephradine1442.204500450
169Cilastatin0113031102.711120.9
170Cinanserin018800188001880
171Citalopram090090090
172Clobenprobit013800138001380
173Clonazepam050050050
174Clonitazene041004100410
175Cyclacillin02302218.70230
176Cypermethrin057005700570
177Dacomitinib0430034270.704300
178Dantrolene091009100910
179Debrisoquin010010010
180Demecarium1127.731023.10130
181Difenoxin026002600260
182Diphenoxylate1831.22822.40840
183Enzalutamide01601156.30160
184Ertapenem021002101204.8
185Eszopiclone041004100410
186Ethacrynic_Acid010010010
187Etonitazene055005500550
188Etozolin050050050
189Etretinate0700701614.3
190Ezogabine014400144001440
191Famitinib13230.313230.303240
192Flubanilate011001100110
193Flucloxacillin010001000100
194Flunidazole090090090
195Flunitrazepam020020020
196Fosamprenavirc224333.8303546.2204530.8
197Guanadrel0801712.5080
198Guanethidine14201420050
199Guanfacine010001000100
200Guanisoquin030030030
201Guanoxan012001200120
202Guanoxyfen011001100110
203Henatinib010010010
204Hetacillin02101204.80210
205Hydroxystilbamidine020020020
206Ibrutinib13870.323860.503880
207Imipenem016001600160
208Iobenguane050050050
209Irinotecan124402502238
210Isosulfazecin1931.15895.31931.1
211Leuprolide02002020100
212Levocabastine0801712.5080
213Levofuraltadone029002900290
214Levopropylcillin020020020
215Lodoxamide027002700270
216Loracarbef05505509.10550
217Mebendazole082008200820
218Methicillin070070070
219Mitomycin080080080
220Momelotinib012500125001250
221Moricizine021900219002190
222Naquotinib43301.253291.543301.2
223Neratinib73242.1133183.9133183.9
224Niclosamide016001600160
225Nifurtimox030030030
226Nilutamide020020020
227Nitisinone060060060
228Nitrazepam060060060
229Nitrofurantoin035003500350
230Octocrylene012001200120
231OctylMethoxycinnamic016001600160
232Olmutinib85661.4125622.165681
233Orantinib1313.11313.10320
234Osimertinib3410313.24810174.52210432.1
235Oxacilin016021412.50160
236Oxamniquine026002600260
237Pelitinib12290.422280.912290.4
238Penicillin G012001200120
239Penicillin V020002000200
240Pentagastrin55660.955660.98948215.6
241Pericyazine04101402.40410
242Piritrexim315475.4245544.235750.5
243Poziotinib0167011660.601670
244Proguanil011001100110
245Pyrotinib41432.70147031442
246Rescinnamine1712.5080080
247Rociletinib79100.849130.499081
248Romidepsin030030030
249Siguazodan1531.93515.60540
250Spebrutinib0880058750.608800
251Streptomycin29724954188118.2
252Tedizolid-phosphate020020020
253Tegaserod068006800680
254Tepotinib1881.11881.10890
255Ticarcillin026002600260
256Vasopressin020020020
257Verapamil079007900790
258Vilazodone099009901981
259Vorapaxar1352.81352.80360
260Yn-968d108601851.20860
261Zaleplon030030030
262Zopiclone03701362.70370

Compounds are as in Supplementary Table SM-2.

Determined by dividing the number of active poses by the total number poses (active + inactive).

The percent active poses of these compounds exceeded the threshold in Table 2 (compound 2 or camostat) and therefore are predicted to be active anti-TMPRSS2 inhibitors.

Screened drug molecules and counts of their predicted “active” and “inactive” docked poses as predicted by the three top MLs. Compounds are as in Supplementary Table SM-2. Determined by dividing the number of active poses by the total number poses (active + inactive). The percent active poses of these compounds exceeded the threshold in Table 2 (compound 2 or camostat) and therefore are predicted to be active anti-TMPRSS2 inhibitors.

Molecular dynamics simulation and covalent docking

As a prerequisite for successful covalent bond formation between the ligand’s warhead and the targeted nucleophilic center within the binding site (hydroxyl of Ser478), it is necessary for the warhead to reside for some time, e.g., 20 to 80 ns [177], [178], at close proximity to the nucleophilic residue. It can be assumed that the sum of van der Waals’ radii of oxygen and carbon atoms (3.3 Å, representing the hydroxyl oxygen of Ser478 and carbon atoms of electrophilic warheads, magenta line in Fig. 7) is a reasonable distance threshold for probable subsequent nucleophilic attack [177].
Fig. 7

MD simulations distance probes (A) Distances separating the reactive warheads of candidate hits: capreomycin (126), aspoxicillin (116) and fosamprenavir (196) from the nucleophilic hydroxy of Ser478 during 200 ns of MD trajectories. (B) Distances separating the reactive warheads of camostat (2, active testing compound), ibrutinib (206, predicted inactive) or cobicistat (33, inactive testing compound) or the central atom of piritrexim (242, predicted inactive) from the nucleophilic hydroxy of Ser478 during 200 ns of MD trajectories. Each time step represents 1.0 ns. The magenta lines represent the minimum distance between the reactive warhead atoms of the hits and Ser478 hydroxyl oxygen atom in a covalent-bond productive encounter (the distance represents the summation of van der Waals’ radii of carbon and oxygen atoms). (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.)

MD simulations distance probes (A) Distances separating the reactive warheads of candidate hits: capreomycin (126), aspoxicillin (116) and fosamprenavir (196) from the nucleophilic hydroxy of Ser478 during 200 ns of MD trajectories. (B) Distances separating the reactive warheads of camostat (2, active testing compound), ibrutinib (206, predicted inactive) or cobicistat (33, inactive testing compound) or the central atom of piritrexim (242, predicted inactive) from the nucleophilic hydroxy of Ser478 during 200 ns of MD trajectories. Each time step represents 1.0 ns. The magenta lines represent the minimum distance between the reactive warhead atoms of the hits and Ser478 hydroxyl oxygen atom in a covalent-bond productive encounter (the distance represents the summation of van der Waals’ radii of carbon and oxygen atoms). (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.) To evaluate the residence time and distance separating the reactive warheads of capreomycin, aspoxicillin and fosamprenavir from the nucleophilic hydroxyl of Ser478, we opted to perform 200-ns MD simulations for the three hits as they dock within the binding pocket. The starting poses were selected to have the highest consensus score among other docked poses in each case. MD trajectories show fosamprenavir to leave the binding pocket after approximately 60 ns. Moreover, it failed to have close encounters with the nucleophilic Ser478 OH (within distance threshold) as can be seen in Fig. 7A where the distance separating the electrophilic carbamate atom of fosamprenavir and the nucleophilic Ser478 OH never crossed the threshold magenta line. Accordingly, we were prompted to discard fosamprenavir from subsequent covalent docking. However, the other two hits (capreomycin and aspoxicillin) remained during MD simulation within the active site (Fig. 7A). However, aspoxicillin seems to have better chances to form covalent bond with Ser478 as it crosses/approaches the distance theshold line (magneta line in Fig. 7A) more frequently. Covalent docking shows both aspoxicillin and capreomycin to successfully form covalent bonds with the nucleophilic hydroxyl of Ser478, as in Fig. 8. In aspoxicillin case, the nucleophilic hydroxyl attacks and ring-opens the β-lactam ring (Fig. 8A and 8B), while in capreomycin case an oxa-Michael addition reaction takes place whereby the nucleophilic hydroxyl adds to the Michael acceptor feature within capreomycin (Fig. 8C and D). Moreover, covalent docking shows that each docked compound exhibits additional reversible binding interactions: Aspoxicillin forms hydrogen bonding interactions with Glu426, Ser497 and Gly476 together with hydrophobic interactions Cys502 and His333 (Fig. 8A and B). Similarly, docked capreomycin is involved in hydrogen bonding interactions with Gln475, His333, Ser497, Gly476, His316, and Gly428. Additionally, the guanidine of capereomycin is stacked against the imidazole side chain of His333.
Fig. 8

The 3-D and 2-D structure of covalent docked complexes tying the reactive warheads (A) and (B) Aspoxicilin, (C) and (D) Capreomycin 1 with the hydroxy of Ser478.

The 3-D and 2-D structure of covalent docked complexes tying the reactive warheads (A) and (B) Aspoxicilin, (C) and (D) Capreomycin 1 with the hydroxy of Ser478. Covalent docking, however, is only a decorative tool to suggest how a covalent–bond forming drug might fit into the binding pocket in case it succeeds in forming covalent bond. However, we believe the main supportive tool in our case is MD simulations. To enhance confidence in MD simulations as success gauge for our ML models, we performed MD simulations (200 ns) for further 4 compounds (starting with high-quality docked poses of consensus score of 9), namely, camostat (2), cobicistat (33), ibrutinib (206) and piritrexim (242). The former two belong to the testing set, while the latter two belong to the screened set. Camostat is the only active compound among the collection and it was correctly predicted by our ML models to be active. While the other three were all predicted to be inactive by our models, however one of them, i.e., cobicistat, is experimentally proven to be inactive. This collection should allow us to adequately probe the correlation between our ML predictions and MD simulation data. Fig. 7B shows the results of the MD study. Clearly, only camostat persisted within the vicinity of the binding site, while the rest, whether experimentally inactive or predicted to be inactive, were quickly ejected from the binding site. These results nicely correlate with the predictions of our ML models despite being generated for rather diverse set of compounds.

Conclusions

In the present work we introduce computational bootstrapping of machine learning QSAR modelling using multiple high-quality docked poses. Ligand-receptor contact fingerprints and scoring function values were used as descriptors, while several MLs were scanned. We implemented this method for the discovery of potential inhibitors for the serine protease enzyme TMPRSS2 involved in the infectivity of coronaviruses. Three hits were identified. Subsequent molecular dynamic simulation and covalent docking supported the results of the new computational approach.

CRediT authorship contribution statement

Ma’mon M. Hatmal: Resources, Formal analysis, Investigation, Writing - review & editing. Omar Abuyaman: Formal analysis, Investigation, Resources, Writing – original draft, Writing - review & editing. Mutasem Taha: Conceptualization, Methodology, Supervision, Investigation, Resources, Writing - review & editing.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
  137 in total

1.  A probabilistic neural network approach for modeling and classification of bacterial growth/no-growth data.

Authors:  M Hajmeer; I Basheer
Journal:  J Microbiol Methods       Date:  2002-10       Impact factor: 2.363

Review 2.  The pharmacological landscape and therapeutic potential of serine hydrolases.

Authors:  Daniel A Bachovchin; Benjamin F Cravatt
Journal:  Nat Rev Drug Discov       Date:  2012-01-03       Impact factor: 84.694

Review 3.  Ranking poses in structure-based lead discovery and optimization: current trends in scoring function development.

Authors:  Ramkumar Rajamani; Andrew C Good
Journal:  Curr Opin Drug Discov Devel       Date:  2007-05

4.  New Electrophiles and Strategies for Mechanism-Based and Targeted Covalent Inhibitor Design.

Authors:  Sneha Ray; Andrew S Murkin
Journal:  Biochemistry       Date:  2019-04-24       Impact factor: 3.162

5.  Simulated annealing molecular dynamics and ligand-receptor contacts analysis for pharmacophore modeling.

Authors:  Ma'mon M Hatmal; Mutasem O Taha
Journal:  Future Med Chem       Date:  2017-07-19       Impact factor: 3.808

6.  MolProbity: More and better reference data for improved all-atom structure validation.

Authors:  Christopher J Williams; Jeffrey J Headd; Nigel W Moriarty; Michael G Prisant; Lizbeth L Videau; Lindsay N Deis; Vishal Verma; Daniel A Keedy; Bradley J Hintze; Vincent B Chen; Swati Jain; Steven M Lewis; W Bryan Arendall; Jack Snoeyink; Paul D Adams; Simon C Lovell; Jane S Richardson; David C Richardson
Journal:  Protein Sci       Date:  2017-11-27       Impact factor: 6.725

Review 7.  Drug discovery for a new generation of covalent drugs.

Authors:  Amit S Kalgutkar; Deepak K Dalvie
Journal:  Expert Opin Drug Discov       Date:  2012-05-19       Impact factor: 6.098

Review 8.  Efficient drug lead discovery and optimization.

Authors:  William L Jorgensen
Journal:  Acc Chem Res       Date:  2009-06-16       Impact factor: 22.384

9.  Impact of the structures of macrocyclic Michael acceptors on covalent proteasome inhibition.

Authors:  S Kitahata; F Yakushiji; S Ichikawa
Journal:  Chem Sci       Date:  2017-08-11       Impact factor: 9.825

Review 10.  Multidirectional Efficacy of Biologically Active Nitro Compounds Included in Medicines.

Authors:  Dorota Olender; Justyna Żwawiak; Lucjusz Zaprutko
Journal:  Pharmaceuticals (Basel)       Date:  2018-05-29
View more
  4 in total

1.  Discovery of new PKN2 inhibitory chemotypes via QSAR-guided selection of docking-based pharmacophores.

Authors:  Mahmoud A Al-Sha'er; Haneen A Basheer; Mutasem O Taha
Journal:  Mol Divers       Date:  2022-05-04       Impact factor: 2.943

2.  Exploiting activity cliffs for building pharmacophore models and comparison with other pharmacophore generation methods: sphingosine kinase 1 as case study.

Authors:  Lubabah A Mousa; Ma'mon M Hatmal; Mutasem Taha
Journal:  J Comput Aided Mol Des       Date:  2022-01-21       Impact factor: 3.686

3.  Discovery of new Cdc2-like kinase 4 (CLK4) inhibitors via pharmacophore exploration combined with flexible docking-based ligand/receptor contact fingerprints and machine learning.

Authors:  Mai Fayiz Al-Tawil; Safa Daoud; Ma'mon M Hatmal; Mutasem Omar Taha
Journal:  RSC Adv       Date:  2022-04-05       Impact factor: 3.361

4.  Reported Adverse Effects and Attitudes among Arab Populations Following COVID-19 Vaccination: A Large-Scale Multinational Study Implementing Machine Learning Tools in Predicting Post-Vaccination Adverse Effects Based on Predisposing Factors.

Authors:  Ma'mon M Hatmal; Mohammad A I Al-Hatamleh; Amin N Olaimat; Rohimah Mohamud; Mirna Fawaz; Elham T Kateeb; Omar K Alkhairy; Reema Tayyem; Mohamed Lounis; Marwan Al-Raeei; Rasheed K Dana; Hamzeh J Al-Ameer; Mutasem O Taha; Khalid M Bindayna
Journal:  Vaccines (Basel)       Date:  2022-02-26
  4 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.