Literature DB >> 28776988

Comprehensive and Automated Linear Interaction Energy Based Binding-Affinity Prediction for Multifarious Cytochrome P450 Aromatase Inhibitors.

Marc van Dijk1, Antonius M Ter Laak2, Jörg D Wichard2, Luigi Capoferri1, Nico P E Vermeulen1, Daan P Geerke1.   

Abstract

Cytochrome P450 aromatase (CYP19A1) plays a key role in the development of estrogen dependent breast cancer, and aromatase inhibitors have been at the front line of treatment for the past three decades. The development of potent, selective and safer inhibitors is ongoing with in silico screening methods playing a more prominent role in the search for promising lead compounds in bioactivity-relevant chemical space. Here we present a set of comprehensive binding affinity prediction models for CYP19A1 using our automated Linear Interaction Energy (LIE) based workflow on a set of 132 putative and structurally diverse aromatase inhibitors obtained from a typical industrial screening study. We extended the workflow with machine learning methods to automatically cluster training and test compounds in order to maximize the number of explained compounds in one or more predictive LIE models. The method uses protein-ligand interaction profiles obtained from Molecular Dynamics (MD) trajectories to help model search and define the applicability domain of the resolved models. Our method was successful in accounting for 86% of the data set in 3 robust models that show high correlation between calculated and observed values for ligand-binding free energies (RMSE < 2.5 kJ mol-1), with good cross-validation statistics.

Entities:  

Mesh:

Substances:

Year:  2017        PMID: 28776988      PMCID: PMC5615371          DOI: 10.1021/acs.jcim.7b00222

Source DB:  PubMed          Journal:  J Chem Inf Model        ISSN: 1549-9596            Impact factor:   4.956


Introduction

Cytochrome P450 aromatase (CYP19A1; EC 1.14.14.1) is a member of the Cytochrome P450 (CYP) superfamily of mono-oxygenases. This enzyme catalyzes a key step in estrogen biosynthesis, i.e., aromatization of androgens such as androstenedione and testosterone to estrone and 17β-estradiol, respectively.[1−7] Common to most CYPs, CYP19A1 can bind a variety of low molecular weight compounds, but it has a high catalytic specificity toward steroid substrates due to the distribution of polar and nonpolar residues within the binding site.[8,9] Overexpression of aromatase in tumor tissue was identified to play a key role in the development of estrogen receptor positive breast cancer, endometrial cancer and endometriosis.[1,3,5] Around 50–80% of breast cancers have been found to be estrogen-dependent, where estrogen binding to receptor stimulates tumor cell proliferation.[10−13] Because the aromatization reaction catalyzed by CYP19A1 is rate limiting, it serves as an ideal target for the development of selective and potent inhibitors that decrease the levels of circulating estrogen.[14,15] This has led to the development of four generations of aromatase inhibitors (AIs) in clinical use over the last three decades (Figure ).[14,16,17] Most AIs are nonsteroidal in nature (NSAIs) and derived from aminoglutethimide-like molecules, imidazole/triazole derivatives, or flavonoid analogs.[17,18] They act by means of competitive and reversible inhibition (type I) or quasi-irreversible inhibition by coordination with the heme iron (type II).[19,20] Steroidal inhibitors are typically based on the adrostenedione scaffold with various chemical substituents at varying positions, which can be functional groups responsible for mechanism based inhibition (e.g., Exemestane, Figure f).[21]
Figure 1

Members of four generations of clinical steroidal (c,f) and nonsteroidal (a,b,d,e) aromatase (CYP19A1) inhibitors. First generation: aminoglutethimide (a, Cytadren, Novartis).[99−101] Second generation: Fadrozole (b, Afema, Novartis),[99−101] and Formestane (c, Lentaron, Novartis).[14] Third generation: Anastrozol (d, Arimidex, AstraZeneca),[102] Letrozole (e, Femara, Novartis),[103−105] and Exemestane (f, Aromasine, Pfizer).[103]

Members of four generations of clinical steroidal (c,f) and nonsteroidal (a,b,d,e) aromatase (CYP19A1) inhibitors. First generation: aminoglutethimide (a, Cytadren, Novartis).[99−101] Second generation: Fadrozole (b, Afema, Novartis),[99−101] and Formestane (c, Lentaron, Novartis).[14] Third generation: Anastrozol (d, Arimidex, AstraZeneca),[102] Letrozole (e, Femara, Novartis),[103−105] and Exemestane (f, Aromasine, Pfizer).[103] Despite their clinical success, current AIs are associated with various drug related side-effects,[22,23] including effects due to inhibition of other members of the CYP family[24] that can lead to drug–drug interactions (DDI).[25] Therefore, the search for a next generation of AIs with improved potency, higher selectivity and reduced toxicity is still ongoing, and both synthetic as well as natural product derived alternatives such as coumarin, lignin and flavonoids have been explored over the years.[17,26−30] The widened search spectrum for AI lead compounds increases the need for effective methods to screen the inhibitory potential and modes of interaction for both the target protein and other CYPs. In particular, the use of in silico methods for the prioritization of compound ideas throughout the lead discovery and optimization stages potentially offers an attractive alternative for extensive use of in vitro methods.[31,32] However, especially for CYPs it remains a challenge to train generally applicable predictive models for protein binding due to their substrate promiscuity, catalytic site malleability, different modes of inhibition, and ability to bind the same ligand in multiple orientations.[19,25,33,34] The flexibility in both the structural and interaction dimensions limits the applicability of QSAR (Quantitative Structure–Activity Relationship) methods based on molecular descriptors only.[33,35,36] The addition of structural information such as in 3D-QSAR, molecular docking or extensive pharmacophore methods has increased predictive capacity, typically yielding local models for structurally similar binders.[33,36,37] However, these methods have difficulties dealing with the dynamic nature of the CYP catalytic site and substrate binding modes, providing a limited, static view of protein–substrate interactions.[38] On the other hand, molecular dynamics (MD) based free energy calculation methods have the potential to provide an accurate estimate of the binding affinity, even for very flexible systems such as CYP enzymes.[39−41] They are valuable methods in the design stage of drug discovery but due to their high CPU costs many of the pathway-based methods (including, e.g., Free Energy Perturbation (FEP) and Thermodynamic Integration (TI)) are unattractive for use in high-throughput settings especially when having to deal with multiple protein and/or ligand conformations that may contribute to binding (as in case of several CYPs).[41,42] Alternatively, end-point methods such as Linear Interaction Energy (LIE) theory may provide a trade-off between accuracy and speed by estimating the solvation free energy between the two end points using linear response theory instead of multiple intermediate steps along a pathway.[34,43−45] LIE requires empirical calibration of its scaling parameters. An advantage is that the sampling issue can be effectively and efficiently addressed using an iterative version of LIE[41] that uses (re)weighted results from multiple short simulations starting from different ligand[41] and protein conformations[40] as obtained, e.g., by molecular docking. The LIE method has been applied extensively over the last two decades in the prediction of protein–ligand binding affinities.[45,46] Most LIE models have been trained using relatively small and curated data sets ranging from as few as 10–15 up to approximately 50 compounds, yielding predictive affinity models for compounds that are structurally similar or diverse but engage in similar interactions with the target protein.[39,40,47−52] However, this situation may differ notably to the much larger data sets obtained in a typical industrial lead development and optimization project. Apart from their size, these compound libraries are often sparse and designed to increase possibilities to discover novel active ligands or scaffolds. It can become a challenge to create predictive LIE models with well-defined applicability domains[39] for these data sets and, with high-throughput in mind, to do so in an automated fashion. In the current study, we present a system of three quantitative binding affinity prediction models for CYP19A1. The models were trained using a data set of 132 putative CYP19A1 inhibitors with known inhibition constants obtained from a lead discovery study by Bayer AG that is subject to the same challenges described above in terms of compound size and chemical diversity. As an additional challenge, we aim here for a comprehensive approach for LIE model inference, parametrization and applicability assessment based on simulation and calibration data only, without the need for data set prefiltering by the user based on other information from experiment such as type of binding, which is not always available a priori. We have dealt with these challenges by developing an automated machine learning workflow that uses a stochastic approach to explore LIE-model parameter landscape. The iterative LIE (iLIE) method is used as a cost function in this search to prioritize the relative contribution of a binding orientation among a series of independent simulations of possible binding orientations. This approach allows to efficiently deal with flexible proteins (such as CYPs) that are able to bind ligands in multiple orientations. The probability of a compound to be part of a particular model is defined by a Bayesian approach that determines the maximum a posteriori estimates for the α and β parameters of the iLIE equation for a set of compounds, in order to maximize the number of explained compounds in one or more LIE models with predefined quality of the correlation between calculated and experimentally observed values for the ligand-binding free energies (in terms of RMSE and r2). We subsequently used protein–ligand interaction profiling to define the applicability domain of the resolved models in terms of protein–ligand interactions. We show that our machine learning workflow is able to explain 86% of the employed CYP19A1 data set in three fully cross-validated LIE models with distinct combinations of α and β model parameters and an error margin below 3.0 kJ mol–1 of the experimentally observed binding free energy (i.e., within typical experimental accuracy[53]). Using simulation data only, steroid inhibitors were separated from nonsteroidal inhibitors, and the presented system of binding affinity models together with their protein–ligand interaction profiles provide a comprehensive means of predicting the binding affinity of unknown ligands for CYP19A1.

Methods

Molecular Dynamics (MD) averaged protein–ligand interaction energies for LIE affinity prediction were obtained using our previously published (semi)automated iterative LIE workflow.[39,54,64] The methodological details on the protein and ligand structure preparation, molecular docking and MD simulation stages of this workflow are described below, together with the model calibration and validation strategies developed in this study.

Structure Preparation

The crystal structure of CYP19A1 with PDB ID 3EQM(55) was used after removal of the 4-androstene-3-17-dione (ASD) molecule, water oxygen atoms (none in the catalytic cavity) and crystallization buffer additives. A data set of inhibition constants (Ki) for 132 putative CYP19A1 inhibitors with known stereochemistry (Tables S1 and S4, Supporting Information) was provided by Bayer AG, Berlin. Six of the 132 compounds entered the data set as duplicate (Table S1) but with independently measured K values. These compounds were used as internal validation in the study; and indeed most of the duplicates were found within the same (local) model, Table S1. Ki values were determined using the 3H2O release assay with [1β-3H]androstenedione as substrate[56] in human placental microsomes according to FDA and UP guidelines, and were used to derive experimental binding free energies ΔGobs (Table S1). Assumptions taken in directly deriving ΔGobs from these inhibition data were supported by the obtained correlations for our final LIE models. The initially minimized 3D structures for these compounds in their neutral form were generated from their canonical SMILES string using Molecular Operating Environment (MOE) version 2012.10,[57] and the MMFF94 force field.[58] Subsequent optimization and generation of MD topology and parameter files was performed using the Automated Topology Builder (ATB) server version 1.0.[59]

Docking and Clustering

Ligands were docked into the active site of CYP19A1 using the PLANTS (Protein–Ligand Ant System) docking software version 1.2.[60] The target binding site was defined at the approximate center of the protein active site, at a position 0.7 nm distal to the heme iron center perpendicular to the heme plane, and with a size defined by a sphere with a radius of 1.1 nm. Docking poses were generated with speed 1 settings and evaluated using the ChemPLP[61] scoring function. A maximum of 300 docked poses with mutual Root-Mean-Square Deviations (RMSDs) in atomic positions of more than 0.2 nm were retained and clustered using Principle Component Analysis (PCA), using the heavy atom positions as variables. After dimensionality reduction, the scores obtained were used for subsequent k-means clustering.[62] During this analysis, an additional component or cluster was taken into account in case it would have led to a further increment of at least 5% of the explained variance in the space of the coordinates or scores, respectively. The medoids of the clusters obtained (4 to 8 per ligand) were chosen as representative binding conformations of the ligand in the CYP19A1 active site.

MD Simulations

The MD simulations for every representative binding pose obtained after docking and clustering were carried out using the GROMACS 4.5.4 package[63] and an adapted version[64] of the automated MD workflow script obtained from the GROMACS web server.[65] The GROMOS 54A7 force field[66] was applied to describe the protein and heme group in simulation. Ligand topologies were obtained with ATB[59] as described above (see Structure Preparation subsection). Each complex was energy minimized in a vacuum using steepest descent minimization and solvated in a simulation box with a Near-Densest Lattice Packing (NDLP) optimized volume[67,68] (∼6300 solvent molecules) where water was described by the SPC model.[69] 7 Cl– ions were added to neutralize the system. The system was energy minimized (steepest-descent) and gradually heated up to 300 K in four NvT simulations of 20 ps, in which harmonic potentials were used for heavy-atom positional restrains in the sequence: 100/1000, 200/5000, 300/50 and 300/0 for temperature (K) and positional restraint force constant (kJ mol–1 nm–2), respectively, together with a Berendsen thermostat[70] with a separate solute and solvent coupling time of 0.1 ps. Subsequently, three 20 ps equilibration runs were performed at NpT, using heavy-atom positional restraints on the solute with decreasing force constants of 1000, 100 and 10 kJ mol–1 nm–2. To further stabilize the system (as shown to be required before[71]), a short 500 ps unrestrained NpT simulation preceded the 2 ns production NpT simulation. A leapfrog algorithm[72] was employed for integrating the equations of motion. Heavy hydrogens (with a mass of 4.032 amu)[73] were used and all bonds were constrained using the LINCS algorithm,[74] allowing a time-step of 4 fs. In all NpT simulations, a Berendsen thermostat[70] was employed to maintain the temperature of the system close to its reference value of 300 K, using separate temperature baths for the solvent and solute degrees of freedom, with a coupling time of 0.1 ps. A Berendsen barostat[70] with a coupling time of 0.5 ps and an isothermal compressibility of 7.5 × 10–4 [kJ mol–1 nm–3]−1 was used to maintain the pressure close to its reference value of 1.013 25 bar during NpT simulations. Van der Waals and short-range electrostatic interactions were explicitly evaluated every time step for pairs of atoms within a 0.9 nm cutoff, and a grid-based neighbor list was used and updated every 2 time steps. Long-range electrostatic interactions were included using a twin-range cutoff (0.9/1.4 nm) with reaction field correction.[75] The relative dielectric constant for the medium outside the reaction field was set to 61. Center of mass motion removal was applied to both the solute rotation and translation components every 10 steps. Interaction energies and atomic coordinates were stored every 2 ps. To evaluate average ligand interaction energies of the unbound ligands in water, each ligand was solvated in a NDLP optimized simulation box filled with approximately 625 SPC water molecules. No counterions were added. The MD protocol was identical to the one described for the simulations of the protein–ligand complex.

Iterative Linear Interaction Energy (iLIE) Method

In LIE, the predicted free energy of binding for a ligand in pose i to a protein (ΔG) is estimated from MD-averaged electrostatic ⟨V⟩ and van der Waals interaction energies ⟨V⟩ between the ligand and its surrounding as obtained in complex with the protein (bound,i) and free in solution (free). Following LIE theory,[43,76]Values for the empirical LIE model parameters α and β are obtained by parametrization against a training set of compounds with experimentally determined binding affinities using linear regression analysis. An optional constant γ may be considered and has been used by others to account, e.g., for hydrophobicity of the binding site.[76,77] In the iterative version of eq ,[41] the predicted protein–ligand binding free energy ΔGpred is calculated by combining results from N multiple short MD simulations starting from different ligand poses in the protein binding site that represent local minima on the potential energy surface of the protein–ligand complex, using a weighted sum (ΣW). The relative weight (W) of each simulation (i) entering the sum is calculated according to its probability using[78]with k Boltzmann’s constant, T the temperature of the simulation system and N the number of independent simulations. Extending eq by including the weighted sum of multiple short simulations yields[41]The dependencies of the ΔG on α and β requires eq to be solved iteratively.[41]

Automated iLIE Binding Affinity Model Inference

LIE models can be trained automatically for an unknown data set of sufficient size by clustering ligands based on similarity in ΔV and ΔV energy patterns (defined as ⟨V⟩ – ⟨V⟩ and ⟨V⟩ – ⟨V⟩), respectively) as a function of α and β model parameters (and possibly an optional constant γ) using the iLIE equation as a cost function in an iterative regression analysis. This is under the assumptions that (i) ΔV and ΔV in the data set show multivariate normality and low heteroscedasticity, (ii) there is similarity in the physio-chemical nature of the ligands and their interaction with the protein,[76,79] and (iii) the analysis leads to a single model. Our machine learning workflow expands this automatic training approach with the aim of finding the a posteriori estimates for the α and β parameters of the iLIE equation (eq ) for a set of compounds that maximizes the number of explained compounds in one or more LIE models within predefined RMSE and r2 margins (RMSE ≤ 5 kJ mol–1 and r2 ≥ 0.6). The workflow (Figure ) consists of a data curation and filtering stage to deal with assumption (i) (A in Figure ), the main stochastic sampling routine using a Bayesian inference approach for parameter estimation (B), and a clustering stage where final LIE models are created (C). The latter two stages explore the existence of multiple (independent) models each describing a part of ligand physio-chemical/interaction space (assumptions (ii) and (iii)). The methods used in each of these stages are explained in more detail below.
Figure 2

Schematic overview of the automated machine learning workflow aimed at finding the a posteriori estimates for one or multiple combinations of α and β parameters of the iLIE equation for a set of compounds, which maximize the number of explained compounds in one or more LIE models with predefined RMSE and r2 cutoffs. The “data set curation and filtering” stage (A) uses FFT based MD trajectory filtering to obtain stable average values of ΔV and ΔV. Ligand poses with average ΔV/ΔV pairs outside a 97.5% confidence interval (Figure S2) identified using multivariate normal mixture model analysis are labeled as outlier (out). Protein–ligand interaction profiling is performed on the FFT based stable energy trajectories (dashed lines). Ligand groups identified by the clustering of the interaction profiles are used as input to the stochastic search (B). The existence of LIE models for these clusters is explored during an iterative four-step stochastic search in which compounds are added to the evolving model from the global compound pool, according to a progressively updated probability using the iRLS weights of the added compounds at every iteration. The charted model landscape is clustered during the last workflow stage (C) and final models are selected.

Schematic overview of the automated machine learning workflow aimed at finding the a posteriori estimates for one or multiple combinations of α and β parameters of the iLIE equation for a set of compounds, which maximize the number of explained compounds in one or more LIE models with predefined RMSE and r2 cutoffs. The “data set curation and filtering” stage (A) uses FFT based MD trajectory filtering to obtain stable average values of ΔV and ΔV. Ligand poses with average ΔV/ΔV pairs outside a 97.5% confidence interval (Figure S2) identified using multivariate normal mixture model analysis are labeled as outlier (out). Protein–ligand interaction profiling is performed on the FFT based stable energy trajectories (dashed lines). Ligand groups identified by the clustering of the interaction profiles are used as input to the stochastic search (B). The existence of LIE models for these clusters is explored during an iterative four-step stochastic search in which compounds are added to the evolving model from the global compound pool, according to a progressively updated probability using the iRLS weights of the added compounds at every iteration. The charted model landscape is clustered during the last workflow stage (C) and final models are selected.

MD Trajectory Filtering

To enforce the independence of the multiple simulations performed per ligand,[78,80] average values ⟨V⟩ and ⟨V⟩ are used in eq as obtained from segments of the interaction energy trajectories that are constant in time within a predefined cutoff. For that purpose, Fast Fourier Transform (FFT) filtering was carried out followed by spline fitting.[80] FFT smoothens the trajectory using a band-pass filter keeping the first 15 elements around the average in the frequency domain followed by inverse FFT to the original time domain (complex elements are discarded). Double spline fitting on the smoothened energy trajectory followed by gradient analysis identifies transitions when the absolute change in gradient is larger than 0.2 kJ mol–1 ps–1. The gradient was calculated using second-order central differences in the interior and first-order differences at the boundaries. For every protein–ligand MD simulation, the first continuous series of interaction energy data points with a minimum length of 100 ps is used in which both ⟨V⟩ and ⟨V⟩ do not show fluctuations (i.e., gradients) larger than the above given cutoff value. Simulations for which no such series could be identified were discarded from further analysis. This event occurred once for a total of 688 CYP19A1 MD simulations performed during this study. Multivariate normal mixture model analysis deploying the expectation maximization algorithm[81,82] was used to identify (possibly multiple) normal distributed subpopulations in the dependent variables (ΔV, ΔV). We used the algorithm implemented in the sklearn.mixture Python library for this purpose with a “full” covariance type. Individual simulations were labeled as outlier and left out from further analysis if they have variable pairs outside a 97.5% confidence interval of the fitted χ2-distributions for all of the identified populations (out in Figure ). When applicable, multiple subpopulations are treated as independent data sets in the remainder of the workflow.

Separation of Simulations

The use of multiple poses in iLIE to start short MD simulations allows the ligand to interact with the protein in multiple conformations. To guard the efficiency of the stochastic search when using multiple simulations per compound, a filtering step was performed based on an initial estimate of the α/β model parameter space by sampling a grid of predefined α/β parameters. This allowed to (i) discard results from simulations with low W representing poses unlikely to contribute to binding, and (ii) separate the combination of simulations with high W values in different regions of sampled α/β model parameter space into unique cases, to increase the chance for related cases to be grouped together in the same regression model. A square grid for the iLIE α and β model parameters was defined between a value of 0 and 1 with a grid spacing of 0.01. To compute ΔGpred and the W values, the iLIE equation (eq ) was evaluated at each grid point using all poses for a given ligand. Minima in model parameter space were defined as regions in which the deviation of ΔG from experiment is within ±5 kJ mol–1 (1.2 kcal mol–1), which equals about one pK log unit.[53] The propagation of the Wi’s for the independent simulations as a function of the model parameters in the defined region were analyzed and categorized as constant across minima or variable (illustrated in Figure ). If in the latter case there was more than one simulation for which the gradient in W propagation is of opposite sign (Figure , poses 1 and 2), these are labeled as separate cases, together with a copy of the simulations of the same ligand for which W was identified to be constant with significant value (>0.1) across minima (Figure , pose 3). Simulations with W < 0.1 across minima were discarded for further analysis (Figure , pose 4). The separation of poses following this procedure was only used during the stochastic search.
Figure 3

Propagation of weights W for four poses of a compound as a function of α or β model parameter, as obtained by solving the iLIE equation on a fixed grid of α and β parameters and focusing on the grid region where the difference between ΔGpred and experimentally observed ΔGobs is smaller than 5 kJ mol–1.

Propagation of weights W for four poses of a compound as a function of α or β model parameter, as obtained by solving the iLIE equation on a fixed grid of α and β parameters and focusing on the grid region where the difference between ΔGpred and experimentally observed ΔGobs is smaller than 5 kJ mol–1.

Stochastic Sampling

A stochastic approach was used to assess a posteriori estimates for combinations of α and β parameters for clusters of compounds in the data set, which together maximize the number of explained compounds in a minimum number of LIE models with predefined RMSE and r2 cutoffs. The stochastic search was seeded with subsets of compounds from every node in the hierarchical cluster tree obtained from protein–ligand interaction profile clustering (see the following). This approach increases the probability to identify predictive LIE models for compounds having similar protein–ligand interaction profiles. The compounds belonging to each node are first filtered for regression outliers according to the same protocol used during the stochastic search (step 3 in the following) to ensure that the start set is of maximum quality from a regression statistics point of view. Each stochastic search is an iterative process (Figure , middle pane) that consists of four steps: Increase the current compound set by 20% with a random sample from the main compound pool according to probability p defined as the probability of a compound (c) to be part of a model as p = 1 – (|{c|c ∈ R}|/|R|). R is the set that maintains a counter of the number of times each ligand was labeled as outlier according to the maximum likelihood estimates (weights) of the iterative Reweighted Least Squares regression (iRLS) while iterating (step 3). The probability table is initiated with a probability of 1 for each ligand. α and β model parameters for the new set of compounds are obtained from iRLS using ΔGobs values for the compounds as dependent variable. Andrew’s Wave was used as maximum likelihood estimator.[83] Use the iRLS weights to determine regression outliers (<0.75) and inliers (≥0.75). Update the global table of outlier counts (R) for each compound and re-evaluate the probability table (p, step 1). Evaluate the model with respect to the ΔG RMSE and coefficient of determination r2 and accept the model if statistics are within 10% deviation of the previous model and the α and β parameters are within the range 0 to 1. If accepted, store the new model and start a new iteration until there are no more ligands to be added, statistics constraint limits are reached (5 kJ mol–1 RMSE, 0.6 r2), or no successful ligand addition could be made during 50 consecutive trials.

Model Clustering

The stochastic sampling yields a collection of models describing the probability of ligands and individual poses to occur together with respect to the LIE model parameters and within the predefined statistical model quality. For every ligand in the data set, we collected the iRLS regression weights of the models in which the ligand occurs, normalized by the RMSE of the respective model and summed over individual poses. These values where visualized in a heat map by clustering them with respect to α and β model parameters, illustrating the distribution of ligands over LIE models. The heat map was used to visually select the ligand sets that maximize the number of explained compounds in one or more LIE models. The final models best representing the data set were trained using the iLIE equation (eq ) with the ligand sets selected from the heat map. Representative simulations for each compound were selected by first training a model using all poses, after which we selected the simulations within a range of 0.2 of the most dominant simulation according to their weights W (eq ). These simulations were used for retraining the final model.

Protein–Ligand Interaction Profiling

Protein–ligand interactions in all MD frames for all ligand poses were analyzed using in-house python software. The software aims to identify protein–ligand contacts as belonging to the following eight interaction types using rule-based protocols that are described in the Supporting Information: hydrogen bonded contacts (hb-ad, hb-da), water mediated hydrogen-bonded contacts (wb-da, wb-ad), halogen-bonded contacts (xb), salt bridging contacts (sb-np, sb-pn), cation-π interactions (pc), aromatic π- (ps) or T-stacking (ts), hydrophobic interactions (hf) and heme-coordination (hm, which is not explicitly accounted for during our classical simulations). Interaction subcategories include donor–acceptor (da), acceptor–donor (ad), negative–positive (np) and positive–negative (pn), for ligand and protein residue, respectively. Relationships between the interaction profiles within pairs of MD simulations were evaluated by encoding the ligand-protein interaction profile data for each independent simulation as a structured key and by building a pairwise similarity matrix based on them using the following protocol: Discard all protein–ligand interactions that occur for less than 10% of the simulation time, to reduce noise in further analysis due to infrequent interactions. Normalize the frequency of occurrence of the interaction types by their relative abundance in the full data set, to prevent abundant hydrophobic interactions to have an artificially large influence on the final similarity metric. Collect the unique protein residues involved in protein–ligand interactions from all filtered profiles after step 2. Collect the resulting (sorted) residue numbers form the primary slots of the structured key. Each slot is divided in 2 × 12 registers subsequently containing: The frequency with which any of the 12 interaction types (comprising the 8 main groups and their subcategories mentioned above) occur between the ligand and the given residue over the course of the simulation. A list of ligand SYBYL atom types involved in the interaction in the corresponding first 12 registers (4a). The similarity between two structured keys is evaluated as the sum of similarities between each register, normalized by key length (number of residues). The similarity is calculated differently for the two registry types mentioned in step 4: For every residue interaction frequency (4a), the similarity is assigned 1 if they are identical, 0 if any of the registers equals 0, and the minimum frequency of occurrence otherwise. Similarity is defined as the percentage of similarity in SYBYL atom types involved in the interaction (4b), multiplied by the frequency of occurrence of the corresponding type in the first 12 registers (4a). This multiplication ensures that the weight of similarity in SYBYL atom types for a given interaction type in the final similarity metric is never larger than that of the interaction type itself. The resulting pairwise similarity matrix is used as input for a hierarchical agglomerative cluster analysis as implemented in the Python scipy.cluster.hierarchy(84) to identify groups of (independent) ligand simulations that show similar interaction profiles, which were used as seeds for the stochastic approach for iLIE model regression described above.

Results and Discussion

The 132 compounds in the data set (Tables S1 and S4 of the Supporting Information) were selected in a lead discovery study as promising AI candidates. The 12 steroid and 121 nonsteroid compounds can be classified as resembling the second to fourth generation of clinical AI’s (Figure ) including Letrozole (compound 2), Anastrazole (compound 9) and Fadrozole (compound 13). All nonsteroid compounds have one azole ring of which the basic nitrogen atoms (mostly imidazole or triazole) have the potential to apically coordinate the heme iron.[85−89] Most of them have an aryl or apolar cyclic moiety mimicking the steroid ring of the substrate, and one or two nitrile groups. Apart from these common features there is considerable structural diversity among the ligands with respect to the topological arrangement and chemical nature of the functional groups. We initially attempted to train a single iLIE model for our data set of 132 compounds (using all docked poses and a 3/4 test-to-train set ratio), which resulted in models with negative r2 values and RMSE values higher than 7.0 kJ mol–1. Subsequent attempts to group compounds based on their molecular structures into iLIE binding affinity prediction models (i.e., based on SAI/NSAI compound groups or using structurally similar compounds derived from a chemical fingerprinting and clustering analysis) yielded local models with <10 ligands when using all ligand poses, leaving most of the data set unexplained. An initial estimate of optimum α/β parameter ranges was obtained by solving the iLIE equation for every ligand individually, taking into account all poses, on a grid of fixed α/β parameters. The range corresponding to a prediction error in ΔGpred of less than 5 kJ mol–1 for a maximum number of compounds is located in α and β ranges between 0 and 0.5 and 0.55–0.95, respectively, Figure . In this broad region, even the most densely populated area of model parameter combinations comprises no more than 50% of the ligands. Furthermore, 101 of the 132 ligands have at least two different dominant binding poses in different parts of this region based on the change in the distribution of weighted propensities W for the independent simulations. These observations illustrate the challenges mentioned before associated with LIE model development for diverse and large sets of training compounds with respect to their physicochemical variation, the possibility of ligands to bind in multiple orientations, and the unlikely existence of a single predictive model with sufficient coverage.
Figure 4

LIE α and β model parameter scan performed on each of the 132 ligands individually by solving the iLIE equation (eq ), for every point on a square grid of α and β model parameters between a value of 0 and 1 with a grid spacing of 0.01. The color gradient highlights the percentage of ligands having a ΔGpred value within 5 kJ mol–1 of ΔGobs for a given combination of α and β.

LIE α and β model parameter scan performed on each of the 132 ligands individually by solving the iLIE equation (eq ), for every point on a square grid of α and β model parameters between a value of 0 and 1 with a grid spacing of 0.01. The color gradient highlights the percentage of ligands having a ΔGpred value within 5 kJ mol–1 of ΔGobs for a given combination of α and β. To explore the existence of a small subset of iLIE models that cover a substantially larger part of the data set and that have a well-defined protein–ligand interaction profile we developed a machine learning workflow using information extracted from the MD-trajectories only. The philosophy behind the method (see Methods section for details) is described below together with the results obtained after applying it to the data set of aromatase inhibitors.

Data Set Curation and Filtering

⟨V⟩ and ⟨V⟩ derived from the independent MD simulations used for iLIE are assumed to be averages over well-separated parts of the protein–ligand interaction energy surface.[78] Progressive changes in interaction energy values during a simulation may indicate an unstable system or a conformational transition of the ligand away from its initial position. As a remedy, constant interaction energy trajectories were obtained by filtering the MD energy trajectories for large fluctuations using a spline-fitting procedure on the FFT filtered time series.[80] Convergence of interaction energy terms is illustrated for a randomly picked subset of simulations in Figure S1 of the Supporting Information. Subsequently, we performed a multivariate normal distribution analysis to identify possible multiple independent and/or partly overlapping normal distributed subpopulations in the dependent variables (i.e., ΔV and ΔV derived after FFT filtering, Tables S2 and S3). Using a 97.5% confidence interval, 18 protein–ligand simulations (29 without FFT filtering) were found to be not part of the single identified distribution (Figure S2, Supporting Information), which included all simulations for ligands 156 and 189. These two compounds were therefore not included in further analyses. From Figure S2, detected outliers (which were left out from further analyses) involve among others the combinations of the ΔV values for compounds 156, and poses 2 of compounds 69 and 300, and pose 3 of compound 148. In these cases, the interaction profiles obtained from simulation indicated ligand interactions with protein residues located near the entrance channel to the catalytic cavity (between β-sheet 4 and helix F) instead of the protein residues and heme group located around the center of the catalytic cavity as defined for docking (see Methods section). Ligands 189 and 233 were found to interact with both regions. From the substantially different average energies and interaction profiles of the simulations involving these ligands, both can be labeled as outliers with respect to the simulations of other ligands in the data set and were left out from further analyses as well.

Stochastic Sampling

The stochastic sampling stage of our machine learning workflow (Methods section and Figure C) was initiated with the filtered and curated data set described above and seeded with ligand pose clusters derived from hierarchical clustering of the protein–ligand interaction profiles obtained from the MD trajectories (described in the following). For the system considered, the full workflow is completed in ∼10 min on one core of a 4 core 2,2 GHz Intel Core i7 desktop machine. The results are visualized as a α and β heat map (Figure ) showing clusters of ligands with an associated likelihood (color gradient) of constituting a model together for the respective combinations of sampled α and β model parameters. 12 of the 130 ligands used in the search (compounds 159, 166, 182, 194, 210, 220, 223, 232, 233 and 324) could not be placed in any model with a probability larger than 0.5 and are not shown in the heat maps. The ligand sets used to train the final representative models were visually selected based on the clusters in the heat map. This selection is the only manual step within our otherwise automated machine learning workflow.
Figure 5

Results of the stochastic sampling of α and β model-parameter space for the data set of 132 CYP19A1 inhibitors. The likelihood for a compound (labeled by ID on the x-axis) to be part of a model in a given β (panel A) and α (panel B) range is shown as a heat map. The color gradient is a dimensionless measure of the likelihood calculated as the summed iRLS regression weights of a ligand in each of the sampled models of the stochastic search divided by the Root-Mean-Square Error (RMSE) of the model. The measure of likelihood is plotted as a function of the model α and β parameters with a bin size of 0.02. Regions of highest density used to train the final models are labeled as models 1 to 4 (red dashed rectangular boxes).

Results of the stochastic sampling of α and β model-parameter space for the data set of 132 CYP19A1 inhibitors. The likelihood for a compound (labeled by ID on the x-axis) to be part of a model in a given β (panel A) and α (panel B) range is shown as a heat map. The color gradient is a dimensionless measure of the likelihood calculated as the summed iRLS regression weights of a ligand in each of the sampled models of the stochastic search divided by the Root-Mean-Square Error (RMSE) of the model. The measure of likelihood is plotted as a function of the model α and β parameters with a bin size of 0.02. Regions of highest density used to train the final models are labeled as models 1 to 4 (red dashed rectangular boxes).

Selection of Representative LIE Binding Affinity Models Containing More Than 10 Compounds

The machine learning workflow sampled combinations of α/β model parameters that fall within the broad model-parameter region identified in the grid scan (Figure ). The workflow identified four posterior distributions (clusters) with centers containing more than 10 components (Figure , red dashed rectangular boxes labeled 1 to 4). These four clusters served as training set for representative LIE binding affinity models (Table ).
Table 1

Final Set of Representative LIE Binding Affinity Models Derived from the Stochastic Approximate Inference of Model Parameters for the Data Set of 132 Putative Aromatase Inhibitorsa

ModelnαβγRMSEr2q2LOOSDEPLOOr2bstrRMSEbstr
1, center160.2030.663 1.300.950.941.470.960.011.100.17
1, center160.2000.627–2.591.270.950.931.560.930.101.410.72
1, full310.2330.675 2.360.860.832.570.870.032.230.22
1, full310.2260.576–7.322.110.890.852.410.890.041.980.23
2, center130.1610.796 1.690.890.822.140.900.031.540.28
2, center130.1620.8302.451.690.890.782.380.900.031.480.32
2, full520.1070.748 2.590.760.762.590.730.032.710.10
2, full520.1010.696–2.542.550.770.742.690.750.022.650.07
3, center150.1340.893 1.470.930.911.690.930.021.460.16
3, center150.1250.838–3.231.420.930.862.110.940.051.300.12
3, full310.1080.868 1.680.920.911.800.920.021.620.17
3, full310.1100.8770.4951.680.920.901.890.920.021.660.11

Model statistics are reported for three representative LIE binding affinity models trained with and without γ parameter (kJ mol–1) for ligands belonging to the full model cluster (full) and the cluster center only (center, within 75% confidence interval). Root-Mean-Square Error (RMSE, kJ mol–1) and coefficient of determination (r2) model statistics are reported for the model, as well as for the bootstrap cross-validated model (bstr) and Leave-One-Out (LOO) cross-validated model (as Standard Error in Prediction (SDEP, kJ mol–1) and cross-validated r2 (q2)). Bootstrap cross validation was performed using a 20-fold random sampling with a training set of 75% of the model data set.

Model statistics are reported for three representative LIE binding affinity models trained with and without γ parameter (kJ mol–1) for ligands belonging to the full model cluster (full) and the cluster center only (center, within 75% confidence interval). Root-Mean-Square Error (RMSE, kJ mol–1) and coefficient of determination (r2) model statistics are reported for the model, as well as for the bootstrap cross-validated model (bstr) and Leave-One-Out (LOO) cross-validated model (as Standard Error in Prediction (SDEP, kJ mol–1) and cross-validated r2 (q2)). Bootstrap cross validation was performed using a 20-fold random sampling with a training set of 75% of the model data set. Model 1 with a cluster center of 16 ligands is able to account for a total of 31 ligands and shows little overlap with the other models in terms of α/β model parameter values. Cluster center 4 is able to account for 61% of the ligands not accounted for by model 1, with an r2 of 0.75 and RMSE of 3.3 kJ mol–1. Compared to cluster center 1, the ligands that are part of the area sampled by cluster center 4 show considerably more variation in the values for model parameters that can constitute a predictive model, in particular for α. With α close to 0, the predictive capacity of the model is mostly determined by the electrostatic component of the LIE equation. Furthermore, the dense population of ligands in cluster center 4 is predominantly determined by models with statistics in the less predictive half of the predefined RMSE and r2 margins (with 0.6 ≤ r2 < 0.8 and 3.0 < RMSE ≤ 5 kJ mol–1). However, the method did resolve two less populated clusters (Figure , cluster centers 2 and 3) that show little overlap but are contained within cluster 4. The models derived from these two centers have more favorable statistics than model 4 and together explain 95% of the ligands not explained by model 1. Model 2 and 3 cannot be combined without significantly worsening model statistics (r2= 0.59, RMSE = 3.5 kJ mol–1 for the combined model, versus r2 = 0.76, RMSE = 2.6 kJ mol–1 and r2 = 0.92, RMSE = 1.7 kJ mol–1 for models 2 and 3, respectively; Table ). Bootstrap analysis confirmed the two models to be derived from partly overlapping but distinct clusters (Figure S3, Supporting Information). The relatively low values for γ when including an offset parameter (Table ) indicates that the binding affinity can be modeled in terms of α and β only. Note that β values are higher than the theoretical value[43] of 0.5 and than the range of values previously reported for CYP LIE models (0.0–0.5),[39−41,54,90] whereas α values are relatively low compared to the corresponding range of values (0.2–0.6),[39−41,54,90]Table . The low α and high β values suggest that the binding represented by the models is mostly determined by electrostatic interactions. On the basis of the above, models 1, 2 and 3 were selected as representative models. Together, they cover 86% of the original data set with a RMSE between experimental and calculated binding free energy below 2.6 kJ mol–1, which is well within the typical experimental error.[53] The LIE models are robust from a statistical point of view shown, e.g., by Leave-One-Out and bootstrap cross-validation, Table . Although presented as three separate models, there is overlap in the distributions they were derived from, in particular for models 2 and 3. This becomes evident from the ΔGpred values and errors in prediction (Figure , Figure S4 Supporting Information) when making predictions for the same compound using each of the three models. The relationship is also reflected in the protein–ligand interaction profiles of the compounds used to train the models (see below).
Figure 6

Correlation between the predicted (ΔGpred) and observed (ΔGobs) binding free energies in three models (panels A to C for models 1 to 3, respectively) that were trained using the results from stochastic sampling. The solid diagonal lines indicate ideal correlation and the dashed lines indicate upper and lower error margins of 5 kJ mol–1. Blue filled circles correspond to compounds belonging to the cluster centers as indicated by the red boxes 1–3 in Figure , and red filled circles correspond to remaining compounds in that cluster, and gray filled circles correspond to the remaining compounds of the data set as predicted by the model. Blue and red filled crosses in panel C indicate low-affinity Fadrozole-like compounds.

Correlation between the predicted (ΔGpred) and observed (ΔGobs) binding free energies in three models (panels A to C for models 1 to 3, respectively) that were trained using the results from stochastic sampling. The solid diagonal lines indicate ideal correlation and the dashed lines indicate upper and lower error margins of 5 kJ mol–1. Blue filled circles correspond to compounds belonging to the cluster centers as indicated by the red boxes 1–3 in Figure , and red filled circles correspond to remaining compounds in that cluster, and gray filled circles correspond to the remaining compounds of the data set as predicted by the model. Blue and red filled crosses in panel C indicate low-affinity Fadrozole-like compounds. The classification of ligand-amino acid interactions occurring for more than 50% of the simulation time for all ligands identified 4 polar interaction sites (“hotspots”, Figure ) comprising (i) R115 and M374; (ii) Q225; (iii) D309 and T310; and (iv) L477, S478 and the heme prosthetic group (497); and 3 apolar contact sites: (i) I133, F134; (ii) V370; and (iii) F221, W224. This particular combination between apolar and polar interaction sites has also been identified in previous crystallographic, mutagenesis and computational studies on CYP19A1.[55,71,91−93] Below, we describe the dominant protein–ligand interactions observed in the simulations used for models 1–3 introduced above, starting with the dominant interactions for models 1 and 3, respectively.
Figure 7

Cartoon representation of Cytochrome P450 19A1 (PDB code 3EQM(55)) with the natural substrate 4-androstene-3-17-dione (ASD, cyan stick representation) bound. Protein residues (stick representation) involved in polar protein–ligand interactions in more than 50% of the simulation time are grouped in four hotspots (indicated in red, yellow, green and purple), including the heme group (HEME).

Cartoon representation of Cytochrome P450 19A1 (PDB code 3EQM(55)) with the natural substrate 4-androstene-3-17-dione (ASD, cyan stick representation) bound. Protein residues (stick representation) involved in polar protein–ligand interactions in more than 50% of the simulation time are grouped in four hotspots (indicated in red, yellow, green and purple), including the heme group (HEME). Model 1: The steroid based aromatase inhibitors, except ligand 210, form the basis of model 1. The starting poses for the simulations with highest W values for these ligands are similar to the natural substrate ASD with respect to binding orientation (i.e., as in the X-ray structure PDB ID 3EQM) and interaction profile,[55] cf. Figure S5 of the Supporting Information. Figure , panel A shows that hydrogen-bonded contacts dominate the polar interactions, between two carbonyl groups of the steroid (position A3 and D17) and residues M374 and D309 or T310, although not necessarily simultaneously. These findings are in line with previous biochemical studies on steroid binding to CYP19A1.[94−96] Steroid 210 has a bulky trifluoride group hindering formation of both hydrogen bonds, resulting in a binding orientation similar to the natural substrate but resulting in a prediction error of 19.0 kJ mol–1 by model 1. Interestingly, 82% of the NSAI ligands that are uniquely described by this model 1 form hydrogen-bonded interactions with M374 and/or D309/T310 as well (forming these interactions in on average 40% of simulation time). They predominantly do so using a carbonyl O atom or N atom acceptor part of a nitrile group and/or azole ring. In particular the presence of a nitrile group allows for simultaneous hydrogen-bond interactions with the aforementioned protein residues. In addition, there are hydrophobic contacts between residues F221 and W224 and aromatic or cyclic apolar ligand groups, and hydrogen bonding interactions with Q225 (Figure A). Figure also illustrates that potential heme-coordinating poses of the NSAI ligands are observed less frequently in model 1 in comparison to the other two models.
Figure 8

Protein–ligand interaction profiles for the compounds in models 1–3 (Table ) derived from the stochastic approximate inference. The relative interaction frequencies for each protein residue–ligand interaction are represented by vertically stacked bars where the bar colors correspond to a specific interaction type as listed in the graph legend. Hydrophobic contact frequencies are divided by 10 because of their relative abundance with respect to the other classified interactions.

Protein–ligand interaction profiles for the compounds in models 1–3 (Table ) derived from the stochastic approximate inference. The relative interaction frequencies for each protein residue–ligand interaction are represented by vertically stacked bars where the bar colors correspond to a specific interaction type as listed in the graph legend. Hydrophobic contact frequencies are divided by 10 because of their relative abundance with respect to the other classified interactions. Model 2 and 3 exclusively contain NSAIs. The contact profiles include hydrogen-bonded interactions to the same residues of the protein as in model 1 but the hydrogen bond types and frequencies differ, Figure . Model 3: Hydrogen bonding to residues Q225, M374 as well as S478 predominantly involves the (single) nitrile group of the compounds. The compounds almost exclusively contain imidazole groups of which the N atoms can be involved in hydrogen bonding to T310 and in heme coordination (cf. Figure ). 22 of the 33 compounds have a Letrozole (Figure e) like topology with one imidazole group, one nitrile containing aryl group and a variable apolar moiety. This topology is complementary to the active site and enables contacts with various interaction hotspots simultaneously (Figure C). The Letrozole-like compounds can adopt multiple binding orientations due to rotational symmetry, thereby preserving interactions. This is confirmed by the finding that per ligand, several simulations contribute to the calculated binding free energy (Table S1, last column). The importance of interactions for the (predicted) affinity becomes apparent when comparing the compounds in model 3 in the low-affinity region of the correlation plot (Figure C, crosses) with those in the high-affinity region. The low-affinity compounds are almost exclusively Fadrozole (Figure b) like compounds with either an imidazole or imidazopyridine group that is unable to interact with as many hotspots simultaneously as the Letrozole-like compounds that are in the high-affinity region. Model 2 shows more variation in the scaffold and functional groups of the compounds included when compared to model 3. Differences are the presence of halogens, carbonyl and hydroxyl groups in model 2 compounds and the replacement of the imidazole by a triazole group, the replacement of the variable apolar moiety by an aryl or thiophene group, and the replacement of a nitrile group by a halogen or keto group. Although the compounds belonging to the cluster center of model 2 are unique to it from a statistical point of view (in terms of α/β combination and other model statistics, Table and Figure S3, Supporting Information), the compounds of the model that are under- and overpredicted are more model 3- and model 1-like, respectively, in terms of topology (Table S1) and interaction profile. As such, model 2 positions itself in between model 1 and 3 as shown by the interaction profile (Figure B) and model statistics (Table ). In conclusion, the interaction profiles of the MD trajectories provide a basis for understanding and establishing the applicability domain of the resolved models in terms of observed protein–ligand interactions. These differences in interactions can explain why similar compounds fall into different models. For example, compounds 333, 318 and 246 only differ by small variation in their benzylic para-substituent but they are part of models 1, 2 and 3, respectively (Table S1). The p-acetylated compound 318 was found to contribute to model 2 with four different binding poses and accordingly it interacted with a variety of protein residues. In contrast, the slightly larger compound 333 and the nitril containing compound 246 only contributed with a single pose, which mutually differed in terms of binding orientation and interacting residues (310 for compound 246 vs 115 and 309 for compound 333). As another example, a Cl-substituted letrozole analogue (compound 139) contributed to model 2, with its poses differing from the pose that contributed most (to model 1) and that showed the F- and Br-substituents of counterpart compounds 200 and 98 on a distance from methionine 374 suited for halogen-bonding interactions, Table S1. We note that MD sampling was found to be necessary for the purpose of establishing applicability domains: when performing interaction profiling based on the docked MD starting conformations only, we could, e.g., not classify interaction profiles for models 2 and 3. Despite the diversity in the data set, our machine-learning workflow was successful in grouping together compounds with a steroid scaffold in model 1 and compounds with a Letrozole like topology composed out of an imidazole group, one nitrile containing aryl group, and a variable apolar moiety in model 3. In addition, other compounds present in these models that do not share the same topology show similar interaction profiles in terms of interaction hotspots. These structural characteristics and corresponding interaction profile define the applicability domain for models 1 and 3, which is beneficial for developing predictive LIE models as illustrated by their good model statistics. In contrast, model 2 is more diverse with topologically similar compounds being represented by small subsets (≤5) of compounds. The key protein residues interacting with the compounds comprised in model 2 are the same as for model 3. This could explain e.g. why compounds 78 and 80 fall in models 2 and 3, respectively (Table S1). However, the type of hydrogen-bond acceptors is different (Figure ), which positions model 2 in between model 1 and 3 in terms of LIE parameters. The diversity in molecular structures covered by the three models is illustrated by a selection of compounds from the models as presented in Figure S6 of the Supporting Information. Protein–ligand interaction profile results illustrate that the number and type of polar protein–ligand interactions are determinative for the LIE empirical parameters and that the variation in the data set results in a system of three overlapping but distinct models rather than one global model. The importance of polar contacts is shown by the consistent value for α between 0.10 and 0.24 as frequently reported before[79,97,98] and a β value as a function of the ligand functional groups[97,98] rather than the theoretical value of 0.5.[43] With values markedly larger than 0.5, the ligand functional group dependent value for β found in this study are different than those previously proposed for other systems,[97,98] which may well be due to choices in the force field and system set up[80] and which is in line with spread in LIE parameters previously reported among a variety of other CYPs.[39−41,54] All nonsteroidal ligands have one azole ring of which the basic nitrogen atoms have the potential to apically coordinate the heme iron. Despite the inability of commonly used force fields to accurately account for the energetics of heme coordination, a considerable number of compounds in model 2 and 3 adopted a potential coordinating pose during simulation based on interaction profile data (Figure , gray bars).

Conclusions

In this work, we present binding affinity prediction models for steroidal and nonsteroidal inhibitors of CYP19A1 aromatase using our automated LIE workflow.[39,54,64] The data set of 132 compounds used for training LIE models was acquired from a lead discovery study and poses challenges to the calibration of empirical free energy models because of the sparsity of and large structural diversity within the data set and due to the catalytic site malleability and the substrate and interaction promiscuity of the CYP protein. Nevertheless, it is a data set representative for training of LIE or other models in industrial or other applied settings. In spirit of our automated iLIE approach, we developed an automated machine learning workflow aimed at exploring the model parameter landscape and at maximizing the number of data set compounds explained in one or few LIE models. The method was successful in accounting for 86% of the data set in 3 models showing high correlation between calculated and observed values for binding free energy (r2 of 0.86, 0.76, 0.92; RMSE of 2.36, 2.59, 1.68 kJ mol–1 for the three models, respectively). The use of a maximum likelihood estimator and Bayesian inference maximize the chance of identifying posterior predictive distributions describing the data set and limiting the chance of overfitting of the LIE models. The robustness of the models is illustrated by small variations after bootstrap analysis (segmentation cross validation) and by good leave-one-out cross validation statistics (q2LOO of 0.83, 0.73, 0.92; SDEPLOO of 2.57, 2.59, 1.80 kJ mol–1, respectively). The ligand clusters leading to the final set of models are visually selected from the heat map (Figure ). In a future iteration of the software we aim to also automate this final step of our workflow, thereby completing its automation. The resolved clusters, from which the three models were derived separate the data set into a SAI model and two NSAI models that have distinct α and β model parameters. With respect to polar interactions, compounds in model 1 predominantly interact with D309/T310 and/or M374 by hydrogen-bonded interactions with carbonyl oxygens as ligand acceptors, which is common for steroids interacting with CYP19A1.[55,71,94−96] On the other hand, compounds in model 3 interact with Q225, M374 and S478 via hydrogen bonding involving the nitrile nitrogen, and with T310 by hydrogen bonding via an imidazole N atom. Rotationally symmetric binding conformations of these compounds that preserve the overall interaction profile are frequently included in the binding affinity prediction. This underlines the flexible nature of the CYP19A1 catalytic cavity and the ability of the iLIE approach to account for this. Using our machine learning workflow, we have been able to calibrate LIE binding affinity prediction models with a protein–ligand interaction based applicability domain that explains 86% of the data set using experimental affinity data and information extracted from MD only. The detailed structure and interaction based applicability domain provides a frame of reference for the user to determine if a model, and which model, can be used for binding affinity prediction of a novel compound based on the protein–ligand interaction profile and provides an estimate of the reliability of the prediction. The automated nature of the workflow allows it to be used together with our automated iLIE workflow, together creating a pipeline to apply LIE in high-throughput settings.
  86 in total

1.  What determines the van der Waals coefficient beta in the LIE (linear interaction energy) method to estimate binding free energies using molecular dynamics simulations?

Authors:  W Wang; J Wang; P A Kollman
Journal:  Proteins       Date:  1999-02-15

2.  A method to obtain a near-minimal-volume molecular simulation of a macromolecule, using periodic boundary conditions and rotational constraints.

Authors:  Henk Bekker; Jur P van den Berg; Tsjerk A Wassenaar
Journal:  J Comput Chem       Date:  2004-06       Impact factor: 3.376

Review 3.  Aromatase inhibitors: rationale and use in breast cancer.

Authors:  Cynthia Osborne; Debu Tripathy
Journal:  Annu Rev Med       Date:  2005       Impact factor: 13.739

4.  Empirical scoring functions for advanced protein-ligand docking with PLANTS.

Authors:  Oliver Korb; Thomas Stützle; Thomas E Exner
Journal:  J Chem Inf Model       Date:  2009-01       Impact factor: 4.956

Review 5.  Aromatase inhibitors in the treatment of breast cancer.

Authors:  Robert W Brueggemeier; John C Hackett; Edgar S Diaz-Cruz
Journal:  Endocr Rev       Date:  2005-04-06       Impact factor: 19.871

6.  Definition and testing of the GROMOS force-field versions 54A7 and 54B7.

Authors:  Nathan Schmid; Andreas P Eichenberger; Alexandra Choutko; Sereina Riniker; Moritz Winger; Alan E Mark; Wilfred F van Gunsteren
Journal:  Eur Biophys J       Date:  2011-04-30       Impact factor: 1.733

Review 7.  Advances and applications of binding affinity prediction methods in drug discovery.

Authors:  Marco Daniele Parenti; Giulio Rastelli
Journal:  Biotechnol Adv       Date:  2011-08-12       Impact factor: 14.227

Review 8.  Aromatase inhibitors and their application in breast cancer treatment*.

Authors:  A M Brodie; V C Njar
Journal:  Steroids       Date:  2000-04       Impact factor: 2.668

9.  Synthesis and biochemical studies of 17-substituted androst-3-enes and 3,4-epoxyandrostanes as aromatase inhibitors.

Authors:  Margarida M D S Cepa; Elisiário J Tavares da Silva; Georgina Correia-da-Silva; Fernanda M F Roleira; Natércia A A Teixeira
Journal:  Steroids       Date:  2008-07-17       Impact factor: 2.668

10.  Efficient free energy calculations for compounds with multiple stable conformations separated by high energy barriers.

Authors:  Jozef Hritz; Chris Oostenbrink
Journal:  J Phys Chem B       Date:  2009-09-24       Impact factor: 2.991

View more
  7 in total

1.  A Comparative Linear Interaction Energy and MM/PBSA Study on SIRT1-Ligand Binding Free Energy Calculation.

Authors:  Eko Aditya Rifai; Marc van Dijk; Nico P E Vermeulen; Arry Yanuar; Daan P Geerke
Journal:  J Chem Inf Model       Date:  2019-09-11       Impact factor: 4.956

2.  Docking Studies and Molecular Dynamics Simulation of Ipomoea batatas L. Leaves Compounds as Lipoxygenase (LOX) Inhibitor.

Authors:  Yeni Yeni; Supandi Supandi; Lusi P Dwita; Suswandari Suswandari; Maizatul S Shaharun; Nonni S Sambudi
Journal:  J Pharm Bioallied Sci       Date:  2020-11-05

3.  Network pharmacology-based analysis for unraveling potential cancer-related molecular targets of Egyptian propolis phytoconstituents accompanied with molecular docking and in vitro studies.

Authors:  Reham S Ibrahim; Alaa A El-Banna
Journal:  RSC Adv       Date:  2021-03-22       Impact factor: 3.361

4.  Effective estimation of the inhibitor affinity of HIV-1 protease via a modified LIE approach.

Authors:  Son Tung Ngo; Nam Dao Hong; Le Huu Quynh Anh; Dinh Minh Hiep; Nguyen Thanh Tung
Journal:  RSC Adv       Date:  2020-02-21       Impact factor: 4.036

5.  Adequate prediction for inhibitor affinity of Aβ40 protofibril using the linear interaction energy method.

Authors:  Son Tung Ngo; Binh Khanh Mai; Philippe Derreumaux; Van V Vu
Journal:  RSC Adv       Date:  2019-04-23       Impact factor: 4.036

6.  Binding free energy predictions of farnesoid X receptor (FXR) agonists using a linear interaction energy (LIE) approach with reliability estimation: application to the D3R Grand Challenge 2.

Authors:  Eko Aditya Rifai; Marc van Dijk; Nico P E Vermeulen; Daan P Geerke
Journal:  J Comput Aided Mol Des       Date:  2017-09-09       Impact factor: 3.686

7.  How Good is Jarzynski's Equality for Computer-Aided Drug Design?

Authors:  Kiet Ho; Duc Toan Truong; Mai Suan Li
Journal:  J Phys Chem B       Date:  2020-06-22       Impact factor: 2.991

  7 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.