Literature DB >> 35213804

PathDetect-SOM: A Neural Network Approach for the Identification of Pathways in Ligand Binding Simulations.

Stefano Motta¹, Lara Callea¹, Laura Bonati¹, Alessandro Pandini^2,3.

Abstract

Understanding the process of ligand-protein recognition is important to unveil biological mechanisms and to guide drug discovery and design. Enhanced-sampling molecular dynamics is now routinely used to simulate the ligand binding process, resulting in the need for suitable tools for the analysis of large data sets of binding events. Here, we designed, implemented, and tested PathDetect-SOM, a tool based on self-organizing maps to build concise visual models of the ligand binding pathways sampled along single simulations or replicas. The tool performs a geometric clustering of the trajectories and traces the pathways over an easily interpretable 2D map and, using an approximate transition matrix, it can build a graph model of concurrent pathways. The tool was tested on three study cases representing different types of problems and simulation techniques. A clear reconstruction of the sampled pathways was derived in all cases, and useful information on the energetic features of the processes was recovered. The tool is available at https://github.com/MottaStefano/PathDetect-SOM.

Entities: Chemical

Mesh：

Substances：
Ligands

Year: 2022 PMID： 35213804 PMCID： PMC8908765 DOI： 10.1021/acs.jctc.1c01163

Source DB: PubMed Journal: J Chem Theory Comput ISSN： 1549-9618 Impact factor: 6.006

Introduction

The binding of a ligand to its macromolecular target is a critical event in many cellular processes in living organisms. Understanding ligand–protein recognition and interactions at the molecular level is important to unveil biological mechanisms and to provide the basis for the design and discovery of new drugs.[1,2] Molecular docking is a well-established computational method to predict the three-dimensional structure and to estimate the binding free energy of a protein–ligand complex.[3,4] The low computational requirements of this method made it the leading approach for ligand virtual screening. In recent years, due to an impressive increase in computational power, alternative methods based on molecular dynamics (MD) have gained increasing attention for their higher accuracy in modeling ligand–protein binding by considering protein conformational flexibility.[5] These methods can be classified into two categories: those mainly focused on the bound and unbound states for estimation of the binding free energy and those aimed at reproducing the physical pathway (PP) of binding (and/or unbinding).[6] Methods that fall in the first category include end-state methods, such as the linear interaction energy (LIE)[7] and the molecular mechanics Poisson–Boltzmann surface area (MM-PBSA),[8] and alchemical free-energy perturbation methods, such as thermodynamic integration (TI)[9] and free-energy perturbation (FEP).[10] PP methods simulate the complete binding and/or unbinding events, which can in principle lead to the calculation of both thermodynamic and kinetic properties[11] and to the characterization of relevant states along the pathways. Methods falling within this category include several enhanced-sampling approaches such as steered MD (SMD),[12,13] metadynamics (MetaD)[14] and its variations,[15−18] Gaussian-accelerated MD (GaMD),[19] scaled MD,[20,21] τ-RAMD,[22] MD binding,[23,24] maze,[25,26] and CG-MD.[27] It should be noted that with the increase in computational power due to easier access to high-performance or GPU-based architectures, unbiased simulations are also becoming computationally affordable for the study of long-time-scale processes.[28−31] The PP methods have the advantage of explicitly simulating key molecular events, such as the protein conformational changes that facilitate ligand access to the binding cavity and the formation of intermediate states. All the above information is fundamental to suggest appropriate modifications of hit compounds in drug-design studies. However, PP methods generally require an extensive sampling of binding/unbinding events to obtain an accurate description of the energy landscape of the process based on reliable statistics. It follows that many events have to be analyzed through several simulation replicas or with a single simulation that describes several re-crossing events. The large amount of data from different replicas or events calls for better automated tools to analyze all the simulated events at once and to provide a clearly interpretable summary picture of the differences in the sampled pathways. We suggest the use of self-organizing maps (SOMs)[32] to handle such complex sets of data. An SOM is a type of artificial neural network useful for effective identification of patterns in the data[33−35] and has been widely used in many fields.[36,37] The most interesting property of an SOM is that it performs a dimensionality reduction by mapping multidimensional data on the SOM grid, retaining topological relationships between neurons, that is, keeping similar input data close to each other on the map.[33] Several applications of SOMs to the analysis of biomolecular simulations can be found in the literature,[38−40] ranging from comparison of the dynamics of different mutants,[41] clustering of ligand poses in virtual screening,[42] binding site identification,[43] identification of blocks for structural alphabets[44−46] and conformational analysis of loop opening.[47] More recently, we applied SOMs to the reconstruction of protein unfolding pathways on the basis of several SMD simulation replicas.[48] Here, we designed, implemented, and tested PathDetect-SOM (pathway detection on SOM), an SOM-based protocol for the analysis of ligand binding/unbinding pathways derived from MD simulations with PP methods. Taking advantage of the properties of SOMs, the tool is able to generate a model that clearly highlights differences in the pathways sampled along a simulation or in different replicas. The protocol makes it possible to obtain a synthetic view of the sampled conformational space by highlighting the relevant states, to trace the pathways followed by the system on the SOM, and to derive a network model that provides a meaningful representation of the binding/unbinding pathways. We applied this protocol to a range of ligand binding/unbinding simulations with different features, successfully obtaining not only a simple schematic representation of the pathways but also hints about the thermodynamics and/or kinetics of the process. The protocol is implemented as the batch executable R script with a command-line interface that will be accessible to biomolecular practitioners with limited or no familiarity with the R environment.

Methods

Overview of the Protocol

PathDetect-SOM is a modular command-line tool based on a three-step protocol (see Figure ):

Figure 1

Flowchart of the PathDetect-SOM protocol for ligand binding studies. Data preparation (a); map training and analysis (b); and pathway analysis (c).

Flowchart of the PathDetect-SOM protocol for ligand binding studies. Data preparation (a); map training and analysis (b); and pathway analysis (c). (a) The user selects a set of features best describing ligand conformations along the process. If a set of proteins and ligand atoms is provided, the tool will automatically compute the pairwise distances between the protein and ligand sets of atoms. This set of distances will be hereafter referred to as intermolecular distances. (b) SOM is initialized and trained with the input vectors containing the values of the selected features for all the simulation frames. Each frame is considered as a data point and assigned to the neuron with most similar feature values. During the training process, the feature values of a neuron and its neighbors are adjusted toward the values in the input vector assigned to that neuron. The final prototype vector of each output neuron summarizes the conformations associated with the neuron, and groups of similar conformations are mapped to neighboring neurons. In addition, to offer a more concise picture of the map, after training, the neurons are also grouped to a relatively small number of clusters and the representative conformation of each cluster is saved. Population analysis and average properties can then be visualized on the trained SOM. (c) The pathways followed during the simulation can be directly traced on the SOM, reconstructing the binding/unbinding pathway. This representation facilitates the identification of regions of the map exclusively sampled by specific simulations. In turn, pathways can be clustered to recover dominant binding events. Finally, a graph-based representation of transitions can be built from the transition matrix calculated at the neuron level. Community detection on this graph can highlight putative macrostates. PathDetect-SOM is distributed as an R script available under GNU General Public License at https://github.com/MottaStefano/PathDetect-SOM. The repository includes a brief guide and tutorial material based on sample trajectories from the first study case presented in the results.

Data Preparation

The feature selection is a key step for SOM training. Several features can be used to train the SOM (e.g., the simple cartesian coordinates of a set of atoms, see Supplementary Methods and Figure S1). However, the intermolecular distances are the most suitable choice to accurately describe the ligand–receptor reciprocal orientation. A set of receptor and ligand atoms is chosen for the computation of intermolecular distances. Selected atoms should describe both the binding site and the mouth at the entrance of the binding site. Ideally, both atoms from the backbone and from large or polar/charged side chains should be included when the side chain dynamics and interactions are relevant for binding. Similarly, selected ligand atoms should well describe the core molecular structure and all the relevant lateral groups. The user can provide the filtered trajectory with the coordinates of the chosen atoms in the form of an xvg file, easily obtained using the GROMACS gmx traj command. A capping value is applied to the distances to avoid that training is dominated by information on the unbound states (see Supplementary Methods and Figure S2). Details on the atom selection for the study cases presented here are summarized in Table S1.

Map Training

The selected features are used to train the SOM using an iterative approach. The map is initialized by assigning random values of the feature vectors to each neuron. In each training cycle, the input vectors representing the single conformations are presented in random order to the map and assigned to the neuron with the closest feature values, also called the best matching unit (BMU). The feature values of the BMU and its neighbors are modified to be closer to the values of the input vector. The magnitude of the modification decreases with the distance from the BMU and along the training. At the end of the iterative process, the resulting SOM preserves the topological relationship between neurons, keeping similar original input data close on the map. In a second step, with the aim of making the map easier to interpret, the neurons are further grouped in a small, but representative, number of clusters by agglomerative hierarchical clustering using Euclidean distances and complete linkage. For each system, the optimal number of clusters can be selected on the basis of silhouette profiles (Figure S3). We propose to choose the number of clusters as the one with the optimal silhouette profile within the 9–15 range. A lower number of clusters would create conformations too coarse for the process that is taking place, while a number higher than 15 would create excessive fragmentation, making the visual interpretation difficult and thus going against the purpose of the tool. A representative structure for each neuron is saved; this is defined as the structure with the feature vector closest to the neuron vector. For each cluster, a representative neuron is also chosen as the one with the feature values closest to the weighted-average feature vector of the neurons belonging to that cluster. In the latter case, the average was performed using the population of each neuron as weight. In the present work, 10 × 10 sheet-shaped SOMs with a hexagonal lattice shape and without periodic boundary conditions were trained over 5000 training cycles. The neurons were further grouped in a small, but representative, number of clusters, different for each study case, using the cluster analysis approach outlined above.

Path Analysis

The trained SOM captures the conformational space of several trajectories in a topological map. Therefore, it is possible to reconstruct the path explored by each simulation on the map. Pathways are traced on the SOM based on the annotation of the BMU associated with each frame of the simulation. Given that similar conformations are enforced to be close on the map, the pathways traced on the SOM are usually continuous. Some exceptions to this behavior may arise due to large conformational changes between two consecutive frames in the simulation. Alternatively, discontinuities may highlight important information on the process only visible by projection on the map, that is, the map has the potential to identify and report when partially geometrically similar conformations should be considered dissimilar and separated in the low-dimensional space, probably representative of distinct conformational states. The resulting SOM pathways were also clustered by agglomerative hierarchical clustering using average linkage. Two different distance metrics are implemented in the PathDetect-SOM tool: a time-dependent and a time-independent distance. In the time-dependent version, the distance between the SOM pathways of two simulations is defined as the average distance of the BMUs of each couple of frames. The distance between two BMUs is defined as the Euclidean distance between the position of the neurons on the map. This distance was also used in a previous work by the authors.[48] In the time-independent version, for each frame of the simulation, the minimum distance between the BMU of the first and the second simulation is computed and averaged over the number of frames. This approach provides a framework to compare simulations evolving at different speeds such as those presented in study case 2. For this type of simulation, indeed, frames to be compared are not at the same position along the replicas due to the different evolution of the simulations. Comparing each frame with the closest frame of the second replica is a time-independent way of performing a distance calculation between two pathways. An approximate transition matrix between each pair of neurons can be computed from the time-dependent distance approach. The matrix is then transformed into a row stochastic matrix, and a graph is built with nodes representing the neurons and edges with weight proportional to the negative logarithm of the transition probability between the corresponding neurons. Communities of nodes can be detected, and in the present work, we used the walktrap algorithm,[49] but other methods can be easily applied. A neuron representative of each community is selected as the one with the highest eigenvector centrality score in the subgraph which only contains nodes belonging to the community. In this work, for the third study case, a commitor analysis was performed using the R library markovchain.[50,51] This analysis computes the probability of hitting a set of states A before set B starting from different initial states. In this case, the two extremes were the bound and unbound states. All the analyses were performed in the R statistical environment using the kohonen package[52,53] for the SOM training and igraph package[54] for graph construction and analysis.

Results

The PathDetect-SOM protocol, developed for the analysis of ligand binding/unbinding pathways, is implemented into a command-line tool with the capability to build an SOM representation of the conformations sampled during the MD simulations. Taking advantage of the SOM topological ordering, the tool offers the possibility to visually represent pathways sampled during different events/replicas in a clear 2D representation. Finally, the geometric microstates identified by the SOM (neurons) can be represented as a graph model, built from their transition probabilities. The graph provides a clear representation of the pathways followed during the simulations, facilitating the identification of alternative routes. Community detection on the graph generates a state model analogous to kinetic partitioning. In the following sections, we present the application of the protocol to three cases that differ for the PP method used to investigate the ligand binding. The study cases were selected to represent PP simulations with different characteristics to highlight the flexibility and general applicability of the PathDetect-SOM tool. (1) The first case regards a ligand unbinding process studied through several replicas of SMD simulation. The simultaneous evolution of the replicas (due to the constant velocity of the bias) and the use of a directional collective variable (CV) makes this study case simple and optimal for testing some parameters of the tool (tests are discussed in the Supplementary Methods section, Figures S1, S2, S4, and S5 and Tables S2 and S3). (2) The second study case is a ligand unbinding problem treated with several replicas of infrequent MetaD. This method differs from the SMD used for the first case because the system evolves along the selected CV with a series of small forth and back movements that fill the free-energy basin. As a result, there is no correspondence between the simulation times of different replicas. Moreover, the type of CV chosen in this case is nondirectional and may provide very different unbinding paths. (3) The third study case consists of a single long MetaD simulation, in which several binding and unbinding events are sampled. In this case, the simulation evolves in all the directions according to two selected CVs, and the ligand has greater freedom than in the previous cases. All the simulations were performed using GROMACS[55] patched with PLUMED.[56]

Ligand Unbinding through Multiple Replicas with Constant Velocity Pulling

The hypoxia inducible factor 2α (HIF-2α) is a pharmacologically relevant transcription factor widely recognized as a target for cancer therapy.[57] Following the discovery of a buried cavity within the HIF-2α Per-ARNT-SIM-B (PAS-B) domain,[58] several artificial small molecules were identified as HIF-2α ligands and potential inhibitors of the HIF-2α dimerization with the aryl hydrocarbon receptor nuclear translocator (ARNT).[59−62] In a recent work, we investigated the unbinding of the THS-020 ligand from the HIF-2α PAS-B domain through SMD simulations.[63] 50 constant velocity SMD replicas of 25 ns each were used to pull the ligand along the selected CV, namely, the distance between the center of mass of the amino acid atoms lining the cavity and the center of mass of the ligand. The simulations analyzed in this work are those along the preferred entrance to the cavity identified in the previous work (reported as “path 1”).[63] All replicas evolved simultaneously, due to the constant velocity of the bias, and along a directional CV. The trained SOM (Figure and details in the Methods section) shows a distribution of states that ranges from the initial bound state (top-right of the map) to the unbound state (top-left). The neighbor distance plot (Figure a) represents the average similarity of a neuron with its neighbors. This map shows a compact group of neurons in correspondence of the bound state and along the right and bottom border of the map. On the contrary, neurons lying at the center of the map display more heterogeneities. In the cluster analysis of SOM neurons (see the Methods section), we identified nine clusters that represent the binding geometries explored by the system following the distance CV used for the SMD simulations (Figure b). The representative conformations extracted from the different clusters help to visualize the relevant states sampled.

Figure 2

SOM analysis of SMD simulations of THS-020 unbinding from HIF-2α: (a) neighbor distance plot. (b) Clustering of the neurons. The representative conformation of each cluster is depicted in cartoons with the ligand in sticks. The pathways followed by each replica were then mapped on the SOM (Figure S6). They are quite consistent, since they roughly evolve through the same sequence of clusters, in agreement with the high directionality imposed by the method. However, some recurrent unbinding pathways can be identified with slight differences from each other, as also emerges from the dendrogram in Figure S7. An overview of these pathways is provided by the network graph derived from the transition matrix (see the Methods section), reported in Figure a. All the simulations start from the bound state (top-right), in which the ligand presents the nitrobenzene ring parallel to the main helix, with the nitro group pointing toward the lower side of the cavity. Then, some replicas evolve through neurons at the bottom right of the map (branch 1 of the graph), while others follow pathways closer to the center of the map (branch 2 of the graph). While simulations following branch 1, which was sampled in most of the replicas (34 out of 50), show the ligand slightly rotated along its principal axis, those along branch 2 maintain the ligand in an orientation similar to the bound state and rigidly translate it along the pathway. When the nitro group reaches the solvent, however, the two branches merge before a second ramification in the graph appears (branches 3 and 4). Replicas in branch 3 describe a rigid transition of the ligand that maintains the initial bound orientation while those in branch 4 sample conformations with the ligand rotated and bound to the mouth of the cavity. The two final branches appear equally probable (22 replicas though branch 3 and 28 through branch 4).

Figure 3

Transition network for the SMD simulations of THS-020 unbinding from HIF-2α. (a) Transition network with its main ramifications explicitly indicated by black arrows (nodes are colored according to the SOM clusters). The representative conformations of neurons that characterize each branch (red circles in the network) and of the bound state (in yellow) are depicted in cartoons with the ligand in sticks. (b) Network colored according to the average SMD force of its frames (from blue to red, increasing values of this property), and the representative conformations of the neurons with the maximum forces, superimposed to the bound state (in gray). Finally, we colored neurons according to the average SMD pulling forces applied to the frames belonging to that neuron (Figure b). Results show that the pulling of the ligand out of its initial bound state requires the maximum of the force, while the remaining part of the pathway requires less force. We interpreted the peaks of maximum forces as the approximate location of the highest energy barrier to be crossed during unbinding, which corresponds to the energy necessary to pull out the ligand from its initial state.

Ligand Unbinding through Multiple Replicas with a Bidirectional Sampling

Deoxyhypusine synthase (DHS) is an enzyme responsible for the post-translational hypusination of the eukaryotic initiation factor 5A (eIF5A) that controls cell proliferation and has been linked to cancer.[64] The involvement in pathogenesis together with the high specificity and functional relevance of the hypusination reaction have made this system an important and promising therapeutic target, stimulating the design and development of inhibitors able to target the hypusination process, including the N1-guanyl-1,7-diaminoheptane (GC7). In a recent work by some of the authors, we investigated the unbinding of GC7 from DHS using an approach inspired to infrequent MetaD.[65] We used the number of contacts between the ligand and the protein binding site atoms as a single CV in 30 replicas of infrequent MetaD that were stopped when the ligand reached an unbound state. By applying the PathDetect-SOM approach to the above simulations, we obtained the trained SOM shown in Figure . The neighbor distance plot (Figure a) displays a very compact region on the left side, corresponding to different bound states. All these neurons were grouped together in the neuron clustering phase (cluster A), while the diverse unbound conformations are segregated to the opposite side (Figure b).

Figure 4

SOM analysis of simulations of GC7 unbinding from DHS. (a) Neighbor distance plot. (b) Clustering of the neurons. The representative conformation of each cluster is depicted in cartoons with the ligand in sticks. Due to the nature of these MetaD simulations, where the system evolves along the CV with a series of small forth and back movements, the direct tracing of the pathways on the map may cause a slight confusion (Figure S8). Moreover, given the lack of correspondence between simulation times of different replicas, we needed to perform a time-independent clustering of pathways (see the Methods section), which allows us to compare replicas of different lengths (dendrogram in Figure S9). Two distinct types of pathways arise from this analysis. Building a network from the transition matrix, as in the previous study case, made the differences between the two pathways more evident (Figure a).

Figure 5

Transition network for the simulations of GC7 unbinding from DHS. (a) Transition network with its ramification explicitly indicated by black arrows (nodes are colored according to the SOM clusters). The representative conformations of neurons that characterize each branch (red circles in the network) are depicted in cartoons with the ligand in sticks. (b) Network colored according to the node betweenness centrality (from white to red, increasing values of this property), and the representative conformations of neuron 17, bottleneck for pathway 1. The two pathways (branches 1 and 2 of the network, in Figure a) lead to different neurons, all describing unbound states. The separation of the unbound states in different neurons is due to the ligand exiting from the two opposite sides of the binding site. Compared to the previous case, this graph is more densely connected due to the bidirectionality of the sampling during the MetaD simulation. Most of the simulations (70%) evolve through branch 1 (Pathway A in the original work[65]) in which the ligand escapes from the side of its guanidine group. The remaining replicas (30%) proceed through an opposite pathway, indicated as branch 2 in the graph, in which the ligand exits from the side of its amino-group (Pathway B in the original work[65]). Interestingly, most of the simulations following branch 1 pass through neuron 17, a node with a high value of betweenness centrality (Figure b). As betweenness is calculated as the number of shortest paths through a node,[66] neuron 17 is a critical conformation to observe the bound/unbound transition. The representative conformation of this neuron shows the characteristic of the intermediate state hypothesized in the previous work,[65] namely, a stable salt bridge of the ligand primary amine group with Glu137.

Ligand Binding/Unbinding through a Single Metadynamic Simulation

As a third study case, we applied the PathDetect-SOM protocol to a single MetaD simulation of ligand binding. The system under study is the same presented in Ligand Unbinding through Multiple Replicas with Constant Velocity Pulling: the THS-020 binding to HIF-2α. In a previous work, starting from the SMD simulations, we built a path CV and used well-tempered MetaD to enhance the sampling along the selected CV and to reconstruct the free-energy landscape of the process.[63] During the 1.8 μs of MetaD simulation, we observed a high number of binding and unbinding events. The trained SOM (Figure and details in the Methods section) presents the starting bound conformation in the top-left corner (cluster L) and the completely unbound conformation in the top-right corner (cluster G). Due to the conformational freedom along the z(r) CV of the path CV (which represents the distance from the reference path), the ligand can also rotate and sample alternative bound conformations. This is the case of cluster I, which contains conformations in which the ligand is rotated 180° with respect of the X-ray starting structure.

Figure 6

SOM trained with MetaD simulations of THS-020 binding to HIF-2α. (a) Neighbor distance plot. (b) Clustering of the neuron vectors. The representative conformation of each cluster is depicted in cartoons with the ligand in sticks. For the sake of comparison with the free-energy landscape previously identified by the MetaD calculation,[63] we mapped the frames belonging to each free-energy basin on the SOM (Figure S10). We found that conformations belonging to each of these basins generally map in few close neurons, belonging to the same cluster on the map. In this study case, the direct tracing of pathways on the SOM is difficult due to the unique long simulation that samples multiple binding/unbinding events. The set of pathways is better represented on the SOM in the form of a movie (Supplementary Movie 1). However, the transition network analysis proposed in the PathDetect-SOM protocol is capable of providing a clear representation of the pathways sampled during the MetaD simulation (Figure a).

Figure 7

Transition network for the MetaD simulation of THS-020 binding to HIF-2α. (a) Transition network with main pathways indicated by black arrows (nodes are colored according to the SOM clustering). (b) Communities identified by the walktrap method represented on the network (nodes are colored according to the different communities). The representative conformations of the communities are depicted in cartoons with the ligand in sticks. (c) Committor probability analysis. The representative conformations of neurons with a committor probability of about 0.5 are reported in red sticks and X-ray starting conformation in gray sticks. As shown in Figure a, there are two main branches: branch 1 connects the crystallographic-like bound conformation to the unbound state, while branch 2 follows the unbinding of an alternative binding mode (cluster I). Only a small number of connections between the two branches are present, indicating that the ligand cannot freely rotate within the binding site, and it preferentially unbinds and rebinds to interconvert between the two bound states. The previous study cases sampled only one unbinding event for each replica and, for this reason, the graph model only describes the interconnection between states along the unbinding pathway. In this last case, due to the MetaD sampling of several binding/unbinding events, the obtained graph takes into account connections along both directions and thus contains more information about the kinetic of the process. Indeed, assuming that the ligand remains trapped for a sufficient time inside an energy minimum, the communities identified with the walktrap method exhibits the properties of kinetic clustering (Figure b). It is important to consider that, for an accurate calculation of kinetic properties, the transition matrix should take into account the effect of the bias potential deposited during the simulation. For this reason, this approach can provide quantitative results only if the analyzed simulation is unbiased or if a proper reweighting procedure to the transition matrix is applied. With the aim of showing the potential of the tool, we present the results obtained from the previously published MetaD simulation, without performing any reweighting procedure. The same approach applied to an unbiased simulation, or with a reweight of the transition matrix, would provide accurate description of the kinetic of the process instead of a simple indication of the energy barrier position. The identified communities well represent the ensemble of metastable states sampled along the process. Along both the branches, it is possible to identify a small community for the bound state (communities E and G); a community in which the ligand is still completely inside the binding cavity and did not reach the unbound state (C and A); a community in which the ligand is located at the mouth of the cavity, but it is already partially immersed in the solvent (B and D); and a community for the completely unbound state (F). Moreover, the transitions between communities may be associated with conformational changes with high energy barriers. Focusing on transitions between communities B and C, and between communities A and D, it seems that they are associated with the conformational changes necessary to observe ligand binding. Indeed, nodes at the boundary of these two pairs of communities display higher average RMSD values for residues at the mouth of the cavity involved in the recognition process (Figure S11). Finally, we performed a committor analysis: we computed the probability of ending in the crystallographic-like bound conformation (neuron 91, in community E) before reaching the unbound conformation (neuron 100, community F) starting from each neuron (Figure c). Given that the transition state is expected to have an equal chance of going to either states, configurations with a committor of approximately 0.50 can be considered at the transition state. In the present case, the energetic barrier seems to be located around conformations close to neurons 63, 72, 73, and 82 (in community C, Figure c). These conformations are located at the boundaries between communities E and C and are near to the bound state, in agreement with the conclusions drawn from the SMD simulations (Ligand Unbinding through Multiple Replicas with Constant Velocity Pulling).

Discussion

Data from MD simulations can contain extremely useful information on molecular processes, but it does not lead to simple canonical analysis protocols: system-specific and problem-specific strategies are often required to extract information from increasingly large trajectory files. Planning and designing appropriate strategies can be a very difficult task, and it often requires the development of ad hoc scripts for advanced analysis and the use of dedicated analysis tools. Several general-purpose tools for the analysis of MD trajectories are available, including GROMACS analysis tools,[67] CPPTRAJ,[68] VMD,[69] MDAnalysis,[70] Bio3D,[71] and MDTraj.[72] All these tools provide basic post-processing analysis such as RMSD, RMSF, radius of gyration, hbond, and contact maps. Some of them are built-in tools distributed along with the main simulation engine (GROMACS analysis tools and CPPTRAJ), while others are python or R libraries that provide a flexible framework for complex analysis (MDAnalysis, Bio3D, and MDTraj) but require the user to develop an ad hoc code. Among the most advanced post-processing methods, Markov state models (MSMs) are often used to develop a complete kinetic model of the process under investigation.[73,74] These types of analyses are often complex and require a high level of expertise by the user to obtain reliable results. For this reason, they are difficult to implement as an automated user-friendly protocol. Moreover, effective use of MSMs requires that simulated data meet strict sampling conditions, such as a lag time sufficiently long to produce a Markovian-state decomposition.[75] This implies that this method can only be used when the aggregate simulation time is in the order of hundreds of microseconds or more. Moreover, development of MSMs using enhanced-sampling MD requires reweighting procedures that nowadays are still at an early stage of development.[76] Here, we presented a tool based on SOMs specifically designed for the analysis of ligand binding pathways sampled in simulations by means of an automated protocol. Our development takes inspiration from other tools based on SOMs already developed by our group and others,[39] some of which with fast implementation on GPU.[38] These tools have been successfully applied to MD data, but they were mainly focused on clustering of macromolecular conformations and not on pathway analysis. The PathDetect-SOM tool does not have any sampling condition and can be applied to MD simulations that sample multiple ligand binding events. While it cannot be directly used to compute stationary quantities and long-time kinetics (unless one demonstrates that the criteria for MSMs are met), it provides an immediate interpretation of the pathways sampled during the simulation and can give hints about the thermodynamics and kinetics of the process. Recently, an analysis of camphor unbinding pathways from cytochrome P450cam has been performed using the t-distributed stochastic neighbor embedding (t-SNE) dimensionality reduction method.[77] Authors obtained a two-dimensional representation of the ligand trajectories that facilitate interpretation and helped in grouping similar pathways. Here, we take advantage of SOM properties to obtain a similar dimensionality reduction, with the intrinsic advantage of the resulting segregation of conformations in local microstates (neurons) that immediately allows the building of an approximate transition matrix. Moreover, SOM was also demonstrated as a powerful tool for the comparison of simulations performed with different simulation parameters.[48] In this work, we tested the PathDetect-SOM tool on a range of ligand binding/unbinding simulations with different features. In all cases, the pathways were successfully characterized and mapped over an intuitive 2D map, thus confirming the general applicability of the protocol. Moreover, depending on the simulation type, several hints regarding the energetics of the process were obtained. In the first study case, we exploited the possibility of re-mapping a property, the SMD pulling forces, on the SOM neurons in order to identify the location of the highest unbinding energy barrier along the simulation (corresponding to the frames with the largest values of the pulling forces). In the second study case, the transition graph and the betweenness centrality score of the nodes suggested the obligate transition across a neuron for the unbinding across pathway 1. Finally, in the third study case, we computed some interesting properties starting from the approximate transition matrix. Here, a reweighting procedure should have been performed to account for the effect of the bias applied during the simulation. Depending on the type of simulation, different reweighting schemes can be applied. If the bias is time-independent, the most simple and effective reweighting procedures are TRAM[78] and DHAM.[79] If the bias is time-dependent, on the other hand, the situation becomes more complicated, and the reweighting procedures that can be applied are limited. For example, a reweighting approach for MetaD simulations based on the Girsanov theorem was recently proposed by the group of B Keller.[80,81] However, random number and the force at every time step are required to calculate the relative path probability, and for this reason, the reweighting factors are computed on the fly (using a patched MD engine) during the simulation and cannot be derived in a post-processing step. With the aim of showing the potential of the tool, we present the results obtained on a MetaD simulation, without performing any reweighting procedure, but accurate kinetic properties can be derived if one analyzes unbiased simulations, or accurately reweighted trajectories. In the presented study case, the committor analysis suggested the location of the energy barrier on the SOM, while determination of the communities in the transition graph led to the identification of kinetic macrostates. As the above properties were computed from the approximate transition matrix, their accuracy strictly depends on the extension of the sampling. PathDetect-SOM has been implemented in the form of an R batch script with an easy command-line interface. While the tool was primarily designed for ligand binding studies, it can be applied to many other types of simulations (unfolding, protein–protein, or protein–peptide binding) by appropriate choice arguments on the command-line input. The batch script format offers easiness of use with flexibility of customization through simple command-line options. As future development, the tool can be extended and included in an R package to offer expert users the possibility to develop ad hoc extensions to the analyses. The tool is open source and freely available with a brief guide and tutorials at https://github.com/MottaStefano/PathDetect-SOM.

63 in total

1. MDTraj: A Modern Open Library for the Analysis of Molecular Dynamics Trajectories.

Authors: Robert T McGibbon; Kyle A Beauchamp; Matthew P Harrigan; Christoph Klein; Jason M Swails; Carlos X Hernández; Christian R Schwantes; Lee-Ping Wang; Thomas J Lane; Vijay S Pande
Journal: Biophys J Date: 2015-10-20 Impact factor: 4.033

2. Flexible docking in solution using metadynamics.

Authors: Francesco Luigi Gervasio; Alessandro Laio; Michele Parrinello
Journal: J Am Chem Soc Date: 2005-03-02 Impact factor: 15.419

3. GROMACS: fast, flexible, and free.

Authors: David Van Der Spoel; Erik Lindahl; Berk Hess; Gerrit Groenhof; Alan E Mark; Herman J C Berendsen
Journal: J Comput Chem Date: 2005-12 Impact factor: 3.376

4. Machine Learning Based Dimensionality Reduction Facilitates Ligand Diffusion Paths Assessment: A Case of Cytochrome P450cam.

Authors: J Rydzewski; W Nowak
Journal: J Chem Theory Comput Date: 2016-03-28 Impact factor: 6.006

5. Multiensemble Markov models of molecular thermodynamics and kinetics.

Authors: Hao Wu; Fabian Paul; Christoph Wehmeyer; Frank Noé
Journal: Proc Natl Acad Sci U S A Date: 2016-05-25 Impact factor: 11.205

6. A new method for predicting binding affinity in computer-aided drug design.

Authors: J Aqvist; C Medina; J E Samuelsson
Journal: Protein Eng Date: 1994-03

7. MDAnalysis: a toolkit for the analysis of molecular dynamics simulations.

Authors: Naveen Michaud-Agrawal; Elizabeth J Denning; Thomas B Woolf; Oliver Beckstein
Journal: J Comput Chem Date: 2011-04-15 Impact factor: 3.376

8. Artificial ligand binding within the HIF2alpha PAS-B domain of the HIF2 transcription factor.

Authors: Thomas H Scheuermann; Diana R Tomchick; Mischa Machius; Yan Guo; Richard K Bruick; Kevin H Gardner
Journal: Proc Natl Acad Sci U S A Date: 2009-01-07 Impact factor: 11.205

Review 9. Molecular Docking: Shifting Paradigms in Drug Discovery.

Authors: Luca Pinzi; Giulio Rastelli
Journal: Int J Mol Sci Date: 2019-09-04 Impact factor: 5.923

10. Insights Into the Binding Mechanism of GC7 to Deoxyhypusine Synthase in Sulfolobus solfataricus: A Thermophilic Model for the Design of New Hypusination Inhibitors.

Authors: Mattia D'Agostino; Stefano Motta; Alice Romagnoli; Patrick Orlando; Luca Tiano; Anna La Teana; Daniele Di Marino
Journal: Front Chem Date: 2020-12-17 Impact factor: 5.221

1 in total

1. Recognizing the Binding Pattern and Dissociation Pathways of the p300 Taz2-p53 TAD2 Complex.

Authors: Tongtong Li; Stefano Motta; Amy O Stevens; Shenghan Song; Emily Hendrix; Alessandro Pandini; Yi He
Journal: JACS Au Date: 2022-08-03

1 in total