Literature DB >> 35389981

Dynamic coupling of residues within proteins as a mechanistic foundation of many enigmatic pathogenic missense variants.

Nicholas J Ose¹, Brandon M Butler¹, Avishek Kumar¹, I Can Kazan¹, Maxwell Sanderford^2,3, Sudhir Kumar^2,3,4, S Banu Ozkan¹.

Abstract

Many pathogenic missense mutations are found in protein positions that are neither well-conserved nor fall in any known functional domains. Consequently, we lack any mechanistic underpinning of dysfunction caused by such mutations. We explored the disruption of allosteric dynamic coupling between these positions and the known functional sites as a possible mechanism for pathogenesis. In this study, we present an analysis of 591 pathogenic missense variants in 144 human enzymes that suggests that allosteric dynamic coupling of mutated positions with known active sites is a plausible biophysical mechanism and evidence of their functional importance. We illustrate this mechanism in a case study of β-Glucocerebrosidase (GCase) in which a vast majority of 94 sites harboring Gaucher disease-associated missense variants are located some distance away from the active site. An analysis of the conformational dynamics of GCase suggests that mutations on these distal sites cause changes in the flexibility of active site residues despite their distance, indicating a dynamic communication network throughout the protein. The disruption of the long-distance dynamic coupling caused by missense mutations may provide a plausible general mechanistic explanation for biological dysfunction and disease.

Entities: Chemical

Mesh：

Substances：
Proteins

Year: 2022 PMID： 35389981 PMCID： PMC9017885 DOI： 10.1371/journal.pcbi.1010006

Source DB: PubMed Journal: PLoS Comput Biol ISSN： 1553-734X Impact factor: 4.779

Introduction

Our understanding of factors responsible for the pathogenesis of disease-association variants (DAVs) in proteins continues to evolve. From a biophysics perspective, it has been shown that DAVs could alter the stability of a protein [1-3]. But, only one-third of over 2,000 mutations led to a decrease in protein stability, a high-throughput functional assay revealed [4]. Rather than affecting stability, a large fraction of DAVs seems to impair specific protein-ligand function or enzymatic activity [5-8]. Furthermore, studies combining evolutionary approaches with the biochemistry of protein design have revealed that DAVs at non-conserved sites can involve complex and frequently poorly understood mechanisms [5,9-11]. Through sequencing efforts, a large catalog of missense variants in thousands of human proteins has been assembled, including those implicated in diseases (Fig 1A) [5,11-13]. However, many DAVs occur at positions that are neither evolutionary well-conserved nor a part of any known functional domain (Fig 1C). Regardless of biochemical similarity, amino acid substitutions at non-conserved sites lead to a wide range of outcomes, increasing or decreasing functional activity at up to three orders of magnitude (i.e., the rheostatic pattern of change) [14]. These enigmatic mutations are frequently misdiagnosed because neither evolutionary nor static structural features are informative. In fact, many rare missense variants occur at fast-evolving positions that do not have functional annotations (Fig 1B), which adversely impacts the prediction accuracy of commonly used methods because they run counter to expectations. In Fig 1D, we see that EvoD is able to exceed the prediction accuracy of other contemporary sequence-based metrics by accounting for additional evolutionary properties [15].

Fig 1

Frequency, evolutionary conservation, and rates of misdiagnosis of missense variants.

Frequency, evolutionary conservation, and rates of misdiagnosis of missense variants.

(a) Histogram of minor allele frequencies (MAF) of missense variants in the 1000 Genomes data set. (b) Counts of these missense variants according to evolutionary conservation and the Uniprot functional annotation of their domain of residence. Evolutionary rate classes are from Kumar et al. [15] with class 0 sites containing no substitutions, class 1 sites exhibiting 0–1 substitutions per billion years, and class 2 sites exhibiting greater than 1 substitution per billion years. (c) Histogram of evolutionary conservation of sites containing only known pathogenic missense variants found in the Human Gene Mutation Database (HGMD) [16], with and without functional annotation in the Uniprot database. (d) Performance of four different missense diagnosis tools, quantified by their area under the receiver operation curve (AUC), which measures their ability to discriminate between putatively neutral (1000 Genomes missense variants with MAF>1%) and disease associated variants (DAVs) found in fast evolving positions (evolutionary rate class of 2). DAVs with MAF > 0.01% were excluded from these analyses. Here we explore the mechanistic role of dynamic allosteric coupling of sites carrying DAVs with the catalytic sites important for enzymatic activity. Our exploration is based on the premise that many mutations alter conformational dynamics of proteins, shifting the distribution of the ensemble and protein function, including the emergence of new functions [10,17-21], adaption to different environments [22], and dysfunction [12,23]. We use the dynamic coupling index (DCI) to identify sites strongly coupled to active sites critical for function [24,25]. We refer to them as dynamic allosteric residue coupling (DARC) sites. A mutation at a DARC site is likely to influence conformational dynamics and allosteric regulation, making individuals carrying mutants of these sites highly susceptible to disease phenotypes. Firstly, in order to elucidate this allosteric mechanism, we used Molecular Dynamics (MD) simulations to examine GCase, a signature human enzyme consisting of 497 amino acids and at least 94 amino acids with observed DAVs implicated in Gaucher disease (GD) [26], which is characterized by a dangerous buildup of lipids in certain organs. Genetic changes in GCase can lead to other health conditions as well, including Parkinson’s disease [27-30] and Dementia with Lewy bodies [30,31]. We investigated the mechanistic impact of these mutations on conformational dynamics and allosteric regulation [32]. In the following, we report that GD mutations disrupt allosteric regulation due to changes in dynamic flexibility around the catalytic sites, altering enzymatic activity essential for homeostasis. The positions harboring DAVs can be thought of as key DARC sites. While all atom MD simulations are sensitive enough to investigate how mutations at specific distal sites (usually diagnosed as benign by conventional in silico tools) can modulate the overall dynamics of functionally critical sites, therefore allosterically impacting function, MD approaches are often time consuming and computationally expensive. To explore the role of conformational dynamics and allostery in missense variants of many proteins with different 3-D structures at a broader scale, we utilized a more efficient coarse-grain approach, the Elastic Network Model (ENM), to conduct a proteome-wide analysis of allosteric coupling for a set of enzymes. These analyses suggest that pathogenic variants are most abundant at DARC sites. We also present an analysis of DCI asymmetry, which measures the degree of symmetry in the dynamic coupling between two sites, revealing that mutations are likely to result in a loss of function if they occur at distal sites controlled by the active site, resulting in pathogenesis.

Results

Disease-associated mutations modify dynamics throughout the protein

GCase is a member of the family of glycoside hydrolases that use glutamates for hydrolyzing glucocerebrosidase into glucose and ceramide. Many amino acid variants of this enzyme are reported to cause Parkinson’s disease [27-30], Dementia with lewy bodies [30,31], and GD [33]. Using the crystal structure of GCase (PDB ID: 1ogs) [34] and 94 sites with DAVs [35], we calculated the Euclidean distance between the mutation site and the active site (e.g., residues 235 and 340). A vast majority (87.5%) of GD pathogenic variants occur further than 10 Å from the nearest active site residue, making direct interactions implausible. This suggests the existence of a network of indirect interactions through which a mutation at a distal site can induce dynamic changes at other regions of the protein and, by extension, impact protein function. The behavior of residues within this network can be examined by using a structural dynamics-oriented approach. We illustrate the approach using structural dynamics by considering the example of a single mutant N370S that is present with a high frequency (~70%) in the Ashkenazi Jewish population and studied extensively [27,32]. We first calculated the structural flexibility profiles of residues using a position-specific dynamic flexibility index, or DFI (see the Methods section). A comparison between DFI profiles of the wild-type GCase protein with the one that contains N370S is shown in Fig 2A. The DFI profile provides an estimate for the role of each residue in mediating structure-encoded dynamics. As for GCase, the DFI profile indicates significant shifts in dynamics caused by N370S. Regions of the protein that should be rigid are now flexible and vice versa (Fig 2B). Hinges in the protein have moved or disappeared, and new hinges have appeared elsewhere. As reported in previous studies, these hinge shifts suggest a major change in dynamics and thus protein function [19,24,25,36].

Fig 2

A comparison of DFI profiles of wild-type GCase and N370S mutant protein.

A comparison of DFI profiles of wild-type GCase and N370S mutant protein.

(a) The %DFI profile of the mutant protein (N370S, yellow) is contrasted with that of the wild-type (black). Dissimilarities in the two profiles demonstrate how a single point mutation (N370S) can induce changes in the flexibility profile of a different region of the protein. (b) Ribbon diagrams showing DFI as a color-coded spectrum from red-white-blue; red and blue indicate the highest and lowest flexibility, respectively. The regions with the most significant changes in dynamic flexibility are highlighted. Among the five loops surrounding the active site, we observe that loop 1 (residues 312–317) exhibits an increase in DFI scores (Fig 2A), suggesting that increased flexibility of this loop could contribute to the decrease in enzymatic activity by hindering the accessibility of the ligand to the active site as reported previously [37]. This variant displays a small change in flexibility near loop 1. Changes in DFI within loop 1 for other studied mutations are shown in the supplementary figure (S1 Fig). Additionally, the protein with N370S shows a very large shift in flexibility between residues 387 to 396, which overlaps with loop 3 (residues 394–399); within the overlap is the R395 residue, which orients differently in the active and inactive states of the enzyme [23] (Fig 2A).

Mutations at distal sites dynamically-coupled to the active site alter long-range communication

Although the only sequence difference between the wild-type and mutant GCase is a single residue, DFI changes across the protein. This behavior suggests that changes in long-range dynamic coupling may be responsible for the altered flexibility profiles. The dynamic coupling index (DCI) captures the strength of the displacement response for site i upon perturbation of site j, relative to the average fluctuation response of site i to all other sites in the protein. In this way, DCI can reveal the degree of dynamic coupling between i and j. Here, we present DCI as a percentile rank of the DCI range observed with values ranging from 0 to 1 (%DCI). Importantly, DFI and DCI are distinct in that DFI measures the flexibility of a position. In contrast, DCI measures the pairwise coupling of one position with another. Furthermore, DCI estimates are conditional on the functional position selected for analysis. Every amino acid position in any given protein has a unique network of direct, local interactions that give rise to a unique network of highly coupled pair positions. Across the protein structure, this gives rise to an inhomogeneous 3D interaction network. Using DCI to explore this network can be insightful when considering active sites, because it is known that even far away positions may disrupt function through the mechanism of allostery [36]. Residues that are distant enough from the active site to likely have no direct interaction (>10 Å) yet are highly coupled to them (%DCI > 60 implying a greater than average response fluctuation when active site residues are perturbed) can play an important role in protein function. In the example of GCase, around half (52.6%) of the studied pathogenic variants, including N370S, occur at DARC sites (Fig 3A). In fact, according to our list of disease mutation sites [38], approximately 28% of DARC sites are associated with GD, compared to ~15% of non-DARC sites throughout the entire protein. Also, the %DCI values of DAV sites are significantly different (P < .001) from those of non-disease sites, as seen in Fig 3D. This suggests that variants at DARC sites are more likely to lead to genetic disease. Moreover, a comparison of DCI values of DAV sites with all other protein sites supports the same observation: mutations at DARC sites, distal sites that exhibit high coupling (i.e., high DCI), are predisposed to impact function [24,25]. Such sites may be observed in a variety of different regions and structures across a protein, as seen in Fig 3B.

Fig 3

DCI of GCase sites.

DCI of GCase sites.

(a) Scatter plot of all GCase residues with dividers at %DCI = 60 and distance = 10 Å. The upper right quadrant contains DARC sites which can affect the active site without direct interaction. Red dots indicate severe DAVs, which have a significantly higher DCI (P < .001) than other sites. (b) A ribbon diagram showing known mutation sites of GCase (represented as pink-colored dots) and the degree of coupling to the active site delineated by the color gradient, where darker and lighter shades correspond to strongly coupled and weakly coupled, respectively. (c) Average DCI profile of 20 different DAVs compared to the wild-type. In general, we observe a global loss of coupling to the active site. (d) Violin plots showing that DAVs are generally located at sites that have higher DCI with the active site. Using MD simulations, we obtained the dynamic features of 20 DAVs, 2 neutral variants, and the wild-type protein. These variants were chosen because experimental data on their function was also available, by the study of Liou et al. [35]. When comparing DCI profiles of the active site for the wild-type and proteins with DAVs, fluctuations in DCI occur at certain sites, while mostly decreasing in GCase sites with DAVs (Fig 3C). These changes in DCI imply that the long-distance communication pathways cannot follow typical channels to the active site. This communication breakdown is presumed to be a consequence of altered dynamics. Losing rigidity in a functionally critical hinge region impairs the dynamic allosteric residue coupling, leading to a dysfunctional protein [36]. Our data also suggests a link between DCI and the severity of disease mutations. The median %DCI for DAV sites for Gaucher disease marked as “severe” was 69.6%. In comparison, mutations marked as “mild” had a median of 56.6% (P < .045). This further supports the idea that positions exhibiting higher dynamic coupling to the active site have a greater impact on protein function.

Principal component analysis of DFI aligns with experimentally determined catalytic activity

As explained above, DFI profiles provide information about the dynamic function of residues throughout the protein. At the same time, DARC sites are coupled with the active site despite having no direct contact. We clustered the DFI values of DARC sites for each simulated GCase variant (Fig 4) using principal component analysis (see Methods). We found that the wild type and neutral variants (functional enzymes based on in-vitro assays) are grouped, and many of the tested proteins creating “dead enzymes” (i.e., total loss of function) are grouped as well. Liou et al. [35] used the specific activity of cross-reacting immunological material (CRIM_SA) values to estimate the catalytic rate constants (kcat), thereby giving experiment-based estimates on the functionality of these variants. The fact that variants with higher CRIM_SA values are clustered together, as are variants with low CRIM_SA values, suggests a direct correlation between DFI profiles and CRIM_SA and, therefore, a direct correlation between DFI and protein function.

Fig 4

Dendrograph showing clusters of GCase variants based on the DFI of DARC sites.

The variants for which experimental data is available show dead enzymes and fully functional (i.e. neutral) enzymes clustered within their own groups. Variants with CRIM_SA values of 0.3 to 1.0 are shown in blue, while variants with CRIM_SA values of 0.06 to 0.1 are shown in red. For other variants shown here, CRIM_SA values are between .1 and .3. These variants have reduced function compared to the wild type, but are still somewhat functional. Higher CRIM_SA values suggest superior enzyme function.

Dendrograph showing clusters of GCase variants based on the DFI of DARC sites.

A proteome-wide analysis reveals disease-associated mutations are abundant at DARC sites

After investigating GCase, we used ENM models to expand our study to include 144 human enzymes containing a total of 1024 amino acid variants (433 neutral and 591 DAVs). The ENM is a coarse grained approach and allowed us to study the dynamics of different folds efficiently. This dataset was also used in our previous work [39] incorporating the HumVar data set [40] and sequences with both a high query coverage (>80%) and sequence identity (>80%) selecting only the proteins available in the protein data bank [41]. Additionally, these protein structures had already been modeled, including any missing residues, using the Modeller software package [42]. As illustrated in Fig 5A, the DCI distribution of DAV sites shows a trend opposite to that of sites with neutral variants, exhibiting a significantly different distribution with P < .001. Generally, DAVs are more likely to occur at sites highly-coupled to the active site. In contrast, neutral mutations are more likely to occur at sites that are less coupled. Of the variants in this ensemble, 82% occur farther than 10 Å from the active site, suggesting that allosteric communication through 3-D network of interactions modulate the dynamics of the active site, thus impacting the function.

Fig 5

%DCI and asymmetry for 144 protein ensemble.

%DCI and asymmetry for 144 protein ensemble.

(a) Throughout 144 proteins and 1024 variants’ sites within those proteins, %DCI values were determined. These distributions were compared with the expected null distribution that %DCI values would be equally distributed over all investigates sites. Observed-to-expected ratios reveal that there are more DAVs than expected having high %DCI, whereas fewer neutral variants than expected are observed in high %DCI categories. Above the ratio equal to 1, the DAV or neutral variants occurs more often than the null expectation. Below the ratio of 1, the mutation does not occur as often as expected. (b) Comparison of %DCIasym of sites associated with neutral variants and DAVs. The distributions show a contrast as DAV sites tend to exhibit more positive values (P < .001), suggesting that the active site dominates the coupling. Neutral sites on the other hand tend to give more negative asymmetry values, suggesting that the mutation site dominates. A moving average was used to visually smooth the distribution. DCI specifically quantifies the coupling between individual positions and, as such, DCI values depend explicitly upon the positions selected for analysis. However, these pairwise interactions are not always symmetric. An interaction network may be formed such that residue perturbations may be felt more strongly in one direction than the other. If we find the difference in the DCI values between two residue positions that are not directly interacting (i.e., in spatial contact), we get a better understanding of the dynamic allostery relationship between two residues. This difference, called DCI asymmetry, provides directionality to long-distance coupling, thereby suggesting a causal relationship In any given protein, every amino acid position has a unique network of direct, local interactions that give rise to a unique network of highly coupled partner positions [9,25,43] and heterogeneity in a 3-D network of interactions. Thus, for a particular pair of coupled amino acids (i and j), their unique network constraints differentiate the coupling of i to j from the coupling of j to i. Thus, we used the wild-type structures of our enzymes to calculate i) %DCI, how strongly the position of each mutation is coupled to each active site position, ii) %DCI, how strongly each active site position is coupled to the position of each mutation. From these, we calculated iii) “%DCIasym” from (%DCI–%DCI) to assess the asymmetry in coupling. Among our protein ensemble, we see a slight pattern emerge, where the interaction between disease mutation sites and active sites is generally more dominated by the active site. In contrast, the interaction between neutral mutation sites and active sites is usually dominated by the mutation site. (Fig 5B). This is indeed in agreement with our earlier findings of LacI variants [9], in which substitutions at sites where functional sites dominate the communication most often end up with a function loss.

A neural network trained on dynamic characteristics offers superior performance at highly evolved sites

Many different methods exist to predict the effect of missense variants on protein function. Some contemporary methods focus on evolutionary considerations alongside structural information to improve the accuracy of predictions [44-47]. As one example, PolyPhen-2 uses solvent accessibility, secondary structure propensities, and crystallographic B-factors to classify mutational sites [44]. Many other approaches consider change in polarity, volume, and charge due to mutant amino acid. A number of phenotypic prediction studies use solvent accessibility, which has proven to be a useful attribute in disease prediction [46]. Other methods utilize residue–residue interaction networks of protein structures to identify functionally important residues through network topology parameters [47,48]. Evolution-based methods generally offer better performance than methods that only use structural features, yet evolution-based methods have true positive rates less than 50% for known DAVs at less-conserved positions [5,15]. In addition, their rate of correct diagnosis of true negative (benign) mutations at highly conserved positions is less than 50% [11]. Like DCI, the DFI of mutation sites can indicate an effect on protein function [36,49]. Previously, our group has shown that DFI can predict pathogenicity of protein interface sites more accurately than the accessible surface area, a commonly used metric [6]. In this study, we extended this comparison to a variety of different metrics, a larger number of missense variants, and by adding DCI in the predictive model. Using DFI, DCI, and asymmetry from our protein ensemble, we trained a neural network to predict whether certain missense variants would be neutral or not (see Methods). When used to predict the pathogenicity of random subsets of our data (90% training, 10% testing; 10-fold cross-validation), this neural network reaches the upper end of performance for established predictive software in the metrics of accuracy, precision, recall, and area under the curve (AUC) evaluated for the receiver operating characteristic (ROC) curve (Fig 6). Of particular interest is the performance of our neural network at highly evolving sites (see Methods). The evolution-based metrics tend to overestimate the rate of neutral mutations at highly-evolving sites [11,15], leading to significantly lower recall scores. However, our dynamics based approach outperforms all the other methods. This is because our method accounts for enigmatic sites—allosteric DARC sites which seem to appear neutral from an evolutionary perspective. We don’t expect as many neutral mutations at those sites because we don’t use any sequence information related to conservation.

Fig 6

Accuracy of prediction tools tested against 144 enzyme ensemble data.

Accuracy of prediction tools tested against 144 enzyme ensemble data.

(A) Bar plot showing accuracy, precision, recall, and area-under-the-curve (auc) values for four different methods including our DFI + DCI features. Without using any evolutionary data, our performance matches and may even exceed evolution-based metrics. (B) Recall values for those same metrics tested on fast evolving sites in our data. Higher false negative rates lead to lower recall values for sequence based metrics, but not for DFI + DCI features. Improvement is also shown over another non-evolutionary metric, Rhapsody [50,51], which is a dynamics based approach. Rhapsody utilizes a Gaussian Network Model (GNM) (a 1-D version of ENM). Thus, the major difference between the GNM based approach and our approach is that we simulate perturbation forces in three dimensions, whereas Rhapsody uses one-dimensional pairwise interactions.

Discussion

Allostery was proposed as an important biophysical mechanism for protein function, which has led some to proclaim that allostery constitutes “the second secret of life,” with the genetic code constituting “the first secret of life” [52]. Laboratory-directed evolutionary studies also highlight the emergence of mutations far from the active site [25,53,54]. These distal sites play a critical role in functional evolution, particularly in the emergence of novel functions. Yet, these distal mutation sites challenge enzyme design, as it is difficult to predict them in advance [25,55-57]. Likewise, resurrected ancestral protein studies also reveal that mutations distal from the active site are necessary for functional evolution. An example is the emergence of red color from a green ancestor in a close relative of Green Fluorescence Protein (GFP). This protein needs a minimum of 12 mutations and one deletion to convert from green to red color with high efficiency. A majority of the mutations are far from the chromophore. While the flexibility of the mutational sites does not change, in allosteric response to these mutations, both rigidification and increased flexibility occur for regions of the fold widely separated in the 3-D structure of the proteins, accommodating required flexibility for red photoconversion. These synergistic effects allow catalysis to proceed as desired and function without mutations of catalytic residue positions while maintaining fold stability and quaternary structure [19]. Here we also observe that disease-associated (i.e., function altering) mutations follow the same pattern. As neutral variants and DAVs provide the best opportunity to explore the molecular principles of how genetic variations shape phenotypic changes, we observed the same principle of dynamic allostery such that functions become altered through distal mutations while conserving the amino-acid sequence of catalytic residues. We have found that the disruption of the allosteric dynamics with functionally-important sites in a protein is a mechanistic explanation for many missense variants associated with diseases and other biological phenotypes. The patterns of dynamic coupling with the active sites are different for disease and neutral phenotypes for missense mutations that occur at spatially-distant positions to functional (active) sites. Specific analysis of GCase proteins also provides evidence of the same mechanism observed in resurrected studies. These distal mutations allosterically modify flexibility profiles of different sites, leading to a change in function. This finding also suggests that rather than affecting only protein stability, the disruption of ligand binding, or both, the allosteric dynamic coupling and stability explain how a large fraction of disease-associated variants impair protein-ligand function or enzymatic activity [6,7,12]. A high-throughput functional assay of over 2,000 variants also show that only a minority of mutations led to a decrease in protein stability [4]. Thus, our findings align with the neutral theory of molecular evolution, as mutations on functionally important catalytic sites must have been eliminated by negative selection due to critical functional loss. On the other hand, the distal mutations remotely fine-tune the native state ensemble to modify function without interfering with folding/folding stability. We are in the era of rapid development of next-generation methods for whole-genome, whole-exome, and targeted sequencing that has produced an unprecedented amount of data. Among all the genetic variation data, the most commonly observed variants are missense, and identifying the missense variants with pathogenic effects that contribute to disease or drug sensitivities is the primary goal of 21st-century genomic analysis and phylomedicine. As stated in a review of allostery by Liu and Nussinov [52], uniting the genetic code, which constitutes “the first secret of life,” and allostery, “the second secret of life,” could reveal a generalized disease mechanism and allow for the discovery of novel drugs, as well as blueprints for innovative personalized treatment methods.

Methods

Dataset

A total of 144 individual monomeric protein structures from the Protein Data Bank (PDB) [41] were collected from a BLAST search of sequences with requirements of ≥80% sequence identity and ≥80% query coverage to ensure only structures that could be accurately mapped to human variation data were included. Human genetic variations were obtained from the HumVar, and HumDiv databases [38] with 1024 amino acid variants, where 433 were neutral and 591 were deleterious.

Determining catalytic sites

The catalytic sites were gathered from the Catalytic Site Atlas (CSA) database [58], which identifies the residues directly involved in catalyzing the reactions of enzymes. Since these residues are critical for protein function, they were used as input into our dynamic coupling index (DCI) metric. The entries in the CSA were either “original entries” derived from the literature itself or “homology entries” based on sequence comparison with the literature-based original entries. In either case, the catalytic sites purported by the CSA should accurately represent functional sites on the protein. Our dataset contained 144 enzymatic proteins that mapped to entries in the CSA database.

Calculating functional-dynamics profiles

Dynamic flexibility index (DFI) quantifies the dynamic stability of a given position. It measures the resilience of a position to perturbations initiated at positions in the protein distal to the residue in question, but to which it is linked via structurally encoded global dynamics. Therefore, DFI profiles provide important information about protein function. Namely, residues that exhibit very low DFI scores (DFIs) do not show large amplitude fluctuations in response to random Brownian kicks but rather transfer the perturbation energies throughout the chain in a cascade fashion; examples of low DFI residues are those in hinge regions. Hinges are parts of the protein which are generally rigid. At the same time, they do not exhibit a high fluctuation response to perturbations but transfer these perturbations to the rest of the protein. Like hinges on a door, they stand still, providing an anchor point for other parts to move around. The method for obtaining the dynamic flexibility index (DFI) is based on the perturbation response scanning (PRS) method [59], in which the C-alpha atom of each residue in the protein is modeled as a node in an elastic network model (ENM). The interaction between each node is modeled by a harmonic potential with a distance-dependent spring constant [59,60]. A small perturbation in the form of an external random force (i.e., Brownian kick) is sequentially applied on each node in the network, and the perturbation response of all nodes is recorded according to linear response theory as where F is the external random force, H-1 is the inverse Hessian, and ΔR is the positional displacement of all N nodes in three dimensions. However, ENM is a coarse-grained model. To improve the accuracy of this model and allow sensitivity to mutations, the hessian inverse can be replaced with the covariance matrices obtained from molecular dynamics simulations. Here, G is the covariance matrix containing the dynamic properties of the system. The covariance matrix contains the data for long-range interactions, solvation effects, and biochemical specificities of all types of interactions. Each perturbation is performed in ten different directions to ensure an isotropic response. The perturbation is repeated for every node in the network, and the positional displacements ΔR of each node are stored in a perturbation matrix A given by where is the magnitude of the positional displacement of each residue i in response to a perturbation at residue j. The DFI score of residue i is defined as the sum of the total displacement of residue i induced by a perturbation on all residues, which is computed by taking the sum of the i-th row of the perturbation matrix A, where the denominator is the total displacement of all residues, used as a normalizing factor. Therefore, the greater the DFI score at position i, the more flexible that site will be and the lower the score, the more rigid that site will be, meaning it has less of a response to perturbations throughout the protein. Oftentimes it can be useful to examine the flexibility of certain residues relative to the flexibility range of that single protein. To do this, DFI values can be ranked on a percentage scale as shown below: where n≤ is the number of positions having DFI ≤ DFI. Recently, we have extended this method to identify allosteric links or dynamic coupling between any given residue and functionally important residues by introducing a new metric, the dynamic coupling index (DCI) [36]. The DCI metric can identify DARC sites, which are distal to functional sites but control them through dynamic allosteric coupling. This type of allosteric coupling is important; sites with strong dynamic allosteric coupling to functionally critical residues (DARC sites), regardless of separation distance, likely contribute to the function. Thus, a mutation at such a site can disrupt the allosteric dynamic coupling or regulation, leading to functional degradation. As defined, DCI is the ratio of the sum of the mean square fluctuation response of the residue i upon functional site j perturbations (i.e., catalytic residues) to the response of residue i upon perturbations on all residues. DCI enables us to identify DARC site residues, which are more sensitive to perturbations exerted on residues critical for function. This index can be utilized to determine residues involved in allosteric regulation. It is expressed as where |ΔR| is the response fluctuation profile of residue i upon perturbation of residue j. The numerator is the average mean square fluctuation response obtained over the perturbation of the functionally critical residues Nfunctional. The denominator is the average mean square fluctuation response over all residues. Just as with DFI, DCI may also be ranked on a percentage scale: where m≤ is the number of positions having DCI ≤ DCI. We further investigated the change in dynamics upon mutation compared to the wild type structure using ΔDFI and ΔDCI. The delta-DFI (ΔDFI) profile was calculated as Where DFI is the dynamics profile for the mutated protein structure and DFI is the dynamics profile for the wild-type structure. Similarly, the delta-DCI (ΔDCI) profile was calculated as One additional tool we use is DCI asymmetry, which measures preferential information transfer through asymmetric dynamic coupling. Simply put, the coupling asymmetry between positions i and j can be calculated as Where DCIij represents the relative response of residue i to a perturbation at residue j and DCIji represents the relative response of residue j to a perturbation at residue i. It should be once again made clear that all dynamic analysis of the GCase protein was conducted using data from MD simulations only, while analysis of the 144 enzyme ensemble was performed using data from ENM simulations only.

Molecular dynamics simulations

To compute the DFI and DCI profiles of each missense variant of GCase, we first performed MD simulations to obtain the native ensemble of each variant and then applied our analysis. The starting structure for GCase was taken from the Protein Data Bank (accession number 1ogs [34]). The mutagenesis tool was used in Pymol [61] to create variant structures. Next, we loaded structures into TLEAP using the ff14SB force field [62]. We then added protein hydrogens were and a 14.0 Å cubic box of TIP3P surrounding water atoms, followed by Na+ and Cl- atoms for neutralization [63]. Then all systems were energy-minimized using the SANDER module of AMBER 14 [64,65]. First, the protein was kept fixed with harmonic restraints to allow surrounding water molecules and ions to relax, followed by a second minimization step in which the restraints were removed and the protein-solution was further minimized. Both minimization steps employ the method of steepest descent followed by conjugate gradient. We then ran heating, density equilibration and production using the GPU-accelerated PMEMD module of AMBER 14 [65]. Periodic boundary conditions were used in all simulations, and the bond lengths of all covalent hydrogen bonds were constrained using SHAKE [64]. Direct-sum, non-bonded interactions were cut off at distances of 9.0 Å or greater, and long-range electrostatic interactions were calculated using the particle mesh Ewald method [66,67]. During the heating cycle, we heated systems from 0K to 300K over a duration of 250 ps. The density of the system was then allowed to equilibrate over 5 ns at constant temperature and pressure. A Langevin thermostat was used to control the temperature at 300 K and a Berendsen barostat to adjust the pressure at 1 bar. We used a timestep of 2 fs and saved structural conformations every 10 ps. All simulations were allowed to progress to 1 μs of total simulation time, deemed the minimal required simulation time for convergence based on earlier studies [24,68]. In order to calculate DFI and DCI, we calculated covariance matrices using 50 ns moving windows that overlap by 25 ns over the last 500 ns of the trajectory of each simulation. In order to ensure ergodicity where the DFI and DCI profiles present the equilibrium dynamics, there are two of the basic conditions that need to be met: (i) All conformations must be sampled from the same distribution. (ii) The time windows and subsequent covariance matrices obtained ought to be independent of the initial atomic coordinates in order to eliminate global motions and accurately capture equilibrium coordinate information. Because of this, the final average DFI profiles will be independent of the window size; meaning that the averaging of DFI profiles from different time window sizes (i.e. 50 ns vs 75 ns) will give similar results and the calculated covariance matrices extracted from different times of trajectories should also result in similar DFI profiles, such as seen in S2 Fig.

Clustering the DFI values of DARC sites

We clustered the DFI profiles of DARC sites for various mutated GCase proteins by comparing their percentile rankings. To compare the flexibility profiles, the proteins are concatenated into a data matrix . The statistical procedure Singular value decomposition (SVD) is used to factorize the data into the orthonormal basis, which is a representation of the vector space containing data. It is similar to principal component analysis which may be used to assist in understanding the structure of data or to increase the signal-to-noise ratio in data by eliminating the redundant dimensions and mapping it on a lower-dimensional space. Clustering by SVD acts as an effective noise filter by isolating the highest variances among data points in the top principal vectors. Consequently, the remaining insignificant singular vectors can be omitted from the reconstruction. The DFI profiles of all proteins are merged into a matrix , of dimensions ( × ). Here is the number of datasets (protein variants) we are clustering together, each having number of attributes (n = number of DARC sites, thus each element in a given column presents the DFI value of specific DARC site of a given variant). On performing SVD, is decomposed as follows: Here, and are unitary matrices with orthonormal columns and are called left singular vectors and right singular vectors, respectively, and ζ is a diagonal matrix with diagonal elements known as the singular values of . The singular values of , by convention, are arranged in a decreasing order of their magnitude; σ = {σi} represent the variances in the corresponding left and right singular vectors. The set of highest singular values representing the largest variance in the orthonormal singular vectors can be interpreted to show the characteristics in the data and the right singular vectors create the orthonormal basis which spans the vector space representing the data. The left singular vectors contain weights indicating the significance of each attribute in the dataset as . Using these features of the decomposed singular vectors, we can create another matrix, * using only the highest ‘r’ singular values which can mimic the basic characteristics of the original dataset. Thus, * can be represented as Here, ζ* contains only largest singular values and V* contains the corresponding right singular vectors. The data are now clustered hierarchically based on the pairwise distance between different protein variants in the reconstructed DFI data with reduced dimensions. For a pair of datasets (or between flexibility profiles of any two proteins) j1 and j2, the distance between them in the original set of data was given by which in reduced dimensions can be calculated as These pairwise distances are used as the parameters for clustering the flexibility profiles of GCase. The DFI values of DARC sites are aligned and clubbed into a dataset matrix . The three largest singular values are used for reconstruction of data and clustering. The pairwise distance between each protein using the equation above is used for clustering them hierarchically. A bottom-up approach is used for the hierarchical clustering, where initially each protein variant is assigned its own cluster and then, in successive iteration, closest clusters are merged together into a common cluster. In this approach, the distance between clusters is defined by the average pairwise distance between their components (average linkage clustering [69]). In the end, the clusters are represented hierarchically using a dendrogram, where the vertical axis denotes the Euclidean distance between various clusters and among their sub-clusters.

Neural network

In an attempt to enhance our prediction accuracy based on protein dynamics we integrated a Neural Network based training and prediction algorithm. With an increased number of dimensions in data space, regular regression methods fall behind machine learning strategies and artificial Neural Networks. Our data contains multiple dynamics driven metrics emerging from per position specific DFI of the position with observed variant and also DFI of the neighborhood positions as well as DCI. These metrics by themselves display strong correlation (Fig 5) [49], but proteins are dynamic systems meaning per residue dynamics cannot grant all the relative information about the global dynamics. Therefore, with the inclusion of several distinct metrics that represent different dynamical features of the proteins, we exploited an Artificial Neural Net based prediction approach. The feed forward Neural Network architecture deployed in this paper utilizes a single input layer with multiple features and a binary classification model. The features include: DFIi, %DCIji, DCIasym, and the average DFI of residues within 7Å. We use residues within 7Å because they have direct interactions with the variant site. These four features along with corresponding sites and ground truth values may be found in the Supplementary Materials as S1 Data. The network includes two hidden layers with 80 nodes each between the input and the output layer. The hidden layers are connected with a 50% dropout scheme to eliminate overfitting. The initial node weights and biases of the network are sampled from a uniform distribution with Rectified Linear Units as the activation function to reach better convergence compared to a sigmoid function. The output layer has initial uniform node weights and biases sampled from Xavier uniform initializer and a sigmoid activation function with binary label output. The optimization algorithm utilizes a stochastic gradient descent with built-in momentum to minimize the cross-entropy loss function. The built-in momentum helps to escape saddle points and reach a global minimum loss. The learning rate for the optimizer is set as 0.001 with 1000 epochs in total for the Neural Network to converge. The Neural Network is trained with 90% of randomly selected data points and tested by the remaining 10%. This process is repeated 10 times to gather improved statistics and eliminate any bias coming from the data itself. Employing a 10-fold training/testing algorithm provides a distribution of accuracies instead a single accuracy. The evaluation metrics AUC, accuracy, precision and recall are used to evaluate the predictive power of the classification model by comparing prediction values with the ground truth values. The four possible outcomes from the binary classifier are: True positive (TP), true negative (TN), false positive (FP), and false negative (FN). Accuracy, precision, and recall equations for calculation are denoted below: AUC is obtained by calculating the area under the Receiver Operating Characteristic (ROC) curve, which is generated using true positive rate and false positive rates. In order to determine highly evolving sites, we utilized two different methods: (i) the ConSurf database to evaluate conservation of each site, [70,71], and (ii) Molecular Evolutionary Genetics Analysis (MEGA) software [72] to calculate evolutionary rates as described by Kumar et al. [11].

Stick diagrams of loop 1 which are colored corresponding to their DFI for 19 different disease variants.

DFI here is a color code within a spectrum of red-white-blue where red shows the highest, and blue shows the lowest flexible sites. (TIF) Click here for additional data file.

%DFI profiles averaged over different time scales demonstrate convergence.

Black: average %DFI values calculated using covariance matrix data over 400ns to 600ns of the wild type GCase simulation. Blue: average %DFI values calculated using covariance matrix data over 600ns to 800ns of the wild type GCase simulation. Red: average %DFI values calculated using covariance matrix data over 800ns to 1ms of the wild type GCase simulation. All profiles use 50 ns moving windows that overlap by 25 ns when calculating average %DFI. (TIF) Click here for additional data file.

Accuracy of prediction tools tested against highly evolving sites in 144 enzyme ensemble data.

Bar plots showing accuracy, precision, recall, and area-under-the-curve (auc) values for four different methods including our DFI + DCI features. (A) The prediction methods were evaluated using only fast evolving sits according to ConSurf. (B) The prediction methods were evaluated using only fast evolving sits according to MEGA. Note that for our highly evolving subset, rhapsody returned 0 true positive and 0 false negative values, causing AUC, precision, and recall to be either zero or incalculable. Using either set of highly evolving sites, we are slightly better in AUC and comparable in precision. However, the dynamics based classifier have slightly lower values for accuracy owing to higher false positive rates. (TIF) Click here for additional data file.

Enzyme ensemble information.

For each missense variant used in our analysis, shows the PDB identification code, mutation site, pathogenicity (disease or neutral), and active sites. Sites are aligned to the associated PDB file. (CSV) Click here for additional data file.

Shows the mutation site of each GCase DAV used in our analysis.

Asterisk denotes sites with multiple DAVs reported. Sites are aligned to PDB ID: 1ogs [34]. (CSV) Click here for additional data file.

Input data for our neural network (see Methods).

dfi_i, %dci_ji, dci asymmetry, and average dfi within 7Å columns contain input layer features and the disease(1) or neutral(0) column contains ground truth values for pathogenicity. Columns pdb id and pdb residue index exist for identification purposes. (CSV) Click here for additional data file.

Contains input files for MD simulations of GCase variants.

(RAR) Click here for additional data file. 4 Nov 2021 Dear Professor Ozkan, Thank you very much for submitting your manuscript "Dynamic coupling of residues within proteins as a mechanistic foundation of many enigmatic pathogenic missense variants" for consideration at PLOS Computational Biology. As with all papers reviewed by the journal, your manuscript was reviewed by members of the editorial board and by several independent reviewers. In light of the reviews (below this email), we would like to invite the resubmission of a significantly-revised version that takes into account the reviewers' comments. In this case it is extremely important that you show that your conclusions applies generally and not only to one (or a handful) of examples, as pointed out by the reviewers We cannot make any decision about publication until we have seen the revised manuscript and your response to the reviewers' comments. Your revised manuscript is also likely to be sent to reviewers for further evaluation. When you are ready to resubmit, please upload the following: [1] A letter containing a detailed list of your responses to the review comments and a description of the changes you have made in the manuscript. Please note while forming your response, if your article is accepted, you may have the opportunity to make the peer review history publicly available. The record will include editor decision letters (with reviews) and your responses to reviewer comments. If eligible, we will contact you to opt in or out. [2] Two versions of the revised manuscript: one with either highlights or tracked changes denoting where the text has been changed; the other a clean version (uploaded as the manuscript file). Important additional instructions are given below your reviewer comments. Please prepare and submit your revised manuscript within 60 days. If you anticipate any delay, please let us know the expected resubmission date by replying to this email. Please note that revised manuscripts received after the 60-day due date may require evaluation and peer review similar to newly submitted manuscripts. Thank you again for your submission. We hope that our editorial process has been constructive so far, and we welcome your feedback at any time. Please don't hesitate to contact us if you have any questions or comments. Sincerely, Anders Wallqvist Associate Editor PLOS Computational Biology Arne Elofsson Deputy Editor PLOS Computational Biology *********************** Reviewer's Responses to Questions Comments to the Authors: Please note here if the review is uploaded as an attachment. Reviewer #1: In the manuscript submitted by Nicholas J. Ose, et al, with title “Dynamic coupling of residues within proteins as a mechanistic foundation of many enigmatic pathogenic missense variants”, the authors presented a computational study utilizing dynamic coupling index (DCI) and dynamic flexibility index (DFI) to analyze disease related mutations at dynamic allosteric residue coupling (DARC) sites of 144 human enzymes containing 591 pathogenic missense variants. The dynamical correlations among these mutations distal from the enzymatic active sites are systematically evaluated and compared. The manuscript is well organized and presented and should be published after minor revisions to address the following concerns. 1. Figure 5 and its caption are not clear or easy to follow. For example, term “nSVNs” is not defined anywhere in the manuscript. 2. In the title, it is indicated that Principal component analysis (PCA) was carried out for DFI values of DARC sites. But the PCA seems not to be presented in the section, or at least it is not clear. Reviewer #2: Ose et al. present a computational study mapping genetic missense mutations to phenotypes. Primary tools include the dynamic coupling index or DCI, which measures the allosteric coupling between two residue sites. The method exerts random external forces on the functional site (e.g., the active site) and monitors responses of other (distal) sites. Residues with larger responses, called DARC sites, are speculated to be more important for the function. The DCI is based on the linear response theory established about 15 years ago and having gained further development in multiple recent studies. Using DCI, the authors have examined disease mutations in the human acid beta-glucosidase and 144 other human enzymes with annotated disease/non-disease mutations. They found that the predicted DARC sites have a higher potential to cause a loss-of-function effect leading to diseases. Disease mutations caused a change of flexibility near the active site and an overall reduction in the dynamic coupling, suggesting the functional relevance of the metric. Identifying potential disease mutations has a great impact in improving human health. In this paper, the authors try to establish the mechanistic basis for the genotype-phenotype relationship. The question to be addressed is clearly important and some of the correlations are certainly quite interesting. Incorporating dynamics in predicting disease mutations is not completely new. For example, Ponzoni and Bahar, PNAS 2018. The use of DCI indeed brings some insights, but the significance of the findings is unclear because of the lack of connections with previous studies. For example, how will the authors compare their method with the metrics used in the 2018’s and many other previous studies? Will DCI show a stronger correlation? What about the predictive power (e.g., evaluated using AUC) compared to the current state-of-the-arts? Some discussion on the potential complementary nature of DCI to other sequence-based, structural, or dynamic metrics will also be helpful. A main argument here is the correlation between DCI and disease mutations. However, a correlation does not guarantee a deterministic role. According to Fig. 3D, there are still plenty of disease sites associated with low DCIs and, more importantly, many non-disease sites associated with high DCIs. The correlation may suggest a functional relevance but does not necessary establish that the metric is deterministic or even a primary factor. The overall flow of the manuscript is clear, but there are some missing details that affect the clarity to some extent. For example: The description of MD simulations (in Methods) is too short and clearly does not meet the current standard for transparency and reproducibility. It is unclear if the authors performed MD for all the mentioned proteins (including >90 GCases and 144 human enzymes) or just for GCases and used ENMs for other enzymes. The principal component analysis is mentioned but there is no result. Was PCA used in the clustering? What were the 20-ns simulation windows (p. 20) used for? Minor points: In the abstract, it says 94 GD mutants but in the Results it becomes 200 (p. 5) and then 97 (p. 6). Please check the number and be consistent. Fig. 1. What is the difference between b & c? Are they from different databases? Why are the trends opposite? P. 4, line 69. ‘…, which adversely impact the prediction accuracy of commonly used methods because they run counter to expectations (Fig 1d).’ This sentence is unclear and needs to be further explained. P. 8, line 167. ‘DCI measures the coupling of a position’. It should be ‘coupling of two positions.’ Fig. 4 legend. It should be ‘DARC sites.’ Not all figure panels are cited in the text. Reviewer #3: The work presented particularly exciting. The possibility of having a structural explanation for the impact of a point mutation involved in pathology and particularly interesting. This work follows on from other research carried out either on a specific protein or on a large set of data. (State of the art could be deeper). This work at the border of these two categories and this makes reading the manuscript slightly difficult. I had to reread the entire manuscript several times to be sure what I was reading comma was either the protein involved in Gaucher disease or it was a large number of proteins. I am not sure, moreover, that I have followed everything correctly. Thus figure 1 presented, quickly in the manuscript, relates to a large data set point when Figure 2 it is only on the protein of Gaucher disease. The figures are not analysed for the specialist. It is difficult to know if the results are significant or not. As an example figure 2A is composed of 1 example, one small region seems different, but not at the point of mutations, but no statistical analysis allow to see it. How is it on the other SNPs? And, a general question arises on the use of SNPs associated with this pathology, is it possible to have, thanks to the different projects of 100000 genomes++, all the non-pathological SNPs and then obtain -in fact- exactly the same results. It is a necessary that must be in this paper. The work is mainly based on an existing methodology, which must nevertheless be defended in a more rigorous way. It does not seem to be very sensitive. It would be advisable to better integrate its explanation and its critical analysis in the whole of the manuscript. The choice of this enzyme seems to be more related to the distance from the active site as a problem for its function as allosteric questioning, at long range. Figure 3A represents difficulty of reading it is difficult to see what is really relevant from what is not. The black dots and red dots are not separable. The choice of threshold values is not explain. It is difficult to make an opinion. We would like to have other examples with different shapes and folds. There is clearly work. It is particularly unfortunate not to be able to evaluate it correctly because of the presentation of the manuscript and especially a rather too strong absence of the critical aspect on the results. Reviewer #4: In this manuscript, the authors compare disease mutations with neutral mutations in human enzymes, in terms of their allosteric dynamic coupling with known enzymatic active sites. The computation analysis is based on a combination of elastic network models and molecular dynamics simulation. The authors conduct both case studies and proteome-wide analyses, concluding that disease mutations tend to disrupt catalytic function through dynamic allosteric coupling with active sites. On Pages 6-7, in the section entitled "Disease-associated mutations modify dynamics throughout the protein", the authors investigate a single disease mutation N370S in the enzyme GCase, and show that this disease mutation leads to an increase in flexibility (as measured by DFI) within or near loop 1 and/or loop 3. Are the authors suggesting that for the GCase enzyme, disease mutations on average lead to higher DFI values within or near loop 1 and/or loop 3 than neutral mutations? If so, this assertion should be rigorously tested with p-values presented. If not, the authors should clearly describe the conclusions from their analyses. In general, the manuscript contains numerous general assertions regarding disease versus neutral mutations, only some of which are supported by p-values. The authors should support their general assertions by p-values whenever possible. The definition and application of the DCI metric are somewhat confusing. DCI is defined to measure the impact of catalytic site perturbation on a residue under investigation. Here, catalytic site is the cause, and the residue under investigation is the consequence. However, the authors then apply the DCI metric to identify and study allosteric residues, where presumably the residue under investigation is the cause, and the catalytic site is the consequence. What is the rationale for applying DCI in this context, given that the direction of cause and effect seem to be reversed? On Page 10, the first section is entitled "Principal component analysis of DFI aligns with experimentally determined catalytic activity". However, in this section the authors only performed clustering analysis (Fig. 4), but not principal component analysis. On Page 19, Equation (5) contains several errors. The division sign should be "/" rather than "\\". In the numerator, the index j should sum over from 1 to N_functional, rather than from N_functional to N_functional. In the denominator, the index beneath the summation symbol should be j rather than i. On Page 20, Equation (9), the equation "DCI_asymm = DCI_i - DCI_j" does not make sense and should be fixed and further elaborated. It is not clear if the authors have made all data underlying their findings fully available. The authors should try to make their data as fully available as possible. The data can either be provided as supporting information, or deposited to a public repository. Minor comments and typos: Page 5, Line 98: "wcontaining" -> "containing". Page 7, Fig. 2 mentions "%DFI profile", but "%DFI" is not defined. Page 12, Line 273: "sit" -> "site". Fig. 3a, 3b, and 3d are not referred to in the manuscript text. ********** Have the authors made all data and (if applicable) computational code underlying the findings in their manuscript fully available? The PLOS Data policy requires authors to make all data and code underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data and code should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data or code —e.g. participant privacy or use of data from a third party—those must be specified. Reviewer #1: Yes Reviewer #2: Yes Reviewer #3: Yes Reviewer #4: None ********** PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files. If you choose “no”, your identity will remain anonymous but your review may still be made public. Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy. Reviewer #1: No Reviewer #2: No Reviewer #3: No Reviewer #4: No Figure Files: While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, . PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email us at . Data Requirements: Please note that, as a condition of publication, PLOS' data policy requires that you make available all data used to draw the conclusions outlined in your manuscript. Data must be deposited in an appropriate repository, included within the body of the manuscript, or uploaded as supporting information. This includes all numerical values that were used to generate graphs, histograms etc.. For an example in PLOS Biology see here: http://www.plosbiology.org/article/info%3Adoi%2F10.1371%2Fjournal.pbio.1001908#s5. Reproducibility: To enhance the reproducibility of your results, we recommend that you deposit your laboratory protocols in protocols.io, where a protocol can be assigned its own identifier (DOI) such that it can be cited independently in the future. Additionally, PLOS ONE offers an option to publish peer-reviewed clinical study protocols. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols 24 Jan 2022 Submitted filename: GCase_Rebuttal1_TrueFinal.docx Click here for additional data file. 7 Feb 2022 Dear Professor Ozkan, Thank you very much for submitting your manuscript "Dynamic coupling of residues within proteins as a mechanistic foundation of many enigmatic pathogenic missense variants" for consideration at PLOS Computational Biology. As with all papers reviewed by the journal, your manuscript was reviewed by members of the editorial board and by several independent reviewers. The reviewers appreciated the attention to an important topic. Based on the reviews, we are likely to accept this manuscript for publication, providing that you modify the manuscript according to the review recommendations. Please prepare and submit your revised manuscript within 30 days. If you anticipate any delay, please let us know the expected resubmission date by replying to this email. When you are ready to resubmit, please upload the following: [1] A letter containing a detailed list of your responses to all review comments, and a description of the changes you have made in the manuscript. Please note while forming your response, if your article is accepted, you may have the opportunity to make the peer review history publicly available. The record will include editor decision letters (with reviews) and your responses to reviewer comments. If eligible, we will contact you to opt in or out [2] Two versions of the revised manuscript: one with either highlights or tracked changes denoting where the text has been changed; the other a clean version (uploaded as the manuscript file). Important additional instructions are given below your reviewer comments. Thank you again for your submission to our journal. We hope that our editorial process has been constructive so far, and we welcome your feedback at any time. Please don't hesitate to contact us if you have any questions or comments. Sincerely, Anders Wallqvist Associate Editor PLOS Computational Biology Arne Elofsson Deputy Editor PLOS Computational Biology *********************** A link appears below if there are any accompanying review attachments. If you believe any reviews to be missing, please contact ploscompbiol@plos.org immediately: [LINK] Reviewer's Responses to Questions Comments to the Authors: Please note here if the review is uploaded as an attachment. Reviewer #1: The authors have addressed all the concerns raised by this reviewer. Now the manuscript should be accepted for publication. Reviewer #2: The authors have done great work to revise their paper and have addressed all my previous concerns. I only have two additional minor points to mention regarding the new text, which I believe the authors can fix easily: It would be better to have a brief explanation of the scores (accuracy, recall, etc.) used in the benchmark. Fig. 6B, only recalls are shown. Better to show other scores as well for completeness (maybe in the supplemental file). Reviewer #3: I am particularly and pleasantly surprised by the quality of the responses given to all the reviewers. The authors have taken all the comments into account and wanted to answer all the questions in depth. They did this with great success, which gave me a much better understanding of this work. It deserves to be published as it is and I hope will have the impact it deserves. Reviewer #4: All comments have been adequately addressed. ********** Have the authors made all data and (if applicable) computational code underlying the findings in their manuscript fully available? The PLOS Data policy requires authors to make all data and code underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data and code should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data or code —e.g. participant privacy or use of data from a third party—those must be specified. Reviewer #1: Yes Reviewer #2: Yes Reviewer #3: Yes Reviewer #4: None ********** PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files. If you choose “no”, your identity will remain anonymous but your review may still be made public. Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy. Reviewer #1: No Reviewer #2: No Reviewer #3: No Reviewer #4: No Figure Files: While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email us at figures@plos.org. Data Requirements: Please note that, as a condition of publication, PLOS' data policy requires that you make available all data used to draw the conclusions outlined in your manuscript. Data must be deposited in an appropriate repository, included within the body of the manuscript, or uploaded as supporting information. This includes all numerical values that were used to generate graphs, histograms etc.. For an example in PLOS Biology see here: http://www.plosbiology.org/article/info%3Adoi%2F10.1371%2Fjournal.pbio.1001908#s5. Reproducibility: To enhance the reproducibility of your results, we recommend that you deposit your laboratory protocols in protocols.io, where a protocol can be assigned its own identifier (DOI) such that it can be cited independently in the future. Additionally, PLOS ONE offers an option to publish peer-reviewed clinical study protocols. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols References: Review your reference list to ensure that it is complete and correct. If you have cited papers that have been retracted, please include the rationale for doing so in the manuscript text, or remove these references and replace them with relevant current references. Any changes to the reference list should be mentioned in the rebuttal letter that accompanies your revised manuscript. If you need to cite a retracted article, indicate the article’s retracted status in the References list and also include a citation and full reference for the retraction notice. 6 Mar 2022 Submitted filename: GCase_Rebuttal2.docx Click here for additional data file. 9 Mar 2022 Dear Professor Ozkan, We are pleased to inform you that your manuscript 'Dynamic coupling of residues within proteins as a mechanistic foundation of many enigmatic pathogenic missense variants' has been provisionally accepted for publication in PLOS Computational Biology. Before your manuscript can be formally accepted you will need to complete some formatting changes, which you will receive in a follow up email. A member of our team will be in touch with a set of requests. Please note that your manuscript will not be scheduled for publication until you have made the required changes, so a swift response is appreciated. IMPORTANT: The editorial review process is now complete. PLOS will only permit corrections to spelling, formatting or significant scientific errors from this point onwards. Requests for major changes, or any which affect the scientific understanding of your work, will cause delays to the publication date of your manuscript. Should you, your institution's press office or the journal office choose to press release your paper, you will automatically be opted out of early publication. We ask that you notify us now if you or your institution is planning to press release the article. All press must be co-ordinated with PLOS. Thank you again for supporting Open Access publishing; we are looking forward to publishing your work in PLOS Computational Biology. Best regards, Anders Wallqvist Associate Editor PLOS Computational Biology Arne Elofsson Deputy Editor PLOS Computational Biology *********************************************************** 4 Apr 2022 PCOMPBIOL-D-21-01712R2 Dynamic coupling of residues within proteins as a mechanistic foundation of many enigmatic pathogenic missense variants Dear Dr Ozkan, I am pleased to inform you that your manuscript has been formally accepted for publication in PLOS Computational Biology. Your manuscript is now with our production department and you will be notified of the publication date in due course. The corresponding author will soon be receiving a typeset proof for review, to ensure errors have not been introduced during production. Please review the PDF proof of your manuscript carefully, as this is the last chance to correct any errors. Please note that major changes, or those which affect the scientific understanding of the work, will likely cause delays to the publication date of your manuscript. Soon after your final files are uploaded, unless you have opted out, the early version of your manuscript will be published online. The date of the early version will be your article's publication date. The final article will be published to the same URL, and all versions of the paper will be accessible to readers. Thank you again for supporting PLOS Computational Biology and open-access publishing. We are looking forward to publishing your work! With kind regards, Livia Horvath PLOS Computational Biology | Carlyle House, Carlyle Road, Cambridge CB4 3DN | United Kingdom ploscompbiol@plos.org | Phone +44 (0) 1223-442824 | ploscompbiol.org | @PLOSCompBiol

65 in total

1. Evolution of conformational dynamics determines the conversion of a promiscuous generalist into a specialist enzyme.

Authors: Taisong Zou; Valeria A Risso; Jose A Gavira; Jose M Sanchez-Ruiz; S Banu Ozkan
Journal: Mol Biol Evol Date: 2014-10-13 Impact factor: 16.240

2. A hinge migration mechanism unlocks the evolution of green-to-red photoconversion in GFP-like proteins.

Authors: Hanseong Kim; Taisong Zou; Chintan Modi; Katerina Dörner; Timothy J Grunkemeyer; Liqing Chen; Raimund Fromme; Mikhail V Matz; S Banu Ozkan; Rebekka M Wachter
Journal: Structure Date: 2015-01-06 Impact factor: 5.006

3. Three-dimensional reconstruction of protein networks provides insight into human genetic disease.

Authors: Xiujuan Wang; Xiaomu Wei; Bram Thijssen; Jishnu Das; Steven M Lipkin; Haiyuan Yu
Journal: Nat Biotechnol Date: 2012-01-15 Impact factor: 54.908

4. Human Gene Mutation Database (HGMD): 2003 update.

Authors: Peter D Stenson; Edward V Ball; Matthew Mort; Andrew D Phillips; Jacqueline A Shiel; Nick S T Thomas; Shaun Abeysinghe; Michael Krawczak; David N Cooper
Journal: Hum Mutat Date: 2003-06 Impact factor: 4.878

5. Association of glucocerebrosidase mutations with dementia with lewy bodies.

Authors: Lorraine N Clark; Lykourgos A Kartsaklis; Rebecca Wolf Gilbert; Beatriz Dorado; Barbara M Ross; Sergey Kisselev; Miguel Verbitsky; Helen Mejia-Santana; Lucien J Cote; Howard Andrews; Jean-Paul Vonsattel; Stanley Fahn; Richard Mayeux; Lawrence S Honig; Karen Marder
Journal: Arch Neurol Date: 2009-05

6. Prediction by graph theoretic measures of structural effects in proteins arising from non-synonymous single nucleotide polymorphisms.

Authors: Tammy M K Cheng; Yu-En Lu; Michele Vendruscolo; Pietro Lio'; Tom L Blundell
Journal: PLoS Comput Biol Date: 2008-07-25 Impact factor: 4.475