Literature DB >> 27564311

The Loss and Gain of Functional Amino Acid Residues Is a Common Mechanism Causing Human Inherited Disease.

Jose Lugo-Martinez1, Vikas Pejaver1, Kymberleigh A Pagel1, Shantanu Jain1, Matthew Mort2, David N Cooper2, Sean D Mooney3, Predrag Radivojac1.   

Abstract

Elucidating the precise molecular events altered by disease-causing genetic variants represents a major challenge in translational bioinformatics. To this end, many studies have investigated the structural and functional impact of amino acid substitutions. Most of these studies were however limited in scope to either individual molecular functions or were concerned with functional effects (e.g. deleterious vs. neutral) without specifically considering possible molecular alterations. The recent growth of structural, molecular and genetic data presents an opportunity for more comprehensive studies to consider the structural environment of a residue of interest, to hypothesize specific molecular effects of sequence variants and to statistically associate these effects with genetic disease. In this study, we analyzed data sets of disease-causing and putatively neutral human variants mapped to protein 3D structures as part of a systematic study of the loss and gain of various types of functional attribute potentially underlying pathogenic molecular alterations. We first propose a formal model to assess probabilistically function-impacting variants. We then develop an array of structure-based functional residue predictors, evaluate their performance, and use them to quantify the impact of disease-causing amino acid substitutions on catalytic activity, metal binding, macromolecular binding, ligand binding, allosteric regulation and post-translational modifications. We show that our methodology generates actionable biological hypotheses for up to 41% of disease-causing genetic variants mapped to protein structures suggesting that it can be reliably used to guide experimental validation. Our results suggest that a significant fraction of disease-causing human variants mapping to protein structures are function-altering both in the presence and absence of stability disruption.

Entities:  

Mesh:

Year:  2016        PMID: 27564311      PMCID: PMC5001644          DOI: 10.1371/journal.pcbi.1005091

Source DB:  PubMed          Journal:  PLoS Comput Biol        ISSN: 1553-734X            Impact factor:   4.475


Introduction

Spurred by the advances in DNA sequencing, the accumulation of human genetic variation (and with it amino acid substitution data) has over the past two decades been unprecedented. Multiple databases and resources now enumerate and annotate amino acid substitutions, their functional impact, and association with inherited disease [1-3]. However, to further our understanding of human genetic variation and its impact on disease, it is necessary to elucidate the associated molecular alterations [4-6]. Thus, the step of identifying the underlying molecular mechanisms constitutes a serious impediment to understanding and treating human disease. A straightforward approach to integrating genetic and molecular data is to search databases for structural and functional annotations at the variation site or in the neighborhood of interest, and then provide both the possible and the likely effects of mutations on these annotations [7-12]. Although this approach is useful, its major limitation is its dependence on previously observed and curated functional information as well as our inability, except in limited cases [9, 10], to cover mutations that create functional residues. Furthermore, the deterministic nature of data integration does not easily lend itself to a principled strategy of prioritizing many of the possible molecular mechanisms based on their likelihood to impact clinical phenotype, especially when a variant resides in the neighborhood of the functional site. A more comprehensive approach to analyzing the effects of amino acid substitutions involves the use of statistical inference methods that predict functional impact. Although there are many studies that have adopted this strategy using the data from protein sequence or structure [13], most methods make inferences without specifying which functional property has been impacted. Such an approach, however, is feasible if the methodology can be developed to predict a specific function, say a phosphorylation site or a catalytic residue, which is then applied to sequence variants [14-16]. Furthermore, these specific functional predictions can be integrated with general variant effect predictors to provide probabilistic estimates of molecular mechanisms of disease [17]. While many successful machine learning models can be made based on sequence information alone, structural information can provide additional benefits [18, 19]. This suggests that further improvements could be made if specific predictors of protein function could be integrated into this pipeline [20]. Wang and Moult published a seminal work on the impact of germline variants on protein function [7]. They searched the Protein Data Bank (PDB) and used homology modeling to obtain 3D structures of wild-type proteins as a means to characterize the structural and functional effects of both disease and neutral variants. They reported that the majority of disease-causing substitutions affect protein stability, whereas a relatively small proportion directly disrupt molecular function. By contrast, Sahni et al. observed a rather larger fraction of function-impacting variants in their experimental studies into the impact of variants on protein-protein and protein-DNA interactions [21]. Their work therefore challenges the traditional view of the dominance of structure-impacting changes. Finally, Steward et al. examined the structural, functional and physicochemical features of wild-type protein structures where disease-causing variants occur [22]. Unfortunately, the scope of these and several other studies was limited to characterizing the functional effects of amino acid substitutions across a handful of protein functions [23, 24]. There is therefore a need for large-scale studies that use statistical inference methods based on protein structure to explore the relative contributions made by disruption of functional sites in disease pathogenesis. In this work, we carry out a systematic study of the alterations of specific functional sites as the underlying molecular mechanisms of disease over a data set comprising germline disease-causing amino acid substitutions mapped to protein 3D structures. In particular, we develop multiple structure-based functional residue predictors and assess the impact of disease-associated substitutions on catalytic residues, metal-binding sites, macromolecular binding sites, ligand-binding sites, allosteric sites and post-translational modification (PTM) sites. We then quantify the extent to which disruption or introduction of particular types of functional site accounts for the deleterious impact of amino acid substitutions. Our results provide evidence to support the view that the increased and decreased propensity of particular functional activities are common in human inherited disease.

Materials and Methods

Probabilistic model for alteration of residue function

For a given protein structure and a missense variant, we are interested in estimating whether a particular residue functionality, say f, has been impacted. To achieve this, we broadly distinguish between two scenarios resulting in alteration of function: (i) the mutation disrupts protein stability and subsequently impacts residue function and (ii) the mutation does not impact stability and structure yet still leads to an altered function as a consequence of modified functional propensity. The latter scenario might occur, for example, for tyrosine-to-phenylalanine substitutions that result in minimal structural changes, yet the mutation itself may have significant impact on protein phosphorylation and downstream events. This is because phenylalanine cannot be phosphorylated to subsequently create an SH2 binding site [25]. We informally refer to the events of increased or decreased functional propensity as gain and loss of functional activity. We formalize this approach as follows. Let x be a collection of features that encodes a particular mutation in a protein structure and let P(loss of f|x) be the probability of the loss of residue function f consequent to mutation. Then, we can write where the event S indicates that protein stability is significantly changed and indicates that the stability is not significantly changed; i.e., . This expression gives a probabilistic formulation that can be used to estimate loss of a specific function f, with an assumption of a dichotomized impact on protein stability. We use this simple approach in part because of the issues involved in obtaining large amounts of high-quality data to assess the impact of sequence variants on stability. We note, however, that the expression can be generalized to multiple groups of stability disruption and in the limit the sum would turn into an integral. Now we briefly discuss estimating P(S|x), P(loss of f|S, x), and . The posterior probability of stability disruption P(S|x) can be determined by developing a computational model given a representative data set of variants that significantly impact stability of the protein (both stabilizing and destabilizing mutations) and a representative set of variants that do not. The negative data set can also be substituted by a large representative set of variants for which the impact on stability is unknown [26, 27]. This formulation falls into the category of positive-unlabeled learning [28], a version of semi-supervised learning in which the set of negative examples is unavailable or ignored; e.g., because available negative examples are biased. Next, we discuss estimating ; i.e., the probability of the loss of function given that there is no significant stability disruption. Let x′ be a collection of features that encodes the structural environment of a residue and let f be the specific residue function of interest; e.g., whether the residue is phosphorylated, DNA-binding, etc. Let denote the probability that the residue at this site is functional in the wild-type protein and the probability that the mutated residue, at the same locus in the protein, is functional. We then define the probability of the loss of residue function f as where gives the probability that the residue in the mutant protein is not functional. To estimate , we can employ the same functional residue predictor to compute prediction scores on the wild-type and mutant proteins. That is, because there are no structural changes, the same structure-based classifier can be used to compute both and , with the only difference being the replaced amino acid in the feature set . As before, the probabilistic model P(f|x′) can be developed using data sets of positive and negative examples or, in the absence of representative negative examples, using positive and unlabeled data. We will discuss the details of approximating the posterior probability of protein function in the next section. Finally, we can consider that large changes in protein stability and structure always abolish the function of a residue; i.e., . This implies that that is, the probability of the loss of function at a particular residue is roughly equal to the probability that the residue was functional in the first place. In addition to the loss of protein function, we can also consider the event of the gain of residue function, where This formulation accounts for the changes in residue microenvironments that increase its functional propensity. While most amino acid substitutions found in nature are neutral or disruptive, there are many examples in which they lead to the gain-of-function events. For example, generation of a sequence motif NX[S/T] has been observed to result in gain of N-linked glycosylation events and disease [9]. Similarly, changes in catalytic residues have been observed to increase the efficiency of catalysis, also with phenotypic implications [29]. Furthermore, assuming that significant stability changes rarely lead to gain of function we can simply take In this paper, we are interested in specific types of residue function f for which sufficiently large data sets could be extracted from biological databases. We organize these residues into catalytic residues, metal-binding residues, macromolecule-binding residues, ligand-binding residues, post-translationally modified sites, and allosteric residues. For the purposes of our study, we consider certain types of residues to be functional although they may also be important for protein stability. For example, a disruption of certain metal-binding sites, say a Zn2+-binding residue, will be considered here as disruption of functional residues that consequently impacts protein stability. We do, however, note that this distinction is somewhat philosophical.

Training stability predictors and functional residue predictors

All classification models in this work were trained using the positive-unlabeled framework in which we are given a set of positive examples and a set of unlabeled examples. In the case of stability predictors, P(S|x), the positive examples represent mutations that have been experimentally shown to significantly impact protein stability; based on the previous studies we selected these mutations to be either stabilizing or destabilizing with |ΔΔG| > 0.5 kcal/mol [30], although some other studies use higher values [31, 32]. The set of unlabeled examples, on the other hand, was selected using a database of human variants, dbSNP, mapped to available protein structures in PDB. In the case of functional residue predictors P(f|x′), the positive examples were selected by integrating structural and molecular data that provide experimentally observed functional residues, whereas the unlabeled examples were selected from a set of monomeric proteins in PDB. We will describe all data sets precisely at the end of the Methods section. We next discuss how to train a classification model from positive and unlabeled data. Let be a labeled data set, where is an input example and y ∈ {−1, +1} is its class label. Let be a set of unlabeled examples. In the problem of learning whether a mutation impacts stability, x encodes a set of features corresponding to the mutation and y = +1 indicates large stability disruption. Similarly, in the case of functional site predictors, x encodes a particular residue microenvironment in a protein and y = +1 indicates that the residue is functional. In the positive-unlabeled formulation, all examples in have positive class labels, whereas is a mixture of positive and negative examples. The probability of positive examples P(y = +1) in the unlabeled set is referred to as the class prior. The task of the predictor is to learn the probability P(y = +1|x) when provided data sets and . Unfortunately, learning P(y = +1|x) is not straightforward because the negative examples are not available. To address this problem we rely on the body of work in semi-supervised learning that decomposes the problem into the training of a non-traditional classifier [26]; i.e., a model that distinguishes between labeled and unlabeled data, and estimating the class prior P(y = +1). We denote the posterior probability from a non-traditional classifier as P(l = +1|x), where l = +1 refers to the event of data point being labeled. We approximate these probabilities using kernel-based learning with support vector machine (SVM) classifiers as underlying optimization engines. Additionally, we estimate P(y = +1) using the AlphaMax algorithm [27], and point out other available options for an interested reader [26, 33]. Under mild assumptions [27], the output of a non-traditional classifier P(l = +1|x) can be converted into the output of a traditional classifier P(y = +1|x) using where m and n are the sizes of labeled and unlabeled data sets, respectively. This predictor can now be applied to any data set to compute the frequency of the phenomenon using the empirical mean formula .

The probability of alteration for multiple types of function

We previously considered the loss and gain of the specific function f at a particular residue of interest. We now extend this definition to multiple types of functional residues as follows. Consider an event of loss of any function f from a set . We can use previous reasoning to re-write the earlier expression as To compute this probability, we need to compute probabilities and . Because the functional data is too sparse to learn the joint (posterior) models of residue function, we consider two models to approximate this probability using the marginal (posterior) models that the residue is functional. In the first model that we refer to as the independence model, we consider each type of functional residue to be independent of others and write The expression above is the probability that at least one of the functions from has been lost. Because the functions are not in reality independent, this model may lead to overestimation. The second, more conservative model, approximates the probability of loss as We refer to this model as the max model. Equivalent expressions can be written for as well as for the gain-of-function events. We note that may contain particular groups of functions, say all types of metal binding, or can be used for all functions considered in this work.

Graphlet kernels

In this section we briefly summarize the graphlet kernel prediction framework and show how these kernels were used to train both stability predictors and functional site predictors.

Graphs

A graph G is a pair (V, E), where V is a set of vertices (nodes) and E ⊆ V × V is a set of edges. In a vertex-labeled (colored) graph, a labeling function g is defined as g: V → Σ, where Σ is a finite alphabet, commonly referred to as vertex alphabet. A graph without self-loops, i.e. where (v, v) ∉ E, ∀v ∈ V, is said to be simple. An undirected graph is a graph where the order of the vertices in each pair (u, v) ∈ E can be ignored; otherwise, the graph is said to be a directed graph. A rooted graph G is a graph together with a distinguished vertex termed the root.

Graphlets

A graphlet is a small, simple, connected, rooted graph. We refer to a graphlet with n vertices as an n-graphlet. For more information on graphlets, we direct the reader to [34-38].

Edit distance graphlet kernels

Consider a vertex-labeled graph G = (V, E, g, Σ), where |Σ| ≥ 1. Lugo-Martinez and Radivojac [38] defined the m-edit distance representation of vertex v as where In the previous expression is the count of the j-th labeled n-graphlet rooted at v, κ(n, Σ) is the total number of vertex-labeled n-graphlets and E(n, m) is a set of n-graphlets such that for each n ∈ E(n, m) there exists an edit distance path of length at most m that transforms n into n. That is, the number of edit operations necessary to transform n into n is at most m, where edit operations are defined as insertion or deletion of vertices and edges, or in the case of labeled graphs, substitutions of vertex and edge labels. Finally, weights w(n, n) ≥ 0 are used to adjust the influence of pseudo counts and control computational complexity; in this study, we set w(n, n) = 1 if n ∈ E(n, m) and w(n, n) = 0 otherwise. The length-m edit distance n-graphlet kernel k((u, v) between vertices u and v can be computed as an inner product between the respective count vectors ϕ((u) and ϕ((v). Hence, the length-m edit distance graphlet kernel function can be expressed as where N is a small integer; typically defined up to N = 5 for undirected graphs. Additionally, one can define two subclasses of edit distance kernels referred to as (vertex) label-substitution k (only allows substitutions of vertex labels) and edge-indel kernels k (only allows insertion or deletion of edges). It is worth noting that if m = 0, then k, and are all equivalent to the standard graphlet kernel on labeled graphs [37]. In this work, we only considered the normalized kernel calculated as where k*(u, v) can be k(u, v), , or . The normalized kernel has been previously shown to have favorable performance with respect to non-normalized kernels [37, 38].

Practical aspects of training

We used the graphlet kernel framework and SVM classifiers to construct all functional site predictors. First, we modeled protein structures as protein contact graphs, where each amino acid residue was represented as a vertex and two spatially close residues (i.e. 4.5Å or less between any two atoms) were linked by an undirected edge. Fig 1 illustrates a contact graph for a protein kinase rooted at a tyrosine residue at position 148. Next, we computed a set of normalized graphlet kernel matrices using and k(x, x) for all pairs (x, x). For each , we used SVM [39] and the default value for the capacity parameter to train a predictor. We incorporated evolutionary information by extending the vertex alphabet Σ from the 20 standard amino acids to 40 based on the median residue conservation observed over the entire data set [38]. For example, the amino acid alanine was split into highly conserved alanines (represented as A) and other alanines (represented as a). Once each predictor was trained, we used Platt’s correction to adjust the outputs of the predictor to the 0-1 range [40].
Fig 1

Left: Structure of human aurora kinase A, chain A fragment (PDB entry 2j4z) with highlighted phosphorylation site Tyr148 (denoted as Y148). Right: Corresponding level-3 protein contact graph centered at Tyr148 (denoted with double circles). Nodes represent amino acid residues and edges correspond to spatially neighboring residues (i.e. 6Å or less between C atoms).

Left: Structure of human aurora kinase A, chain A fragment (PDB entry 2j4z) with highlighted phosphorylation site Tyr148 (denoted as Y148). Right: Corresponding level-3 protein contact graph centered at Tyr148 (denoted with double circles). Nodes represent amino acid residues and edges correspond to spatially neighboring residues (i.e. 6Å or less between C atoms). In the case of stability predictors, we augmented the graphlet kernel representation using 33 features previously shown to be informative [41, 42]. In particular, we used 20 features for the 20 different amino acids to encode the mutation information (i.e. −1 for wild-type residue; +1 for mutated residue; 0 otherwise), and 13 features to encode the difference between physicochemical properties between wild-type and mutant amino acid residues [43-50].

Evaluation of in silico predictions

The performance of each predictor was first evaluated through a per-chain 10-fold cross-validation. In each iteration of cross-validation, 10% of protein chains were selected for the test set, whereas the remaining 90% were used for training. This enforces that all data points from the same protein sequence belong to either training or test set and, thus, reduces the chance of overestimating the accuracy of the models. We estimated the area under the ROC curve (AUC), which plots the true positive rate as a function of the false positive rate and the Matthews correlation coefficient (MCC). It is impractical to validate, in vitro or in vivo, the functional effects of each amino acid substitution in our data sets. Therefore, we used independent mutagenesis experimental data to additionally evaluate the performance of our functional site predictors and also evaluate predictions of the loss of functional residues. More specifically, we downloaded all human mutagenesis experimental data from UniProt as of September 2014. This data set comprised 14,933 substitutions from 3,044 distinct proteins. We removed all entries associated with more than one substitution. The resulting 11,425 sites were mapped to high-quality PDB structures using the same steps described in the next section. The final data set comprised 3,356 amino acid substitutions from 2,809 different sites in 880 proteins. For each site in this data set, we extracted functional annotations related to metal binding, PTMs, active sites, macromolecular binding, ligand binding and allosteric activity. Then, for each functional site predictor, we built an independent test set such that (i) each site belonged to a chain that was less than 40% identical to any chain in the training data, (ii) there were at least five positive sites in each test set. The resulting set was used to assess the performance of the functional predictors independently of the cross-validation. Similarly, we created a test data set to evaluate loss-of-function predictions as follows: we searched the description of the mutagenesis experiment such that there was an experimentally observed disruption of a functional site. The resulting non-redundant and filtered data sets were then used to estimate the AUC and MCC of all loss-of-function predictors. We attempted to carry out the same steps for gain-of-function events but there were insufficient data.

Disease-associated mutation data

Missense variants causing inherited disease were obtained from the Human Gene Mutation Database (HGMD) as of June 2013. A set of unlabeled inherited variants was downloaded from dbSNP v.137. All amino acid substitutions were then mapped to protein structures in PDB as follows: (i) a database of amino acid sequences from X-ray crystallographic protein structures with more than 50 amino acids and resolution less than 2.5Å was created, and (ii) for each variant, a 51 residue long sequence centered around the wild-type amino acid at the variant position was aligned using BLAST [51] against the atom sequences in PDB. All alignments without an exact match; i.e., with gaps or sequence identity lower than 100%, were excluded from this study and in the case of multiple exact matches, the structure with the best resolution was selected. This resulted in 10,629 (out of 52,406) disease-causing amino acid substitutions and 8,417 (out of 282,625) unlabeled amino acid substitutions being successfully mapped to high-quality PDB structures. Table 1 summarizes both data sets. There exists an overlap between the HGMD (disease) and dbSNP (unlabeled) variants; the subset of dbSNP variants after the removal of HGMD variants will be referred to as putatively neutral variants (Table 1).
Table 1

Summary of amino acid substitution (AAS) data sets.

Data set namenAASnsnPDBnc
Inherited disease524061062911771387
Unlabeled variants282625841731213585
Putatively neutral282625804930473500

For each data set, we show the total number of amino acid substitutions (nAAS), the number of substitutions mapped to PDB (n), the number of PDB entries (nPDB) and the number of protein chains (n).

For each data set, we show the total number of amino acid substitutions (nAAS), the number of substitutions mapped to PDB (n), the number of PDB entries (nPDB) and the number of protein chains (n).

Functional site data sets

Metal ions annotated in X-ray structures from PDB as of May 2012 were selected using the HETATM field [52]. A metal-binding residue was defined as the residue that has at least one heavy atom (N, O or S) within 3Å of the metal ion. In order to build an unbiased classifier, we removed chains with: (i) more than 40% sequence identity with any other chain in the data set, (ii) crystallographic resolution greater than or equal to 2.5Å, and (iii) R-values greater than or equal to 30%. We only considered data sets with more than 100 metal-binding residues and we refer to these residues as positives. N-linked glycosylation sites were parsed from PDB and filtered in the same way as the metal-binding sites. In the case of phosphorylation sites, we used the data set assembled previously [20]. Catalytic residues were collected from the Catalytic Site Atlas v2.2.10 [53] and only literature-supported sites were kept as positive examples. As with the previous data sets, we filtered out chains with more than 40% sequence identity. DNA-binding sites were collected from Yan et al. [54], RNA-binding sites were downloaded from ccPDB [55] and protein-protein interaction (PPI) sites were obtained from Chung et al. [56]. Each data set was further filtered to remove redundancy. Protein-protein interaction hot spot residues were obtained from Lise et al. [57]. This data set was created from ASEdb [58] and only sites that mapped to PDB were used. Chains with more than 35% sequence identity were filtered out in the original work. In our analysis, we used ΔΔG greater than 1kcal/mol as the cutoff for hot spots. Ligand binding data sets were collected from ccPDB as of November 2014 and allosteric site data were downloaded from ASD v2.0 [59]. For each data set, we removed chains with resolution greater than or equal to 2.5Å. Protein stability data was collected from Capriotti et al. [41]. We used the S1615 data set which consists of 1615 single site mutations extracted from 42 different proteins in the ProTherm database [60]. The attributes for each data point included solvent accessibility, pH value, temperature and energy change ΔΔG. As previously noted, positive data points comprised mutations with |ΔΔG|>0.5. We further filtered out 112 redundant data points. Table 2 summarizes the protein stability and functional site data sets used in this study. In all situations, the unlabeled data set was constructed using a random sample of 10,000 residues selected from the 40% non-redundant set of monomers in PDB. This set was modified in the case of post-translational modifications to include only modifiable residues; e.g., Asn for N-linked glycosylation and Ser/Thr/Tyr for phosphorylation. It is important to emphasize that the set of negative examples was allowed to contain both buried and surface-exposed residues resulting in somewhat easier downstream classification problems. On the other hand, it allowed us to apply our methods to all PDB-mappable amino acid substitutions and make unbiased inferences related to different data sets.
Table 2

Performance assessment of structural and functional residue predictors using cross-validation on positive-unlabeled data sets.

CategorySite typencn+AUCsnspMCC
Protein stabilityStability (S)4010410.7350.1180.9890.223
Metal bindingCalcium (Ca)109248600.8950.4480.9860.561
Cadmium (Cd)19910030.9050.2330.9860.350
Cobalt (Co)1565320.9340.5960.9860.623
Copper (Cu)1054400.9510.7800.9850.727
Iron (Fe)1877850.9760.8750.9860.843
Potassium (K)2879780.6790.1290.9860.211
Magnesium (Mg)134832820.8590.4350.9860.561
Manganese (Mn)36613440.9450.6650.9850.727
Sodium (Na)96127530.6710.1050.9870.211
Nickel (Ni)2546800.9320.5650.9860.621
Zinc (Zn)130757780.9660.6230.9870.691
PTMsN-glycosylation (Nglyco)3397360.7850.1200.9860.183
Phosphorylation (Phos)65511570.8100.3750.9870.504
Catalytic activityCatalytic (Cat)72122240.9340.4330.9850.561
Macromolecular bindingDNA-binding (DNA)13937910.8150.1930.9870.332
RNA-binding (RNA)8334360.7830.1870.9850.319
Protein-protein interaction (PPI)11243500.8070.0910.9870.191
PPI hot spots (Hotspot)351650.8030.3090.9860.278
Ligand bindingADP16225890.8420.3350.9850.475
ATP10417330.8130.2420.9860.382
FAD8022480.8400.3070.9850.448
FMN427880.8240.2840.9850.384
GDP455930.8430.4330.9850.502
GTP223660.7160.1450.9860.181
HEM8322460.8470.2200.9860.361
NAD7316630.8310.2590.9850.393
PLP344770.9160.5050.9860.543
UDP273980.6840.0800.9870.103
Allosteric regulationAllosteric (Allo)1086820.6360.0410.9850.050

For each data set, we show the number of protein chains (n) and the number of positive examples (n+). Additionally, we choose a score threshold corresponding to a specificity (sp) of 99% and report sensitivity (sn) and MCC at this threshold, as well as AUC. In each classification problem, the number of unlabeled examples was set to 10,000. S1 Table predictions lists the full name of each ligand code used. For the purposes of this work, structurally important amino acid residues such as specific metal ion binding residues were considered a part of the portfolio of available residue functions.

For each data set, we show the number of protein chains (n) and the number of positive examples (n+). Additionally, we choose a score threshold corresponding to a specificity (sp) of 99% and report sensitivity (sn) and MCC at this threshold, as well as AUC. In each classification problem, the number of unlabeled examples was set to 10,000. S1 Table predictions lists the full name of each ligand code used. For the purposes of this work, structurally important amino acid residues such as specific metal ion binding residues were considered a part of the portfolio of available residue functions.

Results

In this section we present the development of a stability model and a series of structure-based functional site predictors in order to examine the molecular effects of genetic variants. We evaluate the predictors through cross-validation and using an independent data set. We then summarize our results in relation to the functional impact of disease-causing substitutions and compare them to putatively neutral variants.

Assessment of functional site predictors

All classifiers developed in this study were constructed using positive and unlabeled data summarized in Table 2. Their performance was estimated via per-chain 10-fold cross-validation and is also shown in Table 2. S2 Table further lists the parameters for the best-performing kernel matrix obtained from a grid search over , |Σ| = {20, 40}, m = {0, 1} and N = {4, 5}. Each predictor performance was assessed by means of the area under the ROC curve (AUC), sensitivity (sn) at 99% level of specificity (sp), and the Matthews correlation coefficient (MCC). The majority of predictors (26 out of 30) show good performance (≥ 70% AUC); however, we observe that functions related to smaller interfaces such as metal ions and active sites exhibit higher performance than other functional predictors. This result is not unexpected because predictors of macromolecular binding would have benefited from incorporating higher-order structural signatures such as clefts and pockets [20]. We also use an independent data set to evaluate a subset of functional site predictors, as depicted in S3 Table. Interestingly, most predictors, except for macromolecular binding models show similar or improved performance (AUC) values compared to those reported from cross-validation (Table 2). Overall, despite the variability of performance accuracies, limited number of independent data sets and the relatively small size of the validation data, these results provide evidence that functional site predictions are of sufficient quality to identify possible molecular alterations resulting from specific missense mutations. A literature survey suggests that our predictors perform well when compared to established structure-based methods. Extensive comparisons with other work are difficult and were beyond the scope of this study as our main goal was to probabilistically assess molecular mechanisms of disease. A set of predictors built using the same methodology was best suited to this task.

Estimating prior and posterior probabilities

To use the formal framework laid out in the Methods section, it is important that all methods approximate posterior distributions. Using positive and unlabeled data, we have approached this problem in two steps: (i) by developing classifiers that discriminate between labeled and unlabeled data, and (ii) by estimating the class priors of the positive class in the unlabeled data [27]. Estimated class priors are a particularly useful by-product of learning posterior distributions. For the stability predictor, we estimate up to 13% of unlabeled variants to significantly impact stability using the AlphaMax algorithm [27]. When the stability model was applied to disease variants only, we estimate 14% of these variants to be impactful using the empirical mean formula. It should be noted that when the known disease variants were removed from the unlabeled data set, only 7% of the remaining variants were estimated to severely impact stability. In the case of functional predictors, we applied the AlphaMax algorithm using a set of positive variants and a set of 10,000 variants randomly sampled from a set of non-redundant monomers in PDB (S4 Table). In the case of catalytic residues, we estimate that up to 3% of PDB residues to be catalytic; however, about 5% of disease-causing and 2% of putatively neutral variants were estimated to be catalytic residues, etc. Overall, we generally observe a larger fraction of function-impacting variants in the disease-causing data set as compared with the putatively neutral variants.

Applying loss and gain functional site predictors to human variants

We applied the structure-based predictors on both the wild-type and mutant structural environments as a means to identify and categorize the functional effects of amino acid substitutions causing inherited disease. The distribution of scores on the putatively neutral variants was used as an empirical null distribution. We then used a particular false positive rate (FPR) value to determine a prediction threshold at which to assess the fraction of disease mutations with loss or gain scores that are as high as or higher than the threshold. Table 3 summarizes the relative contributions of disease mutations that either decrease (loss) or increase (gain) the propensity of functional sites at a conservative threshold of 1% FPR for six different prediction outputs. Fig 2 visualizes a subset of these results for the case when stability is not impacted. Together with Table 3, it provides evidence that the loss and gain of functional sites exist even when protein stability is not disrupted; e.g., in the case of loss of function see columns and that roughly have the same values as P(loss|x). We mention that when either a loss or a gain of function event is found to be statistically significant, the mutation of this type of functional residue is considered to be an active mechanism of genetic disease. For instance, at 1% FPR, we observe that loss of catalytic residues (Cat; 3.34%; p-value = 1.93 ⋅ 10−28) and iron-binding residues, (Fe; 3.17%; p-value = 2.06 ⋅ 10−25) are among the most significantly affected molecular mechanisms.
Table 3

Percentage of disease variants with prediction scores at the 1% false positive rate threshold in putatively neutral variants.

Data setSingle type loss events (%)Single type gain events (%)
P(loss|S¯,x) P(loss,S¯|x) P(loss|x) P(gain|S¯,x) P(gain,S¯|x) P(gain|x)
Ca2.65*2.35*2.69*3.90*3.63*3.63*
Cd1.421.311.441.201.021.02
Co1.97*1.94*1.98*0.800.700.70
Cu1.87*1.82*1.88*1.291.151.15
Fe3.14*2.97*3.17*1.77*1.62*1.62*
K2.45*2.08*2.45*2.99*2.60*2.60*
Mg3.22*2.83*3.14*3.09*2.89*2.89*
Mn2.42*2.22*2.45*2.65*2.47*2.47*
Na3.10*2.55*3.13*2.50*3.05*3.05*
Ni1.391.331.400.620.540.54
Zn2.86*2.64*2.89*1.81*1.63*1.63*
Nglyco3.751.252.810.460.230.23
Phos1.691.471.690.580.460.46
Cat3.18*2.90*3.34*4.45*3.94*3.94*
DNA1.58*1.391.61*1.85*1.75*1.75*
RNA0.960.890.961.251.051.05
PPI1.53*1.271.65*1.94*1.79*1.79*
Hotspot1.000.901.001.53*1.451.45
ADP3.12*2.92*3.16*4.03*3.68*3.68*
ATP2.73*2.52*2.76*2.76*2.41*2.41*
FAD2.77*2.58*2.81*3.21*2.92*2.92*
FMN2.15*2.01*2.17*2.40*2.182.18*
GDP2.07*1.99*2.08*2.41*2.18*2.18*
GTP1.72*1.481.72*2.78*2.15*2.15*
HEM2.26*2.05*2.35*2.39*2.05*2.05*
NAD3.00*2.64*3.09*2.53*2.15*2.15*
PLP3.15*2.98*3.10*2.61*2.37*2.37*
UDP2.70*2.41*2.71*3.11*2.46*2.46*
Allo1.471.241.491.96*1.74*1.74*
ModelMulti-type loss events (%)Multi-type gain events (%)
P(loss|S¯,x) P(loss,S¯|x) P(loss|x) P(gain|S¯,x) P(gain,S¯|x) P(gain|x)
Independence3.71*3.31*3.89*4.13*3.37*3.37*
Max3.50*1.93*3.56*4.35*2.63*2.63*

For each of the six prediction outputs and each function f, we show the percentage (%) of disease mutations that have a greater probability of loss and gain of function than a threshold corresponding to a 1% false positive rate (FPR). S1 and S2 Figs show an instance of the inverse cumulative distribution function of P(loss|x) and P(gain|x), respectively. These thresholds were estimated from the empirical null distributions of the probability of loss or gain of function on the set of dbSNP neutral data.

*Indicates significant p-value measured by a one-tailed Fisher’s exact test after Bonferroni correction for multiple comparisons. The p-value was separately estimated for each type of posterior distribution, jointly for loss and gain events (). The p-values for the combined models were corrected separately ().

Fig 2

Percentage of disease variants with prediction scores at the 1% false positive rate threshold in putatively neutral variants.

For each function f, the bars indicate the percentage (%) of disease mutations that have a greater and than a conservative threshold at 1% false positive rates (FPR). These thresholds are estimated from the null distributions of and on the set of dbSNP neutral data, respectively. *Indicates significant p-value measured as a one-tailed Fisher’s Exact test after Bonferroni correction for multiple hypothesis testing (p < 8.62 ⋅ 10−4). The red line indicates the percentage of neutral variants that have greater and which is exactly 1%.

Percentage of disease variants with prediction scores at the 1% false positive rate threshold in putatively neutral variants.

For each function f, the bars indicate the percentage (%) of disease mutations that have a greater and than a conservative threshold at 1% false positive rates (FPR). These thresholds are estimated from the null distributions of and on the set of dbSNP neutral data, respectively. *Indicates significant p-value measured as a one-tailed Fisher’s Exact test after Bonferroni correction for multiple hypothesis testing (p < 8.62 ⋅ 10−4). The red line indicates the percentage of neutral variants that have greater and which is exactly 1%. For each of the six prediction outputs and each function f, we show the percentage (%) of disease mutations that have a greater probability of loss and gain of function than a threshold corresponding to a 1% false positive rate (FPR). S1 and S2 Figs show an instance of the inverse cumulative distribution function of P(loss|x) and P(gain|x), respectively. These thresholds were estimated from the empirical null distributions of the probability of loss or gain of function on the set of dbSNP neutral data. *Indicates significant p-value measured by a one-tailed Fisher’s exact test after Bonferroni correction for multiple comparisons. The p-value was separately estimated for each type of posterior distribution, jointly for loss and gain events (). The p-values for the combined models were corrected separately (). Table 3 also summarizes the statistical enrichment of impact on at least one functional site from the entire repertoire of functions using the independence and max models (see Methods). Here we observe a strong enrichment in all categories of loss of function, with or without impact on stability, for both the independence and max models. Additionally, we also see an enrichment in the gain-of-function events. These results provide statistical support for many individual studies that identify loss of function as a signature of human inherited disease. Overall, our results suggest that with some exceptions, the loss of functional residues is enriched and common in human inherited disease; similarly, the gain of functional residues is observed to be an active mechanism in catalytic activity, most types of ligand-binding residues, and majority of metal-binding residues. In contrast to previous studies, our results suggest that the loss and gain of PTM sites do not show statistically significant enrichment in disease (although we observe enrichment for the loss); however, we note that this may be due to a considerable reduction of training data imposed by the availability of protein 3D structures, especially given a relationship between post-translational modifications and intrinsically disordered proteins [61-64]. Table 4 shows the proportions of disease and putatively neutral variants across functional categories for which molecular mechanisms can be computationally hypothesized. In the first part of the table, we compute the fraction of variants for which exactly one of the member predictors reports a score as high or higher than the FPR-value determined threshold. These fractions were then computed separately for disease and neutral variants. For convenience, when a predictor outputs a value as high or higher than the value determined by a 1% FPR, we refer to this prediction as actionable hypothesis of loss or gain of function. On the other hand, when the FPR-based threshold is adjusted using the Bonferroni correction, we refer to these predictions are confident. For example, at a conservative p-value cutoff of p < 8.62 ⋅ 10−4, we find that 1.51% of mutations are likely to alter exactly one metal binding site and 1.43% may alter a single ligand binding site. For all groups of molecular mechanisms, we observe that the probability of observing a high alteration score is more than three times as likely as in the case of putatively neutral variants.
Table 4

Relative contribution of loss and gain of functional categories from amino acid substitutions.

CategoryLoss (%)Gain (%)Loss or Gain (%)
DiseaseNeutralDiseaseNeutralDiseaseNeutral
I. Single mechanism
Confident biological hypotheses (p-value < 8.62 ⋅ 10−4)
Metal binding1.510.271.460.462.880.70
PTMs000.0100.010
Catalytic sites0.730.060.400.071.140.14
Macromolecule binding0.590.200.640.201.210.40
Ligand binding1.430.471.860.463.070.87
Allosteric sites0.090.060.250.060.350.12
All3.030.843.290.975.801.70
Actionable biological hypotheses (p-value < 0.01)
Metal binding4.902.505.082.998.705.03
PTMs0.300.170.080.220.390.40
Catalytic sites3.341.003.941.007.252.00
Macromolecule binding3.892.934.033.267.395.75
Ligand binding9.744.659.554.9713.727.06
Allosteric sites1.491.001.741.003.162.00
All13.288.3512.939.2117.4313.36
II. Multiple mechanisms
Confident biological hypotheses (p-value < 8.62 ⋅ 10−4)
Metal binding0.720.160.490.101.230.27
Macromolecule binding0.050.040.170.040.220.07
Ligand binding0.240.070.330.090.680.19
All1.430.341.290.312.780.68
Actionable biological hypotheses (p-value < 0.01)
Metal binding5.962.314.752.4210.374.62
Macromolecule binding0.660.510.990.371.720.89
Ligand binding5.551.905.431.8511.044.11
All12.885.4212.225.4223.5710.60

For each functional site category, we show the relative contributions (%) of disease and neutral substitutions where at least one function f within a category has a greater P(loss|x)or P(gain|x) than a conservative threshold at 1% FPR. This threshold is estimated from the null distributions of P(loss|x) and P(gain|x) on the putatively neutral polymorphisms data set, respectively. The table is subdivided into two parts: (i) exactly one function (or mechanism) and (ii) two or more mechanisms. In both parts, the relative contributions are assessed at two p-value cutoffs of p < 8.62 ⋅ 10−4 and p < 0.01. Note that in a small number of cases, a loss of one function might result in the gain of another; thus, the sets of residues counted in the loss and gain may overlap.

For each functional site category, we show the relative contributions (%) of disease and neutral substitutions where at least one function f within a category has a greater P(loss|x)or P(gain|x) than a conservative threshold at 1% FPR. This threshold is estimated from the null distributions of P(loss|x) and P(gain|x) on the putatively neutral polymorphisms data set, respectively. The table is subdivided into two parts: (i) exactly one function (or mechanism) and (ii) two or more mechanisms. In both parts, the relative contributions are assessed at two p-value cutoffs of p < 8.62 ⋅ 10−4 and p < 0.01. Note that in a small number of cases, a loss of one function might result in the gain of another; thus, the sets of residues counted in the loss and gain may overlap. Table 4 also shows situations with two or more functional perturbations consequent to the replacement of a given amino acid residue. The amino acid substitutions disrupting multiple functions may be important in a therapeutic context because addressing a single deficiency (e.g. iron binding) may still not result in a fully corrected phenotype because other deficiencies may still remain (e.g. ligand binding). Here, we have a significantly increased likelihood of observing multi-functional alterations in the disease set compared to the putatively neutral set; i.e., the disease set is several times more likely to contain multi-functional alterations than the putatively neutral set. For instance, 1.23% of disease mutations are likely to affect at least two metal binding sites versus only 0.27% of neutral variants, whereas 0.68% of disease variants may affect more than one ligand binding site as opposed to 0.19% of neutral polymorphisms. If we combine the results for single and multiple mechanisms, we observe that 2.24% of disease variants are predicted, with high confidence, to impair metal-binding sites (1.51% loss of single site and 0.72% loss of multiple sites) and 1.67% probably impair ligand binding sites (1.43% loss of single site and 0.24% loss of more than one site), as depicted in Fig 3. Overall, we believe we can confidently propose molecular mechanisms of disease for 8.6% of all variants in the inherited disease data set whereas we only see about 2.4% of such variants in the neutral set. If we use a p-value cutoff of 0.01 without a Bonferroni correction, then we can computationally hypothesize a molecular mechanism for approximately 40.9% of disease variants.
Fig 3

Relative contribution of loss and gain of functional categories on each amino acid substitutions data set.

For each functional site category, we show the relative contributions (%) of disease and neutral variants where at least one function f within a category has a greater P(loss|x) or P(gain|x) than a conservative threshold at 1% FPR. This threshold is estimated from the null distribution of P(loss|x) and P(gain|x) on the putatively neutral polymorphisms data set, respectively. *Indicates significant p-value measured as a Fisher’s Exact test after Bonferroni correction for multiple hypothesis comparisons (p < 8.62⋅10−4).

Relative contribution of loss and gain of functional categories on each amino acid substitutions data set.

For each functional site category, we show the relative contributions (%) of disease and neutral variants where at least one function f within a category has a greater P(loss|x) or P(gain|x) than a conservative threshold at 1% FPR. This threshold is estimated from the null distribution of P(loss|x) and P(gain|x) on the putatively neutral polymorphisms data set, respectively. *Indicates significant p-value measured as a Fisher’s Exact test after Bonferroni correction for multiple hypothesis comparisons (p < 8.62⋅10−4).

Validation of loss of function predictions

In this study, we have proposed a novel methodology for identifying specific molecular alterations of disease mutations. Given that it is impractical to experimentally validate the predicted functional effects of each individual amino acid substitution, we use mutagenesis experimental data to independently assess the loss of functional site predictions, as shown in S5 Table. To the best of our knowledge, this is the first time a systematic assessment of computationally predicted disruptions of specific types of functional residues has been carried out in the published literature. In general, our loss of function predictors performed as expected. However, more interestingly, if one restricts the loss of function predictions to those with significant p-values (i.e. p < 0.01), then performance (AUC) rises to at least 95% for all predictors. This provides compelling evidence that our methodology can be effectively used to identify molecular mechanisms of disease and hence can be used to prioritize experimental validation. Additionally, Fig 4 depicts two case studies of loss and gain of function predictions which have been experimentally validated. We discuss each case in detail below:
Fig 4

3D visualization of protein structures with experimentally supported loss and gain of function predictions.

Left: SOD1 protein (chain A of PDB entry 2xjl) where residues H63, H71, H80 and D83 form a zinc binding pocket. The substitution D83G gives rise to a loss of zinc binding. Right: CA2 protein (chain A of PDB entry 1fqr) where H94, H96 and H119 are zinc-binding sites. Mutation T198E leads to an increase in zinc affinity.

3D visualization of protein structures with experimentally supported loss and gain of function predictions.

Left: SOD1 protein (chain A of PDB entry 2xjl) where residues H63, H71, H80 and D83 form a zinc binding pocket. The substitution D83G gives rise to a loss of zinc binding. Right: CA2 protein (chain A of PDB entry 1fqr) where H94, H96 and H119 are zinc-binding sites. Mutation T198E leads to an increase in zinc affinity.

Loss of zinc binding in superoxide dismutase (SOD1)

The functional role of SOD1 is to destroy radicals that are normally produced in cells and which are toxic to biological systems. SOD1 forms a zinc-binding pocket consisting of H63, H71, H80 and D83 [65, 66] as shown in Fig 4 (left). Mutations in SOD1 are known to be causative of amyotrophic lateral sclerosis [66-68]. However, the molecular mechanisms underlying these mutations often remain unclear. We predicted a loss of multiple functional activities for mutation D83G and identified zinc binding as the primary underlying molecular mechanism of disease. In particular, D83G has a and leading to a P(loss|x) ≈ 1, which is above the 1% FPR threshold of 0.20 with an empirical p-value of 1.2⋅10−3. A literature search for experimental evidence reveals that mutation D83G causes the destabilization of native structure which leads to protein aggregation with the formation of amyloid-like fibrils, and, ultimately, a gain of toxicity [69]. Zinc binding is a known stabilizer of protein structure and, therefore, the loss of the zinc-binding residue D83 appears to be a plausible destabilizing mechanism that ultimately impacts the biological function of SOD1. We note that the quadruple (H63, H71, H80, D83) was not part of the training data for the zinc-binding predictor. This example raises an interesting possibility that the loss of a functionally important residue (zinc-binding residue) results in a loss of stability, and ultimately leads to disease through the loss of the protein’s function. In other words, protein structure and function appears to be intimately and bidirectionally interconnected. At this moment, however, this is only a theoretical possibility because of the lack of data about the structure and stability of the wild-type and mutant proteins in the absence of zinc ions.

Gain of zinc affinity in carbonic anhydrase 2 (CA2)

CA2 is essential for bone resorption and osteoclast differentiation. CA2 has three zinc-binding residues at H94, H96 and H119 as shown in Fig 4 (right). There are multiple studies that have characterized the effects of variants in CA2 via mutagenesis experiments [70-74]. Among these mutations, we predicted a gain of zinc binding for T198E that was experimentally shown to increase zinc affinity. Specifically, T198E has a and leading to a P(gain|x)≈1, which is above the 1% FPR threshold of 0.35 with a p-value of 3.7⋅10−4. The triple (H94, H96, H119) was not part of the training data for the zinc-binding residue predictor.

Discussion

This study builds on the extensive prior work in structural bioinformatics to provide statistical evidence of the important role that alterations of multiple types of functional residue play in human genetic disease. Most of the existing work has centered around understanding the impact of sequence variants on protein stability or has only considered single types of function such as catalytic residues or protein-interaction sites [7, 20, 23, 24, 30, 75–78]. This work extends these studies by integrating the stability models with a series of functional residue predictors involving metal binding, macromolecular binding, ligand binding and others. Overall, we show and validate the feasibility of computationally predicting mutations that impair specific function using protein 3D structure data. Despite using sophisticated methodology to model loss and gain of functional residues, the nature of this research has limitations involving both data sets and methodology. First, despite major efforts employed by authors and database curators when annotating amino acid substitutions as being causative of a particular disease, it is possible that some amino acid substitutions have been misannotated as disease-causing by the original authors reporting them. Similarly, mutagenesis experimental data are known to be biased toward certain amino acid residues. For example, alanine mutations comprised about 50% of the independent amino acid substitutions data set (due to the frequent use of alanine-scanning mutagenesis). There are also limitations and biases in relation to the protein structures available in PDB as well as in selecting an appropriate set of unlabeled variants. Second, there exist both theoretical and practical limitations in the semi-supervised framework used in this work. The accuracy of our methods is predicated upon the assumption that the computational models are capable of accurately estimating the posterior probability of the class labels. This however could not be guaranteed and thus requires caution when interpreting our results. Furthermore, there are identifiability issues in estimating class priors in the positive-unlabeled framework; i.e., the estimates for the class priors do not have a unique solution and only an upper bound can be estimated [27]. On the practical side, we have been careful to prevent overfitting. We performed only minor parameter selection steps before the final functional predictors were built. Thus, there is the potential to further improve predictor performance through more extensive work. This includes the use of additional features, optimizing the distance threshold used to define an edge between two residues when constructing protein contact graphs, choice of the capacity parameter in SVM, among others. Finally, this work was designed to probabilistically reason about molecular mechanisms of disease and not necessarily to develop classifiers that outperform specialized models across the board. If a user needs a tool for a particular prediction task, we recommend that the most accurate predictor for this task be selected. Despite these limitations, we believe this work contributes to an improved understanding of the impact of sequence variants on protein function. We have provided a model that considers functional alteration both when stability of the protein is disrupted and when it is not disrupted (e.g. interestingly, sequence changes can exert a functional effect in disordered regions such as disorder-to-order transition [79]). We believe that our work suggests a new class of approaches to disease studies that might qualify as mechanism-driven and disease-agnostic, where one might be compelled to identify a set of molecular alterations underlying a disease phenotype without necessarily studying a single disease. While each molecular alteration is likely to require an individualized approach to drug design and therapy, we envisage that the next generation of researchers might decide to specialize in addressing particular types of functional deficiencies rather than beginning with a particular disease.

Inverse cumulative distribution function (CDF) of P(loss|x).

For ATP-binding predictor, we plot the inverse CDF for P(loss|x) on the disease and putatively neutral data sets, respectively. (EPS) Click here for additional data file.

Inverse cumulative distribution function (CDF) of P(gain|x).

For catalytic residue predictor, we plot the inverse CDF for P(gain|x) on the disease and putatively neutral data sets, respectively. (EPS) Click here for additional data file.

Mapping between ligand codes and names.

(PDF) Click here for additional data file.

Selected kernel matrix parameters for each structural and functional site predictor.

For each data set, we show the best-performing kernel matrix parameters obtained through a per-chain 10-fold cross-validation. The normalized edit distance kernel k(u, v) outperformed both and on each data set. Note that the edit distance kernel with m = 0 is equivalent to a standard graphlet kernel. (PDF) Click here for additional data file.

Performance assessment of functional residue predictors using an independent data set.

For each prediction method, we show number of proteins (n), number of positive examples (n+), number of unlabeled examples (n), AUC and sensitivity (sn) and MCC at score threshold corresponding to specificity (sp) of 99%. In the case of N-linked glycosylation (Nglyco), we only predict if the wild-type residue is an asparagine, whereas for phosphorylation (Phos), we only make predictions on threonine, tyrosine or serine residues. (PDF) Click here for additional data file.

Class priors for structural and functional predictors.

Fraction of residues in a data set estimated to be stability-impacting or functional. Estimates on the unlabeled data were made using the AlphaMax algorithm [27]; minor manual adjustments were made by observing the log-likelihood plots. *Indicates a confident prior estimate assessed by manually observing log-likelihood plots. Estimates on the disease and putatively neutral data were made using the empirical mean formula. (PDF) Click here for additional data file.

Performance assessment of loss of function predictions using mutagenesis experimental data.

Independent validation of predicted loss of functional site events using a set of mutagenesis experimental data mapped to protein structures in PDB. This mutagenesis data set contains 3,356 AAS from 880 human proteins. For each functional feature, we show the number of experimentally determined losses (n), AUC, sensitivity (sn) and MCC corresponding to a 99% specificity (sp) threshold. Additionally, the last five columns show the number of statistically significant (p < 0.01) loss-of-function predictions (), as well as estimates for AUC (AUC*), sensitivity (sn*), specificity (sp*) and MCC (MCC*) on this filtered set. (PDF) Click here for additional data file.
  69 in total

1.  ProTherm, version 2.0: thermodynamic database for proteins and mutants.

Authors:  M M Gromiha; J An; H Kono; M Oobatake; H Uedaira; P Prabakaran; A Sarai
Journal:  Nucleic Acids Res       Date:  2000-01-01       Impact factor: 16.971

2.  Novel mutations that enhance or repress the aggregation potential of SOD1.

Authors:  Uma Krishnan; Marjatta Son; Bhagya Rajendran; Jeffrey L Elliott
Journal:  Mol Cell Biochem       Date:  2006-04-01       Impact factor: 3.396

Review 3.  Gain-of-glycosylation mutations.

Authors:  Guillaume Vogt; Benoît Vogt; Nadia Chuzhanova; Karin Julenius; David N Cooper; Jean-Laurent Casanova
Journal:  Curr Opin Genet Dev       Date:  2007-04-30       Impact factor: 5.578

4.  Analysis of protein-protein interaction sites using surface patches.

Authors:  S Jones; J M Thornton
Journal:  J Mol Biol       Date:  1997-09-12       Impact factor: 5.469

5.  The characterization of amino acid sequences in proteins by statistical methods.

Authors:  J M Zimmerman; N Eliezer; R Simha
Journal:  J Theor Biol       Date:  1968-11       Impact factor: 2.691

6.  Secreted PCSK9 decreases the number of LDL receptors in hepatocytes and in livers of parabiotic mice.

Authors:  Thomas A Lagace; David E Curtis; Rita Garuti; Markey C McNutt; Sahng Wook Park; Heidi B Prather; Norma N Anderson; Y K Ho; Robert E Hammer; Jay D Horton
Journal:  J Clin Invest       Date:  2006-11       Impact factor: 14.808

7.  Graphlet kernels for prediction of functional residues in protein structures.

Authors:  Vladimir Vacic; Lilia M Iakoucheva; Stefano Lonardi; Predrag Radivojac
Journal:  J Comput Biol       Date:  2010-01       Impact factor: 1.479

8.  Gain and loss of phosphorylation sites in human cancer.

Authors:  Predrag Radivojac; Peter H Baenziger; Maricel G Kann; Matthew E Mort; Matthew W Hahn; Sean D Mooney
Journal:  Bioinformatics       Date:  2008-08-15       Impact factor: 6.937

9.  Structure-assisted redesign of a protein-zinc-binding site with femtomolar affinity.

Authors:  J A Ippolito; T T Baird; S A McGee; D W Christianson; C A Fierke
Journal:  Proc Natl Acad Sci U S A       Date:  1995-05-23       Impact factor: 11.205

10.  Improving the prediction of disease-related variants using protein three-dimensional structure.

Authors:  Emidio Capriotti; Russ B Altman
Journal:  BMC Bioinformatics       Date:  2011-07-05       Impact factor: 3.169

View more
  9 in total

1.  Missense variant pathogenicity predictors generalize well across a range of function-specific prediction challenges.

Authors:  Vikas Pejaver; Sean D Mooney; Predrag Radivojac
Journal:  Hum Mutat       Date:  2017-06-12       Impact factor: 4.878

2.  Assessment of methods for predicting the effects of PTEN and TPMT protein variants.

Authors:  Vikas Pejaver; Giulia Babbi; Rita Casadio; Lukas Folkman; Panagiotis Katsonis; Kunal Kundu; Olivier Lichtarge; Pier Luigi Martelli; Maximilian Miller; John Moult; Lipika R Pal; Castrense Savojardo; Yizhou Yin; Yaoqi Zhou; Predrag Radivojac; Yana Bromberg
Journal:  Hum Mutat       Date:  2019-07-03       Impact factor: 4.878

Review 3.  Gain-of-Function Mutations: An Emerging Advantage for Cancer Biology.

Authors:  Yongsheng Li; Yunpeng Zhang; Xia Li; Song Yi; Juan Xu
Journal:  Trends Biochem Sci       Date:  2019-04-29       Impact factor: 13.807

4.  Prioritizing de novo autism risk variants with calibrated gene- and variant-scoring models.

Authors:  Yuxiang Jiang; Jorge Urresti; Kymberleigh A Pagel; Akula Bala Pramod; Lilia M Iakoucheva; Predrag Radivojac
Journal:  Hum Genet       Date:  2021-09-22       Impact factor: 5.881

Review 5.  dCas9-VPR-mediated transcriptional activation of functionally equivalent genes for gene therapy.

Authors:  Lisa M Riedmayr; Klara S Hinrichsmeyer; Nina Karguth; Sybille Böhm; Victoria Splith; Stylianos Michalakis; Elvir Becirovic
Journal:  Nat Protoc       Date:  2022-02-07       Impact factor: 17.021

Review 6.  The Human Gene Mutation Database: towards a comprehensive repository of inherited mutation data for medical research, genetic diagnosis and next-generation sequencing studies.

Authors:  Peter D Stenson; Matthew Mort; Edward V Ball; Katy Evans; Matthew Hayden; Sally Heywood; Michelle Hussain; Andrew D Phillips; David N Cooper
Journal:  Hum Genet       Date:  2017-03-27       Impact factor: 4.132

7.  Spatial distribution of disease-associated variants in three-dimensional structures of protein complexes.

Authors:  A Gress; V Ramensky; O V Kalinina
Journal:  Oncogenesis       Date:  2017-09-25       Impact factor: 7.485

8.  Pathogenicity and functional impact of non-frameshifting insertion/deletion variation in the human genome.

Authors:  Kymberleigh A Pagel; Danny Antaki; AoJie Lian; Matthew Mort; David N Cooper; Jonathan Sebat; Lilia M Iakoucheva; Sean D Mooney; Predrag Radivojac
Journal:  PLoS Comput Biol       Date:  2019-06-14       Impact factor: 4.475

9.  Computational and experimental methods for classifying variants of unknown clinical significance.

Authors:  Malte Spielmann; Martin Kircher
Journal:  Cold Spring Harb Mol Case Stud       Date:  2022-04-28
  9 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.