| Literature DB >> 24795746 |
Thanawadee Preeprem1, Greg Gibson1.
Abstract
We have developed a novel structure-based evaluation for missense variants that explicitly models protein structure and amino acid properties to predict the likelihood that a variant disrupts protein function. A structural disruption score (SDS) is introduced as a measure to depict the likelihood that a case variant is functional. The score is constructed using characteristics that distinguish between causal and neutral variants within a group of proteins. The SDS score is correlated with standard sequence-based deleteriousness, but shows promise for improving discrimination between neutral and causal variants at less conserved sites. The prediction was performed on 3-dimentional structures of 57 gene products whose homozygous SNPs were identified as case-exclusive variants in an exome sequencing study of epilepsy disorders. We contrasted the candidate epilepsy variants with scores for likely benign variants found in the EVS database, and for positive control variants in the same genes that are suspected to promote a range of diseases. To derive a characteristic profile of damaging SNPs, we transformed continuous scores into categorical variables based on the score distribution of each measurement, collected from all possible SNPs in this protein set, where extreme measures were assumed to be deleterious. A second epilepsy dataset was used to replicate the findings. Causal variants tend to receive higher sequence-based deleterious scores, induce larger physico-chemical changes between amino acid pairs, locate in protein domains, buried sites or on conserved protein surface clusters, and cause protein destabilization, relative to negative controls. These measures were agglomerated for each variant. A list of nine high-priority putative functional variants for epilepsy was generated. Our newly developed SDS protocol facilitates SNP prioritization for experimental validation.Entities:
Keywords: epilepsy disorders; missense mutation; non-synonymous single nucleotide polymorphism; protein structural analysis; structural disruption score; variant prioritization
Year: 2014 PMID: 24795746 PMCID: PMC4001065 DOI: 10.3389/fgene.2014.00082
Source DB: PubMed Journal: Front Genet ISSN: 1664-8021 Impact factor: 4.599
Figure 1Flow diagram of the analysis pipeline. The analysis employed sequence-based deleterious prediction scores, parameters which reflect the nature of amino acid changes, and 3D structure-based evaluations. Structural analyses were performed by characterizing functionality of mutated protein residues caused by negative and positive SNPs (indicated by green and red stick representations, respectively). All analysis results were collectively used to evaluate enriched features found predominantly in causal SNPs. We then examined these predictors with regard to the case variants. The number of deleterious structure predictions per substitution represents a “structural disruption score” (SDS), and was used to rank candidate epilepsy variants.
Number of variants within each gene set, classified into three classes (cases, negative controls, and positive controls), and numbers of 3D structures used in the analysis.
| Set 1 (71 genes) | 30 (68) | 1674 (5281) | 100 (134) | 20 (24) | 8 (35) | 35 (59) | 20 (86) | 31 (80) | 114 | 57 |
| Set 2 (42 genes) | 184 (373) | 554 (1490) | 105 (205) | 2 (2) | 3 (19) | 5 (17) | 21 (46) | 20 (38) | 51 | 36 |
Number of variants by categories is indicated by the number of SNPs locate within the set of selected 3D structures (114 structures for gene set 1, and 51 structures for gene set 2), followed by the total number of SNPs with and without 3D structures (number shown in parentheses).
Number of structures represents the number of selected 3D structures that passed quality validation scores. The initial number of structures obtained from each data source is much larger, indicated by numbers in parentheses.
Categories for structural indicators, cutoff values for continuous numerical parameters, and number of SNPs with extreme measures.
| Stability change | 0 | 19 (1%) | 0 | |
| 0 | 0 | 0 | ||
| 20 (65%) | 830 (50%) | 66 (66%) | ||
| 0 | 1 (0%) | 0 | ||
| Dynamic sites | 0 | 38 (2%) | 3 (3%) | |
| 1 (3%) | 44 (3%) | 1 (1%) | ||
| Dynamic sites | 0 | 48 (3%) | 2 (2%) | |
| 1 (3%) | 48 (3%) | 3 (3%) | ||
| Flexible sites | 1 (3%) | 34 (2%) | 2 (2%) | |
| 1 (3%) | 52 (3%) | 0 | ||
| Sequence optimality | 1 (3%) | 49 (3%) | 5 (5%) |
All cutoff values were defined exclusively for gene set 1. The counts and percentages of variants with extreme values in each of the three variant classes are included.
.
| Sequence-based deleterious scores | SIFT | <0.0001 | (6.60) | df 1704 | <0.0001 | (4.61) | df 598 |
| PolyPhen2_HDIV | <0.0001 | (8.15) | df 1772 | <0.0001 | (7.38) | df 657 | |
| PolyPhen2_HVAR | <0.0001 | (10.22) | df 1772 | <0.0001 | (8.70) | df 657 | |
| LRT | <0.0001 | (4.19) | df 1734 | 0.0066 | (2.73) | df 654 | |
| MutationTaster | <0.0001 | (9.39) | df 1674 | <0.0001 | (5.57) | df 631 | |
| MutationAssesssor | <0.0001 | (15.30) | df 1772 | <0.0001 | (6.06) | df 652 | |
| Sequence conservation scores | GERP | <0.0001 | (5.30) | df 1772 | <0.0001 | (3.92) | df 657 |
| phyloP | <0.0001 | (4.81) | df 1772 | 0.0010 | (3.29) | df 657 | |
| SiPhy | <0.0001 | (5.89) | df 1771 | <0.0001 | (4.95) | df 657 | |
| Structure-based scores | ΔΔG | <0.0001 | (4.81) | df 1772 | <0.0001 | (3.83) | df 657 |
| B-factor | 0.0190 | (−2.35) | df 1756 | 0.2450 | (1.16) | df 657 | |
| RMSF | 0.2304 | (−1.20) | df 1756 | 0.2264 | (1.21) | df 657 | |
| P(Flexible) | 0.1185 | (−1.56) | df 1772 | 0.7594 | (−0.31) | df 657 | |
| Γ | 0.9090 | (−0.11) | df 1772 | 0.2308 | (−1.20) | df 657 | |
| RSA | 0.0035 | (−2.93) | df 1772 | 0.0658 | (−1.84) | df 657 | |
All tests were performed on a subset of SNPs whose protein structures pass quality validations. Significant statistics indicate different means between negative and positive SNPs.
Statistic parameters include the two-tailed p-value, value of the t-statistics (t ratio), and the degree of freedom (df). Significant p-values (α = 0.05) are designated with “
.” The number of “df” equals to n-2 samples used in the analysis.
Figure 2Density plots of six deleterious scores for Case, Neutral, and Causal SNPs. By most of the standard deleteriousness scores, the distributions of cases in gene set 1 (Panel A) are closer to the neutral than the causal variants, and the neutral and causal variants are significantly different. The “known epilepsy” dataset (gene set 2, Panel B) demonstrated similar results. In this gene set, variants documented to cause epilepsy were regarded as “cases,” while variants associated with other disease types were considered as “causal SNPs (positive control).” Although three prediction programs (SIFT, Polyphen2_HDIV, and Polyphen2_HVAR) suggested case and causal SNPs share similar distributions of deleterious scores, the remaining three programs illustrate their prediction algorithms do not favor the two types of causal variants equally. More importantly, case SNPs (epilepsy-causing SNPs) resemble neutral SNPs more than the causal ones.
Figure 3Density plots for relative solvent accessibility and free energy change for Case, Neutral, and Causal SNPs. The two structure-based scores demonstrate that Case and Causal SNPs share similar characteristics. Results were obtained from two independent sets of genes (Panels A,B). Note the shift of the blue curves (cases) toward the causal SNPs (red) and away from the neutral ones (green).
Fisher's exact test statistics for gene sets 1 and 2.
| Enriched in causal SNPs | Deleterious count ≥ 4 | <0.0001 | <0.0001 |
| Conservation count = 3 | Enriched in neutral SNPs | <0.0001 | |
| Induce large amino acid change (Grantham score ≥ 100) | <0.0001 | <0.0001 | |
| Induce disulfide change | NS ( | <0.0001 | |
| Induce glycine/proline change | 0.0003 | <0.0001 | |
| Locates in buried site (RSA ≤ 20%) | 0.0109 | <0.0001 | |
| Locate in conformationally rigid site (P(Flexible) ≤ 0.74) | NS ( | 0.0060 | |
| Locate on protein patch | <0.0001 | <0.0001 | |
| Locate in protein domain | 0.0204 | <0.0001 | |
| Strongly reduce protein stability (ΔΔG ≥ 4 kcal/mol) | NS ( | 0.0400 | |
| Reduce protein stability (ΔΔG ≥ 0.5 kcal/mol) | 0.0009 | <0.0001 | |
| Enriched in neutral SNPs | Conservation count = 3 | <0.0001 | Enriched in causal SNPs |
| Locate in highly dynamics site (B-factornorm ≥ 97.5%) | NS ( | 0.0224 | |
| Locate in highly flexible site (P(Flexible) ≥ 97.5%) | 0.0468 | NS ( | |
High sequence conservation (conservation count = 3) is a feature that enriched in causal SNPs of gene set 2, but is more likely to be found in neutral SNPs of gene set 1.
Data for gene set 1 includes 100 causal SNPs and 1674 neutral SNPs.
Data for gene set 2 includes 289 causal SNPs (184 epilepsy case variants plus 105 non-epilepsy positive control variants), and 554 neutral SNPs.
Significant p-values (α = 0.05) are designated with “
.” Non-significant test statistics are labeled with NS, followed by the correspondent p-value.
Case SNPs with high structural disruption scores.
| Structural disrupted variants which locate in high tolerance genes | A | (4075)TGC > CGC [C1359R] | Yes [6] | Yes [180] | Yes [2.05] | Yes [3-layer(αβα) sandwich] | Yes [1.64] | 5 | high [0.26] | ||||
| B | (685)CGA > GGA [R227G] | Yes [125] | Yes [R > G] | Yes [3-layer(αβα) sandwich] | Yes [0.87] | 4 | high [0.77] | ||||||
| C | (1211)CGG > CAG [R404Q] | Yes [4] | Yes [10.55] | Yes | Yes [up-down bundle] | Yes [0.90] | 5 | High [0.8] | |||||
| D | (1064)ATC > ACC [I463T] | Yes [4] | Yes [4.42] | Yes [3-layer(αβα) sandwich] | Yes [1.37] | 4 | High [0.05] | ||||||
| E | (449)TCC > TGC [S150C] | Yes [5] | Yes [112] | Yes [0.77] | Yes [α−β horseshoe] | 4 | High [0.51] | ||||||
| F | (1517)GAT > GGT [D506G] | Yes [4] | Yes [D > G] | Yes [18.27] | Yes [1.40] | 4 | High [1.08] | ||||||
| G | (127)CTG > GTG [L43V] | Yes [4] | Yes [4.66] | Yes [3-layer(αβα) sandwich] | Yes [2.06] | 4 | High [0.17] | ||||||
| H | (409)CGC > TGC [R137C] | Yes [4] | Yes [180] | Yes [1.17] | Yes [up-down bundle] | Yes [1.83] | 5 | High [0.27] | |||||
| I | (2993)GGA > GAA [G998E] | Yes [5] | Yes [G > E] | Yes [2.27] | Yes [3.23] | 4 | High [0.32] | Yes [97.75] | |||||
| Structural disrupted variants which locate in low tolerance genes | J | (830)GGA > GTA [G277V] | Yes [6] | Yes [109] | Yes [G > V] | Yes [3-layer(αβα) sandwich] | Yes [1.64] | 5 | Low [−0.45] | ||||
| K | (821)CGT > CAT [R274H] | Yes [4] | Yes [0] | Yes [α−β complex] | Yes [0.95] | 4 | Low [−0.29] | ||||||
| L | (374)AAT > AGT [N125S] | Yes [4] | Yes [6.65] | Yes [orthogonal bundle] | Yes [1.36] | 4 | Low [−0.14] | ||||||
| M | (336)ATA > ATG [I112M] | Yes [5] | Yes [0] | Yes [α horseshoe] | Yes [1.41] | 4 | Low [−0.32] | ||||||
| N | (566)GAA > GGA [E189G] | Yes [4] | Yes [E > G] | Yes [up-down bundle] | Yes [1.79] | 4 | Low [−0.30] | ||||||
Each candidate epilepsy variant is suggested to disrupt protein structure/function if its structural disruptions score (SDS) is high (≥4 out of 7). A list of 14 putative structural disrupted variants from 30 missense mutations is reported, along with the corresponding 7 scores that make up the SDS. The variants are classified into two groups, based on the level of polymorphism tolerance of its gene.
Only the values corresponding to enriched characteristics of causal SNPs are included in the table; the empty cells do have values but they are not presented here for clarity.
Summary of structural disrupted case SNPs.
| Structural disrupted variants which locate in high tolerance gene | A | (4075)TGC > CGC [C1359R] | ABC transporter A family member 6 [lipid homeostasis] | Large change in amino acid properties, mutation causes protein destabilization but does not alter disulfide bonds | n/a | n/a | 5 | 0.30, 2.00 | |
| B | (685)CGA > GGA [R227G] | Hydrolase [neuron development] | Near active site (low Γ), loss of side chain (R > G) | Linked to Chanarin-Dorfman syndrome (fat depositions in internal organs) | Less likely | 4 | 0.16, 0.23 | ||
| C | (1211)CGG > CAG [R404Q] | Lipoxygenase [lipid metabolism] | Near active site, on best protein patch | Shares substrate with COX (COX-2 expression increases upon electrical stimuli) | Likely | 5 | 0.05, 0.37 | ||
| D | (1064)ATC > ACC [I463T] | RNA helicase [mRNA degradation control] | Stabilization center | Gain of function in Drosophila's homolog suppresses seizure; mRNA loss accounts for 1/3 of human diseases | Maybe | 4 | 0.23, 1.16 | ||
| E | (449)TCC>TGC [S150C] | Epiphycan [cartilage development] | Stabilization center, mutation yields preferred hydrophobic core | Osteoarthritis | No | 4 | 0.61, 2.42 | ||
| F | (1517)GAT > GGT [D506G] | DNA helicase [DNA damage repair] | 3 indications as a binding residue, confirmed by mutagenesis | Effective cellular protection mechanism helps animals survive brain injuries after induced seizures | Likely | 4 | 0.57, 3.76 | ||
| G | (127)CTG > GTG [L43V] | Esterase [lipid metabolism] | Mutation locates at a turn region (not favorable in highly structured proteins) | Antiepileptic drugs interfere with lipid metabolisms | Likely (drug response) | 4 | 0.28, 2.62 | ||
| H | (409)CGC > TGC [R137C] | Neuromedin-U receptor 1 [uterus contraction, vasoconstriction] | Diminishes stabilizing salt bridge and causes protein destabilization | Control of food intake | No | 5 | 0, 0.08 | ||
| I | (2993)GGA > GAA [G998E] | Partner and localizer of BRCA2 [homologous recombination repair] | Mutation may interfere with conformational flexibility of protein, largely decreases protein stability | Several mutations identified in breast cancer but disease associations are not definitive | No | 4 | 0.59, 2.40 | ||
| Structural disrupted variants which locate in low tolerance gene | J | (830)GGA > GTA [G277V] | Mitochondria endonuclease [programmed cell death] | Rigid residue at turn region, controls positioning of C-terminal and active site, confirmed by mutagenesis | n/a but reduces substrate binding | n/a | 5 | 0.27, 1.20 | |
| K | (821)CGT > CAT [R274H] | Fatty-acid amide hydrolase2 [lipid metabolism] | Mildly decreases protein stability | Gene's regulation of endocannabinoid system is linked to Alzheimer's and other CNS disorders | Maybe | 4 | n/a | ||
| L | (374)AAT > AGT [N125S] | Monoamine oxidase type A [neurotransmitter metabolism] | Far from functional sites, mildly reduces protein stability | Gene catalyzes several neurotransmitters and associated with behavioral phenotypes, confirmed by animal studies | No | 4 | n/a | ||
| M | (336)ATA > ATG [I112M] | Phosphatase regulator [cellular process regulation] | Longer amino acid side chain causes steric clash | Member of KEGG epilepsy pathway; protein in the same family linked to Lafora disease (teenager-onset of epilepsy) | Likely | 4 | n/a | ||
| N | (566)GAA > GGA [E189G] | Tyrosine-protein phosphatase non-receptor type 14 [cellular process regulation] | Mutation does not alter inter-residue bonding but slightly decreases protein stability | Several mutations identified in colorectal cancers | No | 4 | 0.93, 3.22 |
The table summarizes the structural and clinical findings for each of the top 14 case variants. The 9 putative functional variants for epilepsy (variants A–H, M) are identified from of a subset of the 14 variants. Eight variants (variants A–H) are located in “high tolerance genes” and do not possess a compatible feature with those of neutral SNPs. The ninth variant (variant M) is located in a “low tolerance gene”; the gene is likely to be associated with epilepsy disorders.
Variant's features include all structural changes/implications that were collected during the analysis, regardless of their significant in feature enrichment toward causal SNPs.
Disease implications denote any clinically-relevant associations found in literatures.
Epilepsy implications indicate our opinions on whether or not the variant contributes to epilepsy development. The opinion is based upon several data sources. However, the considerations exclude the SDS of a variant and its minor allele frequencies (MAFs). (The SDS had already been utilized as a filter during the variant prioritization step. The allele frequencies are presented here solely for comparison purposes).
Figure 43D structures for the nine high-priority variants for epilepsy. The nine high-priority variants for epilepsy include eight structural disrupted variants (A–H) which locate in genes with high tolerance of polymorphisms (Table 5) and one structural disrupted variant (M) which resides in a low tolerance gene (Table 5). In all figures, mutated protein residues caused by the SNPs are indicated with blue surface representations along with the amino acid name (wild type) and residue index. Yellow surface representations refer to predicted binding sites or catalytic sites, except in (M) in which it represents a protein residue with close proximity to the altered amino acid. Orange representations in (C) indicate the predicted best conserved residue cluster on the protein surface. The substitution of Ser150Cys in (E) adds one favorable hydrophobic residue (green surface representation) to the core of the α/β horseshoe. The magenta sticks in (H) represent the partner residue, Glu117, which forms a salt bridge with wild type Arg137. This salt bridge is lost in the presence of mutant Cys at position 137.
Step-wise analysis for correlation of SDS and Condel score.
| All SDS | 0.1504 | 0.0342 | none |
| SDS-high deleterious count | 0.1112 | 0.0717 | 16 |
| SDS-large amino acid change | 0.1508 | 0.0340 | 5 |
| SDS-induce gly/Pro change | 0.1308 | 0.0496 | 7 |
| SDS-locate in buried site | 0.0782 | NS ( | 15 |
| SDS-locate on protein patch | 0.1629 | 0.0270 | 3 |
| SDS-locate in protein domain | 0.2093 | 0.0110 | 25 |
| SDS-reduce protein stability | 0.1243 | 0.0560 | 20 |
Initial SDS includes 7 parameters, described in Table .
The full dataset has 30 missense variants. All data points were used to test for a correlation between SDS and Condel score. When an SDS component was removed during the step-wise analysis, the SDSs for some numbers of variants were affected, i.e., the excluded parameter was applicable to the variant. For such cases, the correlation analysis was performed with all of the 30 data points, minus the number of exclusions indicated in the last column. Significant p-values are designated with “
for α = 0.05 and α = 0.10, respectively. Non-significant test statistics are labeled with NS, followed by the correspondent p-value.