| Literature DB >> 30265692 |
Vasundhara Dehiya1,2, Jaya Thomas1,2, Lee Sael3.
Abstract
Can structural information of proteins generate essential features for predicting the deleterious effect of a single nucleotide variant (SNV) independent of the known existence of the SNV in diseases? In this work, we answer the question by examining the performance of features generated from prior knowledge with the goal towards determining the pathogenic effect of rare variants in rare disease. We take the approach of prioritizing SNV loci focusing on protein structure-based features. The proposed structure-based features are generated from geometric, physical, chemical, and functional properties of the variant loci and structural neighbors of the loci utilizing multiple homologous structures. The performance of the structure-based features alone, trained on 80% of HumVar-HumDiv combination (HumVD-train) and tested on 20% of HumVar-HumDiv (HumVD-test), ClinVar and ClinVar rare variant rare disease (ClinVarRVRD) datasets, showed high levels of discernibility in determining the SNV's pathogenic or benign effects on patients. Combined structure- and sequence-based features generated from prior knowledge on a random forest model further improved the F scores to 0.84 (HumVD-test), 0.75 (ClinVar), and 0.75 (ClinVarRVRD). Including features based on the difference between wild-type in addition to the features based on loci information increased the F score slightly more to 0.90 (HumVD-test), 0.78 (ClinVar), and 0.76 (ClinVarRVRD). The empirical examination and high F scores of the results based on loci information alone suggest that location of SNV plays a primary role in determining functional impact of mutation and that structure-based features can help enhance the prediction performance.Entities:
Mesh:
Substances:
Year: 2018 PMID: 30265692 PMCID: PMC6161878 DOI: 10.1371/journal.pone.0204101
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Dataset statistics.
The number of benign and pathogenic variants for each dataset are listed.
| Dataset | Benign | Pathogenic | Total |
|---|---|---|---|
| HumVD (all) | 3256 | 6884 | 10140 |
| HumVD-train (80%) | 2591 | 5521 | 8112 |
| HumVD-test (20%) | 665 | 1363 | 2028 |
| ClinVar | 152 | 285 | 437 |
| CinVarRVRD | 117 | 163 | 280 |
Fig 1Loci specific feature extraction pipeline.
Fig 2Structural neighborhood.
Green sphere depicts a radius of 9Å centered at 116.A of PDBID:1IJN structure. All amino acids residues found within this region are considered as neighbors of the query SNV loci, 116.A.
List of structure- and sequence-based features.
The rank represents the importance of individual features obtained by 5-fold cross validation on the HumVar-HumDiv training data (Size = 8,112) using Random Forest for attribute evaluation using weighted F score as evaluation metric.
| Name | Description | Rank |
|---|---|---|
| KDmean | mean KD hydrophobicity value | 2 |
| RSAmax | maximum residual solvent accessibility value | 4 |
| nRSA | mean residual solvent accessibility of structural neighbors | 5 |
| nNum | number of amino acids whin structural neighborhood | 6 |
| Bstddev | standard deviation of B factors | 7 |
| nB | mean B factor of structural neighbors | 8 |
| nKD | mean KD hydrophobicity of structural neighbors | 10 |
| nSC | mean sequence conservation of structural neighbors | 13 |
| nBinding | number of binding site types in neighborhoods | 14 |
| Binding | number of binding site types | 15 |
| Mapreg | whether SNV locus is within core region of phi-psi Ramachandran map (PolyPhen2) | 16 |
| PSIC | PSIC score of wild type amino acid (PolyPhen2) | 1 |
| Nobs | number of amino acid observed at the substitution position in the multiple alignment (PolyPhen2) | 3 |
| MinDJxn | distance of SNV locus from closest exon/ intron junction (PolyPhen2) | 9 |
| CodPos | position of SNV locus within a codon (PolyPhen2) | 11 |
| SeqCons | sequence conservation | 12 |
SNV prediction methods used in the comparative study.
‘Type’ represents whether the features are generated on a web-server or pre-computed. ‘Benign labels’ and ‘Pathogenic labels’ represent how classification labels of each algorithm was interpreted to match to Benign and Pathogenic labels.
| Algorithms | Type | Benign labels | Pathogenic labels |
|---|---|---|---|
| PolyPhen2 [ | web-server | benign | probably damaging, possibly damaging |
| SIFT [ | code, precomput. | tolerated, t. low confidence | deleterious, d. low confidence |
| CADD [ | precomput. | ||
| FATHMM [ | web-server | tolerated | damaging |
| LRT [ | web-server, precomput. | neutral | damaging |
| M-CAP [ | web-server | neutral | damaging |
| MutationAssessor [ | web-server | low, neutral | high, medium |
| MutationTaster [ | web-server | polymorphism, p. automatic | disease causing, d.c. automatic |
| PROVEAN [ | web-server | neutral | damaging |
Validation results for five features sets.
Weighted F scores are reported on the validation set (20% HumVD-train) for all algorithms. ‘Sequence’ refer to sequence-based features (PSIC, Nobs, MinDJxn, CodPos and SeqCons); ‘Structure’ refers to structure-based features (KDmean, RSAmax, nRSA, nNum, Bstddev, nB, nKD, nSC, nBinding, Binding, Mapreg); ‘Str no Neigh’ refers to structure-based features without neighbour information (KDmean, RSAmax, Bstddev, Binding, Mapreg); ‘Seq + Str’ refers to all sixteen sequence and structure features listed above; ‘All + mutation’ refers to all prior features listed above and features based on difference between wild-type listed in S1 Table. The boldface numbers highlights top two weighed F scores for each feature sets.
| Algorithm | Feature sets | ||||
|---|---|---|---|---|---|
| Sequence | Structure | Str. noNeigh | Seq + Str | All + mutation | |
| Naive Bayes | 0.71 | 0.69 | 0.82 | 0.89 | |
| SVM | 0.81 | 0.67 | 0.64 | 0.82 | |
| Logistic Regression | 0.81 | 0.68 | 0.65 | 0.82 | |
| KNN | 0.82 | ||||
| MLP | 0.82 | 0.69 | 0.68 | 0.89 | |
| Decision Table | 0.71 | 0.89 | |||
| Random Forest | 0.71 | ||||
Fig 3ROC curves for three datasets.
A. HumVD, B. ClinVar, and C. ClinVarRVRD.
Test results for different features.
Summary of results for multiple feature types on optimal weighted F scores. (FNR = False Negative Rate, TPR = True Positive Rate, FPR = False Positive Rate, TNR = True Negative Rate).
| Dataset | Feature set | FNR | TPR | FPR | TNR | Weighted F |
|---|---|---|---|---|---|---|
| HumVD test | Sequence | 0.12 | 0.88 | 0.29 | 0.71 | 0.83 |
| Structure | 0.21 | 0.79 | 0.29 | 0.71 | 0.76 | |
| Structure (noNeigh) | 0.17 | 0.83 | 0.50 | 0.50 | 0.72 | |
| Prior (Seq + Str) | 0.13 | 0.87 | 0.78 | 0.84 | ||
| All + mutation | ||||||
| ClinVar | Sequence | 0.10 | 0.90 | 0.59 | 0.41 | 0.71 |
| Structure | 0.18 | 0.82 | 0.50 | 0.50 | 0.70 | |
| Structure (noNeigh) | 0.21 | 0.79 | 0.62 | 0.38 | 0.64 | |
| Prior (Seq + Str) | 0.20 | 0.80 | 0.75 | |||
| All + mutation | 0.45 | 0.55 | ||||
| ClinVarRVRD | Sequence | 0.28 | 0.72 | 0.42 | 0.58 | 0.66 |
| Structure | 0.52 | 0.48 | 0.67 | |||
| Structure (noNeigh) | 0.41 | 0.59 | 0.36 | 0.64 | 0.61 | |
| Prior (Seq + Str) | 0.19 | 0.81 | 0.32 | 0.68 | 0.75 | |
| All + mutation | 0.20 | 0.80 |
Performance comparison with existing methods.
| Algorithm | Weighted F scores | |
|---|---|---|
| ClinVar | ClinVarRVRD | |
| PolyPhen2 [ | 0.73 | 0.69 |
| SIFT [ | 0.70 | |
| CADD [ | 0.70 | 0.65 |
| CADD [ | 0.75 | 0.70 |
| FATHMM [ | 0.69 | 0.66 |
| LRT [ | 0.73 | 0.71 |
| M-CAP [ | 0.69 | 0.58 |
| MutationAssessor [ | 0.68 | 0.64 |
| MutationTaster [ | 0.55 | 0.44 |
| PROVEAN [ | 0.71 | 0.69 |
| Prior (Seq + Str) | 0.75 | |
| All + mutation | ||
Fig 4Pathogenic rare variants in binding sites.
A. SNV at chr7:117590378 (Residue N1303) associated with a rare disease, Cystic fibrosis. The SNV locus is near the binding site of N6-Phenylethyl-ATP. B. SNV at chr5:68296301 (Residue H1940) associated with the SHORT syndrome. The SNV locus is a binding site compound 5e C. SNV at chrX:154031373 (Residue P153) associated with Rett syndrome. The SNV locus is near DNA binding site. D. SNV at chr15:48468527 (Residue N1489) associated with the Marfan syndrome. The SNV locus is at the calcium bind site.
Fig 5Pathogenic rare variants in structurally stable sites.
A. SNV at chr9:130480398 (Residue V263) associated with a rare disease, Citrullinemia type I. The SNV locus is on a stable α-helix structure near domain binding site. B.SNV at chr7:117590378 (Residue Y569) associated with a rare disease, Cystic fibrosis. The SNV locus is in the middle of a stable triple-stranded parallel β-sheet. C.SNV at chr3:10149822 (Residue R167) associated with a rare disease, Pheochromocytoma. The SNV locus is part of a stable α-helix structure.
Fig 6Four benign SNV on BRCA2 gene.
A. Multiple alignment of PDBIDs:1IYJ,1MIU, and 1MJE marked with four benign SNV loci. B. SNV at chr13:32356536 (Residue S2445) tip of a short α-helix. C. SNV at chr13:32380124 (Residue V3007) tip of unstable β-sheet. D. SNV at chr13:32379467 (Residue V2898) middle of unstable β-sheet. E. SNV at chr13:32371035 (Residue E2777) middle of unstable α-helix where one of the structure, PDBID:1IYJ, terminates.