| Literature DB >> 28391525 |
Xinjun Zhang1,2, Meng Li2,3, Hai Lin2,4, Xi Rao2, Weixing Feng3, Yuedong Yang5, Matthew Mort6, David N Cooper6, Yue Wang7, Yadong Wang8, Clark Wells9, Yaoqi Zhou10, Yunlong Liu11,12,13.
Abstract
While synonymous single-nucleotide variants (sSNVs) have largely been unstudied, since they do not alter protein sequence, mounting evidence suggests that they may affect RNA conformation, splicing, and the stability of nascent-mRNAs to promote various diseases. Accurately prioritizing deleterious sSNVs from a pool of neutral ones can significantly improve our ability of selecting functional genetic variants identified from various genome-sequencing projects, and, therefore, advance our understanding of disease etiology. In this study, we develop a computational algorithm to prioritize sSNVs based on their impact on mRNA splicing and protein function. In addition to genomic features that potentially affect splicing regulation, our proposed algorithm also includes dozens structural features that characterize the functions of alternatively spliced exons on protein function. Our systematical evaluation on thousands of sSNVs suggests that several structural features, including intrinsic disorder protein scores, solvent accessible surface areas, protein secondary structures, and known and predicted protein family domains, show significant differences between disease-causing and neutral sSNVs. Our result suggests that the protein structure features offer an added dimension of information while distinguishing disease-causing and neutral synonymous variants. The inclusion of structural features increases the predictive accuracy for functional sSNV prioritization.Entities:
Keywords: Position Specific Score Matrix; Position Weight Matrix; Random Forest; Solvent Accessible Surface Area; Splice Site
Mesh:
Year: 2017 PMID: 28391525 PMCID: PMC5602096 DOI: 10.1007/s00439-017-1783-x
Source DB: PubMed Journal: Hum Genet ISSN: 0340-6717 Impact factor: 5.881
Fig. 1Cumulative probability density function (CDF) curves and Kolmogorov–Smirnov (K–S) test p values on various protein structure features for the exons containing disease-causing (red) and neutral (black) sSNVs. a CDF of the average solvent accessible surface area (ASA) of all the amino-acid residuals in the exon. b KS–S test p values for the average, minimum and maximum ASA values of all the amino-acid residuals in the exon. c CDF of the average disorder score of all the residuals in the affected exon. d K–S test p values for 12 disorder score-derived features (Supplementary Table 1). e CDF of the average probability of the most likely protein secondary structure (alpha-helix, beta sheet, or random coil) on all the residuals in the affected exon. f K–S test p value for 12 protein secondary structure-derived features (Supplementary Table 1). g, h CDF and K–S p values of the percentage of the exon overlapping with known/predicted Pfam domain. i, j CDF and K–S p values of the normalized PTM counts in the affected exon
Fig. 2Comparison between regSNP-splicing and SPANR on independent test variant data set and ClinVar variant data set. a, b ROC curves showing the performance of regSNP-splicing (red curve) and SPARN (blue curve) on an independent test data sets for VSS, and VIE variants, respectively. c, d ROC curves showing the performance of regSNP-splicing (red curve), SPARN (blue curve) and mutPred Splice (black curve) for VSS and VIE variants documented in the ClinVar database, respectively
Fig. 3Reverse correlation between average minor allele frequency (MAF) and average predicted disease-causing probability for a on-splicing site and b off-splicing site 1000 Genomes variants, respectively. Minor allele frequency, ranging between 0 and 1, is divided into 20 equal bins, and each bin represents 0.05 increment of MAF. For all the variants with MAF falling into each bin, we calculated their average MAF and average disease-causing probability values. One dot represents a pair of average MAF and average DCP. A linear model was fitted for the 20 dots and R 2 value is calculated