| Literature DB >> 28794409 |
Sahar Gelfman1,2, Quanli Wang3,4, K Melodi McSweeney3,4, Zhong Ren3,4, Francesca La Carpia5, Matt Halvorsen3,4, Kelly Schoch6, Fanni Ratzon7, Erin L Heinzen3,5, Michael J Boland3,8, Slavé Petrovski3,9, David B Goldstein3,4.
Abstract
Identifying the underlying causes of disease requires accurate interpretation of genetic variants. Current methods ineffectively capture pathogenic non-coding variants in genic regions, resulting in overlooking synonymous and intronic variants when searching for disease risk. Here we present the Transcript-inferred Pathogenicity (TraP) score, which uses sequence context alterations to reliably identify non-coding variation that causes disease. High TraP scores single out extremely rare variants with lower minor allele frequencies than missense variants. TraP accurately distinguishes known pathogenic and benign variants in synonymous (AUC = 0.88) and intronic (AUC = 0.83) public datasets, dismissing benign variants with exceptionally high specificity. TraP analysis of 843 exomes from epilepsy family trios identifies synonymous variants in known epilepsy genes, thus pinpointing risk factors of disease from non-coding sequence data. TraP outperforms leading methods in identifying non-coding variants that are pathogenic and is therefore a valuable tool for use in gene discovery and the interpretation of personal genomes.While non-coding synonymous and intronic variants are often not under strong selective constraint, they can be pathogenic through affecting splicing or transcription. Here, the authors develop a score that uses sequence context alterations to predict pathogenicity of synonymous and non-coding genetic variants, and provide a web server of pre-computed scores.Entities:
Mesh:
Year: 2017 PMID: 28794409 PMCID: PMC5550444 DOI: 10.1038/s41467-017-00141-2
Source DB: PubMed Journal: Nat Commun ISSN: 2041-1723 Impact factor: 14.919
Fig. 1TraP model construction and evaluation. a TraP construction workflow and main features calculated for TraP: (1) Information acquisition from all genes and transcripts that harbor by the variant, (2) changes to splice site motif that affect it’s binding affinity to the splicing machinery, (3) creations of new splice junctions that might interact with the splicing machinery, (4) creations and disruptions of cis-acting binding sites to splicing regulatory proteins (SRP), (5) interactions between features, such as a stronger effect of a new splice site on an exon with a weak original splice site (red representing a new splice site). Model is trained using synonymous variants that are either known pathogenic variants (blue box, left) or DNMs from healthy individuals (red box, right). b A receiver-operating characteristic curve showing the results of 10 rounds of 10-fold cross-validations with an average AUC of 0.86. c Model predictions of the training-set show a clear separation of pathogenic variants (blue) versus control DNMs (red). TraP (y-axis) exhibits a minimum threshold for pathogenic variants of 0.459, below, which reside all control DNMs. GERP++ score (x-axis) considers 49.5% of benign variants as conserved
Fig. 2TraP and allele frequency of synonymous and intronic variants. a TraP density plots for training-set pathogenic variants (red), control DNMs (blue) and 1.46 M ExAC synonymous variants (green). b Correlation between TraP and MAF for 29,985 synonymous variants that create strong cryptic splice sites. The data set was binned into 20 groups by taking 5% score intervals and examining the correlation of the 20 points with the average MAF for each group. c Correlation between GERP++ score and MAF for 29,985 synonymous variants that create strong cryptic splice sites. The data set was binned 20 groups as in (b). d MAF distributions for different types of variants. MAF distribution for synonymous variants is presented with no Trap threshold (yellow), minimum pathogenic TraP (≥ 0.459, orange) and high TraP (≥ 0.93, red). Synonymous variants with high TraP (red), have significantly lower average MAF than NS variants (bright blue). MAF distribution of CADD top scoring synonymous variants (97.84th percentile) is also presented (green). e MAF distributions based on a non-GERP++TraP model for 1.46 M ExAC synonymous variants. Thresholds used differ from the final TraP model: minimum pathogenic TraP threshold used is the 25th percentile score (≥ 0.66, orange) and high TraP threshold is the 75th percentile score (≥ 0.955, red). f MAF distributions for 1.5 M intronic variants from 776 sequenced whole genomes. MAF distribution is presented for variants with no Trap threshold (yellow), minimum pathogenic TraP (≥ 0.459, orange) and high TraP (≥ 0.93, red). The whiskers of the boxplots extend to the most extreme data point, which is no more than 1.5 times the interquartile range away from the box
Fig. 3ROC curves of ClinVar pathogenic and benign variants. a A ROC curve of ClinVar pathogenic and benign synonymous variants, calculated for TraP (red), GERP++ (green) and CADD (blue). b Same as a but for ClinVar intronic variants. Colored area represents high specificity region
Fig. 4Epilepsy synonymous DNMs vs. ClinVar benign controls. A quantile–quantile plot for 103 Epi4K DNMs and 4,352 benign ClinVar synonymous variants is calculated for a TraP scores, c GERP++ scores and e CADD scores. Score distributions for training-set control DNMs, ClinVar benign variants and Epi4K DNMs are scored using b TraP, d GERP++ and f CADD.The whiskers of the boxplots extend to the most extreme data point, which is no more than 1.5 times the interquartile range away from the box
Epi4K DNMs with high TraP scores
| Variant ID | TraP | Gene ID | Phenotype associated with Gene | Ref |
|---|---|---|---|---|
|
|
|
|
|
[ |
| 18-40695437-G-A | 0.812 |
| Schizophrenia, synaptic signaling; weak PD |
[ |
| 2-222428815-G-A | 0.438 |
|
|
[ |
| 14-23858626-T-C | 0.430 |
| Cardiomyopathy | |
| 3-42700087-G-A | 0.383 |
| Unknown | |
|
|
|
|
|
[ |
| 13-52249316-A-G | 0.352 |
| Unknown |
Genes associated with epilepsy are in bold and genes associated with other neurological disorders are underlined.
Fig. 5Mini-gene design and quantification. a Minigene design. (A) Exon 10 and flanking genomic sequence was amplified from patient and parent DNA and cloned into the pI-12 splicing reporter vector. (B) Predicted splicing effect if splice site mutation has no effect on WT splicing. (C) Predicted skipping of exon 10 if splice site is disrupted by K333. b Semi-quantitative PCR gel of splicing isoforms of parent harboring the W97C variant and proband harboring both the W97C and K333 variants