| Literature DB >> 24451234 |
Matthew Mort, Timothy Sterne-Weiler, Biao Li, Edward V Ball, David N Cooper, Predrag Radivojac, Jeremy R Sanford, Sean D Mooney.
Abstract
We have developed a novel machine-learning approach, MutPred Splice, for the identification of coding region substitutions that disrupt pre-mRNA splicing. Applying MutPred Splice to human disease-causing exonic mutations suggests that 16% of mutations causing inherited disease and 10 to 14% of somatic mutations in cancer may disrupt pre-mRNA splicing. For inherited disease, the main mechanism responsible for the splicing defect is splice site loss, whereas for cancer the predominant mechanism of splicing disruption is predicted to be exon skipping via loss of exonic splicing enhancers or gain of exonic splicing silencer elements. MutPred Splice is available at http://mutdb.org/mutpredsplice.Entities:
Mesh:
Substances:
Year: 2014 PMID: 24451234 PMCID: PMC4054890 DOI: 10.1186/gb-2014-15-1-r19
Source DB: PubMed Journal: Genome Biol ISSN: 1474-7596 Impact factor: 13.583
Summary of original data sets used in this study
| Disease-causing splice altering variants (DM-SAVs) | Splice altering variants (SAVs) | Inherited disease-causing coding region mutations that disrupt pre-mRNA splicing, derived from HGMD | 1,189 | 453 |
| Disease-causing splice neutral variants (DM-SNVs) | Splice neutral variants (SNVs) | Inherited disease-causing missense mutations not reported to disrupt splicing derived from the same set of genes as the DM-SAVs. The majority are not expected to have any effect on exon splicing but approximately 25% may nevertheless disrupt splicing | 7,729 | 364 |
| Polymorphic splice neutral variants (SNP-SNVs) | Splice neutral variants (SNVs) | Putatively ‘neutral’ common coding region SNPs (minor allele frequency >0.3) from the 1000 Genomes Project. The majority are not expected to have any effect on pre-mRNA splicing | 7,339 | 3,773 |
Summary of training set sizes derived from the data sets outlined in Table 1
| Disease negative set | DM-SAVs (1,189, 1,189, 2,601) | DM-SNVs (7,729, 7,363, 31,967) |
| SNP negative set | DM-SAVs (1,189, 1,189, 2,090) | SNP-SNVs (7,339, 7,253, 70,847) |
| Mixed negative set (disease and SNP) | DM-SAVs (1,189, 1,189, 6,335) | DM-SNVs and SNP-SNVs (15,068, 14,616, 111,630) |
| Random SNP set (control) | SNP-SNVs (50%) (3,669, 3,669, 9,901) | SNP-SNVs (50%) (3,670, 3,613, 7,349) |
Number of training examples for each different iteration (iter. 1, iter. 2 and iter. 3.) are shown in parentheses.
Summary of features investigated in this study
| Distance to nearest splice site | SNP-based | Distance between a given variant and the nearest 5′ or 3′ splice site in the target exon. |
| ESR change | SNP-based | Change in the frequency of ESR elements subsequent to a single base substitution. This includes: |
| ESE to neutral (ESE loss) | ||
| ESE to ESE (no change) | ||
| Neutral to ESE (ESE gain) | ||
| ESE to ESS (ESE loss and ESS gain) | ||
| Neutral to neutral (no change) | ||
| ESS to ESS | ||
| Neutral to ESS (ESS gain) | ||
| ESS to neutral (ESS loss) | ||
| ESS to ESE (ESS loss and ESE gain) | ||
| In ESE | SNP-based | Frequency of ESE binding sites (in the wild-type) that overlap with the location of the variant |
| In ESS | SNP-based | Frequency of ESS binding sites (in the wild-type) that overlap with the variant |
| ESR hexamer score (ESR-HS) | SNP-based | Hexamer scoring function to express the relationship between disease and neutral variants and their differential distributions with respect to loss or gain of an ESE or ESS |
| Spectrum kernel | SNP-based | Frequency of 3-mers and 4-mers over an 11 bp window (wild type and mutant) |
| Change in natural splice site strength | SNP-based | MaxEnt splice site score of natural splice site in mutant allele minus MaxEnt splice site score of wild-type allele |
| Maximum cryptic splice site | SNP-based | Maximum cryptic splice site (5′ and 3′) score (outside of the natural splice site) found overlapping the variant on the mutant allele |
| Evolutionarily conserved element | SNP-based | PhastCons conserved element probability for substitution site, based on multiple alignments of 46 placental mammals |
| Base-wise evolutionary conservation | SNP-based | PhyloP base-wise sequence conservation score at site of single base substitution based on multiple sequence alignment of 46 placental mammals |
| Natural wild-type splice site strength | Exon-based | MaxEntScan score of the natural 5′ and 3′ splice site of the wild-type target exon |
| Flanking intron size | Exon-based | Length in base-pairs of the upstream and downstream introns flanking the target exon |
| Intronic ESS density | Exon-based | Intronic ESS density was calculated for 100 bp upstream and 100 bp downstream of the target exon |
| Exonic ESS density | Exon-based | ESS density was calculated across the first 50 bp and the last 50 bp of the target exon. If the length of the exon was less than 100 bp, then the full length of the exon was used to calculate the ESS density |
| Exonic ESE density | Exon-based | Same as above but for ESEs |
| Internal coding exon | Exon-based | {true, false}, Is the target exon an internal coding exon (that is, the target exon is not the first or last coding exon) |
| Exonic GC content | Exon-based | Percentage of nucleotides that are either guanine or cytosine in the target exon |
| Exon size | Exon-based | Size of the target exon |
| Constitutive exon | Exon-based | Is the target exon constitutively spliced |
| Exon number | Gene-based | Number of exons in the transcript |
| Transcript number | Gene-based | Number of different reported isoforms that the target gene encodes |
Figure 1Feature ranking for Disease negative set versus SNP negative set (Iter. 1), shown by means of the average AUC using 10-fold cross-validation. The linear support vector machine (SVM) classifier was trained with only the specific feature (or feature subset) that was being tested. As a control, each training example had a randomly generated numerical value computed. AUC values for all features were then compared with the AUC produced by a classifier trained with only the randomly generated attribute by means of a Bonferroni corrected t-test (P < 0.05). Significantly different AUC values compared to the random attribute are indicated by asterisks in parentheses for the respective data sets (significant Disease negative set feature, significant SNP negative set feature). Features are ranked by reference to the Disease negative set.
Figure 2Model performance evaluation using ROC curves when applied to the same unseen test of 352 variants (238 positive and 114 negative). For each of the four training sets (Table 2), three different RF classification models were built (Iter. 1, Iter. 2 and Iter. 3). The percentage AUC for each training set and specific iteration are shown in parentheses.
Standard performance benchmarks for MutPred Splice based on an unseen test set of 352 variants (238 positive, 114 negative) using the three different iterations (Iter. 1, Iter 2. and Iter. 3) of the four different training sets identified in this study (Table 2)
| Disease negative set | Iter. 1 | 7.0 | 53.4 | 93.0 | 73.2 | 75.2 | 0.45 |
| Iter. 2 | 7.0 | 52.5 | 93.0 | 72.8 | 75.9 | 0.44 | |
| Iter. 3 | 4.4 | 55.0 | 95.6 | 75.3 | 77.1 | 0.49 | |
| SNP negative set | Iter. 1 | 36.8 | 73.1 | 63.2 | 68.1 | 76.4 | 0.35 |
| Iter. 2 | 36.8 | 72.3 | 63.2 | 67.7 | 76.8 | 0.34 | |
| Iter. 3 | 34.2 | 71.0 | 65.8 | 68.4 | 78.3 | 0.35 | |
| Mixed negative set | Iter. 1 | 7.9 | 56.3 | 92.1 | 74.2 | 78.8 | 0.46 |
| Iter. 2 | 7.9 | 56.7 | 92.1 | 74.4 | 78.6 | 0.46 | |
| Random SNP set | Iter. 1 | 0.0 | 1.3 | 100.0 | 50.6 | 50.6 | 0.06 |
| Iter. 2 | 0.9 | 1.7 | 99.1 | 50.4 | 45.2 | 0.03 | |
| Iter. 3 | 29.8 | 31.1 | 70.2 | 50.6 | 50.3 | 0.01 | |
Classification models were built using RF with 1,000 trees. The unseen test set was experimentally characterized with respect to the splicing phenotype. Performance benchmarks for the final classification model (Mixed negative set; Iter. 3) are highlighted in bold. Performance metrics where appropriate were calculated using a probability threshold (general score) ≥0.60. The Random SNP set is a control set. MCC, Matthews correlation coefficient.
Predicted proportion of exonic variants that disrupt pre-mRNA splicing in human genetic disease (Inherited disease, that is, germline; and Cancer, that is, somatic) and also identified in the general population (1000 Genomes Project participants)
| Inherited disease | 11.0% (5,193/47,228) | 90.3% (468/518) | 30.5% (4,130/13,559) | 16.0% (9,791/61,305) |
| Cancer | 9.2% (32,056/347,380) | 8.6% (9,010/105,094) | 32.4% (9,141/28,256) | 10.4% (50,207/480,730) |
| 1000 Genomes | 6.8% (7,016/103,445) | 6.7% (5,968/89,396) | 19.5% (273/1,400) | 6.8% (13,257/194,241) |
The somatic Cancer data set includes driver and passenger mutations recorded in COSMIC [63]. The 1000 Genomes Project data set was derived from the 1000 Genomes Project without any MAF filter having been applied, that is, all rare and common variants were included. The proportion of predicted SAVs for each data set is shown together with the frequencies of predicted SAVs; the sizes of the data sets are shown in parentheses.
Predicted proportion of exonic variants from two gene subsets (tumor suppressor versus oncogenes) that disrupt pre-mRNA splicing in human genetic disease (Inherited disease that is, germline and Cancer that is, somatic) and also identified in the general population (1000 Genomes project participants)
| Inherited disease | 25.3% (1,130/4,463) | 10.9% (132/1,207) |
| Cancer | 16.0% (1,612/10,082) | 10.9% (525/4,831) |
| 1000 Genomes | 7.4% (84/1,133) | 8.0% (49/612) |
The somatic Cancer data set includes driver and passenger mutations recorded in COSMIC [63]. The 1000 Genomes Project data set was derived from the 1000 Genomes Project without any MAF filter having been applied, that is, all rare and common variants were included. The proportion of predicted SAVs for each data set is shown, together with the frequencies of predicted SAVs; the sizes of the data sets are shown in parentheses.
Figure 3Case study illustrating the semi-supervised approach employed in this study. The disease-causing (DM) missense mutation CM080465 in the OPA1 gene (NM_015560.2: c.1199C > T; NP_056375.2: p.P400L) was not originally reported to disrupt splicing but was later shown in vitro to disrupt pre-mRNA splicing [25]. CM080465 was included in the negative set in the first iteration (Iter. 1). The Iter. 1 model, however, predicted CM080465 to disrupt pre-mRNA splicing (SAV). In the next iteration (Iter. 2), CM080465 was excluded from the negative set. The Iter. 2 model still predicted CM080465 to be a SAV and so, in the final iteration (Iter. 3), this variant was included in the positive set. This demonstrated that a semi-supervised approach can, at least in some instances, correctly re-label an incorrectly labeled training example. SAV, splice-altering variant; SNV, splice neutral variant.
Comparison of three existing tools used to identify exonic SAVs with MutPred Splice
| Splicing focus | Splice site disruption | All exonic and intronic | ESE/ESS disruption and cryptic splice site | All exonic |
| Prediction output | Binary label | Multiple output scores | Multiple output scores | Probabilistic, with additional hypothesis of splicing mechanism disrupted |
| TP | 41 | 65 | 68 (61) | 121 |
| FP | 4 | 33 | 15 | 7 |
| TN | 79 | 50 | 68 | 76 |
| FN | 140 (0) | 116 | 113 (57) | 60 |
| FPR% | 4.8 | 39.8 | 18.1 | 8.4 |
| Sensitivity (%) | 22.7 (100.0) | 35.9 | 37.6 (51.7) | 66.9 |
| Specificity (%) | 95.2 | 60.2 | 81.9 | 91.6 |
| Accuracy (%) | 58.9 (97.6) | 48.1 | 59.7 (66.8) | 79.2 |
| MCC | 0.22 (0.93) | -0.04 | 0.19 (0.34) | 0.54 |
Evaluation was based on 264 exonic variants (181 positive, 83 negative). Performance metrics are given for guidance only as not all tools may be directly comparable (due to different applications or limitations). Performance scores in parentheses reflect adjusted performance based upon the evaluation of only specific categories of splicing mutation (for example, splice site disruption) relevant to the respective tool. For methods that output multiple scores for a variant (HSF and Skippy), performance metrics may differ depending upon the features and thresholds applied. TP, true positives; FP, false positives; TN, true negatives; FN, false negatives; FPR, false positive rate; MCC, Matthews correlation coefficient.
Figure 4Role of exonic variants in aberrant mRNA processing for Inherited disease and Cancer data sets. The somatic Cancer variants were derived from COSMIC and include both driver and passenger mutations. For all mutation types and the combined total, the proportions of predicted SAVs in both Inherited disease and Cancer were significantly enriched (Fisher’s exact test with Bonferroni correction applied; P < 0.05) when compared to exonic variants identified in the 1000 Genomes Project (unlike the SNP negative training set, in this instance no MAF filter was applied, that is, all rare and common variants were included).
Figure 5Confident hypotheses of the underlying splicing mechanism disrupted for predicted exonic SAVs in Inherited disease and somatic variants in Cancer. Significant enrichment (+) or depletion (-) for a specific hypothesis is shown for the Cancer versus Inherited disease datasets (Fisher’s exact test with a Bonferroni-corrected threshold of P < 0.05).
Figure 6Proportion of exonic variants involved in aberrant mRNA processing for a set of tumor suppressor genes (71 genes) and a set of oncogenes (54 genes), from three different data sets (Inherited disease, somatic mutations in Cancer, and variants identified in the 1000 Genomes Project with no MAF filter applied, that is, all rare and common variants included). Disease-causing substitutions in tumor suppressor (TS) genes tend to be recessive loss-of-function mutations, in contrast to disease-causing substitutions in oncogenes, which are usually dominant gain-of-function mutations. Inherited disease and Cancer are significantly enriched in the TS gene set (denoted by an asterisk), when compared with the equivalent set of oncogenes, for mutations that are predicted to result in aberrant mRNA processing (SAVs). P-values were calculated using a Fisher’s exact test with a Bonferroni-corrected threshold of P < 0.05.