| Literature DB >> 25416802 |
Xueqiu Jian, Eric Boerwinkle, Xiaoming Liu.
Abstract
In silico tools have been developed to predict variants that may have an impact on pre-mRNA splicing. The major limitation of the application of these tools to basic research and clinical practice is the difficulty in interpreting the output. Most tools only predict potential splice sites given a DNA sequence without measuring splicing signal changes caused by a variant. Another limitation is the lack of large-scale evaluation studies of these tools. We compared eight in silico tools on 2959 single nucleotide variants within splicing consensus regions (scSNVs) using receiver operating characteristic analysis. The Position Weight Matrix model and MaxEntScan outperformed other methods. Two ensemble learning methods, adaptive boosting and random forests, were used to construct models that take advantage of individual methods. Both models further improved prediction, with outputs of directly interpretable prediction scores. We applied our ensemble scores to scSNVs from the Catalogue of Somatic Mutations in Cancer database. Analysis showed that predicted splice-altering scSNVs are enriched in recurrent scSNVs and known cancer genes. We pre-computed our ensemble scores for all potential scSNVs across the human genome, providing a whole genome level resource for identifying splice-altering scSNVs discovered from large-scale sequencing studies.Entities:
Mesh:
Substances:
Year: 2014 PMID: 25416802 PMCID: PMC4267638 DOI: 10.1093/nar/gku1206
Source DB: PubMed Journal: Nucleic Acids Res ISSN: 0305-1048 Impact factor: 16.971
Summary of the dataset used in the present study
| Data source | 5′ | 3′ | Intronic | Exonic | Total |
|---|---|---|---|---|---|
| Positive | |||||
| HGMD | 725 | 65 | 2 | 788 | 790 |
| SpliceDisease | 182 | 84 | 235 | 31 | 266 |
| DBASS | 63 | 45 | 96 | 12 | 108 |
| Subtotal | 970 | 194 | 333 | 831 | 1164 |
| Negative | |||||
| 1000 Genomes | 712 | 1083 | 1548 | 247 | 1795 |
| Total | 1682 | 1277 | 1881 | 1078 | 2959 |
Missing rates of the prediction scores for eight in silico tools
| Tool | No. of missing | Missing rate |
|---|---|---|
| PWM | 77 | 0.026 |
| MES | 82 | 0.028 |
| NNSplice | 68 | 0.023 |
| HSF | 66 | 0.022 |
| GeneSplicer | 563 | 0.190 |
| GENSCAN | 2466 | 0.833 |
| NetGene2 | 1887 | 0.638 |
| SplicePredictor | 2252 | 0.761 |
Figure 1.Averaged ROC curves for seven individual scores and four ensemble scores with 10-fold cross-validation.
Figure 2.Plots of sensitivity and specificity for seven individual scores and four ensemble scores using cutoff values that maximize accuracy. All measures are reported as averages based on 10-fold cross-validation.
Mean values of the evaluation measures for the 11 scores based on 10-fold cross-validation
| Score | AUC | Cutoff | Accuracy | Sensitivity | Specificity | PPVa | NPVb |
|---|---|---|---|---|---|---|---|
| PWM1 | 0.951 | 0.075 | 0.905 | 0.838 | 0.950 | 0.918 | 0.897 |
| PWM2 | 0.946 | 0.062 | 0.911 | 0.851 | 0.952 | 0.922 | 0.904 |
| MES | 0.941 | 0.217 | 0.895 | 0.859 | 0.919 | 0.875 | 0.908 |
| NNSplice1 | 0.902 | 0.146 | 0.827 | 0.787 | 0.854 | 0.782 | 0.860 |
| NNSplice2 | 0.910 | 0.125 | 0.845 | 0.768 | 0.895 | 0.830 | 0.855 |
| HSF1 | 0.930 | 0.044 | 0.885 | 0.825 | 0.925 | 0.879 | 0.889 |
| HSF2 | 0.927 | 0.039 | 0.883 | 0.825 | 0.922 | 0.875 | 0.889 |
| AdaBoost | 0.963 | 0.708 | 0.924 | 0.881 | 0.952 | 0.923 | 0.925 |
| Random forests | 0.964 | 0.515 | 0.923 | 0.898 | 0.941 | 0.911 | 0.932 |
| ada_addc | 0.977 | 0.612 | 0.937 | 0.907 | 0.955 | 0.930 | 0.941 |
| rf_addd | 0.978 | 0.598 | 0.935 | 0.901 | 0.958 | 0.935 | 0.936 |
aPPV: positive predictive value.
bNPV: negative predictive value.
cada_add: AdaBoost with four additional scores included.
drf_add: random forests with four additional scores included.
Figure 3.P-values of 10-fold cross-validated paired t-test for AUCs between any two scores. The scores are ordered by AUC. The further the distance between a cell and its vertical/horizontal labels, the larger the difference between the AUCs for vertical/horizontal scores. Score pairs with significantly different AUCs are bold. Comparison between different score variations of the same tool are underlined. The remaining represent the non-significant AUC differences.
Trend for the distribution of predicted splice-altering scSNVs in COSMIC
| No. of times observed | No. of scSNVs | No. of predicted splice-altering scSNVs | Proportion |
|---|---|---|---|
| 1 or 2 | 41 455 | 29 322 | 0.707 |
| 3 or 4 | 299 | 222 | 0.742 |
| 5 or more | 129 | 109 | 0.845 |
Distribution of predicted splice-altering scSNVs within cancer/non-cancer genes in COSMIC
| Gene | No. of scSNVs | No. of predicted splice-altering scSNVs | Proportion |
|---|---|---|---|
| Cancer | 2566 | 2025 | 0.789 |
| Non-cancer | 39 317 | 27 628 | 0.703 |