| Literature DB >> 33942434 |
Tabea V Riepe1,2, Mubeen Khan2, Susanne Roosing2, Frans P M Cremers2, Peter A C 't Hoen1.
Abstract
Hereditary disorders are frequently caused by genetic variants that affect pre-messenger RNA splicing. Though genetic variants in the canonical splice motifs are almost always disrupting splicing, the pathogenicity of variants in the noncanonical splice sites (NCSS) and deep intronic (DI) regions are difficult to predict. Multiple splice prediction tools have been developed for this purpose, with the latest tools employing deep learning algorithms. We benchmarked established and deep learning splice prediction tools on published gold standard sets of 71 NCSS and 81 DI variants in the ABCA4 gene and 61 NCSS variants in the MYBPC3 gene with functional assessment in midigene and minigene splice assays. The selection of splice prediction tools included CADD, DSSP, GeneSplicer, MaxEntScan, MMSplice, NNSPLICE, SPIDEX, SpliceAI, SpliceRover, and SpliceSiteFinder-like. The best-performing splice prediction tool for the different variants was SpliceRover for ABCA4 NCSS variants, SpliceAI for ABCA4 DI variants, and the Alamut 3/4 consensus approach (GeneSplicer, MaxEntScacn, NNSPLICE and SpliceSiteFinder-like) for NCSS variants in MYBPC3 based on the area under the receiver operator curve. Overall, the performance in a real-time clinical setting is much more modest than reported by the developers of the tools.Entities:
Keywords: zzm321990ABCA4zzm321990; zzm321990MYBPC3zzm321990; RNA splicing; deep learning; splice prediction tools; variant effect prediction
Mesh:
Substances:
Year: 2021 PMID: 33942434 PMCID: PMC8360004 DOI: 10.1002/humu.24212
Source DB: PubMed Journal: Hum Mutat ISSN: 1059-7794 Impact factor: 4.878
Overview of the most important properties of the different splice prediction tools
| Tool | Approach | Algorithm | Score range | Characteristic | Training data | Input data | Nucleotide positions | Interface | Year |
|---|---|---|---|---|---|---|---|---|---|
| CADD | Support vector machine with linear kernel | ML | – | Integrates more than 60 genomic features into a single score | 13,141,299 SNVs, 627,071 insertions and 926,968 deletions from simulated and observed variants | VCF file | – | Website | 2014 |
| DSSP | CNN with long short‐term memory | DL | 0–1 | Individual prediction for SDS and SAS | HS3D | 140 nt sequence with consensus sequence the middle | 140 nt | Python script | 2018 |
| GeneSplicer | Decision tree and Markov model | ML | 0–15 | Markov model captures additional dependencies among neighboring bases at splice sites | 1323 plant genes and 1115 human genes | FASTA sequence | Up to 80 nt on both sites of splice site | Alamut | 2001 |
| MaxEntScan | Maximum entropy |
Other | 0–12 | Use of different constraints sorted by the effect on entropy, only second‐order dependencies | 1821 nonredundant transcripts with 12,715 introns | 9‐mer FASTA sequence | 9 nt at SAS, 23 nt at SDS | Alamut | 2004 |
| MMSplice | Individual modules scoring exon, intron, and splice sites | DL |
– | Predicts quantitative physical measures of splicing | Vex‐seq + GENCODE | VCF file | All nucleotides in intron, exon, intron structure | Python package | 2018 |
| NNSPLICE | Hidden Markov model and neural network | ML | 0–1 | Captures pairwise correlations between adjacent nucleotides | 285 multiple‐exon human DNA sequences from GenBank | FASTA sequence | −7 to +8 at SAS, −21 to +20 at SDS | Alamut | 1997 |
| SPIDEX | Bayesian modeling | ML | 0–1 | Tissue‐specific PSI values | Illumina Human Body Map 2.0 project | VCF file | Depending on features, up to 2000 nt in introns and 300 nt in exons | Txt file with precomputed values | 2015 |
| SpliceAI | Deep learning with ResNet blocks | DL | 0–1 | Predicts nucleosome positioning from sequence | GENCODE | VCF file | 10,000 nt | Python package | 2019 |
| SpliceRover | CNN | DL | 0–1 | Identifies regions/structures of interest by normalizing contribution scores, and individual models for SDS and SAS | Human and plant | FASTA sequence | Minimal 400 nt | Website | 2018 |
| SpliceSiteFinder‐like | Position weight matrices | Other | 0–100 | – | – | – | – | Alamut | 1987 |
Abbreviations: CNN, convolutional neural network; DL, deep learning; ML, machine learning; nt, nucleotides; SAS, splice acceptor site; SDS, splice donor site; VCF, variant call format.
Figure 1Variant effect on splicing and splice site. (a) Distribution of splice‐altering variants and distribution of variants that affected either the splice acceptor site (SAS) or splice donor site (SDS) in the ABCA4 NCSS, ABCA4 DI, and MYBPC3 NCSS data set. (b, c) Plot of the number splice‐altering and nonsplice‐altering NCSS variants present at the SDS (+3 to +6, panel b) and SAS (−14 to −3, panel c) and the first or last two nucleotides of the exon and the number of variants found to affect splicing
Figure 2Receiver operator curve (ROC) and area under the curve (AUC) for the five splice prediction tools with the highest AUC for each data set. ROC curves for (a) ABCA4 NCSS variants, (b) ABCA4 DI variants, and (c) MYBPC3 NCSS variants. The AUC values are given in the insets
Figure 3Comparison of the area under the curve (AUC) for all tools in the three different data sets. In addition to the individual tools, the Alamut 3/4 consensus was included. The best tool for each category is highlighted in dark blue. For the other category, both the Alamut 3/4 consensus approach and MaxEntScan showed comparable high AUC values and are, therefore, highlighted
Comparison of the optimal thresholds for each data set with the suggested threshold by the developers
| Tool | Suggested threshold | |||
|---|---|---|---|---|
| CADD | 2.66 | 0.24 | 2.09 | 5ʹ extended: 7.39, 3ʹ intronic: 0.0964, exonic: 0.39 |
| DSSP | 0.01 | 0.13 | 0.01 | 0.30 |
| GeneSplicer | 0.18 | 0.05 | 0.21 | – |
| MaxEntScan | 0.26 | 0.31 | 0.24 | 0.10 |
| MMSplice | 1.42 | – | 1.37 | 2 |
| NNSPLICE | 0.13 | 0.40 | 0.30 | 0.05 |
| Spidex | 0.86 | – | 1.72 | 5 |
| SpliceAI | 0.19 | 0.18 | 0.11 | 0.20 |
| SpliceRover | 0.18 | 0.26 | 0.10 | – |
| SpliceSiteFinder‐like | 0.01 | 0.12 | 0.09 | 0.05 |
Confusion matrix and statistical measures of the ABCA4 NCSS variants
| Tool | Missing values | TP | FP | TN | FN | Accuracy (%) | PPV (%) | Sensitivity (%) | Specificity (%) | NPV (%) | MCC |
|---|---|---|---|---|---|---|---|---|---|---|---|
| Alamut Consensus 3/4 | 0 | 35 | 2 | 5 | 29 | 56 | 95 | 55 | 71 | 15 | 0.16 |
| CADD | 0 | 40 | 3 | 4 | 24 | 62 | 93 | 63 | 57 | 14 | 0.12 |
| DSSP | 0 | 51 | 2 | 5 | 13 | 79 | 96 | 80 | 71 | 28 | 0.35 |
| GeneSplicer | 0 | 31 | 3 | 4 | 33 | 49 | 91 | 48 | 57 | 11 | 0.03 |
| MaxEntScan | 0 | 40 | 2 | 5 | 24 | 63 | 95 | 63 | 71 | 17 | 0.21 |
| MMSplice | 0 | 43 | 2 | 5 | 21 | 68 | 96 | 67 | 71 | 19 | 0.24 |
| NNSPLICE | 0 | 42 | 2 | 5 | 22 | 66 | 95 | 66 | 71 | 19 | 0.23 |
| Spidex | 5 | 43 | 2 | 5 | 21 | 68 | 96 | 67 | 71 | 19 | 0.24 |
| SpliceAI | 0 | 50 | 1 | 6 | 14 | 79 | 98 | 78 | 86 | 30 | 0.42 |
| SpliceRover | 0 | 48 | 1 | 6 | 16 | 76 | 98 | 75 | 86 | 27 | 0.39 |
| SpliceSiteFinder‐like | 0 | 40 | 2 | 5 | 24 | 63 | 95 | 63 | 71 | 17 | 0.21 |
Abbreviations: FN, false negatives; FP, false positives; MCC, Mathew's correlation coefficient; NPV, negative predictive value; PPV, positive predictive value; TN, true negatives; TP, true positives.
Confusion matrix and statistical measures of the ABCA4 DI variants
| Tool | Missing values | TP | FP | TN | FN | Accuracy (%) | PPV (%) | Sensitivity (%) | Specificity (%) | NPV (%) | MCC |
|---|---|---|---|---|---|---|---|---|---|---|---|
| Alamut Consensus 3/4 | 0 | 11 | 14 | 46 | 10 | 70 | 44 | 52 | 77 | 82 | 0.28 |
| CADD | 0 | 12 | 24 | 36 | 9 | 59 | 33 | 57 | 60 | 80 | 0.15 |
| DSSP | 0 | 13 | 19 | 41 | 8 | 67 | 41 | 62 | 68 | 84 | 0.27 |
| GeneSplicer | 0 | 14 | 16 | 44 | 7 | 72 | 47 | 67 | 73 | 86 | 0.36 |
| MaxEntScan | 0 | 13 | 21 | 39 | 8 | 64 | 38 | 62 | 65 | 83 | 0.24 |
| NNSPLICE | 0 | 14 | 17 | 43 | 7 | 70 | 45 | 67 | 72 | 86 | 0.35 |
| SpliceAI | 0 | 19 | 3 | 57 | 2 | 94 | 86 | 90 | 95 | 97 | 0.84 |
| SpliceRover | 0 | 15 | 14 | 46 | 6 | 75 | 52 | 71 | 77 | 88 | 0.44 |
| SpliceSiteFinder‐like | 0 | 11 | 27 | 33 | 10 | 54 | 29 | 52 | 55 | 77 | 0.06 |
Abbreviations: FN, false negatives; FP, false positives; MCC, Mathew's correlation coefficient; NPV, negative predictive value; PPV, positive predictive value; TN, true negatives; TP, true positives.
Confusion matrix and statistical measures of the MYBPC3 NCSS variants
| Tool | Missing values | TP | FP | TN | FN | Accuracy (%) | PPV (%) | Sensitivity (%) | Specificity (%) | NPV (%) | MCC |
|---|---|---|---|---|---|---|---|---|---|---|---|
| Alamut Consensus 3/4 | 0 | 23 | 4 | 23 | 11 | 75 | 85 | 68 | 85 | 68 | 0.53 |
| CADD | 0 | 21 | 8 | 19 | 13 | 66 | 72 | 62 | 70 | 59 | 0.32 |
| DSSP | 0 | 22 | 10 | 17 | 12 | 64 | 69 | 65 | 63 | 59 | 0.28 |
| GeneSplicer | 0 | 25 | 6 | 21 | 9 | 75 | 81 | 74 | 78 | 70 | 0.51 |
| MaxEntScan | 0 | 24 | 6 |
21 | 10 | 74 | 80 | 71 | 78 | 68 | 0.48 |
| MMSplice | 0 | 25 | 6 | 21 | 9 | 75 | 81 | 74 | 78 | 70 | 0.51 |
| NNSPLICE | 0 | 23 | 6 | 21 | 11 | 72 | 79 | 68 | 78 | 66 | 0.45 |
| Spidex | 3 | 20 | 9 | 18 | 14 | 62 | 69 | 59 | 67 | 56 | 0.25 |
| SpliceAI | 0 | 22 | 8 | 19 | 12 | 67 | 73 | 65 | 70 | 61 | 0.35 |
| SpliceRover | 0 | 22 | 9 | 18 | 12 | 66 | 71 | 65 | 67 | 60 | 0.31 |
| SpliceSiteFinder‐like | 0 | 25 | 6 | 21 | 9 | 75 | 81 | 74 | 78 | 70 | 0.51 |
Abbreviations: FN, false negatives; FP, false positives; MCC, Mathew's correlation coefficient; NPV, negative predictive value; PPV, positive predictive value; TN, true negatives; TP, true positives.