| Literature DB >> 34349788 |
Jun-Ichi Takeda1, Sae Fukami1, Akira Tamura1, Akihide Shibata1,2, Kinji Ohno1.
Abstract
Prediction of the effect of a single-nucleotide variant (SNV) in an intronic region on aberrant pre-mRNA splicing is challenging except for an SNV affecting the canonical GU/AG splice sites (ss). To predict pathogenicity of SNVs at intronic positions -50 (Int-50) to -3 (Int-3) close to the 3' ss, we developed light gradient boosting machine (LightGBM)-based IntSplice2 models using pathogenic SNVs in the human gene mutation database (HGMD) and ClinVar and common SNVs in dbSNP with 0.01 ≤ minor allelic frequency (MAF) < 0.50. The LightGBM models were generated using features representing splicing cis-elements. The average recall/sensitivity and specificity of IntSplice2 by fivefold cross-validation (CV) of the training dataset were 0.764 and 0.884, respectively. The recall/sensitivity of IntSplice2 was lower than the average recall/sensitivity of 0.800 of IntSplice that we previously made with support vector machine (SVM) modeling for the same intronic positions. In contrast, the specificity of IntSplice2 was higher than the average specificity of 0.849 of IntSplice. For benchmarking (BM) of IntSplice2 with IntSplice, we made a test dataset that was not used to train IntSplice. After excluding the test dataset from the training dataset, we generated IntSplice2-BM and compared it with IntSplice using the test dataset. IntSplice2-BM was superior to IntSplice in all of the seven statistical measures of accuracy, precision, recall/sensitivity, specificity, F1 score, negative predictive value (NPV), and matthews correlation coefficient (MCC). We made the IntSplice2 web service at https://www.med.nagoya-u.ac.jp/neurogenetics/IntSplice2.Entities:
Keywords: LightGBM; aberrant splicing; intronic mutations; single nucleotide variations; splice acceptor site
Year: 2021 PMID: 34349788 PMCID: PMC8326971 DOI: 10.3389/fgene.2021.701076
Source DB: PubMed Journal: Front Genet ISSN: 1664-8021 Impact factor: 4.599
Seven statistical measures indicated in the Human Mutation guidelines (Vihinen, 2013; Grimm et al., 2015) of IntSplice2 by fivefold CV of Training Dataset-1787.
| IntSplice2 | 0.826 | 0.861 | 0.764 | 0.884 | 0.809 | 0.800 | 0.654 |
| Accuracy = | |||||||
| Rate to predict true positives and true negatives in the whole dataset | |||||||
| Precision/Positive Prediciton Value (PPV) = | |||||||
| Rate of true positives in predicted positives | |||||||
| Recall/Sensitivity = | |||||||
| Rate of true positives in actual positives | |||||||
| Specificity = | |||||||
| Rate of true negatives in actual negatives | |||||||
| F1 score = | |||||||
| Harmonic mean of precision and recall. Higher precision and higher recall increase F1 score, but discrepancy between precision and recall lowers F1 score | |||||||
| NPV = | |||||||
| Rate of true negatives in predicted negatives | |||||||
| MCC = | |||||||
| A correlation coefficient between the actual and predicted binary conditions while the numbers of each condition are balanced. Unlike the other parameters, MCC balances the ratio between actual positives and actual negatives. | |||||||
| Predicted condition | Predicted positive | True positive (TP) | False positive (FP) | ||||
| Predicted negative | False negative (FN) | True negative (TN) | |||||
FIGURE 1Evaluation of IntSplice2 by fivefold CV. (A) Five iterated and mean ROC curves with AUROCs. (B) Five iterated and mean PR curves with AUPRs.
FIGURE 2The top 10 important features of IntSplice2 in 110 features.
Seven statistical measures of IntSplice2-BM and IntSplice models using Test Dataset-288, which has no circularity with the respective training datasets.
| IntSplice2-BM | 0.826 | 0.873 | 0.764 | 0.889 | 0.815 | 0.790 | 0.658 |
| IntSplice | 0.802 | 0.854 | 0.729 | 0.875 | 0.787 | 0.764 | 0.611 |
FIGURE 3A representative screenshot of the output of IntSplice2 web service. As previously reported, g.73550880G > A on chromosome 10 (GRCh37/hg19) identified in a patient with Usher syndrome is at the ninth nucleotide from the 3’ end of intron 45 of CDH23. When a user chooses “GRCh37/hg19” and enters the chromosome number “10” and the genomic coordinate “73550880,” the IntSplice2 web service returns the result on the same window on a browser.