| Literature DB >> 35993808 |
Kuokuo Li1,2,3, Tengfei Luo4, Yan Zhu4, Yuanfeng Huang5,6, An Wang1,2,3, Di Zhang1,2,3, Lijie Dong4, Yujian Wang5,6, Rui Wang4, Dongdong Tang1,2,3, Zhen Yu1,2,3, Qunshan Shen1,2,3, Mingrong Lv1,2,3, Zhengbao Ling4, Zhenghuan Fang4, Jing Yuan1,2,3, Bin Li5,6, Kun Xia4,7, Xiaojin He1,2,3,8, Jinchen Li4,5,6, Guihu Zhao5,6.
Abstract
A proportion of previously defined benign variants or variants of uncertain significance in humans, which are challenging to identify, may induce an abnormal splicing process. An increasing number of methods have been developed to predict splicing variants, but their performance has not been completely evaluated using independent benchmarks. Here, we manually sourced ∼50 000 positive/negative splicing variants from > 8000 studies and selected the independent splicing variants to evaluate the performance of prediction methods. These methods showed different performances in recognizing splicing variants in donor and acceptor regions, reminiscent of different weight coefficient applications to predict novel splicing variants. Of these methods, 66.67% exhibited higher specificities than sensitivities, suggesting that more moderate cut-off values are necessary to distinguish splicing variants. Moreover, the high correlation and consistent prediction ratio validated the feasibility of integration of the splicing prediction method in identifying splicing variants. We developed a splicing analytics platform called SPCards, which curates splicing variants from publications and predicts splicing scores of variants in genomes. SPCards also offers variant-level and gene-level annotation information, including allele frequency, non-synonymous prediction and comprehensive functional information. SPCards is suitable for high-throughput genetic identification of splicing variants, particularly those located in non-canonical splicing regions.Entities:
Year: 2022 PMID: 35993808 PMCID: PMC9458456 DOI: 10.1093/nar/gkac686
Source DB: PubMed Journal: Nucleic Acids Res ISSN: 0305-1048 Impact factor: 19.160
Figure 1.The workflow of SPCards.
Summary of positive splicing variants in SPCards
| Donor | Acceptor | |||||||
|---|---|---|---|---|---|---|---|---|
| Method | N-gene | N-variant | CSV | −3 to +8 except CSV | Other splicing variants | CSV | −50 to +2 except CSV | Other splicing variants |
| RT–PCR | 1084 | 3669 | 1080 | 852 | 322 | 750 | 431 | 234 |
| Minigene | 364 | 1142 | 232 | 327 | 121 | 166 | 186 | 110 |
| RNA-seq | 176 | 2080 | 96 | 101 | 760 | 78 | 496 | 549 |
| MFASS | 479 | 1051 | 164 | 111 | 224 | 85 | 273 | 194 |
| Experiment | 1081 | 7736 | 3319 | 1024 | 144 | 2719 | 490 | 40 |
|
| 2063 | 6122 | 3277 | 261 | 48 | 2369 | 127 | 40 |
| Total | 3345 | 21 800 | 8168 | 2676 | 1619 | 6167 | 2003 | 1167 |
RT–PCR, reverse transcription–polymerase chain reaction; MFASS, multiplex functional assay of splicing using Sort-seq; Experiment, the variants were validated by experimental evidence including minigene assay, site-directed mutagenesis or patient-derived RNA sample analysis in the SQUIRLS database; CSV, canonical splicing variants. As the validation method of variants in ClinVar was not available, we classified these variants as in silico analysis. −3 to +8 except CSV, the donor splicing consensus region except canonical splicing variants; −50 to +2 except CSV, the region including the donor splicing consensus region, polypyrimidine tract and branch point except canonical splicing variants.
Performance evaluation based on the SPCards splicing data
| Methods | Positive variant (%) | Negative variant (%) | PPV | NPV | Specificity | FPR | Sensitivity | FNR | Accuracy | MCC | AUC |
|---|---|---|---|---|---|---|---|---|---|---|---|
| CADD-splice | 2998 (88.10) | 3378 (99.27) | 0.63 |
| 0.65 | 0.35 | 0.68 | 0.32 | 0.66 | 0.33 | 0.73 |
| dbscSNV_ADA | 1492 (43.84) | 446 (13.11) |
| 0.66 | 0.83 | 0.17 |
|
|
|
|
|
| dbscSNV_RF | 1492 (43.84) | 446 (13.11) | 0.94 | 0.58 | 0.83 | 0.17 | 0.82 | 0.18 | 0.82 |
|
|
| dpsi_max_tissue | 2900 (85.22) | 3360 (98.74) | 0.84 | 0.63 |
|
| 0.37 | 0.63 | 0.68 | 0.38 | 0.75 |
| dpsi_zscore | 2900 (85.22) | 3360 (98.74) | 0.72 | 0.67 | 0.82 | 0.18 | 0.53 | 0.47 | 0.68 | 0.37 | 0.75 |
| ESRseq | 2995 (88.01) | 3377 (99.24) | 0.52 | 0.56 | 0.67 | 0.33 | 0.41 | 0.59 | 0.55 | 0.08 | 0.54 |
| GeneSplicer | 995 (29.24) | 855 (25.12) | 0.74 | 0.61 | 0.76 | 0.24 | 0.58 | 0.42 | 0.66 | 0.35 | 0.72 |
| KipoiSplice4 | 2739 (80.49) | 3253 (95.59) | 0.89 |
|
|
| 0.53 | 0.47 | 0.76 | 0.53 | 0.72 |
| MaxEntScan | 2965 (87.13) | 3377 (99.24) | 0.57 | 0.68 | 0.52 | 0.48 | 0.72 | 0.28 | 0.62 | 0.25 | 0.63 |
| MMsplice | 2897 (85.13) | 3377 (99.24) |
| 0.63 |
|
| 0.32 | 0.68 | 0.68 | 0.43 | 0.71 |
| regsnp | 1773 (52.10) | 1557 (45.75) | 0.89 |
| 0.90 | 0.10 | 0.67 | 0.33 | 0.78 | 0.58 | 0.84 |
| SPiCE | 1622 (47.66) | 459 (13.49) | 0.92 | 0.68 | 0.70 | 0.30 |
|
|
|
|
|
| SPiCE_MES | 1622 (47.66) | 460 (13.51) | 0.94 | 0.60 | 0.80 | 0.20 |
|
|
|
|
|
| SPiCE_SSF | 1622 (47.66) | 460 (13.51) | 0.94 | 0.53 | 0.82 | 0.18 | 0.79 | 0.21 | 0.80 | 0.53 | 0.88 |
| SpliceAI | 1680 (49.37) | 159 (4.67) |
| 0.23 | 0.84 | 0.16 | 0.73 | 0.27 | 0.74 | 0.34 | 0.83 |
| Spliceogen | 2995 (88.01) | 3377 (99.24) | 0.83 | 0.66 | 0.91 |
| 0.48 | 0.52 | 0.71 | 0.44 | 0.72 |
| Squirl | 2995 (88.01) | 3377 (99.24) | 0.74 |
| 0.81 | 0.19 | 0.61 | 0.39 | 0.71 | 0.43 | 0.78 |
| Synvep | 373 (10.96) | 585 (17.19) | 0.45 | 0.67 | 0.56 | 0.44 | 0.56 | 0.44 | 0.56 | 0.12 | 0.59 |
SpliceAI, SpliceAI score > 0.1 was integrated into SPCards. The number of true-positive variants and false-positive variants in benchmark data was 3403, respectively. PPV, positive predictive value; NPV, negative predictive value; FNR, false-negative rate; Sensitivity, true-positive rate; FPR, false-positive rate; Specificity, true-negative rate; MCC, Mathew correlation coefficient; AUC, area under the curve. Values in bold are the top performances of predicted methods.
Figure 2.Performance of splicing prediction methods within three regions based on benchmark data. We used integrated functionally validated splicing variants reported from 2019 to 2021 in SPCards. (A) All splicing variants. (B) Donor −3 to +8, donor splicing consensus region, the variant three bases upstream and eight bases downstream of the donor site. (C) Acceptor −50 to +2, the variants 50 bases upstream and two bases downstream of the acceptor site including the acceptor splicing consensus region, potential polypyrimidine tract and potential branch point.
Figure 3.Correlation and consistent prediction ratio among splicing prediction methods. We retained only variants that had prediction scores in both methods for the correlation and consistent prediction ratio analysis. (A) Pearson's correlation coefficients (R). (B) Consistent prediction ratio of binary predictions between pairs of splicing methods. The threshold of the splicing prediction methods is shown in Supplementary Table S6.
Integrated data sources in SPCards
| Category | Data source |
|---|---|
|
| |
| Allele frequency | gnomAD, ExAC, ESP6500, 1000 Genomes Project, Kaviar, HRC |
| Splicing prediction | CADDsplice, SpliceAI, dpsi_max_tissue, dpsi_zscore, dbscSNV_ADA, dbscSNV_RF, MaxEntScan, GeneSplicer, ESRseq, Spliceogen, Squirl, regsnp, MMsplice, KipoiSplice4, Synvep, SPiCE_SSF, SPiCE_MES, SPiCE |
| Non-synonymous prediction | ReVe, SIFT, SIFT4G, Polyphen2_HDIV, Polyphen2_HVAR, LRT, MutationTaster, MutationAssessor, FATHMM, PROVEAN, VEST4, MetaSVM, MetaLR, MetaRNN, M-CAP, REVEL, MutPred, MVP, MPC, PrimateAI, DEOGEN2, BayesDel_addAF, BayesDel_noAF, ClinPred, LIST-S2, Aloft, CADD_coding, DANN, fathmm-MKL_coding_pred, fathmm-XF_coding, Eigen-raw_coding, Eigen-PC-raw_coding, GenoCanyon_score, integrated_fitCons, GM12878_fitCons, H1-hESC_fitCons, HUVEC_fitCons, LINSIGHT, GERP++_RS, phyloP100way_vertebrate, phyloP30way_mammalian, phyloP17way_primate, phastCons100way_vertebrate, phastCons30way_mammalian, phastCons17way_primate, SiPhy_29way_logOdds, bStatistic_converted |
| Disease-related | Gene4Denovo, ClinVar, InterVar, ICGC, COSMIC, NCI |
|
| |
| Basic information | UniProtKB, UniProt, Gene Ontology, InterPro, InBio Map, BioSystems |
| Genic intolerance | RVIS, LoFtool, GDI, Episcore, heptanucleotide context intolerance score, pLI |
| Disease-related | OMIM, MGI, HPO |
| Gene expression | BrainSpan, GTEx, The Human Protein Atlas |
| Target drug | DGIdb |
gnomAD, genome aggregation database; EXAC, The Exome Aggregation Consortium; ESP6500, NHLBI GO Exome Sequencing Project; Kaviar, Kaviar Genomic Variant Database; HRC, haplotype reference consortium; RVIS, Residual Variation Intolerance Score; GDI, Human Gene Damage Index; pLI, the probability of being loss-of-function intolerant; OMIM, online Mendelian inheritance in man; MGI, mouse genome informatics; ICGC, International Cancer Genome Consortium; COSMIC, catalogue of somatic mutations in cancer; NCI, NCI-60 Human Tumor Cell Lines Screen; HPO, human phenotype ontology; GTEx, Genotype-Tissue Expression; DGIdb, The Drug Gene Interaction Database.