| Literature DB >> 32550561 |
Somayah Albaradei1,2, Arturo Magana-Mora1,3, Maha Thafar1,4, Mahmut Uludag1, Vladimir B Bajic1, Takashi Gojobori1,5, Magbubah Essack1, Boris R Jankovic1.
Abstract
BACKGROUND: The accurate identification of the exon/intron boundaries is critical for the correct annotation of genes with multiple exons. Donor and acceptor splice sites (SS) demarcate these boundaries. Therefore, deriving accurate computational models to predict the SS are useful for functional annotation of genes and genomes, and for finding alternative SS associated with different diseases. Although various models have been proposed for the in silico prediction of SS, improving their accuracy is required for reliable annotation. Moreover, models are often derived and tested using the same genome, providing no evidence of broad application, i.e. to other poorly studied genomes.Entities:
Keywords: AUC, area under curve; AcSS, acceptor splice site; Acc, accuracy; Bioinformatics; CNN, convolutional neural network; CONV, convolutional layers; DL, deep learning; DNA, deoxyribonucleic acid; DT, decision trees; Deep-learning; DoSS, donor splice site; FC, fully connected layer; ML, machine learning; NB, naive Bayes; NN, neural network; POOL, pooling layer; Prediction; RF, random forest; RNA, ribonucleic acid; ReLU, rectified linear unit layer; SS, splice site; SVM, support vector machine; Sn, sensitivity; Sp, specificity; Splice sites; Splicing
Year: 2020 PMID: 32550561 PMCID: PMC7285987 DOI: 10.1016/j.gene.2020.100035
Source DB: PubMed Journal: Gene X ISSN: 2590-1583
Statistical measures used to assess the performance of the models.
| Measure | Equation |
|---|---|
| Accuracy (Acc) | |
| Specificity (Sp) | |
| Sensitivity (Sn) | |
| F1 Score (F1) | |
| Error rate | 1 − |
Performance metrics for the detection of donor and acceptor SS by Splice2Deep on five organisms.
| Organism | Acc | Sp | Sn | F1 | AUC | |
|---|---|---|---|---|---|---|
| AcSS | 96.91 | 97.80 | 95.61 | 96.91 | 98.69 | |
| 95.21 | 94.86 | 95.53 | 95.22 | 98.31 | ||
| 93.89 | 93.62 | 94.16 | 93.92 | 97.52 | ||
| 94.07 | 95.04 | 94.09 | 94.07 | 98.16 | ||
| 98.08 | 97.78 | 98.38 | 98.09 | 99.49 | ||
| DoSS | 97.38 | 98.83 | 95.93 | 96.38 | 99.10 | |
| 95.59 | 95.67 | 95.50 | 95.58 | 98.69 | ||
| 94.33 | 94.41 | 94.25 | 94.33 | 98.30 | ||
| 90.52 | 93.71 | 90.46 | 91.52 | 96.56 | ||
| 97.68 | 97.74 | 97.63 | 97.69 | 99.48 |
Comparing the SS prediction accuracy of Splice2Deep and state-of-the-art tools using five well-studied organisms. Results in bold represent the best performing model. N/A indicates that the tool has not and cannot be trained for that specific organism.
| Organism | Gene-Splicer | Splice-Predictor | DeepSS | Splicerover | Splice2Deep | |
|---|---|---|---|---|---|---|
| AcSS | 83.31 | 88.01 | 94.85 | 95.35 | ||
| 87.76 | 92.13 | N/A | 94.35 | |||
| 84.21 | 89.42 | N/A | N/A | |||
| 88.66 | 88.69 | N/A | N/A | |||
| N/A | N/A | 93.32 | N/A | |||
| DoSS | 79.48 | 88.2 | 94.76 | 96.18 | ||
| 90.85 | 92.49 | N/A | 94.25 | |||
| 86.17 | 87.5 | N/A | N/A | |||
| 90.19 | 88.79 | N/A | N/A | |||
| N/A | N/A | 94.01 | N/A |
Relative error rates associated with SS detection when using Splice2Deep and the best performing SS prediction tools.
| Splice site | Organism | Best performing model | Error rate of the best performing model (%) | Error rate of Deep2Splice (%) | Relative error rate reduction (%) |
|---|---|---|---|---|---|
| AcSS | Splicerover | 4.65 | 3.09 | 33.55 | |
| Splicerover | 5.65 | 4.79 | 15.22 | ||
| SplicePredictor | 10.58 | 6.11 | 42.25 | ||
| SplicePredictor | 11.31 | 5.93 | 47.57 | ||
| DeepSS | 6.68 | 1.92 | 71.26 | ||
| DoSS | Splicerover | 3.82 | 2.62 | 45.80 | |
| Splicerover | 5.75 | 4.41 | 23.30 | ||
| SplicePredictor | 12.50 | 5.67 | 54.64 | ||
| SplicePredictor | 9.81 | 9.48 | 3.36 | ||
| DeepSS | 5.95 | 5.03 | 15.46 | ||
Fig. 1Accuracy results obtained from the cross-organism model validation. A–E) Cross-organism validation results for the prediction of AcSS, F–J) cross-organism validation results for the prediction of DoSS.
Annotation for each organism and the number of positive and negative SS samples.
| Organism | Number of sequences | Assembly & Genebuild reference | |
|---|---|---|---|
| DoSS | 250,400 (true) | GRCh38.p12 ( | |
| 110,299 (true) | TAIR10 ( | ||
| 103,426 (true) | IRGSP-1.0 ( | ||
| 30,118 (true) | BDGP6.22 ( | ||
| 77,387 (true) | WBcel235 ( | ||
| AcSS | 248,150 (true) | GRCh38.p12 ( | |
| 112,318 (true) | TAIR10 ( | ||
| 104,028 (true) | IRGSP-1.0 ( | ||
| 28,703 (true) | BDGP6.22 ( | ||
| 77,763 (true) | WBcel235 ( |
Fig. 2Data representation. A) Mononucleotide embedding with length (4 × L), and B) trinucleotide embedding with length (64 × L).
Fig. 3Splice2Deep model overview. Local and surrounding windows. ‘SS’ refers to splice site and ‘N’ to nucleotides.
Fig. 4Splice2Deep learning model. It takes DNA sequence as input embedded in 2D (either 4 × L or 64 × L), apply k motif detectors (filters), max pooling, flatten, fully connected layer using SoftMax to output scores.
Grid search space for the tuning of the CNN and NN hyperparameters.
| CNN model hyperparameters | Search space | |
|---|---|---|
| Activation function | [tanh, | |
| Number of neurons on FC layer | [128, 250, | |
| Initialization mode | [uniform, | |
| Batch size | [16, 32, | |
| Dropout rate | [0.01, 0.1, 0.2, | |
| Optimizer | [SGD, Adam, | |
| 4 × L embedding | Number of filters | [16, |
| Filter length | [ | |
| Filter width | [4] | |
| 64 × L embedding | Number of filters | [16, |
| Filter length | [ | |
| Filter width | [64] | |