| Literature DB >> 36203144 |
Victor Akpokiro1, Trevor Martin2, Oluwatosin Oluwadare3.
Abstract
BACKGROUND: Identifying splice site regions is an important step in the genomic DNA sequencing pipelines of biomedical and pharmaceutical research. Within this research purview, efficient and accurate splice site detection is highly desirable, and a variety of computational models have been developed toward this end. Neural network architectures have recently been shown to outperform classical machine learning approaches for the task of splice site prediction. Despite these advances, there is still considerable potential for improvement, especially regarding model prediction accuracy, and error rate.Entities:
Keywords: Convolutional neural network (CNN); Deep learning (DL); Dense neural network (DNN); Ensemble learning; Feature extraction; Splice sites (SS)
Mesh:
Year: 2022 PMID: 36203144 PMCID: PMC9535948 DOI: 10.1186/s12859-022-04971-w
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.307
Fig. 1Illustration of 2 step biochemistry process for Splice Sites. This figure shows canonical sequence distribution in a splice site location, the Introns are spliced, hence the name splice sites resulting in proteins as a final product
EnsembleSplice neural network hyper-parameter search space
| Neural Network | Hyper-parameter | Range | Steps | Selected |
|---|---|---|---|---|
| CNN | Filters | 8–400 | 8 | 72, 120, 136, 144, 168, 208, 250, 272, |
| Kernel size | 1–9 | 2 | 3, 4, 5, 7, 9 | |
| Dropout | 0.05–0.30 | 0.05 | 0.20, 0.35 | |
| Max-Pool size | 1–9 | 2 | 3 | |
| DNN | Units | 32–704 | 32 | 32, 128, 224, 250, 256, 352, 512, 704, |
| Kernel regularizers | 0.0025, 0.025, 0.036 | - | 0.0025, 0.025, 0.036 | |
| Dropout | 0.05–0.50 | 0.50 | 0.1, 0.15, 0.25 |
This table shows the convolutional neural network (CNN) and Dense Neural Network (DNN) search space. This includes the search range, steps and the selected hyperparameter
EnsembleSplices’ CNNs and DNNs model architecture
| Neural networks | Layer type |
|---|---|
| CNN 1 | Conv1D(72, 5) |
| Conv1D(144, 7) | |
| Conv1D(168, 7) | |
| Flatten() | |
| Dropout(0.20) | |
| Dense(2, "sigmoid") | |
| CNN 2 | Conv1D(136, 3) |
| Conv1D(72, 4) | |
| MaxPooling1D(7) | |
| Conv1D(272, 7) | |
| MaxPooling1D(3) | |
| Flatten() | |
| Dropout(rate = 0.35) | |
| Dense(2, "sigmoid") | |
| CNN 3 | Conv1D(208, 9) |
| MaxPooling1D(6) | |
| Conv1D(120, 5) | |
| MaxPooling1D(3) | |
| Flatten() | |
| Dropout(0.20) | |
| Dense(2, "sigmoid") | |
| CNN 4 | Conv1D(250, 5) |
| Conv1D(250, 5) | |
| Conv1D(250, 5) | |
| MaxPooling1D(3) | |
| Flatten() | |
| Dropout(0.20) | |
| Dense(2, "sigmoid") | |
| DNN 1 | Flatten() |
| Dense(704) | |
| Dense(224) | |
| Dropout(0.1) | |
| Dense(512) | |
| Dropout(0.15) | |
| Dense(2, "sigmoid") | |
| DNN 2 | Flatten() |
| Dense(704) | |
| Dense(224) | |
| Dense(128) | |
| Dropout(0.15) | |
| Dense(2, "sigmoid") | |
| DNN 3 | Flatten() |
| Dense(256) | |
| Dense(352) | |
| Dense(32) | |
| Dense(352) | |
| Dropout(0.15) | |
| Dense(2, "sigmoid") | |
| DNN 4 | Flatten() |
| Dense(250) | |
| Dense(250) | |
| Dense(250) | |
| Dropout(0.25) | |
| Dense(2, "sigmoid") |
The number of filters and kernel size are the first and second parameters for convolutional layers (CNN), respectively, with the same activation function (ReLu) and padding. The pool size is the parameter in the max-pooling layer, and the number of dense nodes and ReLu activation function is the parameter in the layer for dense neural networks (DNNs). DNN 4 uses the random normal as its kernel initializer
Fig. 2EnsembleSplices’ CNNs and DNNs model architecture. This figure depicts each CNNs and DNNs base model’s architecture used in this cross-validation experiment. This Figure contains a CNN 1; b CNN 2; c CNN 3; d CNN 4; e DNN 1; f DNN 2; g DNN 3; h DNN 4, with architecture containing its respective layers and their distinct labels
Fig. 3Cross-Validation Ensemble model architecture. These are the architectural representation of each Ensemble model architecture and their individual base model architecture combination used in the cross-validation experiment. This contain a Ensemble ENS1 contains all DNN’s (DNN1, DNN2, DNN3, DNN4); b Ensemble ENS2 contains all the CNNs (CNN1, CNN2, CNN3, CNN4); c Ensemble ENS3 contains all the neural network models (DNN1, DNN2, DNN3, DNN4, CNN1, CNN2, CNN3, CNN4); d Ensemble ENS4 consists of CNN1, CNN2, CNN3, DNN1, DNN3; e Ensemble ENS5 consists of DNN1, DNN3, DNN4, CNN1, CNN2, CNN3; f Ensemble ENS6 consist of DNN1, DNN3, DNN4, CNN1, CNN2. We selected the Ensemble ENS2 from our cross-validation experiment
The cross-validation results for the dataset for the genomic organisms
| Datasets | SpliceSites | Metrics | ENS1 | ENS2 | ENS3 | ENS4 | ENS5 | ENS6 |
|---|---|---|---|---|---|---|---|---|
| Acceptor | Double fault | 0.033 | 0.00 | 0.01 | 0.01 | 0.007 | 0.011 | |
| Correlation | 0.612 | 0.06 | 0.22 | 0.20 | 0.21 | 0.33 | ||
| Q-statistics | 0.89 | 0.131 | 0.50 | 0.65 | 0.553 | 0.83 | ||
| Disagreement | 0.03 | 0.00 | 0.03 | 0.03 | 0.02 | 0.03 | ||
| Accuracy | 0.89 | 0.936 | 0.94 | 0.93 | 0.94 | 0.93 | ||
| Donor | Double fault | 0.013 | 0.00 | 0.00 | 0.00 | 0.003 | 0.003 | |
| Correlation | 0.496 | 0.02 | 0.18 | 0.11 | 0.19 | 0.20 | ||
| Q-Statistics | 0.796 | − 0.001 | 0.44 | 0.37 | 0.451 | 0.478 | ||
| Disagreement | 0.015 | 0.00 | 0.01 | 0.01 | 0.01 | 0.01 | ||
| Accuracy | 0.93 | 0.958 | 0.95 | 0.95 | 0.94 | 0.94 | ||
| Acceptor | Double fault | 0.023 | 0.003 | 0.012 | 0.01 | 0.011 | 0.01 | |
| Correlation | 0.667 | 0.215 | 0.358 | 0.401 | 0.413 | 0.415 | ||
| Q-Statistics | 0.988 | 0.713 | 0.843 | 0.98 | 0.982 | 0.985 | ||
| Disagreement | 0.023 | 0.016 | 0.097 | 0.027 | 0.03 | 0.025 | ||
| Accuracy | 0.913 | 0.947 | 0.946 | 0.945 | 0.948 | 0.942 | ||
| Donor | Double fault | 0.013 | 0.019 | 0.008 | 0.006 | 0.007 | 0.007 | |
| Correlation | 0.638 | 0.132 | 0.317 | 0.3 | 0.315 | 0.326 | ||
| Q-Statistics | 0.992 | 0.308 | 0.689 | 0.83 | 0.882 | 0.747 | ||
| Disagreement | 0.016 | 0.079 | 0.089 | 0.056 | 0.085 | 0.016 | ||
| Accuracy | 0.93 | 0.954 | 0.954 | 0.95 | 0.953 | 0.952 | ||
| Acceptor | Double fault | 0.034 | 0.003 | 0.015 | 0.01 | 0.013 | 0.015 | |
| Correlation | 0.702 | 0.19 | 0.325 | 0.338 | 0.353 | 0.399 | ||
| Q-Statistics | 0.989 | 0.555 | 0.667 | 0.978 | 0.844 | 0.978 | ||
| Disagreement | 0.028 | 0.022 | 0.083 | 0.037 | 0.069 | 0.037 | ||
| Accuracy | 0.894 | 0.938 | 0.938 | 0.939 | 0.937 | 0.933 | ||
| Donor | Double fault | 0.022 | 0.001 | 0.008 | 0.007 | 0.01 | 0.008 | |
| Correlation | 0.665 | 0.103 | 0.289 | 0.298 | 0.338 | 0.315 | ||
| Q-Statistics | 0.989 | 0.274 | 0.773 | 0.894 | 0.978 | 0.907 | ||
| Disagreement | 0.022 | 0.057 | 0.024 | 0.025 | 0.033 | 0.025 | ||
| Accuracy | 0.907 | 0.952 | 0.952 | 0.951 | 0.949 | 0.946 | ||
| Average | Acceptor | Double fault | 0.03 | 0.01 | 0.02 | 0.01 | 0.012 | |
| Correlation | 0.66 | 0.30 | 0.31 | 0.32 | 0.38 | |||
| Q-Statistics | 0.955 | 0.58 | 0.87 | 0.793 | 0.931 | |||
| Disagreement | 0.027 | 0.012 | 0.033 | 0.040 | 0.030 | |||
| Accuracy | 0.830 | 0.940 | 0.940 | 0.940 | 0.930 | |||
| Donor | Double fault | 0.015 | 0.012 | 0.010 | 0.006 | 0.008 | ||
| Correlation | 0.599 | 0.260 | 0.240 | 0.28 | 0.28 | |||
| Q-Statistics | 0.9256 | 0.630 | 0.700 | 0.770 | 0.710 | |||
| Disagreement | 0.017 | 0.040 | 0.030 | 0.040 | 0.020 | |||
| Accuracy | 0.920 | 0.950 | 0.950 | 0.950 | 0.950 |
This table depicts the five-fold Cross-validation Results, average result across the organism distribution, evaluation metrics and the ensemble combinations considered. Results highlighted in black shows the best average evaluation metrics. ENS1 consist of DNN1, DNN2, DNN3, DNN4; ENS2 consists OF CNN1, CNN2, CNN3, CNN4; ENS3 consists of DNN1, DNN2, DNN3, DNN4, CNN1, CNN2, CNN3, CNN4; ENS4 consists of CNN1, CNN2, CNN3, DNN1, DNN3; ENS5 consist of DNN1, DNN3, DNN4, CNN1, CNN2, CNN3; ENS6 includes the DNN1, DNN3, DNN4, CNN1, CNN2
Fig. 4EnsembleSplice architectural pipeline. This figure depicts the Ensemble architecture used for this experiment. This contains the one-hot encoded datasets, the ensemble neural network combination, prediction and label, and the logistics regression and evaluation
The Evaluation performance comparison results
| Datasets | SpliceSites | Model | Sp | Sn | Pre | Err | Acc | MCC | F1 |
|---|---|---|---|---|---|---|---|---|---|
| Acceptor | ISSCNN | 87.27 | 91.81 | 87.82 | 10.45 | 89.55 | 79.17 | 81.45 | |
| SpliceRover | N/A | N/A | N/A | N/A | N/A | N/A | N/A | ||
| DeepSplicer | 92.55 | 92.91 | 7.27 | 92.73 | 85.46 | 92.74 | |||
| SpliceFinder | 89.09 | 93.09 | 89.51 | 8.90 | 91.09 | 82.24 | 91.26 | ||
| EnsembleSplice | 91.52 | ||||||||
| Donor | ISSCNN | 94.36 | 94.90 | 94.39 | 5.35 | 94.64 | 89.27 | 89.84 | |
| SpliceRover | N/A | N/A | N/A | N/A | N/A | N/A | N/A | ||
| DeepSplicer | 95.45 | 94.36 | 95.40 | 5.09 | 94.91 | 89.82 | 94.88 | ||
| SpliceFinder | 94.00 | 95.09 | 94.06 | 5.45 | 94.54 | 89.09 | 94.57 | ||
| EnsembleSplice | |||||||||
| Acceptor | SpliceRover | 88.31 | 89.25 | 88.42 | 11.22 | 88.78 | 77.57 | 88.83 | |
| ISSCNN | N/A | N/A | N/A | N/A | N/A | N/A | N/A | ||
| DeepSplicer | 90.00 | 94.50 | 90.43 | 7.75 | 92.25 | 84.59 | 92.40 | ||
| SpliceFinder | 90.88 | 92.69 | 91.04 | 8.22 | 91.78 | 83.58 | 91.86 | ||
| EnsembleSplice | |||||||||
| Donor | SpliceRover | 86.88 | 87.13 | 86.91 | 13.00 | 87.00 | 74.00 | 87.02 | |
| ISSCNN | N/A | N/A | N/A | N/A | N/A | N/A | N/A | ||
| DeepSplicer | 90.44 | 90.86 | 7.25 | 92.75 | 85.59 | 92.91 | |||
| SpliceFinder | 93.50 | 91.13 | 93.34 | 7.69 | 92.31 | 84.65 | 92.22 | ||
| EnsembleSplice | 94.38 | ||||||||
| Acceptor | SpliceRover | 88.25 | 93.44 | 88.83 | 9.16 | 90.84 | 81.80 | 91.08 | |
| ISSCNN | N/A | N/A | N/A | N/A | N/A | N/A | N/A | ||
| DeepSplicer | 90.88 | 91.19 | 90.90 | 8.97 | 91.03 | 82.06 | 91.04 | ||
| SpliceFinder | 90.75 | 89.94 | 90.67 | 9.66 | 90.34 | 80.69 | 90.3 | ||
| EnsembleSplice | |||||||||
| Donor | SpliceRover | 85.44 | 91.13 | 86.22 | 11.72 | 88.28 | 76.69 | 88.61 | |
| ISSCNN | N/A | N/A | N/A | N/A | N/A | N/A | N/A | ||
| DeepSplicer | 88.00 | 7.69 | 92.31 | 84.94 | 91.97 | ||||
| SpliceFinder | 93.00 | 91.25 | 92.88 | 7.87 | 92.13 | 84.26 | 92.06 | ||
| EnsembleSplice | 96.06 | 96.06 |
This table shows the EnsembleSplice splice site prediction performance results and its comparison to other methods which includes iSS-CNN [17], SpliceRover [12], SpliceFinder [13], and DeepSplicer [14]. We show the prediction accuracy measures and the error rate amongst other evaluation metrics performance results. Results figures highlighted in black denotes best performance, N/A are results for methods of no known datasets model. For this table, Sp denotes specificity, Sn denotes sensitivity, Pre denotes precision, Err error rate, Acc accuracy, MCC denotes Mathew’s correlation coefficient, and F1 denotes the F1 score
Fig. 5EnsembleSplice model interpretability. This figure is a sequence logo to visualize the importance score for each nucleotide per position for the HS3D datasets. a indicates the acceptor positive splice sites, as b shows that acceptor negative splice sites. While c shows the donor positive splice sites as d indicates the donor negative splice sites