| Literature DB >> 27252772 |
Prabina Kumar Meher1, Tanmaya Kumar Sahu2, A R Rao2, S D Wahi1.
Abstract
BACKGROUND: Identification of splice sites is essential for annotation of genes. Though existing approaches have achieved an acceptable level of accuracy, still there is a need for further improvement. Besides, most of the approaches are species-specific and hence it is required to develop approaches compatible across species.Entities:
Keywords: Hsplice; Hybrid approach; Machine learning; Sequence encoding
Year: 2016 PMID: 27252772 PMCID: PMC4888255 DOI: 10.1186/s13015-016-0078-4
Source DB: PubMed Journal: Algorithms Mol Biol ISSN: 1748-7188 Impact factor: 1.405
List of selected features using F-score
| Feature type |
| Features |
|---|---|---|
| Positional | 4 |
|
| Dependency | 4 |
|
| Compositional | 41 |
|
Out of 344 generated features, 49 features are selected among which four are positional, four are dependency and 41 are compositional features
Fig. 1a ROC curves of SVM with linear, polynomial, sigmoid and RBF kernels in fivefold of the cross validation b Bar plots of AUC-ROC values for SVM with RBF kernel for different values of gamma (shown over each bar) in fivefold of the cross validation. SVM with polynomial and RBF kernels performed almost equally. Further, it can be seen that the AUC-ROC value of SVM with RBF kernel almost stabilized after 0.2 (value of gamma) in all the fivefold of the cross validation
Fig. 2Estimates of AUC-ROC and AUC-PR for the proposed approach under balanced (a) and imbalanced (b) situations. Ten different bars represent ten different subsets, where each subset was drawn at random from the original data and AUC-ROC/AUC-PR was computed over fivefold of the cross validation
Performance accuracy of the proposed approach
| Measure | Balanced | Imbalanced | ||||||
|---|---|---|---|---|---|---|---|---|
| Human | Cattle | Fish | Worm | Human | Cattle | Fish | Worm | |
| AUC-ROC | 96.05 | 96.94 | 96.95 | 96.24 | 97.21 | 97.45 | 97.41 | 98.06 |
| AUC-PR | 97.64 | 97.89 | 97.91 | 97.90 | 93.24 | 93.34 | 93.38 | 92.29 |
The performance of the proposed approach is measured in terms of AUC-ROC and AUC-PR in all the four species under both balanced and imbalanced situations. It can be seen that the values of AUC-ROC is almost similar in all the four species under both situations, whereas the values of AUC-PR are higher in balanced case as compared to the imbalanced situation
Performance accuracies of different methods in predicting donor splice sites using NN269 dataset
| Approaches | AUC-ROC | AUC-PR | Type of kernel used |
|---|---|---|---|
| MM1-SVM | 97.62 | 89.58 | Polynomial |
| LIK-SVM | 98.04 | 92.65 | Locally improved kernel |
| WD-SVM |
| 92.86 | Weighted degree kernel |
| WDS-SVM | 98.13 | 92.47 | Weighted degree shift kernel |
| EFFECT | 98.20 | 92.81 | – |
| Proposed | 96.53 |
| Radial basis function |
It can be seen that WD-SVM achieved higher value of AUC-ROC as compared to the others, whereas the AUC-PR is highest for the proposed approach. MM1-SVM achieved lowest accuracies both in terms of AUC-ROC and AUC-PR
Estimates of AUC-ROC and AUC-PR of different methods for balanced dataset in predicting donor splice sites using human, bovine, fish and worm species
| Species | Approaches | ||||||
|---|---|---|---|---|---|---|---|
| MM1-SVM | LIK-SVM | WD-SVM | WDS-SVM | EFFECT | Proposed | ||
| AUC-ROC | Human | 97.07 | 97.13 | 97.25 | 97.06 | 97.15 | 96.05 |
| Bovine | 96.98 | 97.63 | 97.83 | 97.59 | 97.70 | 96.94 | |
| Fish | 97.24 | 97.34 | 97.68 | 97.53 | 97.59 | 96.95 | |
| Worm | 97.49 | 98.02 | 98.23 | 98.12 | 98.15 | 96.24 | |
| AUC-PR | Human | 96.78 | 97.52 | 97.67 | 97.38 | 97.58 | 97.64 |
| Bovine | 96.66 | 97.48 | 97.59 | 97.26 | 97.51 | 97.89 | |
| Fish | 96.85 | 97.42 | 97.67 | 97.39 | 97.49 | 97.91 | |
| Worm | 96.92 | 97.51 | 97.78 | 97.63 | 97.71 | 97.90 | |
It can be seen that the values of AUC-ROC of the proposed approach are less as compared to that of others, whereas the values of AUC-PR for the proposed approach are at par with that of other approaches (except MM1-SVM), in all the four species
Estimates of AUC-ROC and AUC-PR of different methods for imbalanced dataset in predicting donor splice sites using human, bovine, fish and worm species
| Species | Approaches | ||||||
|---|---|---|---|---|---|---|---|
| MM1-SVM | LIK-SVM | WD-SVM | WDS-SVM | EFFECT | Proposed | ||
| AUC-ROC | Human | 97.32 | 97.61 | 97.73 | 97.30 | 97.42 | 97.21 |
| Bovine | 97.57 | 97.89 | 97.93 | 97.65 | 97.70 | 97.45 | |
| Fish | 97.71 | 97.85 | 97.92 | 97.77 | 97.57 | 97.41 | |
| Worm | 97.99 | 98.26 | 98.51 | 98.30 | 98.45 | 98.06 | |
| AUC-PR | Human | 89.95 | 92.23 | 92.36 | 92.17 | 92.41 | 93.24 |
| Bovine | 90.02 | 92.13 | 92.39 | 92.16 | 92.42 | 93.34 | |
| Fish | 90.10 | 92.18 | 92.43 | 92.26 | 92.47 | 93.38 | |
| Worm | 89.10 | 90.27 | 90.89 | 91.53 | 91.67 | 92.29 | |
It can be seen that the values of AUC-ROC of proposed approach are at par with that of others, whereas the values of AUC-PR for the proposed approach are little higher than that of other approaches, in all the four species
Fig. 3Snapshots of the server page (a) and result page after executing an example dataset (b) of the developed prediction server HSplice. The server has been trained with human, cattle and fish splice site datasets. The user has to supply only the test sequence for prediction of donor splice site for the species of his/her interest