| Literature DB >> 28811565 |
Zhao-Chun Xu1, Peng Wang2, Wang-Ren Qiu3,4, Xuan Xiao5,6.
Abstract
Gene splicing is one of the most significant biological processes in eukaryotic gene expression, such as RNA splicing, which can cause a pre-mRNA to produce one or more mature messenger RNAs containing the coded information with multiple biological functions. Thus, identifying splicing sites in DNA/RNA sequences is significant for both the bio-medical research and the discovery of new drugs. However, it is expensive and time consuming based only on experimental technique, so new computational methods are needed. To identify the splice donor sites and splice acceptor sites accurately and quickly, a deep sparse auto-encoder model with two hidden layers, called iSS-PC, was constructed based on minimum error law, in which we incorporated twelve physical-chemical properties of the dinucleotides within DNA into PseDNC to formulate given sequence samples via a battery of cross-covariance and auto-covariance transformations. In this paper, five-fold cross-validation test results based on the same benchmark data-sets indicated that the new predictor remarkably outperformed the existing prediction methods in this field. Furthermore, it is expected that many other related problems can be also studied by this approach. To implement classification accurately and quickly, an easy-to-use web-server for identifying slicing sites has been established for free access at: http://www.jci-bioinfo.cn/iSS-PC.Entities:
Year: 2017 PMID: 28811565 PMCID: PMC5557945 DOI: 10.1038/s41598-017-08523-8
Source DB: PubMed Journal: Sci Rep ISSN: 2045-2322 Impact factor: 4.379
Figure 1Sketch map showing the steps about the pre-mRNA how to become a mature messenger RNA.
Figure 2Sketch map showing the steps how to establish a predictor for biological system.
The test results of splice donor site sequences based on different characteristic parameter τ values.
| Predictor | ACC(%) | MCC(%) | Sn(%) | Sp(%) |
|---|---|---|---|---|
|
| 88.88 | 77.77 | 88.34 | 89.43 |
|
| 80.58 | 61.15 | 81.01 | 80.14 |
|
| 90.56 | 81.13 | 90.09 | 91.04 |
|
| 90.74 | 81.49 | 90.77 | 90.71 |
The test results of splice acceptor site sequences based on different characteristic parameter τ values.
| Predictor | ACC(%) | MCC(%) | Sn(%) | Sp(%) |
|---|---|---|---|---|
|
| 89.01 | 78.09 | 87.40 | 96.08 |
|
| 90.02 | 80.04 | 89.69 | 90.36 |
|
| 91.11 | 82.24 | 90.14 | 92.11 |
|
| 90.95 | 81.95 | 99.44 | 92.50 |
The comparison of the 5-fold cross-validation test results on benchmark data-set only containing splice donor site sequences.
| Predictor | ACC(%) | MCC(%) | Sn(%) | Sp(%) |
|---|---|---|---|---|
| iSS-PseDNCa | 87.71 | 75.46 | 89.56 | 85.86 |
| iSS-PCb | 90.56 | 81.13 | 90.09 | 91.04 |
aThe prediction method developed by Wei Chen (2014).
bThe prediction method proposed in this paper.
The comparison of the 5-fold cross-validation test results on benchmark data-set only containing splice acceptor site sequences.
| Predictor | ACC(%) | MCC(%) | Sn(%) | Sp(%) |
|---|---|---|---|---|
| iSS-PseDNCa | 88.73 | 77.89 | 94.24 | 83.07 |
| iSS-PCb | 91.11 | 82.24 | 90.14 | 92.11 |
aThe prediction method developed by Wei Chen (2014).
bThe prediction method proposed in this paper.
Figure 3ROC curves of the two different predictors for the splice donor site sequences.
Figure 4ROC curves of the two different predictors for the splice acceptor site sequences.
The 5-fold cross-validation test results obtained from different classification algorithms with the same feature extraction method on benchmark data-set only containing splice donor site sequences.
| Predictor | ACC(%) | MCC(%) | Sn(%) | Sp(%) |
|---|---|---|---|---|
| iSS-PCa | 90.56 | 81.13 | 90.09 | 91.04 |
| iSS-SVMb | 77.59 | 55.25 | 75.68 | 79.50 |
| iSS-RFc | 83.13 | 66.38 | 80.11 | 86.14 |
| iSS-libD3Cd | 83.38 | 67.09 | 78.43 | 88.32 |
aThe predictor with SAE proposed in this paper.
bThe predictor with SVM created in WEKA with the default parameters.
cThe predictor with Random Forest (RF) created in WEKA.
dThe predictor with an ensemble classifier libD3C.
The 5-fold cross-validation test results obtained from different classification algorithms with the same feature extraction method on benchmark data-set only containing splice acceptor site sequences.
| Predictor | ACC(%) | MCC(%) | Sn(%) | Sp(%) |
|---|---|---|---|---|
| iSS-PCa | 91.11 | 82.24 | 90.14 | 92.11 |
| iSS-SVMb | 73.10 | 46.23 | 71.94 | 74.29 |
| iSS-RFc | 85.80 | 71.60 | 84.70 | 86.90 |
| iSS-libD3Cd | 83.15 | 66.55 | 79.38 | 87.04 |
aThe predictor with SAE proposed in this paper.
bThe predictor with SVM created in WEKA with the default parameters.
cThe predictor with Random Forest (RF) created in WEKA.
dThe predictor with an ensemble classifier libD3C.
Figure 5A semi-screenshot of the homepage for the web-server “iSS-PC”.
The original values of the twelve PC properties for each dinucleotide.
| Code | HC1 | HC2 | HC3 | HC4 | HC5 | HC6 | HC7 | HC8 | HC9 | HC10 | HC11 | HC12 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| AA | 0.97 | −5.37 | 35.5 | −0.27 | 35 | 66.51 | 1.9 | −1.2 | −18.66 | 12.1 | 35.1 | 3.9 |
| AC | 0.13 | −10.5 | 33.1 | −0.21 | 60 | 108.8 | 1.3 | −1.5 | −13.1 | 9.8 | 31.5 | 4.6 |
| AG | 0.33 | −6.78 | 30.6 | −0.08 | 60 | 85.12 | 1.6 | −1.5 | −14 | 6.3 | 31.9 | 3.4 |
| AT | 0.58 | −6.57 | 43.2 | −0.28 | 20 | 72.29 | 0.9 | −0.9 | −15.01 | 2.1 | 29.3 | 5.9 |
| CA | 1.04 | −6.57 | 37.7 | −0.01 | 60 | 64.92 | 1.9 | −1.7 | −9.45 | 6.1 | 37.3 | 1.3 |
| CC | 0.19 | −8.26 | 35.3 | −0.03 | 130 | 99.31 | 3.1 | −2.3 | −8.11 | 2.9 | 32.9 | 2.4 |
| CG | 0.52 | −9.69 | 31.3 | −0.03 | 85 | 88.84 | 3.6 | −2.8 | −10.03 | 4.5 | 36.1 | 0.7 |
| CT | 0.33 | −6.78 | 30.6 | −0.18 | 60 | 85.12 | 1.6 | −1.5 | −14 | 1.6 | 31.9 | 3.4 |
| GA | 0.98 | −9.81 | 39.6 | 0.03 | 60 | 80.03 | 1.6 | −1.5 | −13.48 | 2.3 | 36.3 | 3.4 |
| GC | 0.73 | −14.6 | 38.4 | 0.02 | 85 | 135.8 | 3.1 | −2.3 | −11.08 | 4 | 33.6 | 4 |
| GG | 0.19 | −8.26 | 35.3 | −0.06 | 130 | 99.31 | 3.1 | −2.3 | −8.11 | 6.1 | 32.9 | 2.4 |
| GT | 0.13 | −10.51 | 33.1 | −0.18 | 60 | 108.8 | 1.3 | −1.5 | −13.1 | 2.1 | 31.5 | 4.6 |
| TA | 0.73 | −3.82 | 31.6 | 0.18 | 20 | 50.11 | 1.5 | −0.9 | −11.85 | 2.3 | 37.8 | 2.5 |
| TC | 0.98 | −9.81 | 39.6 | −0.11 | 60 | 80.03 | 1.6 | −1.5 | −13.48 | 4.5 | 36.3 | 3.4 |
| TG | 1.04 | −6.57 | 37.7 | 0.13 | 60 | 64.92 | 1.9 | −1.7 | −9.45 | 9.8 | 37.3 | 1.3 |
| TT | 0.97 | −5.37 | 35.5 | −0.28 | 35 | 66.51 | 1.9 | −1.2 | −18.66 | 2.8 | 35.1 | 3.9 |
Figure 6A sketch map of a deep sparse auto-encoder model with two hidden layers.