| Literature DB >> 30975175 |
Ying Zeng1,2, Hongjie Yuan1, Zheming Yuan3,4, Yuan Chen5.
Abstract
BACKGROUND: Splice sites prediction has been a long-standing problem in bioinformatics. Although many computational approaches developed for splice site prediction have achieved satisfactory accuracy, further improvement in predictive accuracy is significant, for it is contributing to predict gene structure more accurately. Determining a proper window size before prediction is necessary. Overly long window size may introduce some irrelevant features, which would reduce predictive accuracy, while the use of short window size with maximum information may performs better in terms of predictive accuracy and time cost. Furthermore, the number of false splice sites following the GT-AG rule far exceeds that of true splice sites, accurate and rapid prediction of splice sites using imbalanced large samples has always been a challenge. Therefore, based on the short window size and imbalanced large samples, we developed a new computational method named chi-square decision table (χ2-DT) for donor splice site prediction.Entities:
Keywords: Balanced decision table; Chi-square test; Donor splice site; Short window size; χ2-DT
Mesh:
Substances:
Year: 2019 PMID: 30975175 PMCID: PMC6460831 DOI: 10.1186/s13062-019-0236-y
Source DB: PubMed Journal: Biol Direct ISSN: 1745-6150 Impact factor: 4.540
Descriptions of various datasets
| Datasets | Number of true donor sites | Number of false donor sites |
|---|---|---|
| HS3Dall | 2796 | 271928 |
| HS3DI | 2796 | 2796 |
| HS3DII | 2769 | 5000 |
| HS3DIII | 2796 | 10000 |
| HS3DIV | 2796 | 15000 |
| HS3D-test1:1 | 796 | 796 |
| HS3D-train1:1 | 2000 | 2000 |
| HS3D-train1:10 | 2000 | 20000 |
| HS3D-train1:20 | 2000 | 40000 |
| HS3D-train1:50 | 2000 | 100000 |
| HS3D-train1:135 | 2000 | 271132 |
| BG-570orig | 2127 | 149039 |
| BG-570muta | 2081 | 149572 |
Fig. 1Compressing the 2 × 4 contingency table of position 6. a: 2×4 contingency table of position 6. b: 2×2 contingency table of position 6 after compression
Fig. 2Illustration of compression procedure (position 6 in HS3D-train1:135)
Fig. 3log(p− 1) values for different positions. ↑:The columns with arrows represent that log(p− 1) values of the corresponding positions are higher than that of position − 2. For simplicity, we just present the log(p− 1) values of positions − 15 to + 15
Imbalanced decision table based on HS3D-train1:135
| Sample | Decision rule | Total | ||
|---|---|---|---|---|
| ( | … | ( | ||
| positive | 5 | … | 11 | 2000 |
| negative | 47,512 | … | 368 | 271,132 |
Balanced decision table based on HS3D-train1:135
| Sample | Decision rule | Total | ||
|---|---|---|---|---|
| ( | … | ( | ||
| positive | 5 | … | 11 | 2000 |
| negative (adjusted) | 350.5 | … | 2.7 | 2000 |
Independent test accuracy based on various window sizes
| Window size | Feature dimension | SN (%) | SP (%) | (SN + SP)/2(%) | MCC | Time (mm:ss) |
|---|---|---|---|---|---|---|
| 11 bp(−3~ + 8) | 27 | 93.09 | 91.58 | 92.34 | 0.847 | 00:18 |
| 20 bp(−10~ + 10) | 36 | 93.34 | 90.95 | 92.15 | 0.843 | 00:24 |
| 40 bp(−20~ + 20) | 56 | 91.33 | 91.83 | 91.58 | 0.832 | 01:09 |
| 138 bp(−70~ + 68) | 154 | 92.71 | 89.45 | 91.08 | 0.822 | 07:18 |
Independent test accuracy based on imbalanced and balanced decision tables
| Training set | SN (%) | SP (%) | (SN + SP)/2 (%) | MCC | ||||
|---|---|---|---|---|---|---|---|---|
|
|
|
|
|
|
|
|
| |
| HS3D-train1:1 | 93.09 | 93.09 | 91.58 | 91.58 | 92.34 | 92.34 | 0.847 | 0.847 |
| HS3D-train1:10 | 81.53 | 94.35 | 96.36 | 91.08 | 89.51 | 92.71 | 0.788 | 0.855 |
| HS3D-train1:20 | 78.14 | 93.59 | 96.98 | 92.46 | 87.56 | 93.03 | 0.765 | 0.861 |
| HS3D-train1:50 | 76.76 | 94.22 | 96.98 | 92.34 | 86.87 | 93.28 | 0.753 | 0.866 |
| HS3D-train1:135 | 68.84 | 93.97 | 97.61 | 92.71 | 83.23 | 93.34 | 0.694 | 0.867 |
imbal. Denotes imbalanced decision table and bal. denotes balanced decision table
Independent test accuracy based on different classifiers
| Classifier | SN (%) | SP (%) | (SN + SP)/2 (%) | MCC | ||||
|---|---|---|---|---|---|---|---|---|
| HS3D-train1:1 | HS3D-train1:135 | HS3D-train1:1 | HS3D-train1:135 | HS3D-train1:1 | HS3D-train1:135 | HS3D-train1:1 | HS3D-train1:135 | |
| RF | 94.77 | 16.58 | 91.31 | 99.87 | 93.04 | 58.23 | 0.862 | 0.297 |
| ANN | 91.58 | 12.06 | 91.83 | 99.91 | 91.71 | 55.98 | 0.834 | 0.248 |
| RVKDE | 96.23 | 23.37 | 88.82 | 99.50 | 92.53 | 61.43 | 0.853 | 0.353 |
| χ2-DT | 93.09 | 93.97 | 91.58 | 92.71 | 92.34 | 93.34 | 0.847 | 0.867 |
Independent test accuracy based on different features
| Testing set | Feature | SN (%) | SP (%) | (SN+SP)/2 (%) | MCC | Q9 (%) |
|---|---|---|---|---|---|---|
| BG-570orig | positional | 93.09 | 92.11 | 92.60 | 0.349 | 92.58 |
| positional+compositional | 93.51 | 92.15 | 92.83 | 0.352 | 92.70 | |
| BG-570muta | positional | 90.55 | 91.77 | 91.16 | 0.329 | 91.14 |
| positional+compositional | 92.67 | 92.12 | 92.40 | 0.344 | 92.39 |
10-fold cross accuracy based on comparisons with the long-window size-based methods
| Method | Window size (bp) | Ratio of positive-to-negative samples | SN (%) | SP (%) | (SN + SP)/2 (%) | Q9 (%) |
|---|---|---|---|---|---|---|
| MM1-H2MM | 140 | 2796:27960 (1:10) | 93.81 | 91.69 | 92.75 | 92.63 |
| SVM-B | 140 | 2796:27960 (1:10) | 94.13 | 90.99 | 92.56 | 92.39 |
| Meher’s method | 102 | 2796:53124 (1:19) | 88.30 | 89.40 | 88.90 | 88.80 |
| χ2-DT | 11 | 2796:271928 (1:97) | 94.11 | 92.58 | 93.35 | 93.30 |
10-fold cross accuracy based on comparisons with the short-window size-based methods
| Method | AUC-ROC(±SE) | AUC-PR(±SE) | ||||||
|---|---|---|---|---|---|---|---|---|
| 2796:2796 | 2796:5000 | 2796:10000 | 2796:15000 | 2796:2796 | 2796:5000 | 2796:10000 | 2796:15000 | |
| SAE | 0.946 (±0.0031) | 0.945 (±0.0031) | 0.944 (±0.0030) | 0.945 (±0.0030) | 0.945 (±0.0031) | 0.876 (±0.0045) | 0.772 (±0.0055) | 0.682 (±0.0059) |
| MEM | 0.948 (±0.0031) | 0.946 (±0.0031) | 0.947 (±0.0030) | 0.947 (±0.0030) | 0.947 (±0.0031) | 0.878 (±0.0045) | 0.773 (±0.0055) | 0.683 (±0.0059) |
| MDD | 0.945 (±0.0031) | 0.942 (±0.0032) | 0.944 (±0.0030) | 0.944 (±0.0030) | 0.944 (±0.0031) | 0.872 (±0.0046) | 0.769 (±0.0055) | 0.680 (±0.0059) |
| MM1 | 0.945 (±0.0031) | 0.941 (±0.0032) | 0.936 (±0.0032) | 0.941 (±0.0031) | 0.942 (±0.0032) | 0.870 (±0.0046) | 0.765 (±0.0056) | 0.679 (±0.0060) |
| WMM | 0.927 (±0.0036) | 0.924 (±0.0036) | 0.924 (±0.0035) | 0.925 (±0.0034) | 0.924 (±0.0037) | 0.867 (±0.0046) | 0.703 (±0.0060) | 0.675 (±0.0060) |
| χ2-DT | 0.965 (±0.0023) | 0.969 (±0.0027) | 0.971 (±0.0025) | 0.971 (±0.0025) | 0.953 (±0.0030) | 0.932 (±0.0034) | 0.879 (±0.0042) | 0.856 (±0.0038) |
SE Standard error
Independent test accuracy based on various window sizes
| Window size | Feature dimension | SN (%) | SP (%) | (SN + SP)/2(%) | MCC | Time (mm:ss) |
|---|---|---|---|---|---|---|
| 11 bp(−3~ + 8) | 27 | 93.09 | 91.58 | 92.34 | 0.847 | 00:18 |
| 20 bp(−10~ + 10) | 36 | 93.34 | 90.95 | 92.15 | 0.843 | 00:24 |
| 40 bp(−20~ + 20) | 56 | 91.33 | 91.83 | 91.58 | 0.832 | 01:09 |
| 138 bp(−70~ + 68) | 154 | 92.71 | 89.45 | 91.08 | 0.822 | 07:18 |
Independent test accuracy based on various window sizes
| Window size | Feature dimension | SN (%) | SP (%) | (SN + SP)/2(%) | MCC | Time (mm:ss) |
|---|---|---|---|---|---|---|
| 11 bp(− 3~ + 8) | 27 | 93.09 | 91.58 | 92.34 | 0.847 | 00:18 |
| 20 bp(−10~ + 10) | 36 | 93.34 | 90.95 | 92.15 | 0.843 | 00:24 |
| 40 bp(−20~ + 20) | 56 | 91.33 | 91.83 | 91.58 | 0.832 | 01:09 |
| 138 bp(−70~ + 68) | 154 | 92.71 | 89.45 | 91.08 | 0.822 | 07:18 |
2 × 9 table for counting the number of the samples in each grid
| 0 < | 0.05 < | 0.29 < | 0.55 < | 0.57 < | 0.62 < | 0.69 < | 0.71 < | 0.84 < | 0.85 < | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 < | 0 | 4 | 0 | 1 | 0 | 1 | 0 | 2 | 0 | 2 |
| 0.5 < | 2 | 0 | 3 | 0 | 3 | 0 | 1 | 0 | 1 | 0 |
2 × 3 table for counting the number of the samples in each grid
| 0 < x ≤ 0.05 | 0.05 < x ≤ 0.29 | 0.29 < x < 1 | |
|---|---|---|---|
| 0 < | 0 | 4 | 6 |
| 0.5 < | 2 | 0 | 8 |
Imbalanced decision table
| Sample | Decision rule | Total | |||
|---|---|---|---|---|---|
| ( | ( | ( | ( | ||
| positive | 38 | 12 | 26 | 11 | 87 |
| negative | 184 | 1026 | 289 | 188 | 1687 |
Balanced decision table
| Sample | Decision rule | Total | |||
|---|---|---|---|---|---|
| ( | ( | ( | ( | ||
| positive | 38 | 12 | 26 | 11 | 87 |
| negative | 9.5 | 52.9 | 14.9 | 9.7 | 87 |