| Literature DB >> 29959372 |
Yuming Ma1, Yihui Liu2, Jinyong Cheng1.
Abstract
Protein secondary structure prediction is one of the most important and challenging problems in bioinformatics. Machine learning techniques have been applied to solve the problem and have gained substantial success in this research area. However there is still room for improvement toward the theoretical limit. In this paper, we present a novel method for protein secondary structure prediction based on a data partition and semi-random subspace method (PSRSM). Data partitioning is an important strategy for our method. First, the protein training dataset was partitioned into several subsets based on the length of the protein sequence. Then we trained base classifiers on the subspace data generated by the semi-random subspace method, and combined base classifiers by majority vote rule into ensemble classifiers on each subset. Multiple classifiers were trained on different subsets. These different classifiers were used to predict the secondary structures of different proteins according to the protein sequence length. Experiments are performed on 25PDB, CB513, CASP10, CASP11, CASP12, and T100 datasets, and the good performance of 86.38%, 84.53%, 85.51%, 85.89%, 85.55%, and 85.09% is achieved respectively. Experimental results showed that our method outperforms other state-of-the-art methods.Entities:
Mesh:
Substances:
Year: 2018 PMID: 29959372 PMCID: PMC6026213 DOI: 10.1038/s41598-018-28084-8
Source DB: PubMed Journal: Sci Rep ISSN: 2045-2322 Impact factor: 4.379
Figure 1PSRSM framework. Training Data D is partitioned into k subsets D1, D2,…, Di, … Dk, and Sij is the jth subspace data of subset Di; Cij is a base classifier trained on Sij.
Q3 accuracy of the tested methods on CASP10, CASP11, CASP12, and CB513 datasets. (The results of SPINE-X, PSIPRED, JPRED, and DeepCNF are taken from the papers[2,27]).
| Methods | Q3(%) | |||
|---|---|---|---|---|
| CASP10 | CASP11 | CASP12 | CB513 | |
| SPINE-X | 80.7 | 79.3 | 76.9 | 78.9 |
| PSIPRED | 81.2 | 80.7 | 78.0 | 79.2 |
| JPRED | 81.6 | 80.4 | 75.1 | 81.7 |
| DeepCNF | 84.4 | 84.7 | 82.1 | 82.3 |
| PSRSM | 85.51 | 85.89 | 85.55 | 84.53 |
Q3 accuracy of PSRSM and DeepCNF for each protein in the T100.
| Protein name | PSRSM (Q3%) | DeepCNF (Q3%) | Length | Protein name | PSRSM (Q3%) | DeepCNF (Q3%) | Length |
|---|---|---|---|---|---|---|---|
| 5K4W_A | 96.88 | 85.67 | 321 | 5Y5Z_A | 82.70 | 80.45 | 578 |
| 5MOI_A | 80.27 | 68.61 | 223 | 6B2N_A | 71.10 | 87.07 | 263 |
| 5MOJ_A | 89.24 | 78.30 | 223 | 6BT3_C | 85.00 | 73.64 | 220 |
| 5MOK_ | 89.24 | 77.58 | 223 | 6F0E_A | 86.86 | 80.45 | 312 |
| 5NA1_A | 76.47 | 79.66 | 408 | 6F1T_G | 82.45 | 80.85 | 376 |
| 5O7K_A | 80.21 | 86.46 | 96 | 6F40_A | 75.75 | 74.79 | 1460 |
| 5QAN_A | 91.77 | 78.60 | 243 | 5GZJ_A | 85.24 | 88.30 | 359 |
| 5UB4_A | 83.93 | 84.29 | 280 | 5BK1_H | 93.22 | 80.93 | 236 |
| 5VSA_A | 86.62 | 83.44 | 314 | 5GZI_B | 84.68 | 87.74 | 359 |
| 6AOK_A | 85.71 | 75.12 | 217 | 5K4Y_A | 97.19 | 86.56 | 320 |
| 6FEL_A | 94.07 | 89.83 | 236 | 5LCP_B | 95.00 | — | 20 |
| 6F2L_A | 70.07 | 85.53 | 304 | 5LH4_A | 99.55 | 87.44 | 223 |
| 6F0Z_A | 80.13 | 87.70 | 317 | 5MB5_A | 88.18 | 81.82 | 330 |
| 6EM0_ | 78.83 | 85.20 | 581 | 5MR9_A | 81.37 | 71.57 | 102 |
| 6EHH_A | 94.89 | 85.80 | 176 | 5NXG_A | 98.05 | 83.27 | 257 |
| 5QAE_A | 92.18 | 79.84 | 243 | 5O5I_A | 72.83 | 90.22 | 92 |
| 5QAK_A | 92.18 | 79.84 | 243 | 5V6F_A | 76.81 | 76.81 | 138 |
| 6AX2_A | 73.91 | 82.61 | 46 | 5WHI_A | 93.79 | 90.06 | 161 |
| 6AZ2_A | 91.70 | 81.22 | 229 | 5WXE_A | 60.71 | 60.71 | 28 |
| 6B5G_A | 94.32 | 89.86 | 493 | 6F1D_A | 94.87 | 88.03 | 117 |
| 6B7Z_A | 86.54 | 85.09 | 966 | 5KDB_A | 96.18 | 86.01 | 393 |
| 6BB5_A | 94.96 | 84.89 | 139 | 5KDY_A | 95.42 | 86.26 | 393 |
| 6BBQ_A | 76.73 | 89.62 | 520 | 5N2O_A | 88.57 | 92.86 | 70 |
| 6FD3_A | 80.67 | 85.33 | 300 | 5NEC_A | 84.48 | 86.64 | 741 |
| 6B3G_A | 87.88 | 87.88 | 99 | 5O3U_A | 91.99 | 83.70 | 724 |
| 5XXR_A | 87.12 | 88.64 | 132 | 5O6V_A | 70.36 | 74.19 | 496 |
| 5WVM_ | 84.68 | 80.16 | 509 | 5OQZ_A | 77.78 | — | 18 |
| 5WCT_A | 63.64 | 73.80 | 187 | 5OYD_A | 89.39 | 85.10 | 396 |
| 5W30_A | 79.44 | 79.44 | 180 | 5UG6_A | 91.28 | 87.25 | 149 |
| 5MZV_B | 80.81 | 80.81 | 198 | 5UOE_A | 94.24 | 86.77 | 990 |
| 6F73_B | 62.02 | 78.22 | 574 | 5UOZ_A | 71.43 | — | 21 |
| 6BVC_A | 83.62 | 81.92 | 177 | 5V23_A | 78.57 | 86.73 | 98 |
| 5M3U_ | 91.35 | 83.17 | 416 | 5VDF_A | 94.52 | 87.67 | 73 |
| 5BJZ_B | 97.24 | 85.18 | 398 | 5W92_A | 71.07 | 78.68 | 197 |
| 5LUH_A | 90.74 | 79.26 | 270 | 5WAT_A | 82.22 | 86.36 | 315 |
| 5MOP_ | 90.13 | 83.41 | 223 | 5WOT_A | 93.43 | 80.30 | 198 |
| 5MR5_A | 80.39 | 72.55 | 102 | 5WOZ_A | 89.86 | 92.03 | 138 |
| 5NXP_A | 98.45 | 83.33 | 258 | 5WPX_A | 79.78 | 78.65 | 89 |
| 5XEE_A | 76.53 | 77.55 | 98 | 5XBK_A | 80.77 | 81.01 | 416 |
| 5YPK_A | 91.32 | 83.88 | 242 | 5M88_A | 89.71 | 92.65 | 136 |
| 5YQW_A | 87.41 | 79.89 | 532 | 5MNV_A | 89.19 | 87.71 | 407 |
| 5YWZ_A | 73.55 | 80.17 | 242 | 5MOS_A | 99.55 | 87.44 | 223 |
| 5Z0T_A | 94.03 | 80.38 | 637 | 5MVO_A | 70.45 | 75.95 | 291 |
| 6AX6_A | 79.15 | 81.28 | 235 | 5N1D_A | 90.37 | 84.99 | 353 |
| 6BGN_A | 98.33 | 83.33 | 60 | 5N1N_A | 88.95 | 88.67 | 353 |
| 6C2I_A | 74.21 | 79.08 | 411 | 5O5C_A | 82.08 | 84.39 | 519 |
| 6C8S_A | 88.13 | 78.63 | 379 | 5OQ1_A | 85.40 | 81.02 | 137 |
| 5WDD_A | 93.45 | 91.07 | 168 | 5ORK_B | 78.41 | 85.51 | 352 |
| 6AVD_A | 70.00 | 80.00 | 40 | 5OTY_A | 73.39 | 77.49 | 342 |
| 6FO0_N | 88.75 | 87.28 | 480 | 5URT_A | 71.43 | — | 21 |
(If a protein sequence has more than 4000 or less than 26 amino acids, DeepCNF online server will report errors).
PSRSM, DeepCNF, SPIDER3, MUFOLD,PSIPRED and JPRED average Q3 accuracies and Q3 accuracies in the internal regions, and at boundary regions of secondary structures on the T100.
| Method | Q3(average) | Q3 (internal) | Q3 (boundary) | Website |
|---|---|---|---|---|
| DeepCNF | 82.78 | 85.68 | 73.30 |
|
| SPIDER3 | 82.41 | 88.25 | 70.72 |
|
| MUFOLD | 84.35 | 89.28 | 74.65 |
|
| PSIPRED | 76.33 | 82.84 | 63.06 |
|
| JPRED | 74.45 | 81.42 | 60.25 |
|
| PSRSM | 85.09 | 89.89 | 75.33 |
|
The DeepCNF method is available only to proteins with a length of [26, 4000], MUFOLD is [30,700], and JPRED is [20,800].
Comparison of classifier_C and PSRSM1 on CB513.
| Protein length L | Q3(%) | Training data | ||||
|---|---|---|---|---|---|---|
| Classifier_C | PSRSM1 | |||||
| Classifier_C | PSRSM1 | Number (protein) | Number (amino acid) | Number (protein) | Number (amino acid) | |
| [1,100] | 75.48 | 83.25 | 176 | 10996 | 2260 | 161952 |
| (100,200] | 78.17 | 76.44 | 255 | 37369 | 0 | 0 |
| (200,300] | 78.60 | 75.83 | 137 | 34072 | 0 | 0 |
| (300,400] | 75.94 | 73.82 | 105 | 35529 | 0 | 0 |
| (400,500] | 75.81 | 72.07 | 63 | 27818 | 0 | 0 |
| L > 500 | 74.01 | 71.23 | 64 | 42277 | 0 | 0 |
| all | 77.16 | 77.57 | 800 | 188061 | 2260 | 161952 |
Q3 accuracy of each ensemble classifier on different proteins with different length in T100 dataset.
| PSRSM1 | PSRSM2 | PSRSM3 | PSRSM4 | PSRSM5 | PSRSM6 | |
|---|---|---|---|---|---|---|
| [1,100] | 79.84 | 63.11 | 62.75 | 63.40 | 62.89 | 64.13 |
| (100,200] | 78.19 | 84.58 | 81.02 | 78.99 | 77.18 | 78.16 |
| (200,300] | 74.39 | 78.14 | 87.59 | 78.99 | 75.95 | 75.15 |
| (300,400] | 74.00 | 75.63 | 78.80 | 87.51 | 78.62 | 77.64 |
| (400,500] | 74.23 | 76.69 | 77.09 | 80.81 | 83.24 | 77.06 |
| L > 500 | 73.59 | 75.87 | 75.64 | 76.30 | 77.12 | 83.93 |
Training time on each subset of the ASTRAL + CullPDB.
| Subset | No sampling | Sampling (PSRSM) |
|---|---|---|
| D1 | 7 days | 1.5 days |
| D2 | 30 days | 6 days |
| D3 | 45 days | 8 days |
| D4 | 40 days | 7 days |
| D5 | 15 days | 3 days |
| D6 | 35 days | 6.5 days |
Subsets of training data ASTRAL + CullPDB.
| Subset | Protein length L | Number of proteins | Number of amino acids |
|---|---|---|---|
| D1 | (0, 100] | 2260 | 161952 |
| D2 | (100, 200] | 5256 | 774167 |
| D3 | (200, 300] | 3548 | 877583 |
| D4 | (300, 400] | 2382 | 822913 |
| D5 | (400, 500] | 1170 | 519422 |
| D6 | (500, ∞) | 1058 | 707309 |
Figure 2Relationship between Q3 accuracy and dimension of subspace.