| Literature DB >> 32318092 |
Lili Qian1, Yaping Wen1, Guosheng Han1.
Abstract
The cancerlectin plays an important role in the initiation, survival, growth, metastasis, and spread of cancer. Therefore, to study the function of cancerlectin is greatly significant because it can help to identify tumor markers and tumor prevention, treatment, and prognosis. However, plenty of studies have generated a large amount of protein data. Traditional prediction methods have been unable to meet the needs of analysis. Developing powerful computational models based on these data to discriminate cancerlectins and non-cancerlectins on a large scale has been treated as one of the most important topics. In this study, we developed a feature extraction method to identify cancerlectins based on fusion of g-gap dipeptides. The analysis of variance was used to select the optimal feature set and a support vector machine was used to classify the data. The rigorous nested 10-fold cross-validation results, demonstrated that our method obtained the prediction accuracy of 83.91% and sensitivity of 83.15%. At the same time, in order to evaluate the performance of the classification model constructed in this work, we constructed a new data set. The prediction accuracy of the new data set reaches 83.3%. Experimental results show that the performance of our method is better than the state-of-the-art methods.Entities:
Keywords: analysis of variance; cancerlectins; feature selection; g-gap dipeptide; support vector machine
Year: 2020 PMID: 32318092 PMCID: PMC7147460 DOI: 10.3389/fgene.2020.00275
Source DB: PubMed Journal: Front Genet ISSN: 1664-8021 Impact factor: 4.599
Figure 1The flowchart of our method.
Figure 2Prediction accuracy curve of feature subset.
F-value and P-value of features in optimal feature subset.
| L_R1 | 26.76446 | 3.63E-07 |
| R_L4 | 24.81686 | 9.38E-07 |
| Q_E0 | 20.28248 | 8.77E-06 |
| I_D0 | 16.70216 | 5.28E-05 |
| N_K3 | 16.34925 | 6.32E-05 |
| N_D6 | 15.78628 | 8.40E-05 |
| Q_P9 | 15.52386 | 9.61E-05 |
| I_D4 | 15.23462 | 0.000111 |
| D_N0 | 14.73123 | 0.000144 |
| P_A1 | 14.28921 | 0.000181 |
| N_D1 | 14.13802 | 0.000195 |
| P_L5 | 13.87658 | 0.000223 |
| L_P7 | 13.82865 | 0.000229 |
| S_N5 | 13.69697 | 0.000245 |
| A_L2 | 13.26494 | 0.000306 |
| A_R2 | 12.96445 | 0.000357 |
| L_P5 | 12.90963 | 0.000367 |
| R_Q3 | 12.90722 | 0.000368 |
| L_R8 | 12.702 | 0.000409 |
| N_D3 | 12.40946 | 0.000476 |
| N_G8 | 12.37352 | 0.000485 |
| D_N7 | 12.2193 | 0.000526 |
| D_N8 | 12.09143 | 0.000562 |
| L_C0 | 11.94945 | 0.000605 |
| N_V1 | 11.87518 | 0.000629 |
| E_L5 | 11.79776 | 0.000655 |
| Q_P1 | 11.78632 | 0.000659 |
| Q_A0 | 11.54244 | 0.000748 |
| L_E6 | 11.50195 | 0.000764 |
| R_P4 | 11.4276 | 0.000794 |
| P_L6 | 11.23968 | 0.000877 |
| Q_M7 | 11.22643 | 0.000883 |
| D_G0 | 11.22351 | 0.000884 |
| S_P2 | 11.17902 | 0.000905 |
| Q_L1 | 11.06357 | 0.000961 |
Figure 3The ROC curve for cancerlectin prediction using the optimal 35 g-gap dipeptide.
Classification of new data.
| 1016841179 | 1 |
| 1016841154 | 1 |
| 1016841024 | 1 |
| 1016841005 | 1 |
| 560189093 | 1 |
| 720063203 | 1 |
| 727346123 | 1 |
| 469469047 | 0 |
| 403420575 | 1 |
| 385719187 | 1 |
| 384367986 | 1 |
| 388890228 | 1 |
| 1508736536 | 1 |
| 873090602 | 1 |
| 1022943309 | 1 |
| 974005177 | 1 |
| 392996940 | 0 |
| 385719190 | 1 |
| 1391723745 | 1 |
| 400260732 | 1 |
| 1370479176 | 1 |
| 1370451719 | 1 |
| 1034557774 | 1 |
| 768011769 | 1 |
| 768007991 | 1 |
| 768006291 | 0 |
| 1258501064 | 0 |
| 1272616377 | 1 |
| 1272616369 | 1 |
| 859066280 | 0 |
Comparison of classification results of new data.
| CancerPred (Amino acid composition) (Kumar and Panwar, | 70 |
| CancerPred (Dipeptide composition) (Kumar and Panwar, | 76.67 |
| CancerPred [Split composition (2-part)] (Kumar and Panwar, | 56.67 |
| CancerPred [Split composition (4-part)] (Kumar and Panwar, | 60 |
| Our Method | 83.3 |
Comparison with the results of existing classification models.
| Kumar and Panwar ( | 68.00 | 69.90 | 69.09 |
| Lin et al. ( | 69.10 | 80.10 | 75.19 |
| Damodaran et al. ( | 75.28 | 80.53 | 77.48 |
| Our method | 83.15 | 80.87 | 83.91 |