| Literature DB >> 34987554 |
Zixiao Zhang1, Yue Gong1, Bo Gao2, Hongfei Li1, Wentao Gao1, Yuming Zhao1, Benzhi Dong1.
Abstract
Soluble N-ethylmaleimide sensitive factor activating protein receptor (SNARE) proteins are a large family of transmembrane proteins located in organelles and vesicles. The important roles of SNARE proteins include initiating the vesicle fusion process and activating and fusing proteins as they undergo exocytosis activity, and SNARE proteins are also vital for the transport regulation of membrane proteins and non-regulatory vesicles. Therefore, there is great significance in establishing a method to efficiently identify SNARE proteins. However, the identification accuracy of the existing methods such as SNARE CNN is not satisfied. In our study, we developed a method based on a support vector machine (SVM) that can effectively recognize SNARE proteins. We used the position-specific scoring matrix (PSSM) method to extract features of SNARE protein sequences, used the support vector machine recursive elimination correlation bias reduction (SVM-RFE-CBR) algorithm to rank the importance of features, and then screened out the optimal subset of feature data based on the sorted results. We input the feature data into the model when building the model, used 10-fold crossing validation for training, and tested model performance by using an independent dataset. In independent tests, the ability of our method to identify SNARE proteins achieved a sensitivity of 68%, specificity of 94%, accuracy of 92%, area under the curve (AUC) of 84%, and Matthew's correlation coefficient (MCC) of 0.48. The results of the experiment show that the common evaluation indicators of our method are excellent, indicating that our method performs better than other existing classification methods in identifying SNARE proteins.Entities:
Keywords: SNARE proteins; SVM-RFE-CBR; machine learning; position-specific scoring matrix; support vector machine
Year: 2021 PMID: 34987554 PMCID: PMC8721734 DOI: 10.3389/fgene.2021.809001
Source DB: PubMed Journal: Front Genet ISSN: 1664-8021 Impact factor: 4.599
FIGURE 1Flow chart of SNARE proteins recognition based on PSSM profiles matrix and SVM.
Summary of SNARE protein and non-SNARE protein datasets.
| Dataset | SNARE | Non-SANRE | Total |
|---|---|---|---|
| Original dataset | 682 | 2,583 | 3,265 |
| Train dataset | 644 | 2,234 | 2,878 |
| Test dataset | 38 | 349 | 387 |
FIGURE 2The results of dimension reduction by using SVM-RFE-CBR algorithm.
Comparison of prediction results between SVM-RFE-CBR dimension reduction and original dimension.
| Feature-dimension | Sn | Sp | Acc | AUC | MCC | F-score |
|---|---|---|---|---|---|---|
| 350 |
|
|
|
|
|
|
| 400 | 0.68 | 0.94 | 0.91 | 0.83 | 0.48 | 0.5 |
Comparison of prediction results between SVM-RFE-CBR dimension reduction and original dimension. The bold values mean maximum value in the column.
The result of performance compares between SVM and other classification method.
| Sn | Sp | Acc | MCC | |
|---|---|---|---|---|
| KNN |
| 0.906 | 0.898 |
|
| Random Forest | 0.620 | 0.962 | 0.900 | 0.70 |
| Naïve Bayes | 0.853 | 0.595 | 0.624 | 0.28 |
| SVM | 0.650 |
|
| 0.70 |
The result of performances compares between SVM and other classification method. The bold values mean maximum value in the column.
FIGURE 3ROC curves of different classifier methods.
FIGURE 4(A)The result of performance compares between our classification method and other classification method on training datasets (B) The result of performance compares between our classification method and other classification method on test datasets.