| Literature DB >> 35154261 |
Dong Ma1, Zhihua Chen1, Zhanpeng He1, Xueqin Huang1.
Abstract
Machine learning has been widely used to solve complex problems in engineering applications and scientific fields, and many machine learning-based methods have achieved good results in different fields. SNAREs are key elements of membrane fusion and required for the fusion process of stable intermediates. They are also associated with the formation of some psychiatric disorders. This study processes the original sequence data with the synthetic minority oversampling technique (SMOTE) to solve the problem of data imbalance and produces the most suitable machine learning model with the iLearnPlus platform for the identification of SNARE proteins. Ultimately, a sensitivity of 66.67%, specificity of 93.63%, accuracy of 91.33%, and MCC of 0.528 were obtained in the cross-validation dataset, and a sensitivity of 66.67%, specificity of 93.63%, accuracy of 91.33%, and MCC of 0.528 were obtained in the independent dataset (the adaptive skip dipeptide composition descriptor was used for feature extraction, and LightGBM with proper parameters was used as the classifier). These results demonstrate that this combination can perform well in the classification of SNARE proteins and is superior to other methods.Entities:
Keywords: ASDC features; SMOTE; SNARE protein identification; data imbalance; machine learning
Year: 2022 PMID: 35154261 PMCID: PMC8832978 DOI: 10.3389/fgene.2021.818841
Source DB: PubMed Journal: Front Genet ISSN: 1664-8021 Impact factor: 4.599
FIGURE 1The research flow diagram of SNARE protein identification using a decision tree model.
Feature dimensions of partial feature extraction algorithms and AUROC performance under multiple classifiers.
| Feature dimension | RandomForest ( | LightGBM ( | XGBoost ( | |
|---|---|---|---|---|
| ASDC ( | 400 |
|
|
|
| QSOrder ( | 44 | 0.8401 | 0.864 | 0.8604 |
| DDE ( | 400 | 0.824 | 0.8604 | 0.849 |
| CKSAAP ( | 1,600 | 0.8337 | 0.8664 | 0.8588 |
| AAC ( | 20 | 0.8467 | 0.8514 | 0.8428 |
The meaning of the bold values is the feature extraction algorithm that performs best under a particular classification algorithm.
Model performance under different n values.
| n | Cross-validation | Independent | ||||||
|---|---|---|---|---|---|---|---|---|
| Sens | Spec | Acc | MCC | Sens | Spec | Acc | MCC | |
| 628 | 82.97 | 82.954 | 82.486 | 0.6522 | 94.2 | 60.84 | 63.69 | 0.3102 |
| 1,256 | 95.148 | 89.886 | 92.516 | 0.8523 | 73.91 | 76.56 | 76.33 | 0.3152 |
| 2,510 | 98.486 | 91.832 | 95.158 | 0.9055 | 76.81 | 86.34 | 85.5 | 0.4492 |
| 3,764 | 99.07 | 94.394 | 96.73 | 0.936 | 65.22 | 93.22 | 90.83 | 0.5071 |
| 5,019 | 99.302 | 94.682 | 96.99 | 0.941 | 62.32 | 94.04 | 91.33 | 0.5081 |
| 6,640 | 99.292 | 94.414 | 96.852 | 0.9384 | 59.42 | 94.99 | 91.95 | 0.5149 |
The performance of the three classifiers on the independent test set (n = 2,510).
| n = 2,510 | Sensitivity (%) | Specificity (%) | Accuracy (%) | MCC |
|---|---|---|---|---|
| RandomForest | 63.77 |
|
| 0.444 |
| LightGBM |
| 86.31 | 85.5 |
|
| XGBoost | 73.91 | 86.99 | 85.87 | 0.4412 |
The meaning of the bold values is the feature extraction algorithm that performs best under a particular classification algorithm.
The performance of the three classifiers on the independent test set (n = 5,019).
| n = 5,019 | Sensitivity (%) | Specificity (%) | Accuracy (%) | MCC |
|---|---|---|---|---|
| RandomForest | 46.38 | 95.66 | 91.45 | 0.435 |
| LightGBM |
|
|
|
|
| XGBoost |
| 94.58 | 91.7 | 0.5132 |
The meaning of the bold values is the feature extraction algorithm that performs best under a particular classification algorithm.
FIGURE 2The relationship between the number of leaves and model performance.
FIGURE 3The relationship between the number of maxdepth and model performance.
FIGURE 4The relationship between learning rate and model performance.
Comparison with the experimental results of 2D CNN in the same setting.
| Classifier | Cross-validation | Independent | ||||||
|---|---|---|---|---|---|---|---|---|
| Sens | Spec | Acc | MCC | Sens | Spec | Acc | MCC | |
| 2D CNN ( | 76.6 | 93.5 | 89.7 | 0.7 | 65.8 | 90.3 | 87.9 | 0.46 |
| This Methods | 98.168 | 90.736 | 94.718 | 0.8974 | 81.58 | 94.84 | 93.54 | 0.6839 |