| Literature DB >> 36147670 |
Shi-Shi Yuan1, Dong Gao1, Xue-Qin Xie1, Cai-Yi Ma1, Wei Su1, Zhao-Yue Zhang1,2, Yan Zheng3, Hui Ding1.
Abstract
Ion binding proteins (IBPs) can selectively and non-covalently interact with ions. IBPs in phages also play an important role in biological processes. Therefore, accurate identification of IBPs is necessary for understanding their biological functions and molecular mechanisms that involve binding to ions. Since molecular biology experimental methods are still labor-intensive and cost-ineffective in identifying IBPs, it is helpful to develop computational methods to identify IBPs quickly and efficiently. In this work, a random forest (RF)-based model was constructed to quickly identify IBPs. Based on the protein sequence information and residues' physicochemical properties, the dipeptide composition combined with the physicochemical correlation between two residues were proposed for the extraction of features. A feature selection technique called analysis of variance (ANOVA) was used to exclude redundant information. By comparing with other classified methods, we demonstrated that our method could identify IBPs accurately. Based on the model, a Python package named IBPred was built with the source code which can be accessed at https://github.com/ShishiYuan/IBPred.Entities:
Keywords: Feature extraction; Ion binding proteins; Predictor; Random forest
Year: 2022 PMID: 36147670 PMCID: PMC9474292 DOI: 10.1016/j.csbj.2022.08.053
Source DB: PubMed Journal: Comput Struct Biotechnol J ISSN: 2001-0370 Impact factor: 6.155
Fig. 1The flow of the model building.
The search spaces of search methods and the number of attempts.
| Parameters | Grid Search | Bayesian Search |
|---|---|---|
| “criterion” | Gini, Entropy | Gini, Entropy |
| “max_depth” | 5, 40, 75, 110, 145 | 5, 6, …, 150 |
| “min_samples_split” | 2, 7, 12, 17, 22, 27 | 2, 3, …, 30 |
| “n_estimators” | 10, 25, 63, 158, 398, 1000 | 10x, x∈ |
| “min_samples_leaf” | 5 | 1, 2, …, 10 |
| “max_leaf_nodes” | 100 | 50, 51, …, 150 |
| “ccp_alpha” | 0.001 | 10x, x ∈ [-10, 0] |
| # of attempts | 360 | 64 (Our setting) |
Fig. 2The IFS curves of different search methods and feature extraction strategies on the 10-fold cross-validation test on the training dataset. The data in brackets are the best results of IFS curves that reached the highest average AUCs. (A) Grid search. (B) Bayesian search.
The performance comparison of models on the training dataset and testing dataset using different search methods and feature extraction strategies.
| Search Method | Features | |||||||
|---|---|---|---|---|---|---|---|---|
| Grid | PseCKSAAP (122D) | 0.904 ± 0.042 | 0.757 | 70.77 | 0.430 | 78.26 | 66.67 | 72.46 |
| PseCKSAAP + DDE (146D) | 0.891 ± 0.048 | 0.808 | 75.38 | 0.517 | 82.61 | 71.43 | 77.02 | |
| PseCKSAAP + DDE + CTD (173D) | 0.871 ± 0.027 | 0.836 | 73.91 | 85.71 | 79.81 | |||
| PseCKSAAP + DDE + CTD + QSOrder (148D) | 0.871 ± 0.044 | 0.804 | 78.46 | 0.515 | 60.87 | 74.48 | ||
| Bayesian | PseCKSAAP (112D) | 0.751 | 73.85 | 0.558 | 61.90 | 78.78 | ||
| PseCKSAAP + DDE (193D) | 0.911 ± 0.029 | 76.92 | 0.578 | 91.30 | 69.05 | |||
| PseCKSAAP + DDE + CTD (242D) | 0.895 ± 0.042 | 0.774 | 67.69 | 0.480 | 52.38 | 74.02 | ||
| PseCKSAAP + DDE + CTD + QSOrder (239D) | 0.899 ± 0.038 | 0.805 | 75.38 | 0.486 | 73.91 | 76.19 | 75.05 | |
Note: Values are expressed as mean ± standard deviation in AUC metric that indicates the results on the training dataset. The values highlighted in bold denote the best performance value for each metric across search methods and feature extraction strategies.
The performance comparison of different algorithms on the training dataset and testing dataset using Bayesian search and PseCKSAAP + DDE for feature extraction.
| Algorithm | |||||||
|---|---|---|---|---|---|---|---|
| SVM (229D) | 0.769 | 69.23 | 0.500 | 95.65 | 54.76 | 75.21 | |
| DT (11D) | 0.775 ± 0.059 | 0.698 | 60.00 | 0.386 | 95.65 | 40.84 | 68.06 |
| NB (243D) | 0.950 ± 0.033 | 0.693 | 63.08 | 0.458 | 42.86 | 71.43 | |
| RF (193D) | 0.911 ± 0.029 | 91.30 | |||||
| AB (254D) | 0.926 ± 0.041 | 0.671 | 60.00 | 0.386 | 95.65 | 40.48 | 68.06 |
Note: Values are expressed as mean ± standard deviation in AUC metric that indicates the results on the training dataset. The values highlighted in bold denote the best performance value for each metric across search methods and feature extraction strategies.
Fig. 3The basic statistical information about the features of the optimal model. The PseCKSAAP(Pse) means the features extracted from the physicochemical properties in PseCKSAAP and the PseCKSAAP(CKSAAP) represents the features extracted from the CKSAAP part. (A) The sum of F-scores and the total number of features. (B) The F-scores of features are colored to indicate that they were applied in the optimal model. With a line at the median, the box stretches from the first quartile (Q1) to the third quartile (Q3) of F-scores. And the whiskers extend from the box by 1.5 times the inter-quartile range (IQR), which equals Q3 - Q1. The grey dots represent the features that were not involved in the optimal model construction (that is, were not contained in the optimal feature subset). (C) The counts of selected features and their sum of F-scores based on different gaps. “G0” to “G9” denote the 0–9 gap dipeptides, respectively. (D) The ranked features and their F-scores, as well as the cumulative sum of F-scores from 0 to all features used in the optimal model sequentially.
Significant dipeptide features (p ≤ 0.001) in MAX_CKSAAP and DDE of training dataset.
| MAX_CKSAAP | Intersection | DDE |
|---|---|---|
| NI, ND, DI, VF, QA, | EC, II, YD | RL, VR, SR, ID, |
The performance of different models on extra dataset.
| Search Method | Model | ||||||
|---|---|---|---|---|---|---|---|
| Grid | RF-P (122D) | 0.548 | 52.46 | 0.033 | 99.35 | 1.29 | 50.32 |
| RF-PD (146D) | 0.617 | 59.15 | 0.179 | 74.56 | 42.33 | 58.45 | |
| RF-PDC (173D) | 0.543 | 52.32 | 0.016 | 97.22 | 3.33 | 50.28 | |
| RF-PDCQ (148D) | 0.448 | 46.88 | −0.072 | 56.00 | 36.94 | 46.47 | |
| Bayesian | RF-P (112D) | 0.576 | 52.33 | 0.028 | 99.82 | 0.51 | 50.16 |
| RF-PD (193D) | 0.604 | 57.74 | 0.149 | 69.58 | 44.82 | 57.20 | |
| RF-PDC (242D) | 0.499 | 52.18 | 0.000 | 0.00 | 50.00 | ||
| RF-PDCQ (239D) | 0.488 | 51.67 | −0.017 | 95.77 | 3.56 | 49.66 | |
| Bayesian | SVM-PD (229D) | 75.21 | |||||
| DT-PD (11D) | 0.542 | 53.71 | 0.068 | 93.21 | 10.62 | 51.91 | |
| NB-PD (243D) | 0.597 | 58.89 | 0.180 | 82.63 | 32.99 | 57.81 | |
| ABC-PD (254D) | 0.624 | 58.13 | 0.179 | 89.48 | 23.93 | 56.70 |
Note: “P” is PseCKSAAP, “PD” is PseCKSAAP + DDE, “PDC” is PseCKSAAP + DDE + CTD, “PDCQ” is PseCKSAAP + DDE + CTD + QSOrder. The values highlighted in bold denote the best performance value for each metric across models.