| Literature DB >> 35144656 |
Ying Zeng1, Yuan Chen2, Zheming Yuan3.
Abstract
BACKGROUND: Lysine succinylation is a type of protein post-translational modification which is widely involved in cell differentiation, cell metabolism and other important physiological activities. To study the molecular mechanism of succinylation in depth, succinylation sites need to be accurately identified, and because experimental approaches are costly and time-consuming, there is a great demand for reliable computational methods. Feature extraction is a key step in building succinylation site prediction models, and the development of effective new features improves predictive accuracy. Because the number of false succinylation sites far exceeds that of true sites, traditional classifiers perform poorly, and designing a classifier to effectively handle highly imbalanced datasets has always been a challenge.Entities:
Keywords: Chi-square statistical difference table; ChiDT; Feature selection; Imbalanced dataset; Succinylation site
Year: 2022 PMID: 35144656 PMCID: PMC8832670 DOI: 10.1186/s13040-022-00290-1
Source DB: PubMed Journal: BioData Min ISSN: 1756-0381 Impact factor: 2.522
Fig. 1Flow chart of iSuc-ChiDT
Fig. 2Illustration of compression procedure (position − 10 in Tr_data)
Fig. 3Chi-square values for different positions in Tr_data
Frequency distribution of amino acids at the ith position
| Sample | Amino acid residue | Total | |||||
|---|---|---|---|---|---|---|---|
| 1 | 2 | … | … | 20 | |||
| Positive | … | … | |||||
| Negative | … | … | |||||
| Total | … | … | |||||
20 × 9 chi-square statistical difference table
| Amino acid residue | Position | ||||
|---|---|---|---|---|---|
| 1 | … | … | 9 | ||
| 1 | … | … | |||
| … | … | … | … | … | … |
| … | … | ||||
| … | ... | … | … | … | … |
| 20 | … | … | |||
Imbalanced decision table
| Sample | Rule* | Total | ||
|---|---|---|---|---|
| ( | … | ( | ||
| Positive | 23 | … | 32 | 4748 |
| Negative | 2907 | … | 83 | 50,551 |
*For instance, “(P−1 = −2.028)˄(−0.907 ≤ P2 ≤ 0.501)˄(− 0.715 ≤ P− 8 ≤ 0.066)” represents P− 1 taking a value of − 2.028 and P2 ranging from − 0.907 to 0.501 and P− 8 ranging from − 0.715 to 0.066, where, “˄” denotes the logical conjunction
Balanced decision table
| Sample | Rule | Total | ||
|---|---|---|---|---|
| ( | … | ( | ||
| Positive | 23 | … | 32 | 4748 |
| Negative | 273.04 | … | 7.80 | 4748 |
Various evaluation indexes on different ratios of positives to negatives
| Positives/Negatives* | SN (%) | SP (%) | MCC | |
|---|---|---|---|---|
| 100/100 | 93.00 | 95.00 | 0.880 | 0.939 |
| 100/1000 | 93.00 | 95.00 | 0.752 | 0.939 |
| 100/10000 | 93.00 | 95.00 | 0.371 | 0.939 |
*positive testing sample size/negative testing sample size
Fig. 4Chi-MIC-share scores after introduction of each feature. The red line represents the forced termination of feature introduction
Features retained by Chi-MIC-share
| Retained features | Type | Retained features | Type | |||
|---|---|---|---|---|---|---|
| 1 | position | 6 | position | |||
| 2 | position | 7 | AAC | |||
| 3 | position | 8 | AAC | |||
| 4 | position | 9 | position | |||
| 5 | position | 10 | position |
Independent test accuracy for different classifiers based on the same input features
| Classifier | SN (%) | SP (%) | MCC | |
|---|---|---|---|---|
| RF | 2.75 | 99.83 | 0.115 | 0.312 |
| ANN | 0 | 99.90 | −0.009 | 0.293 |
| RVKDE | 9.84 | 97.25 | 0.106 | 0.362 |
5-fold cross accuracy for different encoding schemes based on ChiDT classifier
| Encoding scheme | Feature dimension | SN (%) | SP (%) | MCC | Q9 |
|---|---|---|---|---|---|
| Binary | 180 | 63.20 | 62.41 | 0.258 | 0.623 |
| Physicochemical properties(531) | 4779 | 58.86 | 60.39 | 0.188 | 0.593 |
| Physicochemical properties(10) | 90 | 59.77 | 62.59 | 0.225 | 0.607 |
Independent test accuracy based on different window sizes
| Window size | SN (%) | SP (%) | MCC | Q9 |
|---|---|---|---|---|
| 51(− 25 ~ + 25) | 68.50 | 61.10 | 0.162 | 0.646 |
| 31(− 15 ~ + 15) | 64.57 | 66.48 | 0.174 | 0.655 |
| 11(−5 ~ + 5) | 62.20 | 65.03 | 0.152 | 0.636 |
Independent test accuracy with or without Chi-MIC-share
| Feature selection | Feature dimension | SN (%) | SP (%) | MCC | Time (mm:ss) | |
|---|---|---|---|---|---|---|
| No feature selection | 239 | 70.08 | 62.95 | 0.182 | 0.663 | 06:14 |
| Chi-MIC-share | 10 | 70.47 | 66.27 | 0.205 | 0.683 | 00:17 |
Independent test accuracy for different methods
| Method | SN (%) | SP (%) | MCC | Q9 |
|---|---|---|---|---|
| SucPred | 27.20 | 67.30 | − 0.030 | 0.436 |
| iSuc-PseAAC | 12.20 | 88.70 | 0.013 | 0.374 |
| SuccFind | 25.20 | 79.20 | 0.029 | 0.451 |
| SuccinSite | 37.10 | 88.20 | 0.199 | 0.548 |
| iSuc-PseOpt | 30.30 | 75.80 | 0.038 | 0.478 |
| pSuc-Lys | 22.40 | 82.60 | 0.036 | 0.436 |
| Success | 14.20 | 86.80 | 0.007 | 0.386 |
| PSuccE | 37.50 | 88.60 | 0.204 | 0.551 |
| Position: | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Original sequence segment: | L | A | Y | V | T | K | A | G | K | |||
| Mutated sequence segment: | L | A | Y | V | T | K | A | G |