| Literature DB >> 33235255 |
Rulan Wang1, Zhuo Wang2,3, Hongfei Wang4, Yuxuan Pang1, Tzong-Yi Lee5.
Abstract
Lysine crotonylation (Kcr) is a type of protein post-translational modification (PTM), which plays important roles in a variety of cellular regulation and processes. Several methods have been proposed for the identification of crotonylation. However, most of these methods can predict efficiently only on histone or non-histone protein. Therefore, this work aims to give a more balanced performance in different species, here plant (non-histone) and mammalian (histone) are involved. SVM (support vector machine) and RF (random forest) were employed in this study. According to the results of cross-validations, the RF classifier based on EGAAC attribute achieved the best predictive performance which performs competitively good as existed methods, meanwhile more robust when dealing with imbalanced datasets. Moreover, an independent test was carried out, which compared the performance of this study and existed methods based on the same features or the same classifier. The classifiers of SVM and RF could achieve best performances with 92% sensitivity, 88% specificity, 90% accuracy, and an MCC of 0.80 in the mammalian dataset, and 77% sensitivity, 83% specificity, 70% accuracy and 0.54 MCC in a relatively small dataset of mammalian and a large-scaled plant dataset respectively. Moreover, a cross-species independent testing was also carried out in this study, which has proved the species diversity in plant and mammalian.Entities:
Year: 2020 PMID: 33235255 PMCID: PMC7686339 DOI: 10.1038/s41598-020-77173-0
Source DB: PubMed Journal: Sci Rep ISSN: 2045-2322 Impact factor: 4.379
Figure 1Position-specific amino acid composition analysis of crotonylated sequences and non-crotonylated sequences in plant dataset. (a) Indicates the Position-specific amino acid composition of crotonylated sequences in plant dataset based on the frequency plot of WebLogo. (b) Shows the Comparison of position-specific amino acid composition between crotonylated sequences in plant dataset (upper part) and crotonylated sequences in mammalian dataset (lower part) based on TwoSampleLogo. (c) Indicates the Position-specific amino acid composition of crotonylated sequences in mammalian dataset based on the frequency plot of WebLogo. (d) and (e) Shows the statistics of each amino acid composition (AAC) in plant and mammalian dataset respectively. From (d) it can be seen that large differences exist in the composition of K, E and S in plant dataset, and from (e) great differences exist among the composition of K, L, A, D, E and N in the dataset of mammalian.
Performance on plant dataset.
| Features | Dimension | Dataset | Classifier | Sn | Sp | Acc | MCC |
|---|---|---|---|---|---|---|---|
| AAC | 20 | Plant | libsvm | 0.69 | 0.61 | 0.65 | 0.31 |
| AAPC | 400 | Plant | libsvm | 0.64 | 0.68 | 0.66 | 0.33 |
| BE | 620 | Plant | libsvm | 0.69 | 0.63 | 0.66 | 0.33 |
| CKSAAP | 1600 | Plant | libsvm | 0.65 | 0.67 | 0.66 | 0.32 |
| EAAC | 540 | Plant | libsvm | 0.68 | 0.72 | 0.71 | 0.40 |
| EGAAC | 135 | Plant | libsvm | 0.74 | 0.66 | 0.70 | 0.40 |
| PSSM | 620 | Plant | libsvm | 0.71 | 0.48 | 0.60 | 0.20 |
| Incorporated | 3935 | Plant | libsvm | 0.74 | 0.72 | 0.73 | 0.41 |
| AAC | 20 | Plant | RF | 0.68 | 0.60 | 0.64 | 0.28 |
| AAPC | 400 | Plant | RF | 0.58 | 0.69 | 0.64 | 0.28 |
| BE | 620 | Plant | RF | 0.73 | 0.63 | 0.68 | 0.36 |
| CKSAAP | 1600 | Plant | RF | 0.60 | 0.68 | 0.64 | 0.28 |
| EAAC | 540 | Plant | RF | 0.83 | 0.70 | 0.77 | 0.54 |
| EGAAC | 135 | Plant | RF | 0.82 | 0.69 | 0.75 | 0.51 |
| PSSM | 620 | Plant | RF | 0.70 | 0.60 | 0.65 | 0.31 |
| Incorporated | 3935 | Plant | RF | 0.85 | 0.73 | 0.79 | 0.57 |
Stands for the combination of each single feature, which means AAC + AAPC + BE + CKSAAP + EAAC + EGAAC + PSSM.
Performance on mammalian dataset.
| Features | Dimension | Dataset | Classifier | Sn | Sp | Acc | MCC |
|---|---|---|---|---|---|---|---|
| AAC | 20 | Mammalian | libsvm | 0.90 | 0.87 | 0.88 | 0.76 |
| AAPC | 400 | Mammalian | libsvm | 0.98 | 0.76 | 0.87 | 0.75 |
| BE | 620 | Mammalian | libsvm | 0.83 | 0.93 | 0.88 | 0.76 |
| CKSAAP | 1600 | Mammalian | libsvm | 0.92 | 0.81 | 0.86 | 0.73 |
| EAAC | 540 | Mammalian | libsvm | 0.90 | 0.89 | 0.89 | 0.78 |
| EGAAC | 135 | Mammalian | libsvm | 0.90 | 0.93 | 0.91 | 0.83 |
| PSSM | 620 | Mammalian | libsvm | 0.98 | 0.91 | 0.94 | 0.85 |
| Incorporated | 3935 | Mammalian | libsvm | 1.0 | 0.85 | 0.92 | 0.86 |
| AAC | 20 | Mammalian | RF | 0.93 | 0.87 | 0.89 | 0.79 |
| AAPC | 400 | Mammalian | RF | 0.89 | 0.76 | 0.82 | 0.65 |
| BE | 620 | Mammalian | RF | 0.93 | 0.87 | 0.89 | 0.79 |
| CKSAAP | 1600 | Mammalian | RF | 0.93 | 0.81 | 0.86 | 0.73 |
| EAAC | 540 | Mammalian | RF | 0.93 | 0.87 | 0.89 | 0.79 |
| EGAAC | 135 | Mammalian | RF | 0.92 | 0.88 | 0.90 | 0.80 |
| PSSM | 620 | Mammalian | RF | 0.94 | 0.91 | 0.92 | 0.83 |
| Incorporated | 3935 | Mammalian | RF | 0.90 | 0.82 | 0.86 | 0.79 |
Stands for the combination of each single feature, which means AAC + AAPC + BE + CKSAAP + EAAC + EGAAC + PSSM.
Performance comparison between our method and existing available crotonylation site prediction tools (pKcr).
| Features | Dataset | Tool | Accuracy | Sensitivity | Specificity | MCC | AUC |
|---|---|---|---|---|---|---|---|
| AAC | Non-histone | pKcr | 0.83 | 0.21 | 0.90 | 0.10 | 0.67 |
| This method | 0.64 | 0.60 | 0.68 | 0.28 | 0.68 | ||
| CKSAAP | Non-histone | pKcr | 0.83 | 0.22 | 0.90 | 0.11 | 0.68 |
| This method | 0.64 | 0.60 | 0.68 | 0.28 | 0.71 | ||
| BE | Non-histone | pKcr | 0.84 | 0.33 | 0.90 | 0.19 | 0.74 |
| This method | 0.68 | 0.73 | 0.63 | 0.36 | 0.77 | ||
| EAAC | Non-histone | pKcr | 0.85 | 0.42 | 0.90 | 0.27 | 0.81 |
| This method | 0.77 | 0.83 | 0.70 | 0.54 | 0.84 | ||
| EGAAC | Non-histone | pKcr | 0.85 | 0.42 | 0.90 | 0.25 | 0.81 |
| This method | 0.77 | 0.83 | 0.70 | 0.51 | 0.82 |
The above comparison indicates that our study is more robust and gives a more balanced performance than the pKcr method.
Performance comparison between our method and other two existing tools (CKSAAP_CrotSite and iKcr-PseEns).
| Dataset | Method | Classifier | Feature | Sn | Sp | Acc | MCC | AUC |
|---|---|---|---|---|---|---|---|---|
| Mammalian | CKSAAP_CrotSite | libsvm | CKSAAP | 0.92 | 0.99 | 0.98 | 0.92 | 0.99 |
| This paper | libsvm | CKSAAP | 0.92 | 0.81 | 0.86 | 0.73 | 0.94 | |
| Mammalian | iKcr-PseEns | Ensemble Random Forest | PseAAC | 0.90 | 0.95 | 0.94 | 0.81 | 0.97 |
| This paper | Random Forest | PseAAC | 0.93 | 0.87 | 0.89 | 0.79 | 0.95 |
The above comparison indicated that our study works competitively good as these two published work.
Comparison of performance before and after feature selection method in the incorporated feature.
| Selection method | Number of features | Classifier | Sn | Sp | Acc | MCC | AUC |
|---|---|---|---|---|---|---|---|
| Original | 3935 | svm | 0.74 | 0.71 | 0.73 | 0.43 | 0.78 |
| Chi-square | 100 | 0.77 | 0.70 | 0.74 | 0.47 | 0.81 | |
| LGBM | 100 | 0.77 | 0.75 | 0.76 | 0.45 | 0.83 | |
| MRMD | 100 | 0.75 | 0.73 | 0.74 | 0.45 | 0.84 | |
| Original | 3935 | RF | 0.83 | 0.65 | 0.74 | 0.49 | 0.82 |
| Chi-square | 100 | 0.84 | 0.70 | 0.77 | 0.55 | 0.84 | |
| LGBM | 100 | 0.85 | 0.72 | 0.78 | 0.54 | 0.84 | |
| MRMD | 100 | 0.83 | 0.70 | 0.76 | 0.55 | 0.83 |
Here ‘original’ corresponds to the incorporated feature, which is AAC + AAPC + BE + CKSAAP +EAAC + EGAAC + PSSM, of 3935 dimension. ‘Chi-square’ corresponds to the selected top-100 dimension of features after selection in Chi-square method, ‘LGBM’ corresponds to the selected top-100 dimension of features based on LGBM feature selection method, ‘MRMD’ corresponds to the selected top-100 dimension of features based on MRMD feature selection method.
Performance of cross-species evaluation.
| Training set | Feature | Validation set | Classifier | Sn | Sp | Acc | MCC | AUC |
|---|---|---|---|---|---|---|---|---|
| Plant | AAC | Mammalian | SVM | 0.48 | 0.51 | 0.49 | − 0.02 | 0.54 |
| Plant | AAPC | Mammalian | SVM | 0.11 | 0.57 | 0.34 | − 0.36 | 0.30 |
| Plant | BE | Mammalian | SVM | 0.44 | 0.36 | 0.40 | − 0.20 | 0.55 |
| Plant | CKSAAP | Mammalian | SVM | 0.26 | 0.54 | 0.40 | − 0.20 | 0.39 |
| Plant | EAAC | Mammalian | SVM | 0.48 | 0.70 | 0.59 | 0.19 | 0.45 |
| Plant | EGAAC | Mammalian | SVM | 0.28 | 0.74 | 0.51 | 0.02 | 0.45 |
| Plant | PSSM | Mammalian | SVM | 0.15 | 0.55 | 0.35 | − 0.33 | 0.35 |
| Plant | AAC | Mammalian | RF | 0.16 | 0.65 | 0.41 | − 0.21 | 0.45 |
| Plant | AAPC | Mammalian | RF | 0.19 | 0.66 | 0.43 | − 0.16 | 0.45 |
| Plant | BE | Mammalian | RF | 0.40 | 0.625 | 0.51 | 0.02 | 0.53 |
| Plant | CKSAAP | Mammalian | RF | 0.48 | 0.60 | 0.54 | 0.08 | 0.59 |
| Plant | EAAC | Mammalian | RF | 0.42 | 0.70 | 0.56 | 0.13 | 0.64 |
| Plant | EGAAC | Mammalian | RF | 0.33 | 0.71 | 0.52 | 0.05 | 0.63 |
| Plant | PSSM | Mammalian | RF | 0.21 | 0.68 | 0.45 | − 0.14 | 0.47 |
In this evaluation, the plant dataset were treated as the training set and mammalian dataset as the testing set.
Figure 2Flowchart of this paper. Four main steps contained: data collection and preprocessing, feature investigation, model training and evaluation and independent test.