| Literature DB >> 34917130 |
Dongxu Zhao1, Zhixia Teng1, Yanjuan Li2, Dong Chen2.
Abstract
Recently, several anti-inflammatory peptides (AIPs) have been found in the process of the inflammatory response, and these peptides have been used to treat some inflammatory and autoimmune diseases. Therefore, identifying AIPs accurately from a given amino acid sequences is critical for the discovery of novel and efficient anti-inflammatory peptide-based therapeutics and the acceleration of their application in therapy. In this paper, a random forest-based model called iAIPs for identifying AIPs is proposed. First, the original samples were encoded with three feature extraction methods, including g-gap dipeptide composition (GDC), dipeptide deviation from the expected mean (DDE), and amino acid composition (AAC). Second, the optimal feature subset is generated by a two-step feature selection method, in which the feature is ranked by the analysis of variance (ANOVA) method, and the optimal feature subset is generated by the incremental feature selection strategy. Finally, the optimal feature subset is inputted into the random forest classifier, and the identification model is constructed. Experiment results showed that iAIPs achieved an AUC value of 0.822 on an independent test dataset, which indicated that our proposed model has better performance than the existing methods. Furthermore, the extraction of features for peptide sequences provides the basis for evolutionary analysis. The study of peptide identification is helpful to understand the diversity of species and analyze the evolutionary history of species.Entities:
Keywords: anti-inflammatory peptides; evolutionary analysis; evolutionary information; feature extraction; random forest
Year: 2021 PMID: 34917130 PMCID: PMC8669811 DOI: 10.3389/fgene.2021.773202
Source DB: PubMed Journal: Front Genet ISSN: 1664-8021 Impact factor: 4.599
FIGURE 1The framework of iAIPs.
Performance comparison of various single features.
| Feature | SN | SP | ACC | MCC | AUC |
|---|---|---|---|---|---|
| Amino acid composition (AAC) | 0.529 | 0.845 | 0.719 | 0.398 | 0.760 |
| Dipeptide deviation for the expected mean (DDE) | 0.589 | 0.854 | 0.748 | 0.464 | 0.784 |
| G-gap dipeptide composition (GDC)-gap1 | 0.456 | 0.862 | 0.700 | 0.353 | 0.764 |
| GDC-gap2 | 0.466 | 0.852 | 0.697 | 0.348 | 0.751 |
| GDC-gap3 | 0.454 | 0.869 | 0.703 | 0.361 | 0.741 |
| GDC-gap4 | 0.449 | 0.853 | 0.692 | 0.335 | 0.733 |
| CKSAAGP | 0.477 | 0.861 | 0.707 | 0.371 | 0.732 |
| CTriad | 0.215 | 0.897 | 0.624 | 0.155 | 0.668 |
| GAAC | 0.533 | 0.750 | 0.663 | 0.288 | 0.679 |
| GDPC | 0.525 | 0.826 | 0.706 | 0.370 | 0.727 |
| GTPC | 0.470 | 0.855 | 0.701 | 0.357 | 0.742 |
| TPC | 0.304 | 0.910 | 0.668 | 0.277 | 0.739 |
Performance comparison of various combined features of fivefold cross-validation on the training dataset.
| Feature | SN | SP | ACC | MCC | AUC |
|---|---|---|---|---|---|
| AAC+DDE | 0.582 | 0.857 | 0.747 | 0.461 | 0.784 |
| AAC+GDC-gap1 | 0.483 | 0.870 | 0.715 | 0.388 | 0.770 |
| AAC+GDC-gap2 | 0.453 | 0.871 | 0.704 | 0.363 | 0.773 |
| AAC+GDC-gap3 | 0.435 | 0.866 | 0.694 | 0.339 | 0.759 |
| AAC+GDC-gap4 | 0.447 | 0.873 | 0.703 | 0.360 | 0.760 |
| DDE+GDC-gap1 | 0.586 | 0.858 | 0.749 | 0.466 | 0.790 |
| DDE+GDC-gap2 | 0.588 | 0.854 | 0.748 | 0.464 | 0.791 |
| DDE+GDC-gap3 | 0.583 | 0.860 | 0.749 | 0.466 | 0.785 |
| DDE+GDC-gap4 | 0.587 | 0.851 | 0.746 | 0.459 | 0.784 |
| AAC+DDE+GDC-gap1 | 0.585 | 0.860 | 0.750 | 0.468 | 0.794 |
| AAC+DDE+GDC-gap2 | 0.584 | 0.852 | 0.745 | 0.457 | 0.790 |
| AAC+DDE+GDC-gap3 | 0.593 | 0.857 | 0.751 | 0.471 | 0.784 |
| AAC+DDE+GDC-gap4 | 0.587 | 0.855 | 0.748 | 0.464 | 0.785 |
Performance comparison of various combined features on the independent dataset.
| Feature | SN | SP | ACC | MCC | AUC |
|---|---|---|---|---|---|
| AAC+DDE | 0.564 | 0.860 | 0.742 | 0.450 | 0.808 |
| AAC+GDC-gap1 | 0.488 | 0.884 | 0.725 | 0.413 | 0.799 |
| AAC+GDC-gap2 | 0.455 | 0.878 | 0.708 | 0.373 | 0.787 |
| AAC+GDC-gap3 | 0.448 | 0.881 | 0.707 | 0.371 | 0.795 |
| AAC+GDC-gap4 | 0.462 | 0.865 | 0.704 | 0.362 | 0.783 |
| DDE+GDC-gap1 | 0.569 | 0.857 | 0.742 | 0.450 | 0.812 |
| DDE+GDC-gap2 | 0.560 | 0.854 | 0.736 | 0.437 | 0.805 |
| DDE+GDC-gap3 | 0.576 | 0.857 | 0.745 | 0.456 | 0.808 |
| DDE+GDC-gap4 | 0.569 | 0.857 | 0.742 | 0.450 | 0.801 |
| AAC+DDE+GDC-gap1 | 0.56 | 0.859 | 0.739 | 0.443 | 0.806 |
| AAC+DDE+GDC-gap2 | 0.557 | 0.855 | 0.736 | 0.437 | 0.805 |
| AAC+DDE+GDC-gap3 | 0.552 | 0.855 | 0.734 | 0.433 | 0.806 |
| AAC+DDE+GDC-gap4 | 0.567 | 0.859 | 0.742 | 0.450 | 0.801 |
Performance of various classifiers utilizing AAC-DDE-GDC-gap1 feature and fivefold cross-validation on the training dataset.
| Classifier | SN | SP | ACC | MCC | AUC |
|---|---|---|---|---|---|
| Random forest | 0.585 | 0.860 | 0.750 | 0.468 | 0.794 |
| AdaBoost | 0.579 | 0.743 | 0.678 | 0.324 | 0.661 |
| Gradient Boost Decision Tree (GBDT) | 0.583 | 0.788 | 0.706 | 0.379 | 0.686 |
| LightGBM | 0.564 | 0.754 | 0.678 | 0.321 | 0.659 |
| XGBoost | 0.576 | 0.757 | 0.684 | 0.336 | 0.666 |
| J48 | 0.552 | 0.737 | 0.663 | 0.292 | 0.647 |
| Logistic | 0.497 | 0.677 | 0.605 | 0.175 | 0.624 |
| Sequential minimal optimization (SMO) | 0.476 | 0.725 | 0.626 | 0.206 | 0.601 |
| SGD | 0.491 | 0.689 | 0.610 | 0.182 | 0.590 |
| Naïve Bayes | 0.483 | 0.684 | 0.603 | 0.168 | 0.604 |
Performance of various classifiers based on AAC-DDE-GDC-gap1 feature on the independent dataset.
| Classifier | SN | SP | ACC | MCC | AUC |
|---|---|---|---|---|---|
| Random forest | 0.560 | 0.859 | 0.739 | 0.443 | 0.806 |
| AdaBoost | 0.607 | 0.809 | 0.728 | 0.426 | 0.708 |
| GBDT | 0.640 | 0.798 | 0.735 | 0.443 | 0.719 |
| LightGBM | 0.538 | 0.859 | 0.730 | 0.424 | 0.698 |
| XGBoost | 0.579 | 0.847 | 0.740 | 0.446 | 0.713 |
| J48 | 0.524 | 0.738 | 0.652 | 0.266 | 0.621 |
| Logistic | 0.498 | 0.658 | 0.594 | 0.156 | 0.615 |
| SMO | 0.442 | 0.701 | 0.598 | 0.147 | 0.572 |
| SGD | 0.493 | 0.679 | 0.604 | 0.173 | 0.586 |
| Naïve Bayes | 0.486 | 0.676 | 0.600 | 0.162 | 0.602 |
FIGURE 2Comparison of identification performance before and after dimensionality reduction.
Performance of different identification models on the independent dataset.
| Method | SN | SP | ACC | MCC | AUC |
|---|---|---|---|---|---|
| AntiInflam (LA) | 0.258 | 0.892 | 0.638 | 0.197 | 0.647 |
| AntiInflam (MA) | 0.786 | 0.417 | 0.565 | 0.210 | 0.706 |
| AIEpred | 0.555 | 0.899 | 0.762 | 0.495 | 0.767 |
| AIPpred | 0.741 | 0.746 | 0.744 | 0.479 | 0.813 |
| iAIPs (our work) | 0.567 | 0.874 | 0.751 | 0.471 | 0.822 |