| Literature DB >> 35571057 |
Minchao Jiang1, Renfeng Zhang2, Yixiao Xia1, Gangyong Jia1, Yuyu Yin1, Pu Wang3, Jian Wu4, Ruiquan Ge1.
Abstract
Parasites can cause enormous damage to their hosts. Studies have shown that antiparasitic peptides can inhibit the growth and development of parasites and even kill them. Because traditional biological methods to determine the activity of antiparasitic peptides are time-consuming and costly, a method for large-scale prediction of antiparasitic peptides is urgently needed. We propose a computational approach called i2APP that can efficiently identify APPs using a two-step machine learning (ML) framework. First, in order to solve the imbalance of positive and negative samples in the training set, a random under sampling method is used to generate a balanced training data set. Then, the physical and chemical features and terminus-based features are extracted, and the first classification is performed by Light Gradient Boosting Machine (LGBM) and Support Vector Machine (SVM) to obtain 264-dimensional higher level features. These features are selected by Maximal Information Coefficient (MIC) and the features with the big MIC values are retained. Finally, the SVM algorithm is used for the second classification in the optimized feature space. Thus the prediction model i2APP is fully constructed. On independent datasets, the accuracy and AUC of i2APP are 0.913 and 0.935, respectively, which are better than the state-of-arts methods. The key idea of the proposed method is that multi-level features are extracted from peptide sequences and the higher-level features can distinguish well the APPs and non-APPs.Entities:
Keywords: T-distributed stochastic neighbor embedding; antiparasitic peptides; feature representation; feature selection; maximum information coefficient
Year: 2022 PMID: 35571057 PMCID: PMC9091563 DOI: 10.3389/fgene.2022.884589
Source DB: PubMed Journal: Front Genet ISSN: 1664-8021 Impact factor: 4.772
FIGURE 1Different distribution between APP and non-APP sequences. (A) V46p+46n (B) T255p+1863n.
Peptide sequence features.
| Features | |
|---|---|
| Sequence-based | Basic Kmer (kmer) |
| Distance-based Residue (DR) | |
| Distance Pair (DP) | |
| Auto covariance (feature-AC) | |
| Auto-cross covariance (ACC) | |
| Cross covariance (feature-CC) | |
| Physicochemical distance transformation (PDT) | |
| Parallel correlation pseudo amino acid composition (PC-PseAAC) | |
| Series correlation pseudo amino acid composition (SC-PseAAC) | |
| General parallel correlation pseudo amino acid composition (PC-PseAAC-General) | |
| General series correlation pseudo amino acid composition (SC-PseAAC-General) | |
| Select and combine the nmost frequenct aminoacids according to their frequencies (Top-n-gram) | |
| Profile-based Physicochemical distance transformation (PDT-Profile) | |
| Distance-based Top-n-gram (DT) | |
| Profile-based Auto covariance (AC-PSSM) | |
| Profile-based Cross covariance (CC-PSSM) | |
| Profile-based Distance-based Top-n-gram (PSSM-DT) | |
| Profile-based Auto-cross covariance (ACC-PSSM) | |
| Terminus-based | One_hot |
| One_hot_6_bit | |
| Binary_5_bit | |
| Hydrophobicity_matrix | |
| Meiler_parameters | |
| Acthely_factors | |
| PAM250 | |
| BLOSUM62 | |
| Miyazawa_energies | |
| Micheletti_potentials | |
| AESNN3 | |
| ANN4D |
FIGURE 2The whole model consists of four parts. The first part is the collection, division and down sampling of the dataset. The second part is feature extraction and feature selection for each peptide sequence. The third part is to analyze the effect of different classifiers through 10-fold cross-validation. In the fourth part, the proposed model is evaluated through independent test.
The results of cross-validation on the training set with different classifiers.
| Model | ACC (%) | SN (%) | SP (%) | AUC | MCC | F1 | |
|---|---|---|---|---|---|---|---|
| Training Set | SVM |
|
| 86.9 |
|
|
|
| Bayes | 86.5 | 83.2 | 87.9 | 0.865 | 0.729 | 0.838 | |
| Knn | 86.3 | 93.0 | 80.5 | 0.893 | 0.736 | 0.867 | |
| DT | 82.7 | 82.0 | 84.5 | 0.833 | 0.660 | 0.824 | |
| RF | 87.5 | 91.9 | 83.7 | 0.951 | 0.753 | 0.877 | |
| Ada | 82.2 | 84.8 | 79.8 | 0.823 | 0.645 | 0.822 |
The bold values indicate the best performance.
The results of independent test on the testing set with different classifiers.
| Model | ACC (%) | SN (%) | SP (%) | AUC | MCC | F1 | |
|---|---|---|---|---|---|---|---|
| Testing Set | SVM |
|
| 84.8 |
|
|
|
| Bayes | 85.9 | 84.8 | 87.0 | 0.868 | 0.718 | 0.857 | |
| Knn | 89.1 | 97.8 | 80.4 | 0.910 | 0.800 | 0.900 | |
| DT | 82.6 | 80.4 | 84.8 | 0.826 | 0.653 | 0.822 | |
| RF | 88.0 | 93.5 | 82.6 | 0.931 | 0.765 | 0.887 | |
| Ada | 88.0 | 91.3 | 84.8 | 0.880 | 0.762 | 0.884 |
The bold values indicate the best performance.
FIGURE 3The performance of different classifiers through cross-validation on the training set.
Comparison of our model with the existing methods through cross-validation on the training set.
| Method | ACC (%) | SN (%) | SP (%) | MCC | F1 |
|---|---|---|---|---|---|
| NM-BD | 88.8 | 85.5 | 92.2 | 0.778 | 0.884 |
| RUS-BD | 88.2 | 92.5 | 83.9 | 0.768 | 0.887 |
|
|
|
| 86.9 |
|
|
The bold values indicate the best performance.
Comparison of our model with the existing methods through independent test on the testing set.
| Method | ACC (%) | SN (%) | SP (%) | MCC | F1 |
|---|---|---|---|---|---|
| AMPfun | 73.9 | 52.2 | 95.7 | 0.531 | 0.667 |
| PredAPP | 88.0 | 97.8 | 78.3 | 0.776 | 0.891 |
|
|
|
| 84.8 |
|
|
The bold values indicate the best performance.
The results of ten-fold cross-validation on the balanced or unbalanced datasets.
| Method | ACC (%) | SN (%) | SP (%) | MCC | F1 |
|---|---|---|---|---|---|
| PredAPP (unbalanced) | 91.9 | 52.5 | 97.3 | 0.574 | 0.609 |
| i2APP (balanced) | 90.0 | 93.2 | 86.9 | 0.803 | 0.900 |
| i2APP (unbalanced) | 96.5 | 76.7 | 99.3 | 0.826 | 0.839 |
The results of independent test using the balanced or unbalanced datasets as the training set.
| Method | ACC (%) | SN (%) | SP (%) | MCC | F1 |
|---|---|---|---|---|---|
| i2APP (balanced) | 91.3 | 97.8 | 84.8 | 0.833 | 0.918 |
| i2APP (unbalanced) | 93.5 | 100.0 | 87.0 | 0.877 | 0.939 |
FIGURE 4The effect of shuffling the sequence.
FIGURE 5t-SNE visualization results of the testing set after dimensionality reduction of the higher-level features.