| Literature DB >> 33195258 |
Yixiao Zhai1, Yu Chen1, Zhixia Teng1, Yuming Zhao1.
Abstract
Excessive oxidative stress responses can threaten our health, and thus it is essential to produce antioxidant proteins to regulate the body's oxidative responses. The low number of antioxidant proteins makes it difficult to extract their representative features. Our experimental method did not use structural information but instead studied antioxidant proteins from a sequenced perspective while focusing on the impact of data imbalance on sensitivity, thus greatly improving the model's sensitivity for antioxidant protein recognition. We developed a method based on the Composition of k-spaced Amino Acid Pairs (CKSAAP) and the Conjoint Triad (CT) features derived from the amino acid composition and protein-protein interactions. SMOTE and the Max-Relevance-Max-Distance algorithm (MRMD) were utilized to unbalance the training data and select the optimal feature subset, respectively. The test set used 10-fold crossing validation and a random forest algorithm for classification according to the selected feature subset. The sensitivity was 0.792, the specificity was 0.808, and the average accuracy was 0.8.Entities:
Keywords: antioxidant protein; machine learning; random forest; sequence feature; unbalanced dataset
Year: 2020 PMID: 33195258 PMCID: PMC7658297 DOI: 10.3389/fcell.2020.591487
Source DB: PubMed Journal: Front Cell Dev Biol ISSN: 2296-634X
FIGURE 1The method flowchart. The original dataset (training and test dataset) is processed in four phases. (1) Using CKSAAP and CT to extract 743D features. (2) In the unbalanced data processing phase, eight methods are adopted to balance the training dataset. (3) In the feature selection phase, the 743D features by MRMD score are ranked and the optimal feature set is selected by Random Forest classifier. (4) Use the selected feature subset to classify the test set to get the final result.
Classification of amino acids.
| 1 | − | − | Ala, Gly, Val |
| 2 | − | + | Ile, Leu, Phe, Pro |
| 3 | + | + | Tyr, Met, Thr, Ser |
| 4 | + + | + | His, Asn, Gln, Tpr |
| 5 | + + + | + | Arg, Lys |
| 6 | + | Asp, Glu | |
| 7 | + | + | Cys |
FIGURE 2The 343-dimensional feature composition diagram. (1) Classify 20 amino acids into seven categories and obtain the g1∼g7, as shown in the amino acid classification part. (2) In the 343D vector space composed of seven types of amino acids, the seven amino acids were arranged and combined, resulting in 343 trimmers. (3) Examples of sequence conversion into features. The figure was adapted from the Supplementary Figure in Shen et al. (2007).
FIGURE 3(A) The comparison chart of the results. It was obtained by different feature extraction methods and it shows that the results of the evaluation indicators using the CKSAAP+CT method were higher than the other methods. (B) The average value of the final results obtained by using three types of unbalanced processing methods. The result obtained by oversampling is much higher than the other methods. It can be seen that repeated sampling of a small number of sample data to synthesize new data is more conducive to extracting features that make it easy to distinguish antioxidant proteins. (C) Comparison of classification effects before and after dimensionality reduction with MRMD. Sn, Sp, and Acc are greatly improved, and the Sn and Sp results are very balanced. (D) Compared with sequence characteristics of the antioxidant proteins and non-antioxidant proteins, the triplet of the first type amino acid and the second type amino acid combination appears more frequently.
The accuracy rate of eight data imbalance processing methods in different classifiers.
| ADASYN | 0.705 | 0.686 | 0.533 |
| BorderlineSMOTE | 0.733 | 0.638 | 0.533 |
| SVMSMOTE | 0.733 | 0.705 | 0.533 |
| ClusterCentroids RandomState = 0 | 0.667 | 0.648 | 0.6 |
| NearMiss version = 1 | 0.733 | 0.686 | 0.6 |
| NearMiss version = 2 | 0.638 | 0.648 | 0.6 |
| NearMiss version = 3 | 0.686 | 0.638 | 0.6 |
| SMOTEENN | 0.59 | 0.571 | 0.562 |
| SMOTETomek | 0.724 | 0.648 | 0.543 |
Compared the best results in our research with the results of AodPred.
| AodPred | 0.751 | 0.745 | 0.748 |
The prediction result of the model established by different data imbalance processing methods.
| ADASYN | 0.698 | 0.712 | 0.705 | 0.705 | 0.41 | 0.766 |
| BorderlineSMOTE | 0.736 | 0.731 | 0.733 | 0.733 | 0.467 | 0.807 |
| SVMSMOTE | 0.736 | 0.731 | 0.733 | 0.733 | 0.467 | 0.78 |
| ClusterCentroids RandomState = 0 | 0.604 | 0.731 | 0.667 | 0.665 | 0.337 | 0.743 |
| NearMiss version = 1 | 0.755 | 0.712 | 0.733 | 0.733 | 0.467 | 0.775 |
| NearMiss version = 2 | 0.66 | 0.615 | 0.638 | 0.638 | 0.276 | 0.733 |
| NearMiss version = 3 | 0.698 | 0.673 | 0.686 | 0.686 | 0.371 | 0.734 |
| SMOTEENN | 0.604 | 0.557 | 0.59 | 0.59 | 0.181 | 0.628 |
| SMOTETomek | 0.755 | 0.692 | 0.724 | 0.724 | 0.448 | 0.805 |