| Literature DB >> 30972038 |
Xiaoqing Ru1, Lihong Li1, Chunyu Wang2.
Abstract
The uniqueness of bacteriophages plays an important role in bioinformatics research. In real applications, the function of the bacteriophage virion proteins is the main area of interest. Therefore, it is very important to classify bacteriophage virion proteins and non-phage virion proteins accurately. Extracting comprehensive and effective sequence features from proteins plays a vital role in protein classification. In order to more fully represent protein information, this paper is more comprehensive and effective by combining the features extracted by the feature information representation algorithm based on sequence information (CCPA) and the feature representation algorithm based on sequence and structure information. After extracting features, the Max-Relevance-Max-Distance (MRMD) algorithm is used to select the optimal feature set with the strongest correlation between class labels and low redundancy between features. Given the randomness of the samples selected by the random forest classification algorithm and the randomness features for producing each node variable, a random forest method is employed to perform 10-fold cross-validation on the bacteriophage protein classification. The accuracy of this model is as high as 93.5% in the classification of phage proteins in this study. This study also found that, among the eight physicochemical properties considered, the charge property has the greatest impact on the classification of bacteriophage proteins These results indicate that the model discussed in this paper is an important tool in bacteriophage protein research.Entities:
Keywords: feature extraction; feature selection; hybrid sequence features; machine learning; phage virion proteins
Year: 2019 PMID: 30972038 PMCID: PMC6443926 DOI: 10.3389/fmicb.2019.00507
Source DB: PubMed Journal: Front Microbiol ISSN: 1664-302X Impact factor: 5.640
Figure 1Outline flowchart of this study.
Figure 2Eight physicochemical properties of amino acids.
Figure 3Two-two combination process of amino acids. (A) Two-two combination of residues. (B) Three-dimensional heat map of amino acid frequency. (C) Heat map of amino acid frequency.
Classification results of three data sets under different classification algorithms.
| CCPA | 188D | 68.5 | 78.3 | 91.3 | |
| MRMD | 185D | 68.5 | 78.3 | 91.5 | |
| AKSNG | 400D | 60.3 | 71.8 | 88.7 | |
| MRMD | 252D | 60.3 | 72.8 | 89.0 | |
| Seq-Str | 473D | 80.6 | 80.9 | 92.6 | |
| MRMD | 189D | 82.0 | 83.1 | 93.2 |
Classification performance under different feature extraction methods.
| Seq based | 188D | 87.4 | 93.6 | 91.3 | 81.5 |
| 400D | 82.8 | 92.4 | 88.7 | 76.1 | |
| Seq and str based | 473D | 86.2 | 97.2 | 92.6 | 85.1 |
| Com based | 588D | 87.1 | 93.2 | 91.2 | 80.7 |
| 661D | 87.5 | 96.5 | 93.1 | 85.3 |
Classification performance under each model.
| Mode l | CCPA (188) | 87.5 | 93.4 | 91.5 | 81.4 |
| Mode 2 | AKSNG (400) | 82.9 | 92.2 | 89.0 | 76.0 |
| Mode 3 | Seq-Str (473) | 86.7 | 96.6 | 93.2 | 84.8 |
| Mode 4 | Combine (588) | 87.6 | 93.5 | 91.5 | 81.5 |
| Mode 5 | Combine (661) | 87.9 | 96.3 | 93.5 | 85.3 |
Performance comparison against recent methods.
| Feng et al. ( | 75.7 | 80.7 | 79.1 | 54.9 |
| Ding et al. ( | 75.7 | 89.4 | 85.0 | 65.5 |
| Zhang et al. ( | 87.0 | 83.0 | 85.0 | 70.1 |
| This search | 87.9 | 96.3 | 93.5 | 85.3 |
Impact of physicochemical properties on classification.
| 1 | Fea 120 | 1.0 | Position of the 100%th neutral electrical storage amino acid in a sequence |
| 2 | Fea 157 | 0.9968696407744475 | Position of the 100%th helical amino acid in a sequence |
| 3 | Fea 178 | 0.9950260206126923 | Position of the 100%th soluble amino acid in a sequence |
| 4 | Fea 99 | 0.9949600329187752 | Position of the 100%th neutral polarizability amino acid in a sequence |
| 5 | Fea 136 | 0.9948079966447566 | Position of the 100%th large tensile amino acid in a sequence |
| 6 | Fea 83 | 0.994509178771573 | Position of the 100%th high-electrode amino acid in a sequence |
| 7 | Fea 52 | 0.994137797849692 | Position of the 100%th small van der Waals volume amino acid in a sequence |
| 8 | Fea 31 | 0.9937317569946658 | Position of the 100%th hydrophilic amino acid in a sequence |