| Literature DB >> 33195148 |
Zifan Guo1, Pingping Wang2, Zhendong Liu3, Yuming Zhao4.
Abstract
Thermophilicity is a very important property of proteins, as it sometimes determines denaturation and cell death. Thus, methods for predicting thermophilic proteins and non-thermophilic proteins are of interest and can contribute to the design and engineering of proteins. In this article, we describe the use of feature dimension reduction technology and LIBSVM to identify thermophilic proteins. The highest accuracy obtained by cross-validation was 96.02% with 119 parameters. When using only 16 features, we obtained an accuracy of 93.33%. We discuss the importance of the different characteristics in identification and report a comparison of the performance of support vector machine to that of other methods.Entities:
Keywords: amino acid; feature dimension reduction; feature selection; support vector machine; thermophilic proteins
Year: 2020 PMID: 33195148 PMCID: PMC7642589 DOI: 10.3389/fbioe.2020.584807
Source DB: PubMed Journal: Front Bioeng Biotechnol ISSN: 2296-4185
Figure 1Study flowchart. (I) The original protein sequence is input for feature extraction. (II) A feature extraction algorithm is used to obtain feature descriptors of each protein. (III) MRMD2.0 is used to rank the importance of features and select features. (IV) Support vector machine is used for parameter optimization and training model establishment. (V) Three parameters are used to evaluate the performance of the model: sensitivity (SE), specificity (SP), and accuracy (ACC).
The results of feature selection by using MRMD2.0.
| AAC | 16/20 | 87.94 |
| DPC | 103/400 | 87.00 |
| DDE | 365/400 | 85.60 |
| CTDC | 33/39 | 85.01 |
| CTDT | 39/39 | 80.50 |
| CTriad | 338/343 | 79.80 |
| CKSAAP | 143/150 | 79.04 |
| GTPC | 107/125 | 78.63 |
| GDPC | 13/25 | 78.57 |
| TPC | 1,008/1023 | 77.11 |
The two numbers in the second column of the table are the number after dimension reduction and the number before dimension reduction.
The results of classification using SVM and various feature combinations.
| The method of Lin and Chen ( | 93.77 | 92.69 | 93.27 |
| AAC (16) | 93.44 | 93.19 | 93.33 |
| AAC (16) + CTDC (33) | 93.77 | 92.81 | 93.33 |
| AAC (16) + DPC (103) | 95.85 | 96.22 | 96.02 |
The numbers in parentheses in the first column of the table represent the number of arguments to the feature preceding the parentheses.
The results of classification accuracy using LIBSVM and various combinations of important features.
| 1 | K | 76.41 |
| 2 | K + D | 77.50 |
| 3 | K + D + LK | 78.29 |
A plus sign in the second column of the table indicates the use of these characteristics for model training and classification. For example, “K + D” indicates the modeling and classification of the data sets with the two-dimension characteristics K and D.
Figure 2Visualization of the ability of important features to classify thermophilic and non-thermophilic proteins. (A) is a violin diagram of the K feature. (B) is a scatter diagram of the K feature and D feature. (C) is a 3D scatter diagram of the K, D, and LK features. K is the percentage of lysine in the amino acid sequence, D is the percentage of aspartic acid in the amino acid sequence, and LK is the percentage content of the dipeptide consisting of leucine and lysine in the amino acid sequence.
The performance of different classification methods in the prediction of the data sets.
| SVM (this article) | 95.85 | 96.22 | 96.02 |
| LMT | 92.35 | 90.29 | 91.40 |
| Logistic | 91.15 | 88.90 | 90.11 |
| Random Forest | 91.69 | 87.51 | 89.75 |
| BayesNet | 88.08 | 86.25 | 87.24 |
| REPTree | 83.60 | 84.62 | 84.07 |
| J48 | 83.50 | 80.33 | 82.03 |
Figure 3The performance of the method described in this article and other six predictors when the input is 16 parameters of amino acid composition and 103 parameters of dipeptide composition. The performance metrics are sensitivity (SE), specificity (SP), and accuracy (ACC).