| Literature DB >> 20066123 |
Zhi Qun Tang1, Hong Huang Lin, Hai Lei Zhang, Lian Yi Han, Xin Chen, Yu Zong Chen.
Abstract
Various computational methods have been used for the prediction of protein and peptide function based on their sequences. A particular challenge is to derive functional properties from sequences that show low or no homology to proteins of known function. Recently, a machine learning method, support vector machines (SVM), have been explored for predicting functional class of proteins and peptides from amino acid sequence derived properties independent of sequence similarity, which have shown promising potential for a wide spectrum of protein and peptide classes including some of the low- and non-homologous proteins. This method can thus be explored as a potential tool to complement alignment-based, clustering-based, and structure-based methods for predicting protein function. This article reviews the strategies, current progresses, and underlying difficulties in using SVM for predicting the functional class of proteins. The relevant software and web-servers are described. The reported prediction performances in the application of these methods are also presented.Entities:
Keywords: machine learning method; peptide function; protein family; protein function; protein function prediction; support vector machines
Year: 2009 PMID: 20066123 PMCID: PMC2789692 DOI: 10.4137/bbi.s315
Source DB: PubMed Journal: Bioinform Biol Insights ISSN: 1177-9322
Figure 1.Schematic diagram illustrating the process of the training and prediction of the functional class of proteins and peptides by using support vector machine (SVM) method. A,B: feature vectors of proteins belong to a functional class; E,F: feature vectors of proteins not belong to a functional class. Sequence-derived feature hj, vj, pj, … represents such structural and physicochemical properties as hydrophobicity, polarizability, and volume; or such properties as domain information, subcellular localization, and post-translational (PT) modification profiles etc.
Web-servers for computing functional class of proteins and peptides by using support vector machines. Web-sites of support vector machine software are also given.
| Server for Predicting Protein Functional Class | CTKPred: SVM prediction and classification of the cytokine family | |
| GPCRpred: SVM prediction of families and subfamilies of G-protein coupled receptors | ||
| pSLIP: SVM protein subcellular localization prediction | ||
| SVMProt: SVM protein functional family prediction from protein sequence | ||
| Server for Predicting Peptide Functional Class | MHC-BPS: SVM prediction of MHC-binding peptides of flexible lengths | |
| SVMHC: SVM prediction of MHC-binding peptides | ||
| SVRMHC: SVM prediction of MHC-binding peptide | ||
| WAPP: SVM prediction of MHC-binding, proteasomal cleavage and TAP transport peptides | ||
| SVM Software and servers | SVM light | |
| LIBSVM | ||
| mySVM | ||
| SMO | ||
| BSVM | ||
| WinSVM | ||
| LS-SVMlab | ||
| GIST SVM Server |
Figure 2.Support vector machines. (a) Definition of hyper-plane and margin. The circular dots and square dots represent samples of class −1 and class +1, respectively. (b) The available hyper-planes H, H’, H”, …, corresponding to a set of training data. (c) Unique optimal separating hyper-plane of a set of training data. (d) Basic idea of support vector machines: Projection of the training data nonlinearly into a higher-dimensional feature space via φ, and subsequent construction of a separating hyper-plane with maximum margin in that space.
Performance of machine learning methods for predicting functional class of proteins as reported in the literature. All of the data and results were collected from the original papers. Please refer to the respective references for complete results. N+, N– and N are the number of class members, non-members and all proteins (members + non-members) respectively, P+ and P– are prediction accuracy for class members and non-members respectively, P is the overall accuracy, and MCC is the Matthews correlation coefficient.
| Enzymes | 46 sub-classes: | Physicochemical properties | 956~9216 (35~3892/807~5324) | Independent evaluation | 53.0~ 99.3 | 85.0~ 99.7 | 81.8~ 99.7 | 0.31 ~ 0.98 | ( |
| 54 sub-classes: | Functional Domain Composition and pseudo amino acid composition | 503~3582 (3~2002/327~3548) | Jackknife Test | 25.0~ 100.0 | ( | ||||
| Transporters | 20 sub-classes: | Physicochemical properties | 613~7508 (50~1220/513~7299) | Independent evaluation | 60.6~ 97.1 | 91.5~ 99.9 | 91.4~ 99.7 | 0.27~ 0.97 | ( |
| Allergenic proteins | Amino acid | 1278 (578/700) | Independent evaluation | 88.9 | 81.9 | 85.0 | 0.71 | ( | |
| Dipeptide composition | 1278 (578/700) | Independent evaluation | 82.8 | 85.0 | 84.0 | 0.68 | |||
| Physicochemical properties | 23474 (1005/22469) | Independent evaluation | 93.0 | 99.9 | 99.7 | 0.96 | ( | ||
| Crystallizable proteins | Mono-, di-, tri-peptide composition, physicochemical and structural properties | 923 (721/202) | 10-fold CV | 65.0 | 69.0 | 67.0 | ( | ||
| Mitochondrial proteins | Amino acid composition | 10372 (1432/8940) | 5-fold CV | 78.9 | 90.0 | 88.2 | 0.62 | ( | |
| G-protein coupled receptors | All GPCRs | Physicochemical properties | 2247 (927/1320) | Independent evaluation | 95.6 | 98.1 | 97.4 | 0.93 | ( |
| Dipeptide composition | 3302 (778/2524) | 5-fold CV | 98.6 | 99.8 | 99.5 | 0.99 | ( | ||
| Protein power spectrum | 946 | Jackknife | 96.1 | ( | |||||
| Gi/o binding type | Structural characteristics | 132 (61/71) | 4-fold CV | 77.0 | 78.3 | ( | |||
| Gq/11 binding type | (extra cellular loops, intracellular loops etc) | 132 (47/85) | 4-fold CV | 68.1 | 72.7 | ||||
| Gs binding type | 132 (24/108) | 4-fold CV | 83.3 | 95.2 | |||||
| Rhodopsin-like (Class A) | Protein power spectrum | 540 | Jackknife | 97.0 | 0.93 | ( | |||
| Secretin-like (Class B) | 187 | Jackknife | 96.3 | 0.94 | |||||
| Metabotropic glutamate (Class C) | 103 | Jackknife | 94.2 | 0.95 | |||||
| Fungal pheromone (Class D) | 21 | Jackknife | 81.0 | 0.92 | |||||
| cAMP receptors (Class E) | 5 | Jackknife | 100.0 | 1 | |||||
| Frizzled/smoothened (Class F) | 90 | Jackknife | 95.6 | 0.94 | |||||
| Nuclear receptors | All nuclear receptors | Amino acid composition | 282 | 5-fold CV | 82.6 | 0.74 | (Bhasin and Raghava, | ||
| Dipeptide composition | 282 | 5-fold CV | 97.5 | 0.96 | 2004a) | ||||
| Physicochemical properties | 872 (334/538) | Independent evaluation | 89.5 | 97.6 | ( | ||||
| Protein power spectrum | 465 | Jackknife | 95.3 | ( | |||||
| Thyroid hormone-like | Protein power spectrum | 165 | Jackknife | 95.8 | 0.95 | ( | |||
| HNF4-like | 114 | Jackknife | 97.4 | 0.96 | |||||
| Estrogen-like | 130 | Jackknife | 97.7 | 0.96 | |||||
| Fushitarazu-F1 like | 35 | Jackknife | 94.3 | 0.97 | |||||
| Nerve growth factor IB-like | 5 | Jackknife | 80.0 | 0.89 | |||||
| Germ cell nuclear receptor | 2 | Jackknife | 100.0 | 1.0 | |||||
| 0A Knirps-like | 7 | Jackknife | 42.9 | 0.65 | |||||
| 0B DAX-like | 7 | Jackknife | 71.4 | 0.84 | |||||
| RNA-binding proteins | All RNA-binding proteins | Amino acid composition and limited range correlation of hydrophobicity and solvent accessible surface area | 6264 (1496/4768) | 10-fold CV | 76.5 | 97.2 | 92.2 | ( | |
| Physicochemical properties | 5126 (2161/2965) | Independent evaluation | 97.8 | 96.0 | 96.1 | 0.8 | ( | ||
| rRNA-binding | Amino acid composition, limited range correlation of hydrophobicit, solvent accessible surface area | 5824 (1056/4768) | 10-fold CV | 100.0 | 99.9 | 99.9 | ( | ||
| Physicochemical properties | 1680 (708/972) | Independent evaluation | 94.1 | 98.7 | 98.6 | 0.74 | ( | ||
| tRNA-binding | Physicochemical properties | 886 (94/792) | Independent evaluation | 94.1 | 99.9 | 99.8 | 0.92 | ( | |
| mRNA-binding | 2383 (277/2106) | 79.3 | 96.5 | 96.0 | 0.53 | ||||
| snRNA-binding | 2021 (33/1988) | 45.0 | 99.7 | 99.5 | 0.38 | ||||
| DNA-binding proteins | All DNA-binding proteins | Amino acid composition, limited range correlation of hydrophobicity, solvent accessible surface area | 12507 (7739/4768) | 10-fold CV | 92.8 | 77.1 | 86.8 | ( | |
| Surface and overall composition, overall charge and positive potential patches on the protein surface | 359 (121/238) | 5-fold CV | 89.1 | 82.1 | 93.9 | ( | |||
| Jackknife | 90.5 | 81.8 | 94.9 | ||||||
| leave 1-pair holdout | 86.3 | 80.6 | 87.5 | ||||||
| Leave-half holdout | 83.3 | 82.5 | 83.5 | ||||||
| Physicochemical properties | 8575 (4240/4335) | Independent evaluation | 90.9 | 87.6 | 88.5 | 0.74 | ( | ||
| DNA condensation | Physicochemical properties | 2410 (50/2360) | Independent evaluation | 94.9 | 98.3 | 98.3 | 0.47 | ( | |
| DNA integration | 1307 (134/1173) | 87.9 | 99.9 | 99.7 | 0.91 | ||||
| DNA recombination | 3357 (889/2468) | 87.8 | 98.9 | 97.9 | 0.87 | ||||
| DNA repair | 5785 (2142/3643) | 88.7 | 96.8 | 95.3 | 0.84 | ||||
| DNA replication | 3734 (1131/2603) | 85.6 | 96.6 | 95.4 | 0.79 | ||||
| DNA-directed | 2348 (273/2075) | 72.9 | 99.7 | 98.9 | 0.79 | ||||
| DNA polymerase | |||||||||
| DNA-directed | 2594 (484/2110) | 90.8 | 99.4 | 98.8 | 0.91 | ||||
| RNA polymerase | |||||||||
| Repressor | 3684 (1337/2347) | 93.3 | 95.6 | 95.4 | 0.76 | ||||
| Transcription factors | 2354 (670/1684) | 86.1 | 99.5 | 99.3 | 0.79 | ||||
| Lipid-binding proteins | All lipid-binding proteins | Physicochemical properties | 6933 (3232/3701) | Independent evaluation | 89.9 | 97 | 94.1 | 0.88 | ( |
| Lipid transport | 2262 (153/2109) | 79.5 | 99.8 | 99.6 | 0.8 | ||||
| Lipid metabolism | 2262 (293/1969) | 79.5 | 99.2 | 98.8 | 0.72 | ||||
| Lipid synthesis | 3498 (891/2607) | 82.2 | 99.6 | 98.1 | 0.87 | ||||
| Lipid degradation | 2178 (403/1775) | 78.9 | 99.9 | 99.3 | 0.87 | ||||
| Transmembrane proteins | Functional Domain Composition | 2059 | jackknife test | 86.3 | ( | ||||
| independent test | 67.5 | ||||||||
| self-consistency | 93.9 | ||||||||
| Pseudo-amino acid composition | 2059 | jackknife test | 82.4 | ( | |||||
| independent test | 90.3 | ||||||||
| self-consistency | 99.9 | ||||||||
| Physicochemical properties | 4668 (2105/2563) | Independent evaluation | 90.1 | 86.7 | 86.7 | 0.75 | ( | ||
| Cytokines | All cytokines | Dipeptide composition | 1110 (437/673) | 7-fold CV | 92.5 | 97.2 | 95.3 | 0.9 | ( |
| FGF/HBGF | 437 (83/354) | 92.7 | 98.6 | 97.5 | 0.92 | ||||
| TGF-β | 437 (190/247) | 97.4 | 94.7 | 95.8 | 0.92 | ||||
| TNF | 437 (96/341) | 94.0 | 98.8 | 97.7 | 0.94 | ||||
| Joint class (IL-6, LIF//OSM, MDK/PTN, NGF) | 437 (68/369) | 91.0 | 99.7 | 98.4 | 0.94 | ||||
| 6 sub-classes: | N.A | 46.7~ 100 | 85.5~ 100 | 84~ 98 | 0.65~ 0.96 | ||||
| Functional classes in yeast | All proteins 13 classes: | Functional domain composition | 4902 | Jackknife | 72.0 | ( | |||
| 86~725 | Jackknife | 15~90 | |||||||
Performance of support vector machine prediction of functional classes of peptides. N+ and N– are the number of members and non-members in a class, P+ and P– are the reported prediction accuracy for members and non-member respectively, and P is the reported overall accuracy.
| A0201 | Orthogonal factors from physical properties | (36/167) | 10-fold cross validation | 76.3 | 71.2 | 71.6 | ( |
| 55.0 | 87.4 | 81.7 | |||||
| 46.3 | 89.8 | 86.7 | |||||
| Amino acid sequence | 113 | 10-fold cross validation | 90.0 | 78.0 (Mc) | ( | ||
| physico-chemical properties | (1125/6911) | Validationset (130/6664) | 99.2 | 97.5 | 97.5 | ||
| A1 | Amino acid sequence | 28 | 10-fold cross validation | 98.0 | 96.0 (Mc) | ( | |
| physico-chemical properties | (200/6831) | Validation set (40/6830) | 75.0 | 99.7 | 99.6 | ||
| A3 | Amino acid sequence | 73 | 10-fold cross validation | 91.0 | 80.0 (Mc) | ( | |
| physico-chemical properties | (139/6833) | Validation set (30/6833) | 93.3 | 98.8 | 98.7 | ||
| B8 | Amino acid sequence | 25 | 10-fold cross validation | 91.0 | 79.0 (Mc) | ( | |
| physico-chemical properties | (168/6833) | Validation set (20/6830) | 95.0 | 99.8 | 99.8 | ||
| B2705 | Amino acid sequence | 29 | 10-fold cross validation | 100.0 | 100.0 (Mc) | ( | |
| physico-chemical properties | (141/7361) | Validation set (21/7359) | 95.0 | 99.9 | 99.9 | ||
| DRB1.0401 | Binary code of amino acid sequence | 567 | 5-fold cross validation | 80.287.1 | 77.485.0 | 78.886.1 | ( |
| physico-chemical properties | (539/6883) | Validation set (100/6704) | 95.0 | 99.9 | 99.9 | ||
Performance of support vector machine prediction of functional classes of novel proteins.
| Enzymes without a homolog in NR databases 2004 ( | 12 | 66.7% | Thiocyanate hydrolase beta subunit (EC 3.5.5.8) [O66186] | Extracellular phospholipase (EC 3.1.1.5) [P82476] |
| Potential cysteine protease avirulence protein avrPpiC2 (EC 3.4.22.-) [Q9F3T4] | Alginate lyase precursor (EC4.2.2.3) [P39049] | |||
| Extracellular phospholipase (EC 3.1.1.5) [P82476] | ||||
| Enzymes without a homolog in Swissprot database 2004 ( | 50 | 72% | DNA polymerase III, theta subunit (EC 2.7.7.7) [P28689] | Beta-agarase B (EC 3.2.1.81) [P488401] |
| Telomere elongation protein (EC2.7.7.-) [P17214] | ||||
| Ammonia monooxygenase (EC 1.13.12.-) [Q04508] | ||||
| Viral proteins without a homolog in Swissprot database 2004 ( | 25 | 72% | Endonuclease II[P07059] Outer capsid protein VP4 [P35746] | TRL10 (Structural envelop glycoprotein) [AAL27474] |
| Protein kinase [P00513] | BARF0 protein [Q8AZJ4] | |||
| Bacterial proteins without a homolog in Swissprot database 2004 ( | 90 | 76.7% | 2-aminomuconate deaminase [P81593] | Alginate lyase [Q59478] |
| Aminopeptidase G [Q54340] | Alpha-N-AFase II [P82594] | |||
| Plant proteins without a homolog in Swissprot database ( | 31 | 71.4% | Antimicrobial peptide 4 [AAL05055] | LeMan3 [Q9FUQ6] |
| Sucrose phosphatase [Q84ZX9] | ||||
| Pairs of homologous enzymes of different families 2004 ( | 8 | 62% | Glycolateoxidase [P05414] and IPP isomerase [Q84W37] Creatine amidinohydrolase [P38488] and Prolinedipeptidase [O58885] | Cystathionine gamma-synthase [P38675] and Methionine gamma-lyase [P13254] |
| Exocellobiohydrolase 1[P38676] and Cystathionine gamma-lyase [Q8VCN5] | ||||
| Remote homologs ( | 445 | 46.3% | 1cem (1,4-D-glucan-glucanohydrolase catalytic domain) and it’s remote homolog 1qazA (Alginate lyase A1–III from Sphingomonas Species; Chain: A;) |
Dataset statistics and prediction performance of SVM prediction of six protein functional classes by using different descriptor sets
| Protein functional family | Descriptor class | Trainingset | Testing set | Independent evaluation set | Q(%) | MCC | |||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| P | N | P | N | P | N | ||||||||||
| TP | FN | TN | FP | TP | FN | Sen(%) | TN | FP | Spec(%) | ||||||
| EC2.4 | 1 | 1249 | 2120 | 1154 | 1 | 9065 | 12 | 724 | 176 | 80.4 | 5064 | 4 | 99.9 | 97.0 | 0.879 |
| 2 | 1319 | 2120 | 1080 | 5 | 8806 | 1 | 646 | 154 | 82.9 | 5067 | 1 | 100.0 | 97.4 | 0.884 | |
| 3 | 1105 | 1756 | 1295 | 4 | 9166 | 5 | 768 | 132 | 85.3 | 5066 | 2 | 100.0 | 97.8 | 0.911 | |
| 4 | 1239 | 2221 | 1161 | 4 | 8701 | 5 | 756 | 144 | 84.0 | 5067 | 1 | 100.0 | 97.6 | 0.903 | |
| 5 | 1242 | 2223 | 1160 | 2 | 8690 | 14 | 753 | 147 | 83.7 | 5065 | 3 | 99.9 | 97.5 | 0.900 | |
| 6 | 1214 | 2077 | 1145 | 45 | 8846 | 4 | 741 | 159 | 82.3 | 5067 | 1 | 100.0 | 97.3 | 0.893 | |
| 7 | 1293 | 2624 | 1072 | 39 | 8295 | 8 | 696 | 204 | 77.3 | 5065 | 3 | 99.9 | 96.5 | 0.860 | |
| 8 | 1275 | 2747 | 1129 | 0 | 8177 | 3 | 782 | 118 | 86.9 | 5965 | 3 | 99.9 | 98.0 | 0.921 | |
| 9 | 1358 | 3887 | 1015 | 31 | 7040 | 0 | 796 | 104 | 88.4 | 5067 | 1 | 100.0 | 98.2 | 0.930 | |
| GPCR | 1 | 1590 | 7458 | 1847 | 1 | 14166 | 3 | 501 | 12 | 97.7 | 6776 | 62 | 99.1 | 99.0 | 0.927 |
| 2 | 564 | 711 | 1728 | 3 | 14121 | 5 | 498 | 15 | 97.1 | 6800 | 38 | 99.4 | 99.3 | 0.946 | |
| 3 | 1169 | 4628 | 1122 | 4 | 10208 | 1 | 491 | 22 | 95.7 | 6800 | 38 | 99.4 | 99.2 | 0.938 | |
| 4 | 1257 | 4474 | 1037 | 1 | 10363 | 0 | 492 | 21 | 95.9 | 6790 | 48 | 99.3 | 99.1 | 0.930 | |
| 5 | 1290 | 4724 | 997 | 8 | 10113 | 0 | 487 | 26 | 94.9 | 6795 | 43 | 99.4 | 99.1 | 0.929 | |
| 6 | 757 | 2060 | 1536 | 2 | 12777 | 0 | 494 | 19 | 96.3 | 6813 | 25 | 99.6 | 99.4 | 0.951 | |
| 7 | 812 | 2950 | 1482 | 1 | 11887 | 0 | 487 | 26 | 94.9 | 6746 | 92 | 98.7 | 98.4 | 0.885 | |
| 8 | 1590 | 7458 | 693 | 12 | 7322 | 57 | 503 | 10 | 98.1 | 6780 | 58 | 99.2 | 99.1 | 0.933 | |
| 9 | 834 | 4361 | 1461 | 0 | 10476 | 0 | 493 | 20 | 96.1 | 6819 | 19 | 99.7 | 99.5 | 0.959 | |
| TC8.A | 1 | 98 | 8014 | 9 | 0 | 13105 | 0 | 17 | 46 | 27.0 | 7962 | 0 | 100.0 | 99.4 | 0.518 |
| 2 | 94 | 7962 | 50 | 0 | 14824 | 0 | 41 | 22 | 65.1 | 7962 | 0 | 100.0 | 99.7 | 0.806 | |
| 3 | 94 | 7962 | 53 | 0 | 14501 | 0 | 42 | 21 | 66.7 | 7962 | 0 | 100.0 | 99.7 | 0.815 | |
| 4 | 94 | 7962 | 47 | 0 | 11250 | 0 | 37 | 26 | 58.7 | 7962 | 0 | 100.0 | 99.7 | 0.765 | |
| 5 | 94 | 7962 | 47 | 0 | 11137 | 0 | 37 | 26 | 58.7 | 7962 | 0 | 100.0 | 99.7 | 0.765 | |
| 6 | 94 | 7962 | 64 | 0 | 15283 | 0 | 44 | 19 | 69.8 | 7962 | 0 | 100.0 | 99.8 | 0.835 | |
| 7 | 94 | 7962 | 59 | 0 | 15045 | 0 | 43 | 20 | 68.3 | 7962 | 0 | 100.0 | 99.8 | 0.825 | |
| 8 | 114 | 810 | 52 | 0 | 15114 | 0 | 41 | 22 | 65.1 | 7962 | 0 | 100.0 | 99.7 | 0.806 | |
| 9 | 103 | 1077 | 63 | 0 | 14847 | 0 | 47 | 16 | 74.6 | 16 | 0 | 100.0 | 99.8 | 0.863 | |
| Chlorophyll | 1 | 523 | 1559 | 166 | 0 | 14297 | 0 | 70 | 12 | 85.4 | 6830 | 16 | 99.8 | 99.6 | 0.83 |
| 2 | 440 | 934 | 248 | 1 | 7927 | 1 | 73 | 9 | 89.0 | 6841 | 5 | 99.9 | 99.8 | 0.91 | |
| 3 | 425 | 603 | 264 | 0 | 15253 | 0 | 77 | 5 | 93.9 | 6841 | 5 | 99.9 | 99.9 | 0.94 | |
| 4 | 415 | 574 | 273 | 1 | 15282 | 0 | 75 | 7 | 91.5 | 6842 | 4 | 99.9 | 99.8 | 0.93 | |
| 5 | 429 | 615 | 259 | 1 | 15240 | 1 | 75 | 7 | 91.5 | 6843 | 3 | 100.0 | 99.9 | 0.94 | |
| 6 | 482 | 946 | 202 | 5 | 14910 | 0 | 72 | 10 | 87.8 | 6844 | 2 | 100.0 | 99.8 | 0.92 | |
| 7 | 394 | 3337 | 210 | 85 | 12517 | 2 | 62 | 20 | 75.6 | 6834 | 12 | 99.8 | 99.5 | 0.79 | |
| 8 | 399 | 1273 | 289 | 1 | 14582 | 1 | 77 | 5 | 93.9 | 6832 | 14 | 99.8 | 99.7 | 0.89 | |
| 9 | 458 | 477 | 231 | 0 | 15379 | 0 | 76 | 6 | 92.7 | 6842 | 4 | 99.9 | 99.9 | 0.93 | |
| Lipid synthesis | 1 | 849 | 2026 | 705 | 3 | 8229 | 7 | 476 | 159 | 75.0 | 5882 | 4 | 99.9 | 97.5 | 0.850 |
| 2 | 927 | 2037 | 629 | 1 | 8225 | 0 | 507 | 128 | 79.8 | 5886 | 0 | 100.0 | 98.0 | 0.884 | |
| 3 | 898 | 2968 | 659 | 0 | 7294 | 0 | 509 | 126 | 80.2 | 5886 | 0 | 100.0 | 98.1 | 0.886 | |
| 4 | 968 | 3227 | 588 | 1 | 7035 | 0 | 493 | 142 | 77.6 | 5886 | 0 | 100.0 | 97.8 | 0.871 | |
| 5 | 970 | 3280 | 586 | 1 | 6982 | 0 | 491 | 144 | 77.3 | 5886 | 0 | 100.0 | 97.8 | 0.869 | |
| 6 | 874 | 2112 | 681 | 2 | 8149 | 1 | 525 | 110 | 82.7 | 5884 | 2 | 100.0 | 98.3 | 0.899 | |
| 7 | 863 | 2415 | 692 | 2 | 7845 | 2 | 512 | 123 | 80.6 | 5883 | 3 | 100.0 | 98.1 | 0.886 | |
| 8 | 815 | 1613 | 740 | 2 | 8638 | 11 | 525 | 110 | 80.7 | 5879 | 7 | 99.9 | 98.2 | 0.961 | |
| 9 | 800 | 3492 | 757 | 0 | 6770 | 0 | 541 | 94 | 85.2 | 5886 | 0 | 100.0 | 98.6 | 0.916 | |
| rRNA binding | 1 | 548 | 579 | 3390 | 6 | 9598 | 22 | 1821 | 90 | 95.3 | 4662 | 6 | 99.9 | 98.5 | 0.964 |
| 2 | 1133 | 1225 | 2811 | 0 | 8974 | 0 | 1827 | 84 | 95.6 | 4668 | 0 | 100.0 | 98.7 | 0.969 | |
| 3 | 1126 | 1638 | 2816 | 2 | 8560 | 1 | 1811 | 100 | 94.8 | 4668 | 0 | 100.0 | 98.5 | 0.963 | |
| 4 | 1337 | 1958 | 2697 | 0 | 8241 | 0 | 1783 | 128 | 93.3 | 4668 | 0 | 100.0 | 98.1 | 0.953 | |
| 5 | 1372 | 1976 | 2572 | 0 | 8223 | 0 | 1784 | 127 | 93.4 | 4668 | 0 | 100.0 | 98.1 | 0.953 | |
| 6 | 921 | 1208 | 2971 | 52 | 8991 | 0 | 1824 | 87 | 95.5 | 4668 | 0 | 100.0 | 98.7 | 0.968 | |
| 7 | 878 | 2743 | 3040 | 26 | 7442 | 14 | 1808 | 103 | 97.9 | 4634 | 34 | 99.3 | 97.9 | 0.951 | |
| 8 | 810 | 972 | 3075 | 3 | 9182 | 2 | 1848 | 63 | 96.7 | 4668 | 0 | 100.0 | 99.0 | 0.977 | |
| 9 | 1103 | 3175 | 2815 | 26 | 7024 | 0 | 1805 | 106 | 94.5 | 4668 | 0 | 100.0 | 98.4 | 0.961 | |
MCC-based performance scores of SVM prediction of different protein functional classes by using different descriptor classes.
| EC2.4 | 9, 8, 3, 4, 5 | 6, 2, 1, 7 | ||
| GPCR | 9, 6, 2, 3, 8, 4, 5, 1 | 7 | ||
| TC8.A | 9, 6, 7, 3, 2, 8 | 4, 5 | 1 | |
| Chlorophyll | 3, 5, 4, 9, 6, 2 | 8, 1 | 7 | |
| Lipid synthesis | 8, 9 | 6, 7, 3, 2, 4, 5, 1 | ||
| rRNA binding | 8, 2, 6, 1, 3, 9, 5, 4, 7 |
Protein descriptors important for characterizing DNA-binding proteins as selected by a feature selection method, recursive feature elimination method.
| 1 | F168 | Solvent accessibility Composition Group 1 |
| 2 | F166 | Secondary structure Group 3 3/4th Distribution |
| 3 | F147 | Secondary structure Composition Group 1 |
| 4 | F75 | Polarity Group 2 1/4th First Distribution |
| 5 | F43 | Normalized Van der Waals volume Composition Group 2 |
| 6 | F155 | Secondary structure Group 1 2/4th Distribution |
| 7 | F91 | Polarizability Group 1 1/4th First Distribution |
| 8 | F143 | Surface tension Group 3 1/4th First Distribution |
| 9 | F171 | Solvent accessibility Transition Group 1 |
| 10 | F126 | Surface tension Composition Group 1 |
| 11 | F87 | Polarizability Transition Group 1 |
| 12 | F145 | Surface tension Group 3 3/4th Distribution |
| 13 | F15 | Composition of R |
| 14 | F6 | Composition of G |
| 15 | F177 | Solvent accessibility Group 1 3/4th Distribution |
| 16 | F154 | Secondary structure Group 1 1/4th First Distribution |
| 17 | F89 | Polarizability Transition Group 3 |
| 18 | F133 | Surface tension Group 1 1/4th First Distribution |
| 19 | F42 | Normalized Van der Waals volume Composition Group 1 |
| 20 | F85 | Polarizability Composition Group 2 |
| 21 | F175 | Solvent accessibility Group 1 1/4th First Distribution |
| 22 | F130 | Surface tension Transition Group 2 |
| 23 | F127 | Surface tension Composition Group 2 |
| 24 | F151 | Secondary structure Transition Group 2 |
| 25 | F98 | Polarizability Group 2 3/4th Distribution |
| 26 | F8 | Composition of I |
| 27 | F67 | Polarity Transition Group 2 |
| 28 | F148 | Secondary structure Composition Group 2 |