| Literature DB >> 19664241 |
Bum Ju Lee1, Moon Sun Shin, Young Joon Oh, Hae Seok Oh, Keun Ho Ryu.
Abstract
BACKGROUND: Predicting the function of an unknown protein is an essential goal in bioinformatics. Sequence similarity-based approaches are widely used for function prediction; however, they are often inadequate in the absence of similar sequences or when the sequence similarity among known protein sequences is statistically weak. This study aimed to develop an accurate prediction method for identifying protein function, irrespective of sequence and structural similarities.Entities:
Year: 2009 PMID: 19664241 PMCID: PMC2731080 DOI: 10.1186/1477-5956-7-27
Source DB: PubMed Journal: Proteome Sci ISSN: 1477-5956 Impact factor: 2.480
Negative samples for the fatty acid metabolism protein class
| Protein class | Number of proteins |
| Transport | 637 |
| Transcription | 538 |
| Gluconate utilisation | 60 |
| Amino acid biosynthesis | 393 |
| DNA-binding | 486 |
| Acetylcholine receptor inhibitor | 103 |
| G-protein coupled receptor | 220 |
| Guanine nucleotide-releasing factor | 370 |
| Fibre protein | 47 |
| Transmembrane | 351 |
| Oxidoreductases | 58 |
| Hydrolases | 75 |
| Isomerases | 50 |
| Lyases | 72 |
| Other proteins | 354 |
Features used for protein function classification
| Feature | Description | Dimension | |
| 1 | Number of amino acids | Number of residues in each protein | 1 |
| 2 | Molecular weight | Molecular weight of the protein | 1 |
| 3 | Theoretical pI | The pH at which the net charge of the protein is zero (isoelectric point) | 1 |
| 4 | Amino acid composition | Percentage of each amino acid in the protein | 20 |
| 5 | Positively charged residue_2 | Percentage of positively charged residues in the protein (lysine and arginine) | 1 |
| 6 | Positively charged residue_3 | Percentage of positively charged residues in the protein (histidine, lysine, and arginine) | 1 |
| 7 | Number of atoms | Total number of atoms | 1 |
| 8 | Carbon | Total number of carbon atoms in the protein sequence | 1 |
| 9 | Hydrogen | Total number of hydrogen atoms in the protein sequence | 1 |
| 10 | Nitrogen | Total number of nitrogen atoms in the protein sequence | 1 |
| 11 | Oxygen | Total number of oxygen atoms in the protein sequence | 1 |
| 12 | Sulphur | Total number of sulphur atoms in the protein sequence | 1 |
| 13 | Extinction coefficient_All | Amount of light a protein absorbs at a certain wavelength (assuming ALL Cys residues appear as half cysteines) | 1 |
| 14 | Extinction coefficient_No | Amount of light a protein absorbs at a certain wavelength (assuming NO Cys residues appear as half cysteines) | 1 |
| 15 | Instability index | The stability of the protein | 1 |
| 16 | Aliphatic index | The relative volume of the protein occupied by aliphatic side chains | 1 |
| 17 | GRAVY | Grand average of hydropathicity | 1 |
| 18 | Percentage of continuous changes from positively charged residues to positively charged residues | 1 | |
| 19 | Percentage of continuous changes from negatively charged residues to negatively charged residues | 1 | |
| 20 | Percentage of continuous changes from positively charged residues to negatively charged residues or from negatively charged residues to positively charged residues | 1 | |
| 21 | Percentage of | 10 | |
| 22 | Percentage of | 10 | |
| 23 | Percentage of | 10 | |
| 24 | Charged | Physicochemical property | 1 |
| 25 | Negatively charged residues | Percentage of negatively charged residues in the protein | 1 |
| 26 | Polar | Physicochemical property | 1 |
| 27 | Aliphatic | Physicochemical property | 1 |
| 28 | Aromatic | Physicochemical property | 1 |
| 29 | Small | Physicochemical property | 1 |
| 30 | Tiny | Physicochemical property | 1 |
| 31 | Bulky | Physicochemical property | 1 |
| 32 | Hydrophobic | Physicochemical property | 1 |
| 33 | Hydrophobic and aromatic | Physicochemical properties | 1 |
| 34 | Neutral, weakly and hydrophobic | Physicochemical properties | 1 |
| 35 | Hydrophilic and acidic | Physicochemical properties | 1 |
| 36 | Hydrophilic and basic | Physicochemical properties | 1 |
| 37 | Acidic | Physicochemical property | 1 |
| 38 | Polar and uncharged | Physicochemical properties | 1 |
| 39 | Amino acid pair ratio | Percentage compositions for each of the 400 possible amino acid dipeptides | 400 |
| Total | 484 |
Features selected by CFS for each protein class
| Protein class | Selected features |
| Transport | R, G, H, I, M, positively charged residue_3, carbon, CC, CD, CE, CH, CK, CN, CQ, CW, CY, FM, GW, HC, HR, IC, IG, LF, LG, LM, MF, MM, MQ, PC, QC, SC, TC, WD, YH, polar, hydrophobic, hydrophobic and aromatic, hydrophilic and basic |
| Transcription | D, C, Q, F, V, positively charged residue_3, sulphur, extinction coefficient_all, instability index, aliphatic index, GRAVY, |
| Translation | NumOfAAs, D, L, hydrogen, GRAVY, |
| Gluconate utilisation | Positively charged residue_3, instability index, aliphatic index, |
| Amino acid biosynthesis | NumOfAAs, theoretical pI, D, C, G, S, sulphur, instability index, aliphatic index, GRAVY, |
| Fatty acid metabolism | NumOfAAs, R, D, C, Q, E, G, I, F, S, negatively charged residue, positively charged residue_3, instability index, aliphatic index, GRAVY, |
| Acetylcholine receptor inhibitor | Molecular weight, C, M, |
| G-protein coupled receptor | Theoretical pI, D, C, Q, E, G, K, F, S, T, negatively charged residue, positively charged residue_3, sulphur, |
| Guanine nucleotide-releasing factor | A, Q, H, I, V, positively charged residue_2, positively charged residue_3, oxygen, instability index, aliphatic index, GRAVY, |
| Fibre protein | G, M, T, positively charged residue_2, |
| Transmembrane | Theoretical pI, D, C, L, S, W, negatively charged residue, extinction coefficient_all, instability index, GRAVY, |
Selection ratios for traditional and new features in the CFS method
| Protein class | Number of selected features | Traditional features (n = 451) | New features | |
| Transport | 38 | 0.302 | 8.43% | 0% |
| Transcription | 51 | 0.387 | 11.31% | 9.09% |
| Translation | 76 | 0.499 | 16.85% | 27.27% |
| Gluconate utilisation | 59 | 0.59 | 13.08% | 15.15% |
| Amino acid biosynthesis | 52 | 0.309 | 11.53% | 15.15% |
| Fatty acid metabolism | 90 | 0.303 | 19.96% | 24.24% |
| Acetylcholine receptor inhibitor | 52 | 0.974 | 11.53% | 9.09% |
| G-protein coupled receptor | 39 | 0.487 | 8.65% | 9.09% |
| Guanine nucleotide-releasing factor | 69 | 0.36 | 15.30% | 27.27% |
| Fibre protein | 31 | 0.481 | 6.87% | 3.03% |
| Transmembrane | 35 | 0.443 | 7.76% | 9.09% |
The merit value is the highest merit calculated for an optimal subset of the features for each class. The selected features are highly correlated with the class and have low inter-correlation with each other.
Accuracy of predictions using training and blind test datasets with the SVM and random forest methods
| Category | Protein class | Training set | Test set | SVM_FF | SVM_CFS | RF_FF | RF_CFS | ||||||
| Positive | Negative | Positive | Negative | Train | Test | Train | Test | Train | Test | Train | Test | ||
| Biological process | Transport | 2,824 | 3,583 | 298 | 414 | 73.26 | 71.34 | 94.38 | 93.53 | 93.14 | 92.41 | 94.66 | |
| Transcription | 3,644 | 3,872 | 415 | 421 | 87.78 | 85.04 | 96.62 | 94.25 | 94.61 | 94.65 | 94.73 | ||
| Translation | 139 | 1,886 | 16 | 210 | 98.81 | 98.37 | 97.78 | 97.87 | 96.90 | 98.07 | 97.34 | ||
| Gluconate utilisation | 53 | 420 | 7 | 46 | 98.73 | 98.11 | 98.94 | 98.11 | 97.04 | 98.11 | 98.30 | ||
| Amino acid biosynthesis | 2,769 | 3,970 | 289 | 460 | 73.55 | 76.63 | 90.28 | 92.12 | 95.69 | 96.29 | |||
| Fatty acid metabolism | 601 | 3,445 | 81 | 369 | 90.58 | 87.55 | 94.19 | 92 | 95.99 | 94.88 | 96.93 | ||
| Molecular function | Acetylcholine receptor inhibitor | 93 | 1,840 | 10 | 205 | 100 | 99.53 | 100 | 100 | 100 | |||
| G-protein coupled receptor | 2,571 | 3,828 | 263 | 448 | 76.04 | 77.07 | 98.76 | 96.62 | 97.74 | 97.60 | 97.46 | ||
| Guanine nucleotide-releasing factor | 335 | 3,994 | 35 | 446 | 98.96 | 98.75 | 99.51 | 98.49 | 98.98 | 98.54 | |||
| Cellular component | Fibre protein | 42 | 1,266 | 6 | 140 | 99.84 | 99.92 | 99.38 | 99.84 | 98.63 | |||
| Domain | Transmembrane | 1,904 | 3,930 | 223 | 426 | 80.01 | 79.81 | 97.15 | 96.02 | 96.46 | 97.38 | ||
The accuracy of predictions using the training dataset was determined when building the classification model using 10-fold cross validation, and the accuracy of predictions using the test dataset was determined using the built model. The accuracies of predictions for all the training and test datasets are presented to demonstrate a good balance between overfitting and underfitting. Positive: number of positive samples; negative: number of negative samples; FF: full features; CFS: correlation-based feature subset selection method. The bold values mean the highest values among four methods.
Figure 1Area under the ROC curves for the four methods for each protein class.
Detailed results of SVM without feature selection (SVM_FF)
| Protein class | Sensitivity | Specificity | F-measure | MCC |
| Transport | 31.54 | 100 | 0.48 | 0.46 |
| Transcription | 71.08 | 98.81 | 0.83 | 0.73 |
| Translation | 81.25 | 100 | 0.90 | 0.9 |
| Gluconate utilisation | 85.71 | 100 | 0.92 | 0.92 |
| Amino acid biosynthesis | 39.45 | 100 | 0.57 | 0.53 |
| Fatty acid metabolism | 30.86 | 100 | 0.47 | 0.52 |
| Acetylcholine receptor inhibitor | 100 | 99.51 | 0.95 | 0.95 |
| G-protein coupled receptor | 38.02 | 100 | 0.55 | 0.53 |
| Guanine nucleotide-releasing factor | 82.86 | 100 | 0.91 | 0.9 |
| Fibre protein | 83.33 | 100 | 0.91 | 0.91 |
| Transmembrane | 41.26 | 100 | 0.58 | 0.56 |
MCC: Matthew's correlation coefficient
Detailed results of SVM with feature selection (SVM_CFS)
| Protein class | Sensitivity | Specificity | F-measure | MCC |
| Transport | 87.58 | 97.83 | 0.92 | 0.87 |
| Transcription | 98.31 | 95.01 | 0.97 | 0.93 |
| Translation | 68.75 | 100 | 0.82 | 0.82 |
| Gluconate utilisation | 85.71 | 100 | 0.92 | 0.92 |
| Amino acid biosynthesis | 79.58 | 100 | 0.89 | 0.84 |
| Fatty acid metabolism | 56.79 | 99.73 | 0.72 | 0.71 |
| Acetylcholine receptor inhibitor | 100 | 100 | 1.00 | 1.00 |
| G-protein coupled receptor | 99.24 | 97.54 | 0.98 | 0.96 |
| Guanine nucleotide-releasing factor | 88.57 | 99.78 | 0.93 | 0.92 |
| Fibre protein | 83.33 | 100 | 0.91 | 0.91 |
| Transmembrane | 97.76 | 97.89 | 0.97 | 0.95 |
MCC: Matthew's correlation coefficient
Detailed results of the random forest method without feature selection (RF_FF)
| Protein class | Sensitivity | Specificity | F-measure | MCC |
| Transport | 87.58 | 95.89 | 0.84 | 0.91 |
| Transcription | 96.87 | 92.4 | 0.89 | 0.95 |
| Translation | 56.25 | 100 | 0.74 | 0.72 |
| Gluconate utilisation | 85.71 | 100 | 0.92 | 0.92 |
| Amino acid biosynthesis | 96.89 | 95.65 | 0.92 | 0.95 |
| Fatty acid metabolism | 71.6 | 100 | 0.82 | 0.84 |
| Acetylcholine receptor inhibitor | 100 | 100 | 1.00 | 1.00 |
| G-protein coupled receptor | 95.44 | 99.11 | 0.95 | 0.97 |
| Guanine nucleotide-releasing factor | 85.71 | 100 | 0.92 | 0.92 |
| Fibre protein | 83.33 | 100 | 0.91 | 0.91 |
| Transmembrane | 95.07 | 99.3 | 0.95 | 0.97 |
MCC: Matthew's correlation coefficient
Detailed results of the random forest method with feature selection (RF_CFS)
| Protein class | Sensitivity | Specificity | F-measure | MCC |
| Transport | 90.27 | 97.10 | 0.93 | 0.88 |
| Transcription | 96.63 | 92.87 | 0.95 | 0.90 |
| Translation | 62.50 | 100.00 | 0.77 | 0.78 |
| Gluconate utilisation | 100.00 | 100.00 | 1.00 | 1.00 |
| Amino acid biosynthesis | 95.85 | 96.30 | 0.95 | 0.92 |
| Fatty acid metabolism | 77.78 | 99.73 | 0.87 | 0.85 |
| Acetylcholine receptor inhibitor | 100.00 | 100.00 | 1.00 | 1.00 |
| G-protein coupled receptor | 96.58 | 97.99 | 0.97 | 0.95 |
| Guanine nucleotide-releasing factor | 85.71 | 99.55 | 0.90 | 0.89 |
| Fibre protein | 66.67 | 100.00 | 0.80 | 0.81 |
| Transmembrane | 94.62 | 98.83 | 0.96 | 0.94 |
MCC: Matthew's correlation coefficient
Comparative performance of the novel feature set and traditional feature set using SVM
| Protein class | Novel feature set (33 features) | Traditional feature set (451 features) | ||||||||
| Training Accuracy | Test accuracy | Sensitivity | Specificity | AUC | Training accuracy | Test accuracy | Sensitivity | Specificity | AUC | |
| Transport | 75.0273 | 36.2 | 100 | 0.681 | 73.2636 | 72.19 | 33.6 | 100 | 0.668 | |
| Transcription | 87.9723 | 88.15 | 99.3 | 77.2 | 0.882 | 92.5625 | 98.3 | 96.4 | 0.974 | |
| Translation | 97.0864 | 97.34 | 62.5 | 100 | 0.813 | 98.8642 | 81.3 | 100 | 0.906 | |
| Gluconate utilisation | 96.8288 | 96.22 | 71.4 | 100 | 0.857 | 98.7315 | 85.7 | 100 | 0.929 | |
| Amino acid biosynthesis | 74.8627 | 42.6 | 100 | 0.713 | 73.5272 | 77.43 | 41.5 | 100 | 0.708 | |
| Fatty acid metabolism | 92.4123 | 45.7 | 100 | 0.728 | 90.5586 | 87.77 | 32.1 | 100 | 0.66 | |
| Acetylcholine receptor inhibitor | 98.448 | 99.06 | 80 | 100 | 0.9 | 100 | 100 | 99.5 | 0.998 | |
| G-protein coupled receptor | 78.6998 | 48.3 | 100 | 0.741 | 76.1838 | 77.35 | 38.8 | 100 | 0.694 | |
| Guanine nucleotide-releasing factor | 97.4359 | 97.92 | 77.1 | 99.6 | 0.883 | 98.8681 | 82.9 | 100 | 0.914 | |
| Fibre protein | 96.789 | 95.89 | 0 | 100 | 0.5 | 99.8471 | 83.3 | 100 | 0.917 | |
| Transmembrane | 85.2931 | 58.3 | 100 | 0.791 | 79.8937 | 80.58 | 43.5 | 100 | 0.717 | |
AUC: Area under the curve.
Comparative performance of the novel feature set and traditional feature set using the random forest
| Protein class | Novel feature set (33 features) | Traditional feature set (451 features) | ||||||||
| Training accuracy | Test accuracy | Sensitivity | Specificity | AUC | Training accuracy | Test accuracy | Sensitivity | Specificity | AUC | |
| Transport | 91.3688 | 90.30 | 86.6 | 93 | 0.968 | 92.9764 | 89.9 | 95.9 | 0.975 | |
| Transcription | 90.9659 | 91.26 | 93.7 | 88.8 | 0.98 | 94.4252 | 96.4 | 94.3 | 0.99 | |
| Translation | 97.679 | 68.8 | 100 | 0.95 | 98.0741 | 68.8 | 100 | 0.996 | ||
| Gluconate utilisation | 96.4059 | 85.7 | 100 | 0.997 | 97.2516 | 85.7 | 100 | 0.992 | ||
| Amino acid biosynthesis | 93.7676 | 94.52 | 91.7 | 96.3 | 0.983 | 94.836 | 94.8 | 95.9 | 0.991 | |
| Fatty acid metabolism | 95.7242 | 72.8 | 99.2 | 0.97 | 96.2926 | 94 | 69.1 | 99.5 | 0.964 | |
| Acetylcholine receptor inhibitor | 99.6896 | 100 | 100 | 1 | 99.8965 | 100 | 100 | 1 | ||
| G-protein coupled receptor | 94.4679 | 95.92 | 94.3 | 96.9 | 0.991 | 96.8745 | 94.7 | 98.7 | 0.993 | |
| Guanine nucleotide-releasing factor | 96.7429 | 96.67 | 62.9 | 99.3 | 0.956 | 98.4985 | 74.3 | 99.8 | 0.992 | |
| Fibre protein | 97.4771 | 95.89 | 33.3 | 98.6 | 0.798 | 99.2355 | 83.3 | 100 | 0.998 | |
| Transmembrane | 93.555 | 93.52 | 87.4 | 96.7 | 0.978 | 95.9719 | 94.2 | 99.3 | 0.995 | |
AUC: Area under the curve.
Figure 2Comparison of nine features used for classification of guanine nucleotide-releasing factor versus negative proteins.
Figure 3Comparison of three features used for classification of transcription versus negative proteins.
Figure 4Comparison of local information used for classification of gluconate utilisation versus negative proteins.