| Literature DB >> 25574395 |
Arvind Kumar Tiwari1, Rajeev Srivastava1.
Abstract
During the past, there was a massive growth of knowledge of unknown proteins with the advancement of high throughput microarray technologies. Protein function prediction is the most challenging problem in bioinformatics. In the past, the homology based approaches were used to predict the protein function, but they failed when a new protein was different from the previous one. Therefore, to alleviate the problems associated with homology based traditional approaches, numerous computational intelligence techniques have been proposed in the recent past. This paper presents a state-of-the-art comprehensive review of various computational intelligence techniques for protein function predictions using sequence, structure, protein-protein interaction network, and gene expression data used in wide areas of applications such as prediction of DNA and RNA binding sites, subcellular localization, enzyme functions, signal peptides, catalytic residues, nuclear/G-protein coupled receptors, membrane proteins, and pathway analysis from gene expression datasets. This paper also summarizes the result obtained by many researchers to solve these problems by using computational intelligence techniques with appropriate datasets to improve the prediction performance. The summary shows that ensemble classifiers and integration of multiple heterogeneous data are useful for protein function prediction.Entities:
Year: 2014 PMID: 25574395 PMCID: PMC4276698 DOI: 10.1155/2014/845479
Source DB: PubMed Journal: Int J Proteomics ISSN: 2090-2166
Summary of computational intelligence (CI) techniques in prediction of binding sites.
| Reference | CI techniques | Binding sites/residues | Performance | Datasets |
|---|---|---|---|---|
| [ | ANN | DNA | Accuracy: 64%, | Amino acid sequence composition, solvent accessibility, and secondary structure |
|
| ||||
| [ | ANN | DNA | Accuracy: 73.6% | Position specific scoring matrices (PSSM) |
|
| ||||
| [ | SVM | DNA | Accuracy: 90% | Surface and overall composition, overall charge and positive potential patches on the protein surface |
|
| ||||
| [ | SVM | DNA | Accuracy: 82.30% | Amino acid sequence, PSSM, and low-resolution structural information |
|
| ||||
| [ | SVM | DNA | Accuracy: 77.2%, sensitivity: 76.4%, and specificity: 76.6% | Position specific scoring matrices (PSSM) |
|
| ||||
| [ | SVM | DNA | Accuracy: 96.6%, sensitivity: 90.7% | Amino acid sequence, pseudoamino acid composition, autocross-covariance transforms, and dipeptide composition |
|
| ||||
| [ | SVM | DNA | Accuracy: 80%, sensitivity: 85.1%, and specificity: 85.3% | Normalized PSSM score, normalized solvent accessible surface area, and protein backbone structure |
|
| ||||
| [ | SVM | DNA | MCC: 0.67, accuracy: 89.6%, sensitivity: 88.4%, and specificity: 90.8% | PSSM, amino acid composition, hydrophobicity, polarity, polarizability, secondary structure, solvent accessibility, normalized Vander Waals volume, binding propensity, and nonbinding propensity |
|
| ||||
| [ | Ensemble of ANN and SVM | DNA | Accuracy: 89.00% | PSSM and structural features such as secondary structure, solvent accessibility, and globularity |
|
| ||||
| [ | Random forest | DNA | Accuracy: 78.20%, sensitivity: 78.06%, and specificity: 78.22% | PSSM with mean and standard of deviation side chain pKa value, hydrophobicity index, and molecular mass |
|
| ||||
| [ | Random forest | DNA | Accuracy: 91.41%, MCC: 0.70, and AUC: 0.913 | PSSM, secondary structure information, and orthogonal binary vector information and two physical-chemical properties dipoles and volumes of the side chains |
|
| ||||
| [ | Random forest | DNA | Accuracy: 83.96% | Pseudoamino acid composition |
|
| ||||
| [ | Gaussian Naive Bayes | DNA | Accuracy: 79.10% and MCC: 0.583 | PSSM, predicted secondary structure, predicted relative solvent accessibility |
|
| ||||
| [ | Naive Bayes classifier | RNA | Accuracy: 85.00% | Amino acid sequence, relative accessible surface area, sequence entropy, hydrophobicity, secondary structure, and electrostatic potential |
|
| ||||
| [ | SVM | RNA | MCC: 0.31 | Amino acid sequence and PSSM |
|
| ||||
| [ | SVM | RNA | Accuracy: 87.99%, sensitivity: 79.95%, and specificity: 90.36% | Smoothed PSSM with the correlation and dependency from the neighboring residues |
|
| ||||
| [ | SVM | RNA | AUC: 0.83 | PSSM, accessible surface area, between centrality and retention coefficient |
|
| ||||
| [ | Random forest | RNA | MCC: 0.5637, accuracy: 88.63%, sensitivity: 53.70%, and specificity: 96.97% | PSSM, physicochemical properties of amino acids, polarity-charge, and hydrophobicity |
|
| ||||
| [ | SVM | RNA | Accuracy: 79.72% and | Protein sequence, amino acid composition, hydrophobicity, secondary structure, predicted solvent accessibility, normalized Vander Waals volume, polarity, and polarizability |
|
| ||||
| [ | SVM | rRNA, RNA, and DNA | rRNA accuracy: 84%, | Protein sequence, amino acid composition, hydrophobicity, secondary structure, predicted solvent accessibility, normalized Vander Waals volume, polarity, and polarizability |
|
| ||||
| [ | SVM | DNA and RNA | DNA sensitivity: 69.40%, specificity: 70.47%, and RNA sensitivity: 66.28%, and specificity: 69.84% | Side chain pKa value, hydrophobicity index and molecular mass |
|
| ||||
| [ | SVM | DNA and RNA | Accuracy: 79.00%, sensitivity: 77.30%, specificity: 79.30% for DNA, and accuracy: 77.70%, sensitivity: 71.60%, and specificity: 78.70% for RNA-binding residues | PSSM with mean and standard of deviation side chain pKa value, hydrophobicity index, and molecular mass |
|
| ||||
| [ | SVM | Metal binding | Accuracy: 78.10% | Physiochemical properties of the amino acid sequences |
|
| ||||
| [ | Bayesian classifier | Zinc | Specificity: 99.8%, sensitivity: 75.5% | Structural properties of a protein |
|
| ||||
| [ | Structural comparison | DNA | Accuracy: 98% and precision: 84% | Combination of structural comparison and the evaluation of statistical potential |
|
| ||||
| [ | Structural comparison | RNA | Accuracy: 98%, precision: 91% for predicting RBPs, and accuracy: 93% and precision: 78% for predicting RNA binding residues | Distance-scaled, finite, ideal gas reference based statistical energy function, and structural alignment |
Summary of computational intelligence (CI) techniques in prediction of subcellular localization.
| Reference | CI techniques | Performance | Datasets |
|---|---|---|---|
| [ | SVM | Accuracy: 86.3% | Amino acid compositions |
| [ | SVM | Average accuracy: 66.7% | Functional domain composition of protein |
| [ | SVM | Overall recall: 89.8% | Amino acid subsequence |
| [ | SVM | Overall accuracy: 93.1% | Physiochemical property of amino acid |
| [ | SVM | Accuracy: 84.9% | Amino acid composition, dipeptide composition, and similarity information |
| [ | SVM | Accuracy: 91.2% | Compositions of residues, dipeptides, and physicochemical properties |
| [ |
| Overall accuracy: 80% | Dipeptide composition of amino acids |
| [ |
| Overall accuracy: 92.5% | Amino acid compositions, dipeptide compositions, and physicochemical properties |
| [ |
| Overall accuracy: 85.4% | Functional domain composition |
| [ |
| Overall accuracy: 93.57% | PSSM and pseudoamino acid composition |
| [ | SVM | Overall accuracy: 74.00% | N-terminal targeting sequences amino acid composition and protein sequence motifs |
| [ | CSVM | Overall accuracy: 80.03% | Pseudoamino acid composition |
| [ | SVM with GA | Overall accuracy: 72.82% | Physiochemical property of amino acid |
| [ | SVM | Overall accuracy: 90.96% and MCC: 0.8655 | Combination of sequence alignment and feature based on amino acid composition |
| [ | SVM | Accuracy: 73.71% | Amino acid composition and PSSM |
| [ | SVM | Accuracy: 88.3% | Pseudoamino acid composition |
| [ | SVM | Recall: 91.30% | Sequence motifs |
| [ | SVM | Accuracy: up to 94.00% | Amino acid composition, amino acid pair, 1, 2.3 gapped amino acid pair compositions |
| [ | SVM | Accuracy: up to 93% | Integrates features from phylogenetic profiles and gene ontology |
| [ | Recurrent NN | Overall accuracy: 72.55% | Pseudo amino acid composition |
| [ | N-to-1 NN | Accuracy: up to 89% | Protein sequence |
| [ | SVM | Overall accuracy: 93.57% | Amino acid and dipeptide, composition, reduced physiochemical properties, gene ontology, PSSM, and pseudoaverage chemical shift |
| [ | SVM and ANN | Accuracy: 68% | Structural properties of a protein |
Summary of computational intelligence (CI) techniques in prediction of enzyme function/family.
| Ref. | CI techniques | Performance | Datasets |
|---|---|---|---|
| [ |
| Accuracy: 85% | Functional domain composition |
| [ |
| Accuracy: 76.6% | Amphiphilic pseudoamino acid composition |
| [ | OET- | Overall accuracy: 91.3%, 93.7%, and 98.3% for the 1st, 2nd, and 3rd level | Functional domain composition and PSSM |
| [ |
| Accuracy: 99% | Amino acid composition |
| [ | Fuzzy | Accuracy: 56.9% | Pseudoamino acid composition, approximate entropy, and hydrophobicity |
| [ | SVM | Accuracy: 80.87% | Amphiphilic pseudo amino acid composition |
| [ | SVM with DWT | Accuracy: 91.9. | Pseudoamino acid composition |
| [ | SVM | MCC: 0.92 and accuracy: 93% | Pseudoamino acid composition with CTF |
| [ | SVM | Accuracy: 91.32% | Functional domain composition |
| [ | SVM | Accuracy: 81% to 98% and MCC: 0.82 to 0.98 | Pseudoamino acid composition with CTF |
| [ | SVM | Accuracy: 95.25% | Structural features based on fragment libraries |
| [ | SVM | Accuracy: 69.1–99.6% | Amino acid sequence |
| [ | SVM | Sensitivity: 85.6% and specificity: 86.1% | Pseudoamino acid composition |
| [ | SVM | Accuracy: 77.4% | Sequence similarity, amino acid composition, physiochemical properties, and dipeptide composition |
| [ | Bayesian classifier | Accuracy: 45% | Structural properties |
| [ | Random forest | Overall accuracy: 94.87%, 87.7%, and 84.25% for the 1st, 2nd, and 3rd level | Sequence derived features |
| [ | Random forest | Precision: 0.98 and recall: 0.89 | Set of specificity determining residues |
| [ | SVM and random forest | Accuracy: 71.29–99.53% by SVM and 94–99.31% by random forest | Sequence derived properties |
| [ | N-to-1 neural network | Overall accuracy: 96%, specificity: 80%, and FP rates: 7% | Amino acid sequences |
Summary of computational intelligence (CI) techniques in prediction of signal peptides.
| Reference | CI techniques | Performance | Datasets |
|---|---|---|---|
| [ | ANN | Accuracy: 97% | Amino acid sequences |
| [ | ANN | Accuracy: 97% | Amino acid sequences |
| [ | Bidirectional recurrent NN | Accuracy: 97% | Amino acid sequences |
| [ | OET- | Accuracy: 73.4% | Pseudoamino acid composition |
| [ | ANN | Accuracy: 93% | Amino acid sequences |
| [ | SVM | Accuracy: 97% | Pseudoamino acid composition |
| [ | SVM | Sensitivity: 90.97% and selectivity: 97.42% | Position specific amino acid composition |
| [ | Bayesian reasoning network | Accuracy: 97.73% for secretory and nonsecretory and 90.90% for signal peptide cleavage site | Sequence derived features |
Summary of computational intelligence techniques in prediction of catalytic residue.
| Reference | CI techniques | Performance | Datasets |
|---|---|---|---|
| [ | ANN | Accuracy: 69% | Features of amino acid sequence and structure |
| [ | GA with ANN | Accuracy: 91.2% | Residue properties |
| [ | SVM | Accuracy: 86% | Sequence and structural properties |
| [ | SVM | Recall: 61% | Protein structure |
| [ | SVM | Accuracy: 88.6%–95.76% | Sequence and structural properties |
| [ | SVM | MCC: 0.74, sensitivity: 0.76, and specificity: 0.51 | Structural features of a protein |
Summary of computational intelligence techniques in prediction of nuclear/GPC receptor.
| Reference | CI techniques | Prediction | Performance | Datasets |
|---|---|---|---|---|
| [ | SVM | NR | Overall accuracy: 82.6%–97.5% | Amino acid composition and dipeptide composition |
| [ | SVM | NR | Overall accuracy: 96% | 4-tuple residue composition |
| [ | SVM | NR | Overall accuracy: 99.6% | Pseudoamino acid composition |
| [ | SVM | NR | Accuracy: 98% | Pseudoamino acid composition |
| [ | SVM | NR | Accuracy: 97% | Amino acid composition, dipeptide composition, and physicochemical property |
| [ | Fuzzy | NR | Overall accuracy: 93% | Pseudoamino acid composition with physicochemical and statistical features |
| [ | SVM | GPCR | Overall accuracy: 99.5% | Dipeptide composition of amino acids |
| [ | SVM | GPCR | Overall accuracy: 89.8%–96.4% | Amino acid composition and dipeptide composition |
| [ | SVM | GPCR | Overall accuracy: 99.6% | Pseudoamino acid composition |
| [ | Adaboost | GPCR | Overall accuracy: 96.4% and MCC: 0.930 | Pseudoamino acid composition with approximate entropy and hydrophobicity patterns |
| [ | PCA | GPCR | Overall accuracy: 80.47–99.5% | Sequence derived features |
Summary of computational intelligence techniques in prediction of membrane protein.
| Reference | CI techniques | Performance | Datasets |
|---|---|---|---|
| [ |
| Overall accuracy: 92.6% | PSSM and pseudoamino acid composition |
| [ |
| Overall accuracy: 87.65% | Protein sequence and PPI data |
| [ | Fuzzy | Overall accuracy: 95.7% | Pseudoamino acid composition |
| [ | OET- | Overall accuracy: 91.6% | Pseudoamino acid composition |
| [ | SVM | Overall accuracy: 90.1% | Protein sequence |
| [ | Discriminant analysis | Overall accuracy: 86.5% | Protein sequence information |
| [ | Ensemble classifier | Overall accuracy: 91.2% | Pseudoamino acid composition and the approximate entropy |
Summary of computational intelligence for protein function prediction by using protein interaction network.
| Reference | CI techniques | Performance | Datasets |
|---|---|---|---|
| [ | Markov random field | Specificity: 45%, sensitivity: 64% | Functional probability of each protein |
| [ | Network flow based algorithm | Accuracy: 10–90% | Structure of protein interactive maps |
| [ | Neighbor based techniques | Precision: 0.9-1.0, recall: 98% | Label 1 and label 2 neighbors |
| [ | Association analysis based method | Accuracy: 93% | H confidence, adjacency matrix |
| [ | Naïve Bayes classifier | Precision: 49%, recall: 62%, MCC: 0.37 | PPI data |
| [ | RWR with | Accuracy: 58–73% | Neighborhood features |
| [ | Time sequenced subnetwork | Significant module: 95.95% | Integrating the gene expression data and PPI data |
| [ | Gibbs sampling based bootstrapping | TP/FP: 0.5 to 1.5 | Interaction and annotation data |
| [ | Network based approach | Precision: 54.83%, | Function-function correlation |
| [ | Neighborhood majority voting system | Precision: 67.3%, | Diffusion state distance (DSD) |
Summary of computational intelligence for protein function prediction by using gene expression data.
| Reference | CI techniques | Performance | Datasets |
|---|---|---|---|
| [ | Multilayer perceptron | TP rate: up to 79.6%, FP rate: up to 97% | DNA array expression data |
| [ | MRF with Bayesian | Sensitivity: 87% | PPI, genetic interactions, highly correlated gene expression network, protein complex data, and structural properties |
| [ | SVM | Accuracy: 89.44 | Gene expression data |
| [ | Genetic programming | Accuracy: 92.50–98.7% | Gene expression data |
| [ | Majority voting genetic programming | Accuracy: 81.82% | Gene expression data |
| [ | Genetic programming | Accuracy: 94.9–99.27% | Gene expression data |
| [ | Genetic programming | Accuracy: 95.24–100% | Gene expression values and constant values |
| [ | Fuzzy nearest cluster | Top N accuracy: 65.27% | Gene expression data |
| [ |
| Accuracy: 0.16–0.24 | PPI and gene expression data |
| [ | Hypergraph | Accuracy: 97.95% | Gene expression data |
| [ | Discriminative local subspaces with SVM | Average precision: 63% and | Gene expression data |
Summary of computational intelligence techniques in pathway analysis from gene expression data.
| Reference | CI techniques | Performance | Datasets |
|---|---|---|---|
| [ | Gene set enrichment analysis | Sensitivity: 0.78, specificity: 0.98, AUC: 0.94 | Gene expression data with significance analysis of microarray |
| [ | Linear discriminant analysis | Error rate: 10–15% | Covariance matrix with group relationships among variables |
| [ | Random forest | Error rate: 11–17% | Gene expression data |
| [ | Naïve Bayes, decision tree based ensemble classifier | Accuracy: 91.2% and | Gene expression data |
| [ | SVM, Bayesian approach, C5.0, and random forest | Error rate: 7–15% | Gene expression data |
| [ | Bayesian approach | AUC: 90.56%, Accuracy: 75.7% | Single-nucleotide polymorphisms |
The results analysis of different classifiers to predict protein functions.
| Computational intelligence based techniques | DNA | RNA | Membrane | Enzyme | Nuclear receptor | G-protein coupled receptor | Overall | |
|---|---|---|---|---|---|---|---|---|
| Random forest | ACC | 78.6 | 64.7 | 89.2 | 81.6 | 97.4 | 94.6 | 86.7 |
| MCC | 0.74 | 0.71 | 0.86 | 0.74 | 0.94 | 0.97 | 0.84 | |
|
| ||||||||
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
|
| ||||||||
|
| ACC | 66.9 | 60.3 | 96.8 | 66.3 | 76.8 | 94.8 | 78.8 |
| MCC | 0.74 | 0.51 | 0.68 | 0.70 | 0.85 | 0.97 | 0.76 | |
|
| ||||||||
| Naïve Bayes | ACC | 64.8 | 86 | 84.4 | 60.7 | 98.8 | 97.2 | 80.7 |
| MCC | 0.61 | 0.64 | 0.81 | 0.65 | 0.87 | 0.98 | 0.77 | |
|
| ||||||||
| SVM with AAC | ACC | 72.4 | 73.5 | 87.7 | 89.3 | 99.8 | 70.8 | 84.1 |
| MCC | 0.83 | 0.81 | 0.90 | 0.82 | 0.74 | 0.82 | 0.82 | |
|
| ||||||||
| SVM with AAC + DC | ACC | 82 | 71.3 | 95 | 91.8 | 96.9 | 94.3 | 91 |
| MCC | 0.86 | 0.78 | 0.91 | 0.84 | 0.94 | 0.95 | 0.89 | |