| Literature DB >> 27999256 |
Abstract
Knowledge on protein folding has a profound impact on understanding the heterogeneity and molecular function of proteins, further facilitating drug design. Predicting the 3D structure (fold) of a protein is a key problem in molecular biology. Determination of the fold of a protein mainly relies on molecular experimental methods. With the development of next-generation sequencing techniques, the discovery of new protein sequences has been rapidly increasing. With such a great number of proteins, the use of experimental techniques to determine protein folding is extremely difficult because these techniques are time consuming and expensive. Thus, developing computational prediction methods that can automatically, rapidly, and accurately classify unknown protein sequences into specific fold categories is urgently needed. Computational recognition of protein folds has been a recent research hotspot in bioinformatics and computational biology. Many computational efforts have been made, generating a variety of computational prediction methods. In this review, we conduct a comprehensive survey of recent computational methods, especially machine learning-based methods, for protein fold recognition. This review is anticipated to assist researchers in their pursuit to systematically understand the computational recognition of protein folds.Entities:
Keywords: computational method; machine learning; protein fold recognition
Mesh:
Year: 2016 PMID: 27999256 PMCID: PMC5187918 DOI: 10.3390/ijms17122118
Source DB: PubMed Journal: Int J Mol Sci ISSN: 1422-0067 Impact factor: 5.923
Summary of database sources of protein structure classification.
| Database Sources | Websites | References |
|---|---|---|
| PDB | [ | |
| UniProt | [ | |
| DSSP | [ | |
| SCOP | [ | |
| SCOP2 | [ | |
| CATH | [ |
Figure 1Architectures of two protein databases: SCOP and ACTH.
Figure 2Framework of machine learning-based methods for protein fold recognition.
Figure 3Construction types of ensemble-classifier models.
Sequence distribution of the 27-fold classes in the DD dataset.
| Index | Fold Identifier | Fold Name | STrain | STest | Total |
|---|---|---|---|---|---|
| 1 | a.1 | Globin-like | 13 | 6 | 19 |
| 2 | a.3 | Cytochrome c | 7 | 9 | 16 |
| 3 | a.4 | DNA/RNA-binding 3-helical bundle | 12 | 30 | 32 |
| 4 | a.24 | 4-Helical up-and-down bundle | 7 | 8 | 15 |
| 5 | a.26 | 4-Helical cytokines | 9 | 9 | 18 |
| 6 | a.39 | EF hand-like | 6 | 9 | 15 |
| 7 | b.1 | Immunoglobulin-like β-sandwich | 30 | 44 | 74 |
| 8 | b.6 | Cupredoxin-like | 9 | 12 | 21 |
| 9 | b.121 | Nucleoplasmin-like/VP | 16 | 13 | 29 |
| 10 | b.29 | ConA-like lectins/glucanases | 7 | 6 | 13 |
| 11 | b.34 | SH3-like barrel | 8 | 8 | 16 |
| 12 | b.40 | OB-Fold | 13 | 19 | 32 |
| 13 | b.42 | β-Trefoil | 8 | 4 | 12 |
| 14 | b.47 | Trypsin-like serine proteases | 9 | 4 | 13 |
| 15 | b.60 | Lipocalins | 9 | 7 | 16 |
| 16 | c.1 | TIM β/α-barrel | 29 | 48 | 77 |
| 17 | c.2 | FAD/NAD(P)-binding domain | 11 | 12 | 23 |
| 18 | c.3 | Flavodoxin-like | 11 | 13 | 24 |
| 19 | c.23 | NAD(P)-binding Rossmann | 13 | 27 | 40 |
| 20 | c.37 | P-loop containing NTH | 10 | 12 | 22 |
| 21 | c.47 | Thioredoxin-fold | 9 | 8 | 17 |
| 22 | c.55 | Ribonuclease H-like motif | 10 | 12 | 22 |
| 23 | c.69 | α/β-Hydrolases | 11 | 7 | 18 |
| 24 | c.93 | Periplasmic binding protein-like | 11 | 4 | 15 |
| 25 | d.15 | β-Grasp (ubiquitin-like) | 7 | 8 | 15 |
| 26 | d.58 | Ferredoxin-like | 13 | 27 | 40 |
| 27 | g.3 | Knottins (small inhibitors, toxins, lectins) | 13 | 27 | 40 |
| Total | 311 | 383 | 694 | ||
Note that STrain denotes the training dataset, and STest denotes the testing dataset.
Performance of representative machine learning-based methods in the literature on the DD dataset.
| Index | Methods | Classifier Type | References | Overall Accuracy (%) |
|---|---|---|---|---|
| 1 | Nanni et al. (2006) | Ensemble | [ | 61.1 |
| 2 | PFP-Pred (2006) | Ensemble | [ | 62.1 |
| 3 | Shamim et al. (2007) | Single (SVM) | [ | 60.5 |
| 4 | PFRES (2007) | Ensemble | [ | 68.4 |
| 5 | Damoulas et al. (2008) | Single (SVM) | [ | 68.1 |
| 6 | ALHK (2008) | Ensemble | [ | 61.8 |
| 7 | GAOEC (2008) | Ensemble | [ | 64.7 |
| 8 | PFP-FunDSeqE (2009) | Ensemble | [ | 70.5 |
| 9 | ACCFold_AC (2009) | Single (SVM) | [ | 65.3 |
| 10 | ACCFold_ACC (2009) | Single (SVM) | [ | 66.6 |
| 11 | Ghanty et al. (2009) | Ensemble | [ | 68.6 |
| 12 | TAXFOLD (2011) | Single (SVM) | [ | 71.5 |
| 13 | Alok Sharma et al. (2012) | Single (SVM) | [ | 69.5 |
| 14 | Marfold (2012) | Ensemble | [ | 71.7 |
| 15 | Kavousi et al. (2012) | Ensemble | [ | 73.1 |
| 16 | PFP-RFSM (2013) | Single (RF) | [ | 73.7 |
| 17 | Feng and Hu (2014) | Ensemble | [ | 70.2 |
| 18 | PFPA (2015) | Ensemble | [ | 73.6 |
| 19 | Feng et al. (2016) | Ensemble | [ | 70.8 |
| 20 | ProFold (2016) | Ensemble | [ | 76.2 |