| Literature DB >> 27294123 |
Junjie Chen1, Bingquan Liu2, Dong Huang3.
Abstract
Protein remote homology detection is one of the central problems in bioinformatics. Although some computational methods have been proposed, the problem is still far from being solved. In this paper, an ensemble classifier for protein remote homology detection, called SVM-Ensemble, was proposed with a weighted voting strategy. SVM-Ensemble combined three basic classifiers based on different feature spaces, including Kmer, ACC, and SC-PseAAC. These features consider the characteristics of proteins from various perspectives, incorporating both the sequence composition and the sequence-order information along the protein sequences. Experimental results on a widely used benchmark dataset showed that the proposed SVM-Ensemble can obviously improve the predictive performance for the protein remote homology detection. Moreover, it achieved the best performance and outperformed other state-of-the-art methods.Entities:
Mesh:
Substances:
Year: 2016 PMID: 27294123 PMCID: PMC4875977 DOI: 10.1155/2016/5813645
Source DB: PubMed Journal: Biomed Res Int Impact factor: 3.411
Figure 1Flowchart to show how the ensemble classifier is formed by combining three basic classifiers on superfamily-level. The ensemble strategy is first employed on superfamily-level, and then the query protein P is predicted belonging to the superfamily type with which its score is the highest.
Figure 2The performance of three basic predictors with all parameter combinations. k value of 2 and the LAG value of 14 were used in SVM-Kmer and SVM-ACC. SVM-SC-PseAAC achieves the best performance with λ = 5 and w = 0.2. Parameter w is mainly impact factor. However, parameter λ has minor impact on the performance.
The performance of three basic predictors with optimal parameters on benchmark dataset.
| Methods | Optimal parameters | ROC[a] | ROC50[a] |
|---|---|---|---|
| SVM-Kmer |
| 0.912 | 0.785 |
| SVM-ACC | LAG = 14 | 0.787 | 0.483 |
| SVM-SC-PseAAC |
| 0.911 | 0.657 |
[a]Average ROC and ROC50 scores.
Performance of ensemble classifier combining various predictors with weighted voting. The best performance was achieved by combining SVM-Kmer, SVM-ACC, and SVM-SC-PseAAC. The symbol ⊕ denotes the weighted voting operator.
| Ensemble methods with superfamily-level strategy | ROC[a] | ROC50[a] |
|---|---|---|
| SVM-Kmer ⊕ SVM-ACC | 0.929 | 0.767 |
| SVM-Kmer ⊕ SVM-SC-PseAAC | 0.937 | 0.715 |
| SVM-ACC ⊕ SVM-SC-PseAAC | 0.922 | 0.691 |
| SVM-Kmer ⊕ SVM-ACC ⊕ SVM-SC-PseAAC |
|
|
[a]Average ROC and ROC50 scores.
Top 10 most discriminative features in three feature spaces. These features describe the characteristics of proteins from various perspectives.
| Rank | Kmer | ACC | SC-PseAAC |
|---|---|---|---|
| 1 | MH | CC |
|
| 2 | WC | AC |
|
| 3 | IM | CC |
|
| 4 | MC | AC |
|
| 5 | MY | CC |
|
| 6 | VM | AC |
|
| 7 | YW | AC |
|
| 8 | YR | CC |
|
| 9 | HW | CC |
|
| 10 | MQ | AC |
|
Note: the subscript indexes in ACC features and SC-PseAAC features mean hydrophobicity (h 1), hydrophilicity (h 2), and mass (m).
Performance comparison of different methods on the benchmark dataset.
| Methods | ROC[a] | ROC50[a] | Source |
|---|---|---|---|
| SVM-Ensemble | 0.943 | 0.744 | This study |
|
| |||
| SVM-Pairwise | 0.896 | 0.464 | Liao and Noble, 2003 [ |
| SVM-LA ( | 0.925 | 0.649 | Saigo et al., 2004 [ |
| Mismatch | 0.925 | 0.649 | Leslie et al., 2004 [ |
|
| |||
| Monomer-dist | 0.919 | 0.508 | Lingner and Meinicke, 2006 [ |
| SVM-WCM | 0.904 | 0.445 | Lingner and Meinicke, 2008 [ |
|
| |||
| SVM-Ngram-LSA | 0.859 | 0.628 | Dong et al., 2006 [ |
| SVM-Pattern-LSA | 0.879 | 0.626 | Dong et al., 2006 [ |
| SVM-Motif-LSA | 0.859 | 0.628 | Dong et al., 2006 [ |
| SVM-Top-n-gram-combine-LSA | 0.939 | 0.767 | Liu et al., 2008 [ |
|
| |||
| PseAACIndex ( | 0.880 | 0.620 | Liu et al., 2013 [ |
| PseAACIndex-Profile ( | 0.922 | 0.712 | Liu et al., 2013 [ |
| SVM-DR | 0.919 | 0.715 | Liu et al., 2014 [ |
| disPseAAC | 0.922 | 0.721 | Liu et al., 2015 [ |
[a]Average ROC and ROC50 scores.