| Literature DB >> 21826121 |
C R Peng1, L Liu, B Niu, Y L Lv, M J Li, Y L Yuan, Y B Zhu, W C Lu, Y D Cai.
Abstract
It is important to identify which proteins can interact with RNA for the purpose of protein annotation, since interactions between RNA and proteins influence the structure of the ribosome and play important roles in gene expression. This paper tries to identify proteins that can interact with RNA using voting systems. Firstly through Weka, 34 learning algorithms are chosen for investigation. Then simple majority voting system (SMVS) is used for the prediction of RNA-binding proteins, achieving average ACC (overall prediction accuracy) value of 79.72% and MCC (Matthew's correlation coefficient) value of 59.77% for the independent testing dataset. Then mRMR (minimum redundancy maximum relevance) strategy is used, which is transferred into algorithm selection. In addition, the MCC value of each classifier is assigned to be the weight of the classifier's vote. As a result, best average MCC values are attained when 22 algorithms are selected and integrated through weighted votes, which are 64.70% for the independent testing dataset, and ACC value is 82.04% at this moment.Entities:
Mesh:
Substances:
Year: 2011 PMID: 21826121 PMCID: PMC3149752 DOI: 10.1155/2011/506205
Source DB: PubMed Journal: J Biomed Biotechnol ISSN: 1110-7243
The distribution of proteins in training dataset and test dataset.
| Dataset | A | B |
|---|---|---|
| Basic training dataset | 1376 | 1376 |
| Independent test dataset | 687 | 687 |
Figure 1The average ACC values of 34 algorithms in basic training dataset.
Figure 2The average MCC values of 34 algorithms in basic training dataset.
Figure 3The average ACC values of 34 algorithms in independent test dataset (including the results of SMVS and WMVS_MCC).
Figure 4The average MCC values of 34 algorithms in independent test dataset (including the results of SMVS and WMVS_MCC).
The standard deviation of the 34 algorithms.
| Algorithm | Standard deviation | |||
|---|---|---|---|---|
| Basic training dataset | Independent test dataset | |||
| ACC (%) | MCC (%) | ACC (%) | MCC (%) | |
| AdaBoostM1 | 0.61 | 1.16 | 1.00 | 1.94 |
| J48 | 0.88 | 1.76 | 1.42 | 2.84 |
| IBk | 0.52 | 1.01 | 1.18 | 2.21 |
| MultiClassClassifier | 0.60 | 1.21 | 1.04 | 2.09 |
| PART | 0.55 | 1.25 | 1.26 | 2.54 |
| MultilayerPerceptron | 1.26 | 2.52 | 2.22 | 3.04 |
| KStar | 0.72 | 1.41 | 1.07 | 2.00 |
| Bagging | 0.76 | 1.51 | 0.43 | 0.88 |
| NBTree | 0.82 | 1.64 | 2.04 | 4.09 |
| Decorate | 0.73 | 1.47 | 1.16 | 2.25 |
| RandomForest | 0.67 | 1.32 | 0.62 | 1.25 |
| JRip | 0.48 | 0.96 | 2.25 | 4.43 |
| RandomCommittee | 0.51 | 0.99 | 1.23 | 2.59 |
| FilteredClassifier | 1.11 | 2.22 | 1.16 | 2.32 |
| ClassificationViaRegression | 0.96 | 1.91 | 0.80 | 1.57 |
| Dagging | 0.70 | 1.38 | 1.00 | 2.00 |
| AttributeSelectedClassifier | 0.85 | 1.71 | 0.66 | 1.40 |
| REPTree | 0.71 | 1.46 | 1.32 | 2.66 |
| SMO | 0.55 | 1.10 | 1.06 | 2.11 |
| J48graft | 1.06 | 2.12 | 1.40 | 2.81 |
| Ridor | 1.01 | 2.14 | 1.70 | 3.44 |
| RandomSubSpace | 0.91 | 1.84 | 1.22 | 2.44 |
| EnsembleSelection | 0.78 | 1.60 | 1.35 | 2.42 |
| SimpleLogistic | 0.41 | 0.83 | 0.92 | 1.84 |
| DecisionTable | 0.98 | 2.06 | 1.86 | 3.87 |
| DataNearBalancedND | 0.88 | 1.76 | 1.42 | 2.84 |
| RacedIncrementalLogitBoost | 0.63 | 1.59 | 1.68 | 3.61 |
| SimpleCart | 0.63 | 1.26 | 1.13 | 2.25 |
| LogitBoost | 0.43 | 0.87 | 1.23 | 2.47 |
| ND | 0.88 | 1.76 | 1.42 | 2.84 |
| BayesNet | 0.51 | 1.02 | 1.02 | 2.10 |
| ClassBalancedND | 0.88 | 1.76 | 1.42 | 2.84 |
| OrdinalClassClassifier | 0.88 | 1.76 | 1.42 | 2.84 |
| END | 0.88 | 1.76 | 1.42 | 2.84 |
The comparison of the predictors.
| Predictor | Average predicted results | Standard deviation | ||
|---|---|---|---|---|
| ACC (%) | MCC (%) | ACC (%) | MCC (%) | |
| Best individual algorithm | 79.29 | 58.58 | 1.06 | 2.11 |
| SMVS | 79.72 | 59.77 | 0.76 | 1.49 |
| WMVS | 80.82 | 61.94 | 0.68 | 1.32 |
| SMVS_AS | 81.88 | 64.40 | 0.55 | 1.02 |
| WMVS_AS | 82.04 | 64.70 | 0.42 | 0.81 |
Figure 5The average MCC value of SMVS_AS and WMVS_AS.
Figure 6Distribution of algorithms.