| Literature DB >> 24267033 |
Jiancheng Zhong, Jianxin Wang, Wei Peng, Zhen Zhang, Yi Pan.
Abstract
BACKGROUND: Essential proteins are indispensable for cell survive. Identifying essential proteins is very important for improving our understanding the way of a cell working. There are various types of features related to the essentiality of proteins. Many methods have been proposed to combine some of them to predict essential proteins. However, it is still a big challenge for designing an effective method to predict them by integrating different features, and explaining how these selected features decide the essentiality of protein. Gene expression programming (GEP) is a learning algorithm and what it learns specifically is about relationships between variables in sets of data and then builds models to explain these relationships.Entities:
Mesh:
Substances:
Year: 2013 PMID: 24267033 PMCID: PMC3856491 DOI: 10.1186/1471-2164-14-S4-S7
Source DB: PubMed Journal: BMC Genomics ISSN: 1471-2164 Impact factor: 3.969
Figure 1ROC curves and AUC values of ten classifiers trained from 10-fold cross-validation. Original data are divided into 10 equal datasets, and nine-folds are used to train the classifier and the remaining one fold is used for testing. The process is repeated ten times to generated ten classifiers, with each of the ten datasets used exactly once as testing data. The figure illustrates the ROC curves and corresponding AUC values of these classifiers.
Figure 2ROC curves and AUC values of our GEP classifier and other methods using individual feature. We select one classifier which has average prediction performance among ten classifiers generated from 10-fold cross-validation, and test it on original data containing 5093 proteins with all available learning features. The figure illustrates the ROC curves of our classifier and other methods that use individual feature.
Comparison between GEP and the methods using individual feature
| Methods | SN | SP | FPR | PPV | NPV | F-measure | ACC | MCC |
|---|---|---|---|---|---|---|---|---|
| DC | 0.4002 | 0.8217 | 0.1783 | 0.4002 | 0.8217 | 0.4002 | 0.7251 | 0.2219 |
| BC | 0.3505 | 0.8069 | 0.1931 | 0.3505 | 0.8069 | 0.3505 | 0.7023 | 0.1574 |
| CC | 0.3548 | 0.8082 | 0.1918 | 0.3548 | 0.8082 | 0.3548 | 0.7043 | 0.1630 |
| SC | 0.3676 | 0.8120 | 0.1880 | 0.3676 | 0.8120 | 0.3676 | 0.7102 | 0.1796 |
| EC | 0.3676 | 0.8120 | 0.1880 | 0.3676 | 0.8120 | 0.3676 | 0.7102 | 0.1796 |
| IC | 0.4010 | 0.8220 | 0.1780 | 0.4010 | 0.8220 | 0.4010 | 0.7255 | 0.2230 |
| NC | 0.4353 | 0.8321 | 0.1679 | 0.4353 | 0.8321 | 0.4353 | 0.7412 | 0.2674 |
| PeC | 0.4036 | 0.8227 | 0.1773 | 0.4036 | 0.8227 | 0.4036 | 0.7267 | 0.2263 |
| ION | 0.5124 | 0.8551 | 0.1449 | 0.5124 | 0.8551 | 0.5124 | 0.7766 | 0.3675 |
| WDC | 0.4576 | 0.8390 | 0.1610 | 0.4580 | 0.8388 | 0.4578 | 0.7516 | 0.2967 |
The proteins in PPI network are ranked in descend order according to the scores assigned by our classifier as well as these existing methods. we select top 1167 proteins ranked by each method as candidate essential proteins. The rest of 3926 (= 5093-1167) proteins are regarded as non-essential proteins. According to known essential protein, the values of sensitivity (SN), specificity (SP), positive predictive value (PPV), FPR, negative predictive value (NPV), F-Measure, accuracy (ACC) and Matthews Correlation Coefficent (MCC) are calculated for each method. The table lists the results.
Comparison of average AUC between GEP and other machine learning based methods.
| Methods | AUC |
|---|---|
| SVM | 0.577 |
| SMO | 0.608 |
| NaiveBayes | 0.744 |
| Bayes Network | 0.731 |
| RBF Network | 0.669 |
| J48 | 0.687 |
| Random Tree | 0.612 |
| Random Forest | 0.721 |
| NaiveBayes Tree | 0.746 |
| Acencio | 0.778 |
This table shows the average AUC values of our GEP classifiers and some machine learning methods.
Figure 3Flowchart of building GEP classifier. This figure shows the flowchart of building GEP classifier.
Parameters used in our GEP method
| Parameter | Description of parameter | Setting of parameter |
|---|---|---|
| P1 | Number of Population | 12000 |
| P2 | Length of Gene | 1 |
| P3 | Length of Chromosome | 60 |
| P4 | Length of Head | 20 |
| P5 | Mutation rate% | 0.25 |
| P6 | Cross rate% | 0.1 |
| P7 | Number of Generation | 500 |
| P8 | Function set | +,-,*, =,/, Sqrt, Log, Exp, Abs, Max, Min |
| P9 | Fitness Function Name | SSPN |
This table lists some parameters used in our GEP method.