| Literature DB >> 24816822 |
Peng Yang1, Xiaoli Li1, Hon-Nian Chua1, Chee-Keong Kwoh2, See-Kiong Ng1.
Abstract
An increasing number of genes have been experimentally confirmed in recent years as causative genes to various human diseases. The newly available knowledge can be exploited by machine learning methods to discover additional unknown genes that are likely to be associated with diseases. In particular, positive unlabeled learning (PU learning) methods, which require only a positive training set P (confirmed disease genes) and an unlabeled set U (the unknown candidate genes) instead of a negative training set N, have been shown to be effective in uncovering new disease genes in the current scenario. Using only a single source of data for prediction can be susceptible to bias due to incompleteness and noise in the genomic data and a single machine learning predictor prone to bias caused by inherent limitations of individual methods. In this paper, we propose an effective PU learning framework that integrates multiple biological data sources and an ensemble of powerful machine learning classifiers for disease gene identification. Our proposed method integrates data from multiple biological sources for training PU learning classifiers. A novel ensemble-based PU learning method EPU is then used to integrate multiple PU learning classifiers to achieve accurate and robust disease gene predictions. Our evaluation experiments across six disease groups showed that EPU achieved significantly better results compared with various state-of-the-art prediction methods as well as ensemble learning classifiers. Through integrating multiple biological data sources for training and the outputs of an ensemble of PU learning classifiers for prediction, we are able to minimize the potential bias and errors in individual data sources and machine learning algorithms to achieve more accurate and robust disease gene predictions. In the future, our EPU method provides an effective framework to integrate the additional biological and computational resources for better disease gene predictions.Entities:
Mesh:
Year: 2014 PMID: 24816822 PMCID: PMC4016241 DOI: 10.1371/journal.pone.0097079
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Figure 1Overall schema of EPU learning algorithm.
EPU is a framework that utilizes positive and ‘weighted’ unlabeled examples to build an ensemble classifier for disease gene identification. First of all, EPU extracts candidate positives (CP) and reliable negatives (RN) from unlabeled set. Then it applies random walk algorithm to weight remaining unlabeled genes on genetic networks. To achieve reliable and robust measure on U, EPU consults three biological networks, PPI network, GO similarity network and Gene expression network. After obtained ensemble weighted genes, EPU builds three PU learning classifiers. Finally, a novel ensemble strategy is applied to combines the outputs from these classifiers to make final predictions.
Figure 2Ensemble learning algorithm.
Overall comparison of classification performance among different techniques.
| Disease group | Techniques | Precision (p) | Recall (r) | F-measure (F) |
| Cardiovascular | PUDI | 82.0% | 80.3% | 80.4% |
| ProDiGe | 54.3% | 96.3% | 69.3% | |
| Smalter's method | 75.4% | 67.6% | 70.6% | |
| Xu's method | 72.1% | 60.0% | 65.4% | |
| EPU | 85.2% | 81.0% |
| |
| Endocrine | PUDI | 83.6% | 75.3% | 79.2% |
| ProDiGe | 57.3% | 87.7% | 69.3% | |
| Smalter's method | 76.4% | 58.8% | 66.5% | |
| Xu's method | 75.4% | 62.0% | 68.0% | |
| EPU | 88.1% | 87.7% |
| |
| Neurological | PUDI | 70.3% | 80.1% | 74.9% |
| ProDiGe | 63.1% | 74.0% | 68.1% | |
| Smalter's method | 60.6% | 65.9% | 63.1% | |
| Xu's method | 59.7% | 66.7% | 63.0% | |
| EPU | 78.2% | 80.4% |
| |
| Metabolic | PUDI | 80.1% | 84.8% | 82.4% |
| ProDiGe | 58.7% | 84.5% | 69.3% | |
| Smalter's method | 59.1% | 84.7% | 69.6% | |
| Xu's method | 65.6% | 78.3% | 71.4% | |
| EPU | 83.3% | 93.9% |
| |
| Ophthalmological | PUDI | 71.6% | 78.5% | 74.9% |
| ProDiGe | 58.3% | 77.7% | 66.6% | |
| Smalter's method | 56.7% | 77.8% | 65.5% | |
| Xu's method | 64.2% | 71.3% | 67.4% | |
| EPU | 89.3% | 81.0% |
| |
| Cancer | PUDI | 76.3% | 80.0% | 78.0% |
| ProDiGe | 71.1% | 79.8% | 75.3% | |
| Smalter's method | 73.8% | 79.0% | 76.3% | |
| Xu's method | 71.0% | 79.7% | 75.1% | |
| EPU | 81.2% | 84.5% |
| |
| Average performance | PUDI | 77.3% | 79.8% | 78.3% |
| ProDiGe | 60.5% | 83.3% | 69.7% | |
| Smalter's method | 67.0% | 72.3% | 68.6% | |
| Xu's method | 68.0% | 69.7% | 68.4% | |
| EPU | 84.2% | 84.8% |
|
PUDI is a SVM-based approach that partitions unlabeled genes into multiple levels with different associations to confirmed disease genes. ProDiGe is a bagging method that iteratively chooses random subsets from unlabeled subset and trains multiple classifiers. Smalter's method integrates multiple biological features, such as topological features, sequence-derived features, evolutionary age features. Xu's method employs the KNN classifier to predict disease genes.
Overall comparison to single-expert classifiers.
| Disease group | Techniques | Precision ( | Recall ( | F-measure ( |
| Cardiovascular | MSVM | 74.3% | 87.6% | 80.4% |
| WNB | 57.3% | 72.5% | 63.9% | |
| WKNN(3) | 60.1% | 68.6% | 64.0% | |
| EPU | 85.2% | 81.0% |
| |
| Endocrine | MSVM | 83.4% | 85.2% | 84.2% |
| WNB | 61.3% | 70.4% | 65.3% | |
| WKNN(3) | 64.5% | 53.1% | 57.9% | |
| EPU | 88.1% | 87.7% |
| |
| Neurological | MSVM | 69.3% | 83.7% | 75.8% |
| WNB | 61.1% | 74.4% | 67.0% | |
| WKNN(3) | 62.3% | 67.1% | 64.6% | |
| EPU | 78.2% | 80.4% |
| |
| Metabolic | MSVM | 84.0% | 91.3% | 87.4% |
| WNB | 68.8% | 79.9% | 73.9% | |
| WKNN(3) | 76.6% | 78.8% | 77.6% | |
| EPU | 83.3% | 93.9% |
| |
| Ophthalmological | MSVM | 78.4% | 86.1% | 81.9% |
| WNB | 61.2% | 78.7% | 68.8% | |
| WKNN(3) | 67.3% | 72.2% | 69.6% | |
| EPU | 89.3% | 81.0% |
| |
| Cancer | MSVM | 73.4% | 83.9% | 78.3% |
| WNB | 72.5% | 85.1% | 78.3% | |
| WKNN(3) | 76.4% | 81.0% | 78.6% | |
| EPU | 81.2% | 84.5% |
| |
| Average performance | MSVM | 78.6% | 86.3% | 81.3% |
| WNB | 63.7% | 76.8% | 69.5% | |
| WKNN(3) | 67.9% | 70.1% | 68.7% | |
| EPU | 84.2% | 84.8% |
|
EPU is compared with its three component classifiers Multi-level Support Vector Machine (MSVM), Weighted Naïve Bayes (WNB) and Weighted K-Nearest Neighbor (KNN) on 6 disease groups. WKNN(3) is an instance-based classifier that predicts the class of an unlabeled gene based on its 3 closest labeled genes.
Novel cancer-related genes predicted by EPU.
| Gene ID | Supported literatures |
| SUGLEC7 | Ito A. et al. (2001) Binding specificity of siglec7 to disialogangliosides of renal cell carcinoma: possible role of disialogangliosides in tumor progression. FEBS Lett. |
| PRDX4 | Lee S.U. et al. (2008) Involvement of peroxiredoxin IV in the 16alpha-hydroxyestrone-induced proliferation of human MCF-7 breast cancer cells. Cell Biol Int 32(4): 401–5. |
| Park H.J. et al. (2008) Proteomic profiling of endothelial cells in human lung cancer. J Proteome Res 7(3):1138–50. | |
| PRDX5 | Enqman L., et al. (2003) Thioredoxin reductase and cancer cell growth inhibition by organotellurium compounds that could be selectively incorporated into tumor cells. Bioorg Med Chem 11(23): 5091–100. |
| McNaughton M., et al. (2004) Cyclodextrin-derived diorganyl tellurides as glutathione peroxidase mimics and inhibitors of thioredoxin reductase and cancer cell growth. J Med Chem 47(1): 233–9. | |
| Enqman L., et al. (2000) Water-soluble organotellurium compounds inhibit thioredoxin reductase and the growth of human cancer cells. Anticancer Drug Des. 15(5): 323–30. | |
| HNRNPL | Goehe, R.W., et al. (2010) hnRNPL regulates the tumorigenic capacity of lung cancer xenografts in mice via caspase-9 pre-mRNA processing. J. Clin. Inves. 120(11): 3923. |
| Hope N.R., et al. (2011) The expression profile of RNA-binding proteins in primary and metastatic colorectal cancer: relationship of heterogeneous nuclear ribonucleoproteins with prognosis. Hum Pathol. 42(3): 393–402. | |
| SRPK1 | Hayes, G.M., et al. (2007) Serine-arginine protein kinase 1 overexpression is associated with tumorigenic imbalance in mitogen-activated protein kinase pathways in breast, colonic, and pancreatic carcinomas. Cancer Res. 67(5): 2972–80. |
| ABCB10 PHF10 | Tang, L., et al. (2009) Exclusion of ABCB8 and ABCB10 as cancer candidate genes in acute myeloid leukemiaLetter to the Editor. Leukemia 23: 1000–2. |
| Wet M., et al. (2010) Preparation of PHF10 antibody and analysis of PHF10 expression gastric cancer tissues. Journal of Xiao Bao Yu Fen Zi Mian Yi Xue 26(9): 874–6. | |
| Li C., et al. (2012) MicroRNA-409-3p regulates cell proliferation and apoptosis by targeting PHF10 in gastric cancer. Cancer Lett 320(2): 187–97. | |
| SUGLEC7 | Ito A. et al. (2001) Binding specificity of siglec7 to disialogangliosides of renal cell carcinoma: possible role of disialogangliosides in tumor progression. FEBS Lett. |
EPU is used to discover novel cancer related genes from unlabeled gene set. The table list 12 candidate genes associated with cancer and their corresponding literature evidences.