Herbert Pang1, Debayan Datta, Hongyu Zhao. 1. Department of Biostatistics and Bioinformatics, Duke University School of Medicine, Durham, NC 27710, USA. pathwayrf@gmail.com
Abstract
MOTIVATION: There is great interest in pathway-based methods for genomics data analysis in the research community. Although machine learning methods, such as random forests, have been developed to correlate survival outcomes with a set of genes, no study has assessed the abilities of these methods in incorporating pathway information for analyzing microarray data. In general, genes that are identified without incorporating biological knowledge are more difficult to interpret. Correlating pathway-based gene expression with survival outcomes may lead to biologically more meaningful prognosis biomarkers. Thus, a comprehensive study on how these methods perform in a pathway-based setting is warranted. RESULTS: In this article, we describe a pathway-based method using random forests to correlate gene expression data with survival outcomes and introduce a novel bivariate node-splitting random survival forests. The proposed method allows researchers to identify important pathways for predicting patient prognosis and time to disease progression, and discover important genes within those pathways. We compared different implementations of random forests with different split criteria and found that bivariate node-splitting random survival forests with log-rank test is among the best. We also performed simulation studies that showed random forests outperforms several other machine learning algorithms and has comparable results with a newly developed component-wise Cox boosting model. Thus, pathway-based survival analysis using machine learning tools represents a promising approach in dissecting pathways and for generating new biological hypothesis from microarray studies. AVAILABILITY: R package Pwayrfsurvival is available from URL: http://www.duke.edu/~hp44/pwayrfsurvival.htm. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
MOTIVATION: There is great interest in pathway-based methods for genomics data analysis in the research community. Although machine learning methods, such as random forests, have been developed to correlate survival outcomes with a set of genes, no study has assessed the abilities of these methods in incorporating pathway information for analyzing microarray data. In general, genes that are identified without incorporating biological knowledge are more difficult to interpret. Correlating pathway-based gene expression with survival outcomes may lead to biologically more meaningful prognosis biomarkers. Thus, a comprehensive study on how these methods perform in a pathway-based setting is warranted. RESULTS: In this article, we describe a pathway-based method using random forests to correlate gene expression data with survival outcomes and introduce a novel bivariate node-splitting random survival forests. The proposed method allows researchers to identify important pathways for predicting patient prognosis and time to disease progression, and discover important genes within those pathways. We compared different implementations of random forests with different split criteria and found that bivariate node-splitting random survival forests with log-rank test is among the best. We also performed simulation studies that showed random forests outperforms several other machine learning algorithms and has comparable results with a newly developed component-wise Cox boosting model. Thus, pathway-based survival analysis using machine learning tools represents a promising approach in dissecting pathways and for generating new biological hypothesis from microarray studies. AVAILABILITY: R package Pwayrfsurvival is available from URL: http://www.duke.edu/~hp44/pwayrfsurvival.htm. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
Authors: Aravind Subramanian; Pablo Tamayo; Vamsi K Mootha; Sayan Mukherjee; Benjamin L Ebert; Michael A Gillette; Amanda Paulovich; Scott L Pomeroy; Todd R Golub; Eric S Lander; Jill P Mesirov Journal: Proc Natl Acad Sci U S A Date: 2005-09-30 Impact factor: 11.205
Authors: Zahra Madjd; Lindy G Durrant; Sarah E Pinder; Ian O Ellis; John Ronan; Sarah Lewis; Neil K Rushmere; Ian Spendlove Journal: Cancer Immunol Immunother Date: 2004-09-16 Impact factor: 6.968
Authors: Antonino B D'Assoro; Robert Busby; Kelly Suino; Emmanuella Delva; Gustavo J Almodovar-Mercado; Heidi Johnson; Christopher Folk; Daniel J Farrugia; Vlad Vasile; Franca Stivala; Jeffrey L Salisbury Journal: Oncogene Date: 2004-05-20 Impact factor: 9.867
Authors: Vincent Vuaroqueaux; Patrick Urban; Martin Labuhn; Mauro Delorenzi; Pratyaksha Wirapati; Christopher C Benz; Renata Flury; Holger Dieterich; Frédérique Spyratos; Urs Eppenberger; Serenella Eppenberger-Castori Journal: Breast Cancer Res Date: 2007 Impact factor: 6.466
Authors: Kang K Yan; Xiaofei Wang; Wendy W T Lam; Varut Vardhanabhuti; Anne W M Lee; Herbert H Pang Journal: Comput Biol Med Date: 2020-08-06 Impact factor: 4.589
Authors: Guoan Chen; Sinae Kim; Jeremy M G Taylor; Zhuwen Wang; Oliver Lee; Nithya Ramnath; Rishindra M Reddy; Jules Lin; Andrew C Chang; Mark B Orringer; David G Beer Journal: J Thorac Oncol Date: 2011-09 Impact factor: 15.609