SungHwan Kim1, Chien-Wei Lin2, George C Tseng3. 1. Department of Biostatistics, University of Pittsburgh, Pittsburgh, PA, USA Department of Statistics, Korea University, Seoul, South Korea. 2. Department of Biostatistics, University of Pittsburgh, Pittsburgh, PA, USA. 3. Department of Biostatistics, University of Pittsburgh, Pittsburgh, PA, USA Department of Computational and Systems Biology Department of Human Genetics, University of Pittsburgh, Pittsburgh, PA, USA.
Abstract
MOTIVATION: Supervised machine learning is widely applied to transcriptomic data to predict disease diagnosis, prognosis or survival. Robust and interpretable classifiers with high accuracy are usually favored for their clinical and translational potential. The top scoring pair (TSP) algorithm is an example that applies a simple rank-based algorithm to identify rank-altered gene pairs for classifier construction. Although many classification methods perform well in cross-validation of single expression profile, the performance usually greatly reduces in cross-study validation (i.e. the prediction model is established in the training study and applied to an independent test study) for all machine learning methods, including TSP. The failure of cross-study validation has largely diminished the potential translational and clinical values of the models. The purpose of this article is to develop a meta-analytic top scoring pair (MetaKTSP) framework that combines multiple transcriptomic studies and generates a robust prediction model applicable to independent test studies. RESULTS: We proposed two frameworks, by averaging TSP scores or by combining P-values from individual studies, to select the top gene pairs for model construction. We applied the proposed methods in simulated data sets and three large-scale real applications in breast cancer, idiopathic pulmonary fibrosis and pan-cancer methylation. The result showed superior performance of cross-study validation accuracy and biomarker selection for the new meta-analytic framework. In conclusion, combining multiple omics data sets in the public domain increases robustness and accuracy of the classification model that will ultimately improve disease understanding and clinical treatment decisions to benefit patients. AVAILABILITY AND IMPLEMENTATION: An R package MetaKTSP is available online. (http://tsenglab.biostat.pitt.edu/software.htm). CONTACT: ctseng@pitt.edu SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
MOTIVATION: Supervised machine learning is widely applied to transcriptomic data to predict disease diagnosis, prognosis or survival. Robust and interpretable classifiers with high accuracy are usually favored for their clinical and translational potential. The top scoring pair (TSP) algorithm is an example that applies a simple rank-based algorithm to identify rank-altered gene pairs for classifier construction. Although many classification methods perform well in cross-validation of single expression profile, the performance usually greatly reduces in cross-study validation (i.e. the prediction model is established in the training study and applied to an independent test study) for all machine learning methods, including TSP. The failure of cross-study validation has largely diminished the potential translational and clinical values of the models. The purpose of this article is to develop a meta-analytic top scoring pair (MetaKTSP) framework that combines multiple transcriptomic studies and generates a robust prediction model applicable to independent test studies. RESULTS: We proposed two frameworks, by averaging TSP scores or by combining P-values from individual studies, to select the top gene pairs for model construction. We applied the proposed methods in simulated data sets and three large-scale real applications in breast cancer, idiopathic pulmonary fibrosis and pan-cancer methylation. The result showed superior performance of cross-study validation accuracy and biomarker selection for the new meta-analytic framework. In conclusion, combining multiple omics data sets in the public domain increases robustness and accuracy of the classification model that will ultimately improve disease understanding and clinical treatment decisions to benefit patients. AVAILABILITY AND IMPLEMENTATION: An R package MetaKTSP is available online. (http://tsenglab.biostat.pitt.edu/software.htm). CONTACT: ctseng@pitt.edu SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
Authors: James F Reid; Lara Lusa; Loris De Cecco; Danila Coradini; Silvia Veneroni; Maria Grazia Daidone; Manuela Gariboldi; Marco A Pierotti Journal: J Natl Cancer Inst Date: 2005-06-15 Impact factor: 13.506
Authors: S Ramaswamy; P Tamayo; R Rifkin; S Mukherjee; C H Yeang; M Angelo; C Ladd; M Reich; E Latulippe; J P Mesirov; T Poggio; W Gerald; M Loda; E S Lander; T R Golub Journal: Proc Natl Acad Sci U S A Date: 2001-12-11 Impact factor: 11.205
Authors: M R Morris; C J Ricketts; D Gentle; F McRonald; N Carli; H Khalili; M Brown; T Kishida; M Yao; R E Banks; N Clarke; F Latif; E R Maher Journal: Oncogene Date: 2010-12-06 Impact factor: 9.867
Authors: Joel S Parker; Michael Mullins; Maggie C U Cheang; Samuel Leung; David Voduc; Tammi Vickery; Sherri Davies; Christiane Fauron; Xiaping He; Zhiyuan Hu; John F Quackenbush; Inge J Stijleman; Juan Palazzo; J S Marron; Andrew B Nobel; Elaine Mardis; Torsten O Nielsen; Matthew J Ellis; Charles M Perou; Philip S Bernard Journal: J Clin Oncol Date: 2009-02-09 Impact factor: 44.544
Authors: Nathan D Price; Jonathan Trent; Adel K El-Naggar; David Cogdell; Ellen Taylor; Kelly K Hunt; Raphael E Pollock; Leroy Hood; Ilya Shmulevich; Wei Zhang Journal: Proc Natl Acad Sci U S A Date: 2007-02-21 Impact factor: 11.205
Authors: Yi Zhang; Catherine A Schnabel; Brock E Schroeder; Piiha-Lotta Jerevall; Rachel C Jankowitz; Tommy Fornander; Olle Stål; Adam M Brufsky; Dennis Sgroi; Mark G Erlander Journal: Clin Cancer Res Date: 2013-06-11 Impact factor: 12.531
Authors: Dominik Langgartner; Andrea M Füchsl; Lisa M Kaiser; Tatjana Meier; Sandra Foertsch; Christian Buske; Stefan O Reber; Medhanie A Mulaw Journal: PLoS One Date: 2018-09-05 Impact factor: 3.240
Authors: Jian-Guo Zhou; Bo Liang; Su-Han Jin; Hui-Ling Liao; Guo-Bo Du; Long Cheng; Hu Ma; Udo S Gaipl Journal: Front Oncol Date: 2019-12-04 Impact factor: 6.244