Mat Soukup1, HyungJun Cho, Jae K Lee. 1. Division of Biometrics III, Food and Drug Administration 9201 Corporate Blvd, Rm. N-250, Rockville, MD 20850, USA.
Abstract
MOTIVATION: Genome-wide microarray data are often used in challenging classification problems of clinically relevant subtypes of human diseases. However, the identification of a parsimonious robust prediction model that performs consistently well on future independent data has not been successful due to the biased model selection from an extremely large number of candidate models during the classification model search and construction. Furthermore, common criteria of prediction model performance, such as classification error rates, do not provide a sensitive measure for evaluating performance of such astronomic competing models. Also, even though several different classification approaches have been utilized to tackle such classification problems, no direct comparison on these methods have been made. RESULTS: We introduce a novel measure for assessing the performance of a prediction model, the misclassification-penalized posterior (MiPP), the sum of the posterior classification probabilities penalized by the number of incorrectly classified samples. Using MiPP, we implement a forward step-wise cross-validated procedure to find our optimal prediction models with different numbers of features on a training set. Our final robust classification model and its dimension are determined based on a completely independent test dataset. This MiPP-based classification modeling approach enables us to identify the most parsimonious robust prediction models only with two or three features on well-known microarray datasets. These models show superior performance to other models in the literature that often have more than 40-100 features in their model construction. AVAILABILITY: Our MiPP software program is available at the Bioconductor website (http://www.bioconductor.org).
MOTIVATION: Genome-wide microarray data are often used in challenging classification problems of clinically relevant subtypes of human diseases. However, the identification of a parsimonious robust prediction model that performs consistently well on future independent data has not been successful due to the biased model selection from an extremely large number of candidate models during the classification model search and construction. Furthermore, common criteria of prediction model performance, such as classification error rates, do not provide a sensitive measure for evaluating performance of such astronomic competing models. Also, even though several different classification approaches have been utilized to tackle such classification problems, no direct comparison on these methods have been made. RESULTS: We introduce a novel measure for assessing the performance of a prediction model, the misclassification-penalized posterior (MiPP), the sum of the posterior classification probabilities penalized by the number of incorrectly classified samples. Using MiPP, we implement a forward step-wise cross-validated procedure to find our optimal prediction models with different numbers of features on a training set. Our final robust classification model and its dimension are determined based on a completely independent test dataset. This MiPP-based classification modeling approach enables us to identify the most parsimonious robust prediction models only with two or three features on well-known microarray datasets. These models show superior performance to other models in the literature that often have more than 40-100 features in their model construction. AVAILABILITY: Our MiPP software program is available at the Bioconductor website (http://www.bioconductor.org).
Authors: Marc Martínez-Llordella; Juan José Lozano; Isabel Puig-Pey; Giuseppe Orlando; Giuseppe Tisone; Jan Lerut; Carlos Benítez; Jose Antonio Pons; Pascual Parrilla; Pablo Ramírez; Miquel Bruguera; Antoni Rimola; Alberto Sánchez-Fueyo Journal: J Clin Invest Date: 2008-08 Impact factor: 14.808
Authors: Francesc Balaguer; Leticia Moreira; Juan Jose Lozano; Alexander Link; Georgina Ramirez; Yan Shen; Miriam Cuatrecasas; Mildred Arnold; Stephen J Meltzer; Sapna Syngal; Elena Stoffel; Rodrigo Jover; Xavier Llor; Antoni Castells; C Richard Boland; Meritxell Gironella; Ajay Goel Journal: Clin Cancer Res Date: 2011-08-15 Impact factor: 12.531
Authors: Claudia R Molins; Laura V Ashton; Gary P Wormser; Ann M Hess; Mark J Delorey; Sebabrata Mahapatra; Martin E Schriefer; John T Belisle Journal: Clin Infect Dis Date: 2015-03-11 Impact factor: 9.079
Authors: Dmytro M Havaleshko; Steven Christopher Smith; HyungJun Cho; Sooyoung Cheon; Charles R Owens; Jae K Lee; Lance A Liotta; Virginia Espina; Julia D Wulfkuhle; Emanuel F Petricoin; Dan Theodorescu Journal: Neoplasia Date: 2009-11 Impact factor: 5.715
Authors: Jae K Lee; Dmytro M Havaleshko; Hyungjun Cho; John N Weinstein; Eric P Kaldjian; John Karpovich; Andrew Grimshaw; Dan Theodorescu Journal: Proc Natl Acad Sci U S A Date: 2007-07-31 Impact factor: 11.205