Vincenzo Lagani1, Ioannis Tsamardinos. 1. Institute of Computer Science, Foundation for Research and Technology-Hellas (FORTH) and Computer Science Department, University of Crete, Heraklion, Greece. vlagani@ics.forth.gr
Abstract
MOTIVATION: Variable selection is a typical approach used for molecular-signature and biomarker discovery; however, its application to survival data is often complicated by censored samples. We propose a new algorithm for variable selection suitable for the analysis of high-dimensional, right-censored data called Survival Max-Min Parents and Children (SMMPC). The algorithm is conceptually simple, scalable, based on the theory of Bayesian networks (BNs) and the Markov blanket and extends the corresponding algorithm (MMPC) for classification tasks. The selected variables have a structural interpretation: if T is the survival time (in general the time-to-event), SMMPC returns the variables adjacent to T in the BN representing the data distribution. The selected variables also have a causal interpretation that we discuss. RESULTS: We conduct an extensive empirical analysis of prototypical and state-of-the-art variable selection algorithms for survival data that are applicable to high-dimensional biological data. SMMPC selects on average the smallest variable subsets (less than a dozen per dataset), while statistically significantly outperforming all of the methods in the study returning a manageable number of genes that could be inspected by a human expert. AVAILABILITY: Matlab and R code are freely available from http://www.mensxmachina.org
MOTIVATION: Variable selection is a typical approach used for molecular-signature and biomarker discovery; however, its application to survival data is often complicated by censored samples. We propose a new algorithm for variable selection suitable for the analysis of high-dimensional, right-censored data called Survival Max-Min Parents and Children (SMMPC). The algorithm is conceptually simple, scalable, based on the theory of Bayesian networks (BNs) and the Markov blanket and extends the corresponding algorithm (MMPC) for classification tasks. The selected variables have a structural interpretation: if T is the survival time (in general the time-to-event), SMMPC returns the variables adjacent to T in the BN representing the data distribution. The selected variables also have a causal interpretation that we discuss. RESULTS: We conduct an extensive empirical analysis of prototypical and state-of-the-art variable selection algorithms for survival data that are applicable to high-dimensional biological data. SMMPC selects on average the smallest variable subsets (less than a dozen per dataset), while statistically significantly outperforming all of the methods in the study returning a manageable number of genes that could be inspected by a human expert. AVAILABILITY: Matlab and R code are freely available from http://www.mensxmachina.org
Authors: Sisi Ma; Pamela J Schreiner; Elizabeth R Seaquist; Mehmet Ugurbil; Rachel Zmora; Lisa S Chow Journal: J Biomed Inform Date: 2020-01-28 Impact factor: 6.317
Authors: Alan Karthikesalingam; Omneya Attallah; Xianghong Ma; Sandeep Singh Bahia; Luke Thompson; Alberto Vidal-Diez; Edward C Choke; Matt J Bown; Robert D Sayers; Matt M Thompson; Peter J Holt Journal: PLoS One Date: 2015-07-15 Impact factor: 3.240