Sebastian Pölsterl1, Sailesh Conjeti2, Nassir Navab3, Amin Katouzian4. 1. Computer Aided Medical Procedures, Technische Universität München, Boltzmannstraße 3, 85748 Garching bei München, Germany. Electronic address: sebastian.poelsterl@tum.de. 2. Computer Aided Medical Procedures, Technische Universität München, Boltzmannstraße 3, 85748 Garching bei München, Germany. Electronic address: conjeti@in.tum.de. 3. Computer Aided Medical Procedures, Technische Universität München, Boltzmannstraße 3, 85748 Garching bei München, Germany; Johns Hopkins University, 3400 North Charles Street, Baltimore, MD 21218, USA. Electronic address: nassir.navab@tum.de. 4. IBM Almaden Research Center, 650 Harry Road, San Jose, CA 95120, USA. Electronic address: akatouz@us.ibm.com.
Abstract
BACKGROUND: In clinical research, the primary interest is often the time until occurrence of an adverse event, i.e., survival analysis. Its application to electronic health records is challenging for two main reasons: (1) patient records are comprised of high-dimensional feature vectors, and (2) feature vectors are a mix of categorical and real-valued features, which implies varying statistical properties among features. To learn from high-dimensional data, researchers can choose from a wide range of methods in the fields of feature selection and feature extraction. Whereas feature selection is well studied, little work focused on utilizing feature extraction techniques for survival analysis. RESULTS: We investigate how well feature extraction methods can deal with features having varying statistical properties. In particular, we consider multiview spectral embedding algorithms, which specifically have been developed for these situations. We propose to use random survival forests to accurately determine local neighborhood relations from right censored survival data. We evaluated 10 combinations of feature extraction methods and 6 survival models with and without intrinsic feature selection in the context of survival analysis on 3 clinical datasets. Our results demonstrate that for small sample sizes - less than 500 patients - models with built-in feature selection (Cox model with ℓ1 penalty, random survival forest, and gradient boosted models) outperform feature extraction methods by a median margin of 6.3% in concordance index (inter-quartile range: [-1.2%;14.6%]). CONCLUSIONS: If the number of samples is insufficient, feature extraction methods are unable to reliably identify the underlying manifold, which makes them of limited use in these situations. For large sample sizes - in our experiments, 2500 samples or more - feature extraction methods perform as well as feature selection methods.
BACKGROUND: In clinical research, the primary interest is often the time until occurrence of an adverse event, i.e., survival analysis. Its application to electronic health records is challenging for two main reasons: (1) patient records are comprised of high-dimensional feature vectors, and (2) feature vectors are a mix of categorical and real-valued features, which implies varying statistical properties among features. To learn from high-dimensional data, researchers can choose from a wide range of methods in the fields of feature selection and feature extraction. Whereas feature selection is well studied, little work focused on utilizing feature extraction techniques for survival analysis. RESULTS: We investigate how well feature extraction methods can deal with features having varying statistical properties. In particular, we consider multiview spectral embedding algorithms, which specifically have been developed for these situations. We propose to use random survival forests to accurately determine local neighborhood relations from right censored survival data. We evaluated 10 combinations of feature extraction methods and 6 survival models with and without intrinsic feature selection in the context of survival analysis on 3 clinical datasets. Our results demonstrate that for small sample sizes - less than 500 patients - models with built-in feature selection (Cox model with ℓ1 penalty, random survival forest, and gradient boosted models) outperform feature extraction methods by a median margin of 6.3% in concordance index (inter-quartile range: [-1.2%;14.6%]). CONCLUSIONS: If the number of samples is insufficient, feature extraction methods are unable to reliably identify the underlying manifold, which makes them of limited use in these situations. For large sample sizes - in our experiments, 2500 samples or more - feature extraction methods perform as well as feature selection methods.
Authors: Jin-On Jung; Nerma Crnovrsanin; Naita Maren Wirsik; Henrik Nienhüser; Leila Peters; Felix Popp; André Schulze; Martin Wagner; Beat Peter Müller-Stich; Markus Wolfgang Büchler; Thomas Schmidt Journal: J Cancer Res Clin Oncol Date: 2022-05-26 Impact factor: 4.553