Joshua Millstein1, Francesca Battaglin2,3, Malcolm Barrett1, Shu Cao1, Wu Zhang2, Sebastian Stintzing4, Volker Heinemann5, Heinz-Josef Lenz2. 1. Department of Preventive Medicine, CA 90033, USA. 2. Department of Medicine, Division of Medical Oncology, Norris Comprehensive Cancer Center, Keck School of Medicine, University of Southern California, Los Angeles, CA 90033, USA. 3. Clinical and Experimental Oncology Department, Medical Oncology Unit 1, Veneto Institute of Oncology IOV-IRCCS, Padua 35128, Italy. 4. Medical Department, Division of Oncology and Hematology, Charité Universitaetsmedizin Berlin, Berlin 10117, Germany. 5. Department of Medicine III, University Hospital Munich, Munich 80336, Germany.
Abstract
MOTIVATION: Large amounts of information generated by genomic technologies are accompanied by statistical and computational challenges due to redundancy, badly behaved data and noise. Dimensionality reduction (DR) methods have been developed to mitigate these challenges. However, many approaches are not scalable to large dimensions or result in excessive information loss. RESULTS: The proposed approach partitions data into subsets of related features and summarizes each into one and only one new feature, thus defining a surjective mapping. A constraint on information loss determines the size of the reduced dataset. Simulation studies demonstrate that when multiple related features are associated with a response, this approach can substantially increase the number of true associations detected as compared to principal components analysis, non-negative matrix factorization or no DR. This increase in true discoveries is explained both by a reduced multiple-testing challenge and a reduction in extraneous noise. In an application to real data collected from metastatic colorectal cancer tumors, more associations between gene expression features and progression free survival and response to treatment were detected in the reduced than in the full untransformed dataset. AVAILABILITY AND IMPLEMENTATION: Freely available R package from CRAN, https://cran.r-project.org/package=partition. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
MOTIVATION: Large amounts of information generated by genomic technologies are accompanied by statistical and computational challenges due to redundancy, badly behaved data and noise. Dimensionality reduction (DR) methods have been developed to mitigate these challenges. However, many approaches are not scalable to large dimensions or result in excessive information loss. RESULTS: The proposed approach partitions data into subsets of related features and summarizes each into one and only one new feature, thus defining a surjective mapping. A constraint on information loss determines the size of the reduced dataset. Simulation studies demonstrate that when multiple related features are associated with a response, this approach can substantially increase the number of true associations detected as compared to principal components analysis, non-negative matrix factorization or no DR. This increase in true discoveries is explained both by a reduced multiple-testing challenge and a reduction in extraneous noise. In an application to real data collected from metastatic colorectal cancer tumors, more associations between gene expression features and progression free survival and response to treatment were detected in the reduced than in the full untransformed dataset. AVAILABILITY AND IMPLEMENTATION: Freely available R package from CRAN, https://cran.r-project.org/package=partition. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
Authors: R T Cormier; K H Hong; R B Halberg; T L Hawkins; P Richardson; R Mulherkar; W F Dove; E S Lander Journal: Nat Genet Date: 1997-09 Impact factor: 38.330