| Literature DB >> 29629235 |
Haohan Wang1, Bryon Aragam2, Eric P Xing2.
Abstract
A fundamental and important challenge in modern datasets of ever increasing dimensionality is variable selection, which has taken on renewed interest recently due to the growth of biological and medical datasets with complex, non-i.i.d. structures. Naïvely applying classical variable selection methods such as the Lasso to such datasets may lead to a large number of false discoveries. Motivated by genome-wide association studies in genetics, we study the problem of variable selection for datasets arising from multiple subpopulations, when this underlying population structure is unknown to the researcher. We propose a unified framework for sparse variable selection that adaptively corrects for population structure via a low-rank linear mixed model. Most importantly, the proposed method does not require prior knowledge of individual relationships in the data and adaptively selects a covariance structure of the correct complexity. Through extensive experiments, we illustrate the effectiveness of this framework over existing methods. Further, we test our method on three different genomic datasets from plants, mice, and humans, and discuss the knowledge we discover with our model.Entities:
Keywords: Applied computing → Genetics; Computational Ge-nomics; Computational genomics; Computing methodologies → Supervised learning; Confounding Correction; Information systems → Data mining; Linear Mixed Model; Sparsity; Variable Selection
Year: 2017 PMID: 29629235 PMCID: PMC5889139 DOI: 10.1109/BIBM.2017.8217687
Source DB: PubMed Journal: Proceedings (IEEE Int Conf Bioinformatics Biomed) ISSN: 2156-1125