| Literature DB >> 30184048 |
Haohan Wang1, Benjamin J Lengerich2, Bryon Aragam3, Eric P Xing1,2,3.
Abstract
MOTIVATION: Association studies to discover links between genetic markers and phenotypes are central to bioinformatics. Methods of regularized regression, such as variants of the Lasso, are popular for this task. Despite the good predictive performance of these methods in the average case, they suffer from unstable selections of correlated variables and inconsistent selections of linearly dependent variables. Unfortunately, as we demonstrate empirically, such problematic situations of correlated and linearly dependent variables often exist in genomic datasets and lead to under-performance of classical methods of variable selection.Entities:
Mesh:
Year: 2019 PMID: 30184048 PMCID: PMC6449749 DOI: 10.1093/bioinformatics/bty750
Source DB: PubMed Journal: Bioinformatics ISSN: 1367-4803 Impact factor: 6.937
Fig. 1.Proportion of simulations in which the irrepresentable condition failed to hold on gene expression, methylation, miRNA datasets for glioblastoma, breast cancer and lung cancer
Fig. 2.AUC of each variable selection method. Methods are: Wald Hypothesis Testing (Wald), Sure Independence Screening (SIS), Lasso, Ridge Regression (RR), Elastic Net (EN), Adaptive Lasso (AL), SCAD, MCP, Trace Lasso (TL), Inverse Covariance Regularizer (IC) and Precision Lasso (PL). The vertical axis represents area under ROC of the variable selection. The results are averaged from ten runs and SD is also shown. From the plot, we can see that our methods (PL and IC) exhibit a clear advantage over traditional methods on simulation data. Please notice that the AUC is calculated for variable selection task, instead of prediction of binary outcomes
Genes that were selected from breast cancer gene expression data and are annotated in the COSMIC dataset to have somatic mutations associated with tumours
| Method | Selected gene | Tumor associations | Driver? |
|---|---|---|---|
| Precision lasso | |||
| Prostate | |||
| pre B-cell ALL, Myoepithelioma | |||
| Uterine leiomyoma | |||
| Wald test | Follicular thyroid | ||
| Lipoma | |||
| Lasso | Lipoma; Leiomyoma; Pleomorphic salivary gland adenoma | ||
| DFSP, Aneurysmal bone cyst | |||
| Ridge regression | Follicular thyroid | ||
| DFSP, Aneurysmal bone cyst | |||
| Elastic net | Lipoma; Leiomyoma; Pleomorphic salivary gland adenoma | ||
| DFSP, Aneurysmal bone cyst | |||
| Adaptive lasso | Acute Myeloid leukaemia | ||
| Lipoma; Leiomyoma; Pleomorphic salivary gland adenoma | |||
| DFSP, Aneurysmal bone cyst | |||
| SCAD | Megakaryoblastic leukaemia of downs syndrome | ||
| Acute lymphoblastic leukaemia | |||
| Lipoma; Leiomyoma; Pleomorphic salivary gland adenoma | |||
| MCP | Acute myeloid leukaemia | ||
| Lipoma; Leiomyoma; Pleomorphic salivary gland adenoma | |||
| DFSP, Aneurysmal bone cyst | |||
| Trace lasso | |||
| Lipoma | |||
| Melanoma | |||
| DFSP, Aneurysmal bone cyst | |||
| Inverse covariance | |||
| Papillary thyroid, NSCLC | |||
| Acute lymphoblastic leukaemia |
Notes: Genes with associations to breast cancer are bolded, and genes associated with high-confidence driver mutations are annotated in the rightmost column. Each method was constrained to select exactly 100 genes from a common set. We see that Precision Lasso selects the most relevant genes.