| Literature DB >> 22373537 |
Joel B Fontanarosa1, Yang Dai.
Abstract
We use least absolute shrinkage and selection operator (LASSO) regression to select genetic markers and phenotypic features that are most informative with respect to a trait of interest. We compare several strategies for applying LASSO methods in risk prediction models, using the Genetic Analysis Workshop 17 exome simulation data consisting of 697 individuals with information on genotypic and phenotypic features (smoking, age, sex) in 5-fold cross-validated fashion. The cross-validated averages of the area under the receiver operating curve range from 0.45 to 0.63 for different strategies using only genotypic markers. The same values are improved to 0.69-0.87 when both genotypic and phenotypic information are used. The ability of the LASSO method to find true causal markers is limited, but the method was able to discover several common variants (e.g., FLT1) under certain conditions.Entities:
Year: 2011 PMID: 22373537 PMCID: PMC3287908 DOI: 10.1186/1753-6561-5-S9-S69
Source DB: PubMed Journal: BMC Proc ISSN: 1753-6561
Prediction results for various model types
| Model | Model type | Training | Testing | Number of truea | Sizeb | |
|---|---|---|---|---|---|---|
| 1 | Genotypes only | 0.57 | 0.55 | 3.57 | 179.43 | 200 |
| Genotypes restricted | 0.56 | 0.55 | 0.84 | 22.07 | 200 | |
| Combined model | 0.82 | 0.82 | 1.27 | 28.38 | 200 | |
| Combined model restricted | 0.82 | 0.82 | 1.06 | 18.70 | 200 | |
| 2a | Genotypes only | 0.61 | 0.54 | 9.98 | 545.33 | 50 |
| Genotypes restricted | 0.56 | 0.55 | 0.86 | 21.66 | 50 | |
| Combined model | 0.83 | 0.81 | 2.78 | 94.32 | 50 | |
| Combined model restricted | 0.83 | 0.82 | 1.14 | 20.57 | 50 | |
| 2b | Genotypes only | 0.73 | 0.54 | 11.65 | 348.86 | 150 |
| Genotypes restricted | 0.58 | 0.56 | 2.01 | 29.57 | 150 | |
| Combined model | 0.85 | 0.78 | 9.35 | 228.43 | 150 | |
| Combined model restricted | 0.83 | 0.82 | 2.48 | 29.26 | 150 | |
| 3 | Genotypes only | 0.62 | 0.54 | 11.32 | 294.68 | 200 |
| Genotypes restricted | 0.58 | 0.56 | 1.75 | 22.84 | 200 | |
| Combined model | 0.83 | 0.82 | 3.94 | 64.17 | 200 | |
| Combined model restricted | 0.83 | 0.82 | 2.04 | 20.40 | 200 |
a Average number of causal simulation markers included.
b Average number of variables in each model.
Averaged results from a 5-fold evaluation procedure on N simulation data sets. Training AROC values were obtained from the internal 10-fold cross-validation on the training sets, as implemented in the R package glmnet. Testing AROC values were determined by applying each of the trained models to the five independent testing sets.
Feature selection
| Model type | Model 1 | Model 3 | |||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| Gene | SNP | Counta | MAFb | Causalc | Gene | SNP | Counta | MAFb | Causalc | ||
| Gene only | C13S523 | 35 | 0.0667 | Y | C13S523 | 71 | 0.0667 | Y | |||
| C15S3360 | 22 | 0.0029 | N | C11S6885 | 63 | 0.0014 | N | ||||
| C8S4379 | 17 | 0.0050 | N | C8S4379 | 61 | 0.0050 | N | ||||
| C6S4146 | 15 | 0.0050 | N | RPA3 | C7S297 | 58 | 0.0007 | N | |||
| C9S4013 | 13 | 0.0308 | N | C1S10178 | 54 | 0.0007 | N | ||||
| C13S522 | 12 | 0.0280 | Y | C17S2981 | 52 | 0.0007 | N | ||||
| Gene restricted | C13S523 | 19 | 0.0667 | Y | C13S523 | 44 | 0.0667 | Y | |||
| C17S3819 | 9 | 0.0043 | N | C13S522 | 24 | 0.0280 | Y | ||||
| C13S522 | 8 | 0.0280 | Y | C7S2324 | 21 | 0.0976 | N | ||||
| C3S2197 | 7 | 0.0108 | N | C8S4379 | 18 | 0.0050 | N | ||||
| C9S4013 | 7 | 0.0308 | N | C17S4578 | 16 | 0.1664 | Y | ||||
| C7S2324 | 7 | 0.0976 | N | C1S9189 | 15 | 0.0065 | Y | ||||
| Combined | Age | Age | 200 | NA | Y | Age | Age | 200 | NA | Y | |
| Smoke | Smoke | 163 | NA | Y | Smoke | Smoke | 185 | NA | Y | ||
| C13S523 | 49 | 0.0667 | Y | C13S523 | 81 | 0.0667 | Y | ||||
| C13S522 | 16 | 0.0280 | Y | C13S522 | 34 | 0.0280 | Y | ||||
| C18S2492 | 7 | 0.0172 | Y | C18S2492 | 18 | 0.0172 | Y | ||||
| C6S853 | 3 | 0.0036 | N | C17S4578 | 8 | 0.1664 | Y | ||||
| C1S6533 | 3 | 0.0115 | Y | C1S6533 | 8 | 0.0115 | Y | ||||
| C2S1 | 2 | 0.0093 | N | C3S2197 | 7 | 0.0108 | N | ||||
| Combined restricted | Age | Age | 200 | NA | Y | Age | Age | 200 | NA | Y | |
| Smoke | Smoke | 163 | NA | Y | Smoke | Smoke | 180 | NA | Y | ||
| C13S523 | 49 | 0.0667 | Y | C13S523 | 75 | 0.0667 | Y | ||||
| C13S522 | 17 | 0.0280 | Y | C13S522 | 32 | 0.0280 | Y | ||||
| C18S2492 | 7 | 0.0172 | Y | C18S2492 | 17 | 0.0172 | Y | ||||
| C1S6533 | 3 | 0.0115 | Y | C3S2197 | 6 | 0.0108 | N | ||||
| C22S1540 | 3 | 0.0201 | N | C1S6533 | 6 | 0.0115 | Y | ||||
| C10S4869 | 3 | 0.0050 | N | C4S1861 | 5 | 0.0022 | Y | ||||
a Number of times a given variable was observed in four out of five trained models.
b Minor allele frequency.
c Variables used to determine disease risk by the GAW17 simulators.
The top most frequent variables occurred in at least four out of five trained models for models 1 and 3. All models were run for the 200 simulation data sets.