| Literature DB >> 22373484 |
Yoonhee Kim1, Qing Li1, Cheryl D Cropp1, Heejong Sung1, Juanliang Cai1, Claire L Simpson1, Brian Perry1, Abhijit Dasgupta2, James D Malley3, Alexander F Wilson1, Joan E Bailey-Wilson1.
Abstract
Machine learning approaches are an attractive option for analyzing large-scale data to detect genetic variants that contribute to variation of a quantitative trait, without requiring specific distributional assumptions. We evaluate two machine learning methods, random forests and logic regression, and compare them to standard simple univariate linear regression, using the Genetic Analysis Workshop 17 mini-exome data. We also apply these methods after collapsing multiple rare variants within genes and within gene pathways. Linear regression and the random forest method performed better when rare variants were collapsed based on genes or gene pathways than when each variant was analyzed separately. Logic regression performed better when rare variants were collapsed based on genes rather than on pathways.Entities:
Year: 2011 PMID: 22373484 PMCID: PMC3287827 DOI: 10.1186/1753-6561-5-S9-S104
Source DB: PubMed Journal: BMC Proc ISSN: 1753-6561
Percentage of CVs and UNCVs that were in the top-ranked 10% of predictors (RF and ULR) or that were included in the final model (LR) in at least 5%, 10%, and 20% of the 200 simulated replicates
| Data set | Total numbera | Random forest, % variants ranked in the top 10% of predictors | Univariate linear regression, % variants ranked in the most significant 10% | Logic regression, % variants included in final model | |||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| In ≥5 PoR | In ≥10 PoR | In ≥20 PoR | In ≥5 PoR | In ≥10 PoR | In ≥20 PoR | In ≥5 PoR | In ≥10 PoR | In ≥20 PoR | |||
| Uncollapsed | CVs | 36 | 100 | 72 | 33 | 94 | 72 | 39 | |||
| UNCVs | 12,485 | 76 | 46 | 8 | 90 | 25 | 3 | ||||
| Gene-collapsed | CVs | 15 | 100 | 100 | 50 | 87 | 73 | 53 | 91b | 64 | 45 |
| UNCVs | 6,642 | 98 | 81 | 24 | 90 | 26 | 3 | 63 | 32 | 9 | |
| Pathway-collapsed | CVs | 167 | 99 | 95 | 72 | 93 | 50 | 26 | 1c | 0 | 0 |
| UNCVs | 2,249 | 91 | 68 | 38 | 88 | 32 | 8 | 0 | 0 | 0 | |
UNCVs are noncausal variants that were not used in the simulation to determine Q2, were not in one of the causal genes, and did not display correlation of at least 0.6 with any CVs.
a Total number of nonmonomorphic variants in Asians. For LR we excluded 4 common CVs and 4,725 common NCVs.
b Final LR model: 3 trees with 10 leaves.
c Final LR model: 2 trees with 20 leaves.
Figure 1LR pick rate (by chromosome) for the most frequently identified noncausal genes and any causal gene using gene-collapsed data