| Literature DB >> 19014573 |
Bareng A S Nonyane1, Andrea S Foulkes.
Abstract
BACKGROUND: Population-based investigations aimed at uncovering genotype-trait associations often involve high-dimensional genetic polymorphism data as well as information on multiple environmental and clinical parameters. Machine learning (ML) algorithms offer a straightforward analytic approach for selecting subsets of these inputs that are most predictive of a pre-defined trait. The performance of these algorithms, however, in the presence of covariates is not well characterized. METHODS ANDEntities:
Mesh:
Substances:
Year: 2008 PMID: 19014573 PMCID: PMC2620353 DOI: 10.1186/1471-2156-9-71
Source DB: PubMed Journal: BMC Genet ISSN: 1471-2156 Impact factor: 2.797
Figure 1TDR to detect genotype effects under all approaches to handling covariates using Random Forests: For all models Z effect is fixed at 0.5 except for Model 3 where Z = 2. For Model 2, corr(X1, Z) = 0.5. For Models 1–3 X1 effect is varied over [0.1,0.9] while for Models 4–6 the effect sizes of main effects, if a model has main effects, are fixed at 0.5 and the effect size of the interaction term is varied over [0.1,0.9].
Figure 2TDR to detect genotype effects under all approaches to handling covariates using MARS: Z and X1 effect sizes are as in Figure 1.
Summary of TDR analysis for binary genotypes and covariates
| Approach | ||||
| Model | Include | Stratify by | Residualize by | Ignore |
| 1. A | +/+ | +/+ | +/+ | +/+ |
| 2. A | ++/+ | +/+ | +/+ | ++/++ |
| 3. A | - -/- - | - -/- - | - -/- - | -/- |
| 4. I | ++/++ | ++/++ | ++/++ | ++/++ |
| 5. I | +/++ | +/++ | -/+ | -/+ |
| 6. C | +/+ | +/+ | -/+ | -/+ |
Summary of simulation results in Figures 1 and 2: Results are given in pairs corresponding to RF and MARS respectively; a "+" indicates reasonable TDR (≥ 80%) for detecting moderate effect sizes (≥ 0.5); "++" indicates reasonable TDR (≥ 80%) for detecting small effect sizes (≥ 0.3);"-" indicates lower TDR (50% to 80%) for moderate effect sizes; and '--" indicates very low TDR (< 20%) at moderate effect sizes. Correlation between X1 and Z is fixed at 0.5 for MODEL 2.
Figure 3True and false discovery rates for genotype effects with correlated predictors using Random Forests: The first plot illustrates the effect of confounding on TDR when X1 and Z effects are fixed at 0.5, while the second plot illustrates FDR given that the Z effect is fixed at 0.5.
Figure 4True and false discovery rates for genotype effects with correlated predictors using MARS: Z and X1 effect sizes are as in Figure 3.
TDR and FDR under Model 2
| RF | MARS | |||
| Approach | TDR | FDR | TDR | FDR |
| 1. Include | + | - | + | + |
| 2. Stratify by | - | + | - | + |
| 3. Residualize by | - | + | + | + |
| 4. Ignore | + | - | + | - |
Correlation between X1 and Z is varied as shown in Figures 3 and 4: "+" indicates reasonable TDR (> 80 for RF and > 70% for MARS) or low FDR (< 20% for RF and < 10% for MARS) for all levels of correlation; and "-" indicates decreasing TDR or increasing FDR as correlation increases.
TDR for detecting genotype effects in the presence of a continuous covariate
| Approach | |||
| M | Include | Residualize by | Ignore |
| M | |||
| 0.969/0.546 | 0.993/0.999 | 0.965/0.995 | |
| M | |||
| 0.922/0.729 | 0.942/0.731 | 0.116/0.740 | |
| M | |||
| 0.829/0.703 | 0.133/0.033 | 0.510/0.628 | |
| M | |||
| 0.999/0.909 | 0.995/0.998 | 0.930/0.984 | |
| M | |||
| 0.898/0.963 | 0.137/0.033 | 0.137/0.036 | |
| M | |||
| 0.906/0.854 | 0.119/0.037 | 0.127/0.045 | |
TDR is given in pairs corresponding to RF and MARS, respectively.
Summary of results in Table 3
| M | Include Z | Residualize by Z | Ignore Z |
| M | +/- | +/+ | +/+ |
| M | +/- | +/- | - -/- |
| M | +/- | - -/- - | -/- |
| M | +/+ | +/+ | +/+ |
| M | +/+ | - -/- - | - -/- - |
| M | +/+ | - -/- - | - -/- - |
Results are given in pairs corresponding to RF and MARS, respectively:
"+" denotes reasonable TDR (> 80%) to detect genotype;
"-" denotes lower TDR (50% – 80%) to detect genotype;
and "--" denotes very low TDR (< 50%) to detect genotype.
Random Forest results of the analysis of ACTG data with HDL-c levels as trait and SNPs from ApoCIII, ApoE, EL and HL genes and race/ethnicity as predictors
| Include* | Ignore* | Residualize* | White | Stratify* Black | Hispanic | |
| -482C/T (rs2854117) | -2.04(0.96) | 1.19(0.97) | ||||
| -455T/C (rs2854116) | 9.04(0.66) | -0.57(0.98) | 1.27(0.98) | |||
| intron 1 (466)G/C (rs2070669) | 4.70(0.93) | 5.85(0.88) | 5.91(0.90) | 4.55(0.91) | -2.78(0.94) | -2.62(1.00) |
| Gly34Gly C/T (rs4520) | 8.86(1.12) | 5.51(1.03) | 4.77(1.07) | 2.99(1.02) | 0.91(0.99) | |
| exon 4 SstI 4348(5) C/G(rs5128) | 0.60(1.01) | 2.12(1.04) | 2.34(1.04) | 2.91(1.03) | -3.20(0.97) | 2.95(1.00) |
| Arg112Cys T/C (rs429358) | 4.87(1.08) | 1.45(1.00) | 2.29(1.06) | 6.29(1.08) | -5.22(0.96) | 6.30(0.98) |
| Arg158Cys T/C (rs7412) | 6.02(0.98) | 7.49(1.00) | 7.49(0.93) | 5.08(0.92) | -5.20(0.97) | 2.69(0.98) |
| rs12970066, | 3.94(1.03) | 4.00(0.97) | 5.54(0.97) | 1.12(0.97) | -1.81(1.05) | |
| Asn396Ser, | 7.06(1.03) | 8.30(1.05) | 8.39(1.10) | 2.27(1.04) | 5.93(0.90) | 2.09(0.97) |
| rs3829632 (-1309A/G) | -1.10 (0.99) | 1.24(0.98) | 2.25(1.04) | -1.64(0.98) | 0.00(0.00) | -2.25(0.96) |
| rs2070895 | 7.86 (1.09) | 5.54(0.99) | 5.58(1.11) | -1.73(0.95) | 2.82(0.97) | |
| rs12595191 | -0.93(0.97) | -1.77(1.00) | -1.07(0.99) | -3.62(1.00) | -1.99(0.99) | 0.07(0.99) |
| rs690 | 10.41(1.08) | 1.28(0.99) | 0.53(0.98) | 3.93(1.05) | -3.98(0.97) | 9.61(0.96) |
| rs6084 | 7.42(1.01) | 6.47(1.01) | 6.27(1.06) | -1.15(1.01) | ||
| Race/ethnicity | NA | NA | NA | NA | NA |
" * " indicates the approach to handling the race/ethnicity covariate;
"NA" indicates that the predictor was not included in the analysis.
The two highest importance scores from RF are in bold.
MARS results of the analysis of ACTG data with HDL-c levels as trait and SNPs from ApoCIII, ApoE, EL and HL genes and race/ethnicity as predictors
| Include* | Ignore* | Residualize* | White | Stratify* Black | Hispanic | |
| -482C/T (rs2854117) | 4 | - | 1 | 1 | - | - |
| -455T/C (rs2854116) | - | 3 | 2 | 3 | - | - |
| intron 1 (466)G/C (rs2070669) | - | - | - | - | - | - |
| Gly34Gly C/T (rs4520) | - | - | - | - | - | - |
| exon 4 SstI 4348(5) C/G (rs5128) | - | - | - | - | - | - |
| Arg112Cys T/C (rs429358) | - | - | 1 | - | - | - |
| Arg158Cys T/C (rs7412) | 3 | 1 | - | 2 | - | - |
| rs12970066, | 5 | - | - | 4 | 1 | - |
| Asn396Ser, | 4 | - | - | 1 | 1 | - |
| rs3829632 (-1309A/G) | - | - | - | - | - | - |
| rs2070895 | 3 | 1 | - | 2 | - | - |
| rs12595191 | - | - | - | - | - | - |
| rs690, | 4 | 2 | - | 1 | - | - |
| rs6084 | 2 | - | 1 | - | 1 | - |
| Race/ethnicity | 1 | NA | NA | NA | NA | NA |
" * " indicates the approach to handling the race/ethnicity covariate;
"NA" indicates that the predictor was not included in the analysis;
and "-" indicates that the predictor was not selected in the final MARS model.