| Literature DB >> 35313824 |
Michael Lau1,2, Claudia Wigmann3, Sara Kress3, Tamara Schikowski3, Holger Schwender4.
Abstract
BACKGROUND: Genetic risk scores (GRS) summarize genetic features such as single nucleotide polymorphisms (SNPs) in a single statistic with respect to a given trait. So far, GRS are typically built using generalized linear models or regularized extensions. However, these linear methods are usually not able to incorporate gene-gene interactions or non-linear SNP-response relationships. Tree-based statistical learning methods such as random forests and logic regression may be an alternative to such regularized-regression-based methods and are investigated in this article. Moreover, we consider modifications of random forests and logic regression for the construction of GRS.Entities:
Keywords: Bagging; Elastic net; Epistasis; Logic regression; Polygenic risk scores; Random forests; Simulation study; Statistical learning; Variable selection
Mesh:
Year: 2022 PMID: 35313824 PMCID: PMC8935722 DOI: 10.1186/s12859-022-04634-w
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Fig. 1Workflow of constructing and evaluating genetic risk scores
Fig. 2Exemplary tree models for three binary input variables , and predicting two different classes and . In a, a classification tree is shown. b depicts a logic tree describing the Boolean expression . Here, a true Boolean expression is identified as class and otherwise. Negated input variables/leaves are marked by white letters on a black background. Both trees are equivalent, i.e., they perform the same predictions for each predictor setting
Parameter settings for the first simulation scenario resulting in 27 settings in total
| Parameter | Considered realizations |
|---|---|
| Odds ratio | 1.2, 1.5, 1.8 |
| Amount of noise SNPs | 4, 14, 44 |
| Sample size | 500, 1000, 2000 |
| Prevalence | Resulting in balanced data sets |
| MAF | Randomly chosen from [0.15, 0.45] |
| Repetitions | 100 |
Study parameters for the second simulation scenario resulting in 45 settings in total
| Parameter | Considered realizations |
|---|---|
| Odds ratio of gene-gene interaction | 1.2, 1.5, 1.8, 2.1, 2.4 |
| Amount of noise SNPs | 5, 15, 45 |
| Interacting SNPs (j, k) | (1, 2), (1, 4), (4, 5) |
| Sample size | 2000 |
| Prevalence | Resulting in balanced data sets |
| MAF | Randomly chosen from [0.15, 0.45] |
| Repetitions | 100 |
Study parameters for the third simulation scenario resulting in 20 settings in total
| Parameter | Considered realizations |
|---|---|
| Odds ratio of GxE interaction | 1.2, 1.5, 1.8, 2.1, 2.4 |
| Amount of noise SNPs | 45 |
| Interacting GxE SNP j | 2, 5 |
| Correlation between | 0.5, 0.9 |
| Sample size | 2000 |
| Prevalence | Resulting in balanced data sets |
| MAF | Randomly chosen from [0.15, 0.45] |
| Repetitions | 100 |
Regarded hyperparameter settings
| Algorithm | Hyperparameter | Considered realizations |
|---|---|---|
| Random forests & random forests VIM | mtry | |
| min.node.size | ||
| num.trees | 2000 | |
| Logic regression & logic bagging | ntrees | |
| nleaves | ||
| Logic regression | Cooling schedule | Experimental |
| Simulated annealing iterations | 500000 | |
| Logic bagging | Bagging iterations | 500 |
| Elastic net | ||
| Cross-validation |
The mentioned hyperparameter names are the names of the corresponding arguments in the respective software packages. For a description of the parameters, see Additional file 1: Section 2
Fig. 3Mean AUC for random forests, random forests VIM, logic regression, logic bagging, elastic net, and the true underlying model in the first simulation scenario considering marginal effective SNPs evaluated on the test data
Fig. 4Mean AUC for random forests, random forests VIM, logic regression, logic bagging, elastic net, and the true underlying model in the second simulation scenario incorporating interactions of SNPs evaluated on the test data. The Designs 2.1, 2.2, and 2.3 describe the scenarios where both interacting SNPs also exhibit marginal effects, only one of both SNPs shows a marginal signal or none of them induce a main effect, i.e., (j, k) = (1, 2), (1, 4), or (4, 5) in Eq. (3), respectively
Fig. 5Mean AUC for random forests, random forests VIM, logic regression, logic bagging, and elastic net in the third simulation scenario incorporating continuous input variables evaluated on the test data. The Designs 3.1 and 3.2 describe the scenarios where the GxE interacting SNP also exhibits a moderate marginal effect or where it does not induce a main effect, i.e., j = 2 or 5 in Eq. (4), respectively
Descriptive statistics of the regarded data set from the SALIA study stratified according to the status of rheumatic diseases
| Variable | Controls | Cases | |
|---|---|---|---|
| 394 | 123 | ||
| Mean age | [years] ± sd | ||
| Mean BMI | [kg/m2] ± sd | ||
| Mean pack-years of smoking | [years] ± sd | ||
| Mean | [μg/m3] ± sd | ||
| Mean | [μg/m3] ± sd | ||
| Mean | [μg/m3] ± sd | ||
| Mean | [μg/m3] ± sd | ||
| Mean | [μg/m3] ± sd | ||
| Mean | [μg/m3] ± sd |
Median p-values of the Wald tests for univariate models only including the GRS built on the SALIA data set
| Algorithm | Median |
|---|---|
| Random forests | 0.018 |
| Random forests VIM | 0.167 |
| Logic regression | 0.353 |
| Logic bagging | 0.021 |
| Elastic net | 0.512 |
Fig. 6AUC for random forests, random forests VIM, logic regression, logic bagging, and elastic net in the application to data from the SALIA study evaluated on the test data. Results for single unadjusted models also considering the alternative genome-wide construction approach
Fig. 7AUC for random forests, random forests VIM, logic regression, logic bagging, and elastic net in the application to data from the SALIA study evaluated on the test data. Results for the final age-adjusted models with different air pollution indicators